This file last updated 17 April 2006. See the end of the file for the revision history

The files provided in this release:

na_armX.dmel.RELEASE5
na_arm2L.dmel.RELEASE5
na_arm2R.dmel.RELEASE5
na_arm3L.dmel.RELEASE5
na_arm3R.dmel.RELEASE5
na_arm4.dmel.RELEASE5
na_XHet.dmel.RELEASE5
na_YHet.dmel.RELEASE5
na_2LHet.dmel.RELEASE5
na_2RHet.dmel.RELEASE5
na_3LHet.dmel.RELEASE5
na_3RHet.dmel.RELEASE5
na_armU.dmel.RELEASE5
na_armUextra.dmel.RELEASE5
gb_acc.dmel.RELEASE5
mapping.sql
README.RELEASE5

Euchromatic Sequence

The first six files are fasta files for the assembled euchromatic arms
of D. melanogaster. This sequence is the result of a combination of BAC
and whole genome shotgun data and is all finished to high quality. All
sequence has been compared to the restriction digest fingerprints in
multiple enzymes for validity. The details of this analysis will be
described in a forthcoming publication.

The chromosome arm sequences represent all of the euchromatin and parts of
the centric heterochromatin. They extend into the centric heterochromatin
on arms X, 2L, 2R, 3L and 3R by a combined total of 4.7 Mb.


Table 1: Release 5 Statisics

Arm      Length  Gaps   Major difference compared to Release 4
X      22422827     3   8kb added to the distal end, gaps filled
                        in regions 1-11
2L     23011544     2   591kb added to the proximal end of the arm.
2R     21146708     1   380kb added to the proximal end.
3L     24543557     1   16kb added on distal end, 718kb added to
                        proximal end, other gaps filled.
3R     27905053     0   None
4       1351857     1   70kbp added to the distal end

Gaps of unknown size are denoted by 100 N's in the fasta files. There
are two sized gaps on X that have estimates for their size. There are
6 other gaps in the genome which are not sized.

In addition to the major changes, all arms except for 3R had minor
sequence changes in regions where the fingerprint digest showed evidence
of missassembly, or in regions that required quality improvements.

Each of the fasta file consists of a simple header for the arm identifier,
the assembly date, and an md5 checksum of the sequence.

Heterochromatic Sequence

The next seven files are fasta files for sequence scaffolds that lie
in heterochromatic regions of the genome. Within each scaffold, we have
attempted to order and orient the contigs in a way consistent with
BAC end alignment and/or STS mapping. Intervals between scaffolds is
denoted by 100 N's, and the order and orientation of the scaffolds with
respect to one another is not known. The details of this analysis and
finishing techniques will be in forthcoming publications. Scaffolds
that map genetically or cytologically to particular chromosomes are in
the respective file; scaffolds which either cannot be localized or have
conflicting localization data are in na_armU.dmel.RELEASE5.

The most recent version of na_armU.dmel.RELEASE5 dated 13 April 2006 has
had four scaffolds removed from the version dated 29 March 2006. The
initial version was discovered to have scaffolds that are part of
X included.

The sequence in the heterochromatic files was generated starting with the
whole genome shotgun assembly provided by Celera in 2001 and improved by
gap closure, quality improvement, and joining scaffolds. The primary focus
in this release has been scaffolds from the shotgun assembly which were
larger than 40kb. Smaller scaffolds have not been targeting for finishing
in this release except in cases where they joined with larger scaffolds.
We have attempted to keep these seven files nonredundant with one another
and with the euchromatic arms.

The file na_armUextra.dmel.RELEASE5 contains 34,630 small scaffolds
produced by the Celera shotgun assembler which could not be consistently
joined with larger scaffolds. This data has not been previously
released. The majority consists of short sequence - 32,804 are less than
1000 bp long and 31,656 were generated with fewer than 10 traces. We have
attempted to remove redundant scaffolds in which the data for a scaffold
was used in the assembly of a finished region, but we have not excluded
scaffolds which may be redundant with euchromatic or other heterochromatic
regions. Nor can we exclude the possibility of contaminations from other
organisms. We are making this data available as a resource for analysis
of region which cannot be assembled well, such as satelites or simple
repeats. Since some of this data is low quality, researchers are encouraged
to contact either BDGP or DHGP for further details on this resource.

GenBank Accession Numbers

Individual components - either separate BACs or whole genome scaffolds - 
are submitted to GenBank as part of the public release. The file
gb_acc.dmel.RELEASE5 is a listing of the GenBank accession numbers and
coordinates for each constituent. The accession numbers in this file
have 3 types, ACNNNNNN, CPNNNNNN, and AABUNNNNNNNN. The first numbers are
for BACs, the second is for finished or improved WGS scaffolds, and the
third is for unimproved WGS scaffolds. In addition, there are some regions
marked with internal identifiers DNNNN or 2110000222NNNNN. The first of
these are new submissions to GenBank which we do not yet have accession
numbers. The regions labeled with 2110000222NNNNN identifiers are all
part of 'aumUextra' and are not likely to be submitted to submitted to
GenBank in the immediate future. Regions marked with internal identifiers
will be submitted to GenBank soon and the coordinate file updated as
they become available.

The sequence of the BAC in GenBank is not necessarily in the same
orientation as it is on the chromosome. Also, there are cases of
polymorphisms in overlap regions in which the sequence of the BAC does
not match the assembled arm. Some BAC sequences have been submitted to
GenBank as composites of two adjacent BACs. Please see the GenBank record
for the coordinates of the individual BACs within these submissions.

The coordinates in this file are '1' based.

Release 4 Mapping

The final file, mapping.sql, is a table that can be used to generate
mappings from Release 4 euchromatin to Release 5. The file contains the PostgreSQL
commands that can be used for finding a coordinate on a Release 5 arm
corresponding to Release 4. Load the data with the command:

           psql -h db_server db_name < mapping.sql

Then an example query to find the coordinate on Release 5 corresponding
to 32415 on Release 4 is:

           db_name=# select r4_map('2R',32415);
            r4_map
           --------
            412533
           (1 row)

and the converse to find the Release 4 coordinate corresponding to a
Release 5 location is:

           db_name=# select r5_map('2R',412533);
            r5_map
           --------
             32415
           (1 row)

The file may need to be modified to suit the requirements of a particular
site.


Revision History:

29 March 2006	Initial version
17 April 2006	Added GenBank Accession number listing; removed redundant scaffold in armU