Projects

Release 3 Notes

RE-ANNOTATED GENOMIC SEQUENCE

Release 3 Notes: Updated June 18, 2004


Release 3.2 of the euchromatin is now available at http://flybase.net/annot/.


RELEASE 3.2 ANNOTATION UPDATE

The March 2004 annotation of the Drosophila euchromatic genome, Release 3.2, is available from FlyBase and the public data libraries (NCBI, EBI, DDBJ). At FlyBase, these data are available from FlyBase Gene Annotation reports, the FlyBase BLAST server, the batch query page, and the download site. Previous releases, unannotated BAC-based sequences, and the WGS3 whole-genome shotgun sequence assembly continue to be available from GenBank. See the Heterochromatin section below for information about the release of Heterochromatin.

Annotations and supporting data may be viewed at Gene Annotation reports (e.g., Sos), which are accessible from individual gene report pages or as a result of a query using the Basic Annotation Query Form. Release 3.2 includes new sequence features curated from the fly literature, such as mutational lesions, aberration breakpoints, and insertion sites of transgenic constructs. These new sequence features may also be accessed via the Gene Annotation reports, however, they are not included in the Release 3.2 GenBank submissions.

A summary description of Release 3.2 data available for batch download may be found HERE.

Release 3.2 is the first of regular updates that will reflect gene-by-gene annotation assessments, rather than a comprehensive survey of the entire genome. Re-annotation of a gene model is triggered by new sequence data, data curated from the literature, or user communications. For updates that include changes to annotations only (and not the underlying sequence), the release numbers increase as decimal increments. These more frequent updates also include new supporting data, represented in the evidence tiers of the gene annotation reports, in the Gbrowse views, and in the Apollo annotation editor and viewer. A major addition to annotated gene models in Release 3.2 is the inclusion of 100 5SrRNAs (of the estimated 160 genes in the 56F 5SrRNA gene cluster); this includes four 5SrRNA pseudogenes. A comparison of gene models annotated in recent releases is tabulated below; a more detailed tabulation may be found HERE

Release 2 Release 3.1 Release 3.2
Total Euchromatin Heterochromatin Total Euchromatin Heterochromatin Total
Protein-coding genes 13474 13369 290 13659 13472 320 13792
Protein-coding transcripts 14335 18109 396 18505 18746 430 18906
Unique peptides 13922 15848 353 16201 16356 390 16746
Peptides unchanged from r2 to r3.1 - 8769 53 8822 - - -
Peptides unchanged from r3.1 to r3.2 - - - - 16902 256 17158
tRNAs 0 288 0 288 288 0 288
rRNAs 0 6 6 0 96 6 102
Pseudogenes 0 17 1 18 39 1 40
microRNAs 0 23 0 23 23 0 23
snRNAs/snoRNAs 0 56 0 56 56 0 56
Natural transposon insertions 0 1572 9 1581 1572 6189\* 7761\*
Misc. non-protein-coding RNA 0 38 8 46 45 13 58
New compared to previous 336 576 226 802 211 56 267
Deleted from previous 114 284 61 345 41 23 64
Mergers of previous - 695 12 707 31 6 37
Splits of previous - 675 1 676 26 0 26
\*Natural transposon insertions in heterochromatin are 'repeat\_regions' with high TE homology. See [Transposable Elements](/annot/release3.html#transposons) below.

The confidence we have in the annotated gene models varies considerably; improvements to the gene models will be ongoing, and will require the continued input of the community. If you notice a mistake in annotation, please submit an error report form (also accessed from the gene annotation reports) or write to flybase-updates AT morgan.harvard.edu. Updates may also be submitted as sequence records or as Apollo-generated XML files.


HETEROCHROMATIN

The sequence finishing and annotation of the heterochromatic region of the genome is being performed by the Drosophila Heterochromatin Genome Project (DHGP; see Hoskins et al. 2002). As sequence gaps are filled, and the heterochromatic scaffolds are finished to high quality and re-annotated, they will be contributed to GenBank and FlyBase and integrated into future releases of the Drosophila genomic sequence. Release 3.2 annotation of the heterochromatic regions was released to GenBank in June 2004 and should be available from FlyBase and GenBank by July 2004.

The Release 3.2 heterochromatin annotation represents the latest effort to describe the protein-coding genes, non-coding genes, and other features located in the heterochromatin sequence. In this update, the underlying sequence is the 20.7Mb of Release 3 whole-genome-shotgun (WGS) scaffolds from Celera that could not be assembled into the euchromatin arms as well as a few BDGP-sequenced scaffolds.

The WGS3 heterochromatin consists of ~2600 scaffolds that still contain gaps and collapsed repeats, but are otherwise considered relatively high-quality sequence. Some of these have been mapped to particular chromosome arms (i.e. 2h, 3h, 4h, Xh, or Yh), while the remaining have been placed on chromsome U. It is important to note that scaffolds that have been mapped to a particular chromosome arm are provisionally ordered, but not oriented: they are ordered by their experimentally determined cytological locations, but their orientation and exact order remain unclear. Chromosome U consists of unordered, unoriented scaffolds. While the underlying sequence of the scaffolds annotated in Release 3.2 has not changed, the mapping and ordering of these scaffolds on chromosome arms (e.g. 2h, 3h...) may differ from previous releases.

The transition between the euchromatic and heterochromatic regions of the genome is thought to be a gradual one, and there are no objective rules to categorize the sequence in this transitional area as definitively euchromatic or heterochromatic. Currently the boundaries between the euchromatic and heterochromatic portions of the genome are based on cytological data, as described in Hoskins et al. 2002.

Annotation guidelines consistent with FlyBase and the overall Drosophila genome annotation were adhered to whenever possible. However, since these annotations are based on high-quality draft sequence, certain gene models may contain missing or premature stop codons, missing start codons, or gaps within their ORFs. Open reading frames corresponding to fragments of transposable elements are common in heterochromatin; every attempt was made to identify these and exclude them from the gene annotations.

As the DHGP adds new data and improves the quality of the underlying sequence and assembly in future releases, the quality of the annotations will also improve. The DHGP welcomes any feedback and data from the community that will assist in this effort.


KNOWN MUTATIONS IN THE SEQUENCED STRAIN

The sequenced strain, usually described as the y[1]; cn[1] bw[1] sp[1]strain, was known to carry mutations in those four genes. During annotation, mutations in other genes have been discovered (currently known are mutations in oc, LysC, MstProx, GstD5, and Rh6). To allow compilation of a comprehensive proteome, wild-type protein sequences for these genes have been included in sequence entries to GenBank/EMBL/DDBJ. Wherever possible, a RefSeq accession based on an alternative wild-type sequence and curated as a FlyBase Annotated Genome Sequence (ARGS) has been provided.


GENOMIC SEQUENCE RELEASES vs. ANNOTATION RELEASES

The different releases of the D. melanogaster genomic sequence are designated by the whole number component of the release number. The first annotated genomic sequence was released on March 24, 2000, and constituted Release 1 (Adams, et al., 2000). After Celera/BDGP filled 330 gaps and changed ~3000 annotations, Release 2 was made public in October, 2000. This whole genome shotgun assembly had ~1300 gaps.

To produce the 116.8 Mb Release 3 euchromatic sequence, the BDGP closed almost all of the gaps in the euchromatic portion of the genome, and raised the sequence quality to an estimated error rate of less than one in 100,000 base pairs in the unique portion of the sequence, and less than one in 10,000 base pairs in the repetitive portion (Celniker et al. 2002). The accuracy of the assembly was verified by restriction digestion of BAC clones, and composite sequences of transposable elements in the previous releases was replaced in Release 3 with the true sequences of 1572 individual transposon insertions.

The BDGP will continue to improve the genomic sequence to high quality. Release 4 genomic sequence has been submitted as unannotated BACs to GenBank as it has been finished, and the unannotated Release 4 sequence is available as chromosome arms or one file with all of the euchromatin at the BDGP download site. In the Release 4 sequence, 21 of the 44 gaps have been closed and two small inversions in the assembly corrected. Details may be found in the BDGP Release 4 notes.

Commencing with Release 3 and continuing into the future, changes to the gene models and other annotations will occur more often than changes to the underlying sequence. These changes are indicated by fractional release numbers; for example, 'Release 3.2' consists of the second update of annotations on the Release 3 genomic sequence. FlyBase will continue to increment release numbers across the entire genome. Since a new release of the genomic sequence has just been completed, the next annotation update will be Release 4.1.

In FlyBase, the release number will appear at the top of each annotation query and report page, and also at the FlyBase download sites for sequence. Please make a note of the release number you are working with.

The annotated sequence is submitted to GenBank as chromosome arms, and GenBank cuts these into 250-350kb segments. When the underlying sequence for a given segment changes, GenBank increments the decimal version number. Note that this does not occur genome-wide, so some accession version numbers will change and others will not. On occasion, the underlying sequence has not changed, but the extent of a given segment may differ (to avoid dividing a gene model between two segments). Such a change in extent will also result in an increment of the version number. Changes to annotations are indicated by an updated date stamp.

Examples of release number changes and corresponding GenBank version numbers are shown in the table below.

Date Release GenBank Version
March 2000 Release 1 AE003452.1
October 2000 Release 2 AE003452.2
June 2002 Release 3.0 AE003452.3
February 2003 Release 3.1 AE003452.4
March 2004 Release 3.2 AE003452.4
March 2000 Release 1 AE003463.1
October 2000 Release 2 AE003463.1
June 2002 Release 3.0 AE003463.2
February 2003 Release 3.1 AE003463.2
March 2004 Release 3.2 AE003463.2

Links from FlyBase gene and annotation reports will go to the most recent release at NCBI. If you need access to a previous release, you can query at NCBI using the accession number including the version number suffix; click on 'revision history.'


RELEASE 3.1

When Release 3 of the genomic sequence became available, FlyBase conducted a comprehensive review of all euchromatic annotations (Misra et al. 2002). The goals of this re-annotation were:

In order to address these goals, a new computational pipeline was created (Mungall et al. 2002) with an exhaustive list of Drosophila sequence datasets and SwissProt/trEMBL SWALL peptide datasets from other species. The results and datasets are stored in the new FlyBase genome annotation database, so that evidence for the annotations can be tracked and queried. A new graphical user interface, Apollo, was developed in a collaboration between FlyBase BDGP and Ensembl, to allow FlyBase biologist curators to easily view the results of computational analysis and efficiently edit the annotations (Lewis et al. 2002). A set of curation rules and a controlled vocabulary of comments was created to allow the group of ten curators to annotate consistently. And finally, a set of validation steps was created, including software to compare each predicted peptide to those curated peptides in SwissProt with experimental evidence.

The Release 3 re-annotation improved the quality of the majority of gene models. The length of UTRs and the number of alternative transcripts increased, due to the increase in EST and complete cDNA sequences. The fine details of the exon-intron structure were significantly improved. Numerous genes were merged and/or split, based on the cDNA and BLASTX data; some genes predicted in earlier releases were deleted, others are newly predicted. Genes were deleted if they overlapped transposons or if they fell below a minimum size cutoff (100aa) and had no experimental evidence beyond a computational gene prediction. Overall, these improved annotations in changes in >45% of the predicted proteins.


TRANSPOSABLE ELEMENTS

As a result of the whole genome shotgun assembly, the sequence of each transposon in Releases 1 and 2 was a composite derived from a number of elements of that transposon type. In Release 3, the sequence of each transposon insertion in the euchromatin of the y[1]; cn[1] bw[1] sp[1] strain was determined and characterized (Kaminker et al. 2002). See the BDGP Natural Transposable Element page for more information. The transposons in euchromatin have not been updated between Release 3.1 and Release 3.2.

The Drosophila heterochromatin sequence is extremely rich in repetitive satellite elements, simple repeats, and transposable element fragments. At the time of Release 3.1, greater than 55% of the Release 3 heterochromatin sequence was determined to have homology to a repetitive element of some type. Currently the Drosophila Heterochromatin Genome Project estimates that 75% of the Release 3 heterochromatin sequence is comprised of repetitive sequence. Since the repetitive regions in the heterochromatin are so fragmented and located in regions with many gaps and potential assembly errors, we did not rigorously curate and hand-identify transposable elements in the same manner as Kaminker et al. 2002 for the Release 3 euchromatin. Instead, we used the Kaminker et al. "Natural Transposable Element" dataset as a library for Repeatmasker to identify stretches of sequence that were likely to be a transposable element or repeat. Since these regions may not be represent complete elements, or may contain many nested elements, the DHGP refers to these as 'repeat regions'. Essentially, 'repeat regions' are stretches of genomic sequence with a significant alignment to a known Drosophila transposable element or simple repeat. In most cases a repeat region is comprised of thousands of nested fragments of other transposable elements. Since our method relies on alignment to known elements it is likely that some legitimate repeats remain to be identified.

The results of the heterochromatin repeat analysis can be seen as the 'Repeatmasker' result tier when using the Apollo genome viewer or obtained as FASTA, GFF, of GAME-XML from the DHGP FTP site.


GENE AND TRANSCRIPT IDENTIFIERS

In Releases 3.0 and 3.1, protein-coding genes were given 'CG ' identifiers of the form Cgnnnnn. For non-protein-coding genes, such as tRNAs, snRNAs, snoRNAs, microRNAs, miscellaneous non-coding RNAs, and pseudogenes , 'CR' identifiers of the form CRnnnnn were assigned. Transposable elements were given TEnnnnn identifiers. Transcripts were assigned FlyBase transcript identifiers, for which the gene identifier is followed by a suffix -RX; e.g., CG12345-RA, CG12345-RB. For peptides, the -RX suffix is replaced by a -PX suffix, with the second identifying letter always in agreement with that of the corresponding transcript; e.g., CG12345-PA, CG12345-PB.

In the Release 3.2, the standard symbols for gene annotation CGnnnnn have been replaced with accepted gene symbol (where available). For example, CG8094, CG8094-RA, and CG8094-PA become gene Hex-C, transcript Hex-C-RA, and protein Hex-C-PA. The CG8094 ID is still supported as a more computable alternative to this symbolic name, but will be less visible.

In Release 1 and 2, only protein-coding genes were annotated, and CGnnnn identifiers were assigned to genes, CTnnnn identifiers to transcripts, and pp-CTnnnn identifiers to peptides. These old Release 1 and 2 CT identifiers are now obsolete, and there is no mapping between CT identifiers and the Release 3 CGnnnn-RA identifiers. However, in most cases the CT identifier has become a synonym of the gene, and can be queried using the FlyBase Gene Search page to find the gene they were associated with in Release 2. In some cases, a Release 2 gene may correspond to more than one Release 3 gene, e.g. if exons were redistributed or split between two new Release 3 genes.


QUESTIONS?

Please address any questions to [email protected]