The pipeline process included an autopromotion step of certain data classes as gene models. Several data classes were used to establish such initial annotations: (1) Release 2 annotations, in most cases; (2) gene models defined by literature-based FlyBase Annotated Reference Gene Sequences (ARGS) for the 776 genes for which they were available; these superseded previous annotations; (3) a cDNA clone served as a template for autopromotion in those regions for which no previous annotation existed and where a DGC cDNA clone could be aligned.
The job of the annotator was two-fold: first, to determine whether the autopromoted gene models were correct and complete, and second, to determine if there was evidence for additional genes or transcripts not already represented by the autopromoted models. Data that typically supported a new gene annotation included BDGP EST and cDNA data, GenBank/EMBL/DDBJ submissions from the scientific community, convincing BLASTX homologies, community submissions of sequence data to FlyBase, or reassessment of gene prediction data.
Annotators followed a consistent set of rules during the following basic steps of annotation:
Does a protein-coding gene exist in this region? Gene prediction algorithms are sufficiently robust that this is rarely an issue for larger genes, (>200 amino acids), unless the gene consists of many small dispersed exons. When making a judgment in cases of small genes, or genes comprised of small exons, available evidence was classified as three types: (1) Matches to cDNA sequence data, which could be the BDGP cDNA/EST data or data generated by the community, and such data was considered more significant if it included an intron with consensus splice sites. (2) Gene prediction data; in the absence of other data, only exons corresponding to a GENSCAN or Genie prediction with a score exceeding 45 were considered. (3) BLASTX homology; matches with expected value less than 1 x e-7 were considered. For gene models with only one of these three types of supporting data, models with a predicted CDS greater than 100 amino acids were created or retained. In cases for which two or more types of supporting data existed, a gene model was created if the predicted CDS exceeded 50 amino acids.
Is it one gene or several? Gene splits or merges were a common annotation correction and were based upon EST/cDNA data, BLASTX homologies, or corrections submitted by the community. Typically, the original erroneous gene models were based exclusively or primarily upon gene prediction data. A general comment referring to the type of data supporting the change was added to the annotation record.
What is the structure of the transcript(s)? Internal exon-intron structures were based primarily upon EST/cDNA data; if these data were absent, they were based upon gene prediction data; in some cases, approximate gene structures were inferred from BLASTX alignments. In practice, many annotations were based upon a combination of these data types. For example, when EST/cDNA covered only the termini, internal structures were usually based upon gene prediction data. The 5' terminus of a transcript was extended to the start of the overlapping EST that extended furthest 5'. If no 5' EST data were available, it was extended to the first in-frame ATG consistent with the gene prediction or BLASTX data. The 3' terminus was extended to the 3' end of a complete cDNA, if available, or to the 3' end of an overlapping 3' EST; if cDNA/EST data were not available, it was extended to the first stop codon consistent with the gene prediction data or BLASTX alignment. For gene models in which the alignment of the closest terminal EST(s) ended outside, but within 60bp, of the predicted coding region, the transcript was extended to include the ESTs. Whenever splice sites other than GT/AG were annotated, a comment was appended to the transcript.
What is the extent of the coding region? The Apollo annotation tool set the translation start site to the 5'-most in-frame ATG in most cases, but curators sometimes used Genie or Genscan predicted starts of translation (and inadvertently used a downstream ATG in the model; K. Burtis, S. Langley, J. Carlson, personal communication; these will be fixed in future releases) or set the start site to match the entry in SWISS-PROT. In cases supported by the literature, a non-ATG translation start was used, or a downstream ATG was used. In some cases, especially for annotations supported only by BLASTX data, it was not possible to identify a likely ATG start codon. In such cases, translation was started at the 5'-most internal in-frame codon and an explanatory comment added.
How many alternative transcripts exist? Generally, we annotated as many alternative transcripts as were supported by the EST, cDNA, and community data. If non-contiguous EST data supported alternative exons in several regions of the gene, it was not always possible to determine which of all possible combinations actually exist in vivo. The number of these alternative transcripts created was left to the discretion of the annotator, and appropriate comments added. However, it should be noted that combinations of 5' ESTs and 3' ESTs from different cDNA clones were used to make gene models, and this may have artificially increased the number of alternative transcripts, since not all of these combinations may exist in vivo.
Partial annotations were avoided whenever possible. If, among many ESTs a single EST differed (and appeared otherwise to be valid), an alternative transcript was annotated and the comment "Only one EST supports this alternative transcript" was appended. If EST/cDNA data seemed suspect, for example, predicted a significantly shorter protein, more than one EST/cDNA was required to justify an annotation. Except in cases of non-overlapping alternative 5' or 3' exons, no systematic attempt was made to determine if alternative transcripts existed due to multiple transcription starts or multiple stops (alternative polyadenylation sites).
After completion of a gene model, the DGC cDNA clones corresponding to that gene were assessed. Problematic clones were flagged and marked as incomplete, containing unspliced introns, or possibly chimeric. Conversely, quality control tests performed on the sequences of the DGC clones, that involved comparison of their predicted protein products with those of genomic annotations, revealed a number of cases where small exons were not properly aligned in the curator's model. We believe these errors result from inherent limitations of the SIM4 alignment software.
Curator comments. The Apollo annotation tool allows for the inclusion of comments, associated with an annotated gene or a specific transcript. We made extensive use of this capability, including controlled comments, and at times free text, whenever an annotation required clarification. The collection of controlled comments was developed during the initial re-annotation stages, and was used as often as possible to facilitate consistency and to provide a means of tracking or querying for various atypical gene structures. For example, all predicted splices that fail to use the canonical GT/AG donor and acceptor splice site dinucleotides were noted, as were genes that have been reported to make use of non-ATG translation starts, genes with overlapping UTRs (on the same strand), genes encoded on dicistronic transcripts, and genes known to be or appearing to be mutant in the sequenced strain. Many of the controlled comments addressed weaknesses or anomalies in the annotation: an unusual alternative transcript supported by a single EST; incomplete supporting data required extension of a gene model to the nearest translation start or stop, or that an ATG translation start codon could not be identified. Genes that were split or merged were noted and the type of evidence supporting the change indicated. Finally, DGC cDNA clones that failed to accurately reflect the annotation, typically those that were incomplete, or appeared to include intronic sequences, were noted.
Gene and Transcript Annotation Comments (sample text)
Only one EST supports this alternative transcript EST data support existence of multiple transcripts Unconventional splice site postulated GC splice donor site postulated Gene prediction data only 5' terminus extended to first start codon; no experimental data confirming this prediction 3' terminus extended to first stop codon; no experimental data confirming this prediction EST data suggest additional 3' exon(s) EST data suggest additional 5' exon(s) Evidence indicates that 3' UTR overlaps 5' UTR of downstream gene; extends to coordinate AE00*: * (based upon 3' extent of *) Although this gene model is supported by multiple pieces of evidence, the computed ORF is questionable because it is much smaller than the predicted transcript Multiple ESTs homologous to non-coding strand of this gene DGC clone * appears problematic: incomplete CDS DGC clone * appears problematic: unspliced intron DGC clone * appears problematic: chimeric DGC clone * appears problematic: contains transposon sequences First Priority for Reannotation (internal view only) Transposon inserted in intron Gene split based on BLASTX data Gene split based on EST data Gene merge based on BLASTX data; no experimental evidence for splice sites Gene merge based on EST data EST data support existence of multiple transcripts due to variable use of mini-exons EST data support existence of multiple transcripts due to variable use of 5' exons EST data support existence of multiple transcripts due to variable use of poly-A sites 5' exon not determined (no ATG translation start identified) Unconventional translation start Translation start as per FBrf EST data support dicistronic gene model May be component of a dicistronic gene; available data inconclusive Probable mutation in sequenced strain: premature stop Probable mutation in sequenced strain: [other] Known mutation in sequenced strain EST * from opposite strand used to extend gene model Although multiple ESTs support this model, read-through of predicted intron results in shorter CDS Putative repetitive region: the region matches several EST and P insertions that map to various locations in the genome