Adh datasets for assessing the Genome Annotation Experiment results

One of the major challenges in evaluating the effectiveness of sequence annotation systems is the lack of powerful reference data sets.

We developed two data sets to help us evaluate participants' submissions:

Std1(original ISMB file) is based on high quality cDNA<->genomic sequence alignments. Starting with a set of 80 full length cDNA sequences from the Adh region, we ended up with 43 annotated transcripts (start_codon, stop_codon, exon, CDS, splice3, and splice5 features) with strong alignments to the genomic sequence and whose splice sites matched a simple "GT/AG" consensus and scored well using a neural net splice site predictor.
We hope that this data set, with its narrow and stringent criteria, can be used as an effective estimate of a set of "known to be correct" annotations.
Std1_corrected further corrected (1/31/00) orignial ISMB std1. Five suspicious cDNA alignments were removed. Total of 38 transcripts remaining.
Std3 is based on the BDGP's annotations of the Adh region, as described in Ashburner et al. These annotations combine computational and biological research results under the supervision of experienced Drosphila biologists. With 222 transcript annotations, this set is much more extensive than std1. Approximately 182 of the annotations are similar to a known protein sequence or a Drosphila EST, while approximately 40 are based on computational results.
While there is less experimental evidence for std3's annotations, we hope that with its broad coverage and careful curation it can be used as an effective estimate of the full set of genes that exist in this region.
Additionally, we compiled a list of 92 5' UTR start sites from std3. All of them were confirmed by full-length cDNA alignment and contain a complete open reading frame downstream. This set of UTRs was used to evaluate the promoter predictions.

[email protected]