Adh datasets for assessing the Genome Annotation Experiment results
One of the major challenges in evaluating the effectiveness of 
sequence annotation systems is the lack of powerful reference data 
sets. 
 
We developed two data sets to help us evaluate participants' 
submissions:
-  Std1(original ISMB file) is based on high quality cDNA<->genomic sequence 
     alignments.  Starting with a set of 80 full length cDNA sequences 
     from the Adh region, we ended up with 43 annotated transcripts
     (start_codon, stop_codon, exon, CDS, splice3, and splice5 
     features) with strong alignments to the genomic sequence and 
     whose splice sites matched a simple "GT/AG" consensus and scored 
     well using a neural net splice site predictor. 
 
     We hope that this data set, with its narrow and stringent 
     criteria, can be used as an effective estimate of a set of "known 
     to be correct" annotations. 
  
 
-  Std1_corrected
	further corrected (1/31/00) orignial 
	ISMB std1. Five suspicious cDNA alignments were removed. Total of
	38 transcripts remaining.
 
 
-  Std3 is based on the BDGP's annotations of the Adh region, as 
     described in Ashburner et al.  These 
     annotations combine computational and biological research results 
     under the supervision of experienced Drosphila biologists.  With 
     222 transcript annotations, this set is much more extensive than 
     std1.  Approximately 182 of the annotations are similar to a 
     known protein sequence or a Drosphila EST, while approximately 40 
     are based on computational results. 
 
     While there is less experimental evidence for std3's annotations, 
     we hope that with its broad coverage and careful curation it can 
     be used as an effective estimate of the full set of genes that 
     exist in this region. 
 
 
-  Additionally, we compiled a list of 92 5' UTR
start sites  
from std3. All of them were confirmed by full-length cDNA alignment and contain
a complete open reading frame downstream. This set of UTRs was used to evaluate the
promoter predictions. 
	
	[email protected]