Date: 23 Oct 1999 http://www.fruitfly.org/seq_tools/datasets/Drosophila This directory contains a set of GenBank flat-file format entries of multiple exon and single exon genes to be used to test and train gene-finding algorithms for Drosophila melanogaster genomic DNA. UC Berkeley and LBNL provide these data sets "as is" in an effort to create a common data set to be used by all gene-finding algorithms. We encourage others to compare their results using these data sets. Accompanying the data set is a ".sets" file listing 5 test/train subsets. These subsets can be used for cross-validation. The files in this directory are: multi_exon_GB.dat.gz GenBank format file 275 unique multi exon genes (less than 80% BLASTN identity between genes) multi_exon_GB.sets proposed cross-validation sets For specific training we have also split the complete Genbank entries into various gene feature sets: ./CDS_v109/ complete CDS ./intron_v109/ complete introns ./exon_v109/ complete exons ./promoter/ promoter sequences from the Eukaryotic Promoter Database (EPD) '98 single_exon_GB.dat.gz GenBank format file 141 single exon genes single_exon_GB.sets proposed cross-validation sets In addition to this highly curated gene set we extracted all possible sequences in GenBank for Drosophila melanogaster that had CDS feature annotation. For this set only genes with identical coding transcripts were removed. multi_exon_GB_all.dat.gz Additional GenBank format file 455 genes (only identical genes excluded) from mRNA and DNA entries in GenBank multi_exon_GB_all.sets proposed cross-validation sets For specific training we have also split the complete Genbank entries into various gene feature sets: ./CDS_all_v109/ complete CDS ./exon_all_v109/ complete exons ./intron_all_v109/ complete introns We also provide the Drosophila Genie data sets from 1996 and Spring 1998 in ./GENIE_96 and ./GENIE_98 A highly curated coding transcript data set can be found in ftp://www.fruitfly.org/seq_tools/datasets/OtherDataSets/ASHBURNER_97 In addition, several Perl and shell routines for generating cross-validated sets are provided in the scripts directory (http://www.fruitfly.org/seq_tools/datasets/scripts/). ===== Martin Reese (LBNL) martinr@bdgp.lbl.gov With help from: David Kulp (UCSC) dkulp@cse.ucsc.edu Andrew Gentles (Stanford) Uwe Ohler (UCB) ohler@fruitfly.berkeley.edu