Date: 11 Apr 1998 ftp://www-hgc.lbl.gov/pub/genesets/Drosophila/GENIE_98/ This directory contains a set of GenBank flat-file format entries of multiple exon and single exon genes to be used to test and train gene-finding algorithms for Drosophila melanogaster DNA. UC Berkeley and LBNL provide these data sets "as is" in an effort to create a common data set to be used by all gene-finding algorithms. We encourage others to compare their results using these data sets. Accompanying the data set is a ".sets" file listing 5 test/train subsets. These subsets can be used for cross-validation. The files in this directory are: multi_exon_GB.dat.gz GenBank format file 258 multi exon genes multi_exon_GB.sets proposed cross-validation sets For specific training we have also split the complete Genbank entries into various gene feature sets: ./CDS_v105/ complete CDS ./intron_v105/ complete introns ./exon_v105/ complete exons single_exon_GB.dat.gz GenBank format file 137 single exon genes single_exon_GB.sets proposed cross-validation sets In addition, several Perl and shell routines are provided in the scripts directory (ftp://www-hgc.lbl.gov/pub/genesets/scripts/). ===== Martin Reese (LBNL) mgreese@lbl.gov With help from: David Kulp (UCSC) dkulp@cse.ucsc.edu Andrew Gentles (Stanford) Uwe Ohler (UCB) ohler@fruitfly.berkeley.edu