Date: 23 Oct 1999 http://www.fruitfly.org/seq_tools/datasets/Human This directory contains a set of GenBank flat-file format entries of multiple exon and single exon genes to be used to test and train gene-finding algorithms for human genomic DNA. UCSC and LBNL provide these data sets "as is" in an effort to create a common data set to be used by all gene-finding algorithms. We encourage others to compare their results using these data sets. Accompanying the data set is a ".sets" file listing 7 test/train subsets. These subsets can be used for cross-validation. The files in this directory are: multi_exon_GB.dat.gz GenBank format file 462 multi exon genes multi_exon_GB.sets proposed cross-validation sets For specific training we have also split the complete Genbank entries into various gene feature sets: ./CDS_v105/ complete CDS ./intron_v105/ complete introns ./exon_v105/ complete exons ./genomic_DNA_v105/ complete genomic DNA from start codon to stop codon ./coding_data/ all possible unrelated CDSes in Genbank single_exon_GB.dat.gz GenBank format file 331 single exon genes single_exon_GB.sets proposed cross-validation sets We also provide the Genie data sets from 1995 and 1996 in ./GENIE_95 and ./GENIE_96 The published Bursett/Guigo in GENOMICS '95 can be obtained at ftp://www-hgc.lbl.gov/pub/genesets/OtherDataSets/GUIGO_96 In addition, several Perl and shell routines for generating cross-validated sets are provided in the scripts directory (http://www.fruitfly.org/seq_tools/datasets/scripts/). ===== Martin Reese (LBNL) martinr@bdgp.lbl.gov With help from: David Kulp (UCSC) dkulp@cse.ucsc.edu Andrew Gentles (Stanford) Uwe Ohler (UCB) ohler@fruitfly.berkeley.edu