Date: 28 Oct 1996 This directory contains 3 sets of GSDB GenBank flat-file format entries to be used to test and train gene-finding algorithms for human DNA. UCSC and LBNL provide these data sets "as is" in an effort to create a common data set to be used by all gene-finding algorithms. We encourage others to compare their results using these data sets. Accompanying each data set is a ".sets" file listing 9 test/train subsets. These subsets can be used for cross-validation. Data sets PART0 - PART6 are consistent to the old cross-validation from 1995 (See ../data_1995!). Data sets PART7 and PART8 contain only new intron genes that were added to GenBank between June '95 and July '96. The 3 data sets are: Only single exon genes: single_exon_GB.dat.gz single_exon_GB.sets Multiple exon genes: multi_exon_GB.dat.gz multi_exon_GB.sets Combined -- single and multiple exon genes: combined_GB.dat.gz combined_GB.sets We also provide our data set split into intron and exon sequences: 2107exons.gz 1754introns.gz ===== David Kulp (USCS) Martin Reese (LBNL) mgreese@lbl.gov