Date: Aug 1995 This directory contains the data sets used to test and train GENIE as of D. Kulp, D. Haussler, M.G. Reese, and F.H. Eeckman (1996). A generalized Hidden Markov Model for the recognition of human genes in DNA. In ISMB-96. St. Louis, AAAI/MIT Press. File "removed_to 285" contains 19 later removed genes! File "removed_to269" contains 16 in new GENBANK version removed genes! This directory contains 3 sets of GSDB GenBank flatfile format entries to be used to test and train gene-finding algorithms from GenBank, release 89, 1995. The Original 305 genes were curated and questionable entries were removed so that only 285 genes remain (old data in data_1995). UCSC and LBNL provide these data sets "as is" in an effort to create a common data set to be used by all gene-finding algorithms. We encourage others to compare their results using these data sets. Accompanying each data set is a ".sets" file listing 7 test/train subsets. These subsets can be used for cross-validation. The 3 data sets are: Only single exon genes: single_exon_GB.dat.gz single_exon_GB.sets Multiple exon genes: multi_exon_GB.dat.gz multi_exon_GB.sets Combined -- single and multiple exon genes: combined_GB.dat.gz combined_GB.sets In addition, several Perl and shell routines are provided in the scripts directory: scramble.pl usage: scramble.pl genbank-flat-file is a file of one or more genbank entries separated by "//". this program outputs the LOCUS IDs in 7 separate sets. This program generates the ".sets" files. example: scramble.pl single_exon_GB.dat > single_exon_GB.sets split.pl usage: split.pl a data_set_name is the basename without extension of a genbank-flat-file, i.e. without the ".dat". the program uses the ".sets" file to extract the specified genes and place them in "genes.test". The remaining entries in the data set are placed in "genes.train". example: split.pl combined_GB 3 (places genes listed in 3rd set in "genes.test") unpack.pl usage: unpack.pl creates a subdirectory and creates a single entry for each GenBank entry. some program may find the unpacked format more convenient. example: unpack.pl combined_GB pack.sh usage: pack.sh creates a single genbank_flat_file, as described above, from a subdirectory in which each entry is a single file. ===== david kulp (ucsc) Martin Reese (LBNL)