Date: 23 Oct 1999 http://www.fruitfly.org/seq_tools/datasets/scripts/ This directory contains several Perl and shell routines to generate cross-validated training and test sets. These scripts are written for the data files in http://www.fruitfly.org/seq_tools/datasets/Human and http://www.fruitfly.org/seq_tools/datasets/Drosophila scramble.pl usage: scramble.pl genbank-flat-file is a file of one or more genbank entries separated by "//". this program outputs the LOCUS IDs in 5 separate sets. This program generates the ".sets" files. example: scramble.pl single_exon_GB.dat > single_exon_GB.sets split.pl usage: split.pl a data_set_name is the base-name without extension of a genbank-flat-file, i.e. without the ".dat". the program uses the ".sets" file to extract the specified genes and place them in "genes.test". The remaining entries in the data set are placed in "genes.train". example: split.pl combined_GB 3 (places genes listed in 3rd set in "genes.test") unpack.pl usage: unpack.pl creates a subdirectory and creates a single entry for each GenBank entry. some program may find the unpacked format more convenient. example: unpack.pl combined_GB pack.sh usage: pack.sh creates a single genbank_flat_file, as described above, from a subdirectory in which each entry is a single file. ===== David Kulp (UCSC) dkulp@cse.ucsc.edu Martin Reese (LBNL) martinr@bdgp.lbl.gov