Date: 08 Apr 1998 This directory contains a set of FASTA-format sequence files which can be used for the training and evaluation of promoter prediction programs in drosophila DNA. The set contains essentially of three parts: promoter sequences, CDS sequences, and non-coding (intron) sequences. Each part contains three sequence files to be used for threefold crossvalidation. CDS sequences were generated as follows: *) starting point were three out of five (No 0--2) files from the 1998 GENIE multiple exon genes set. These files contain sequences from the same number of genes each. *) The exons were concatenated to form single CDS sequences. *) These complete CDS sequences were cut consecutively into 300 bp long non-overlapping sequences. Shorter sequences and remaining sequences at the end were discarded. *) Because of the different length of the CDS sequences, the files contained a different number of sequences (237--277). To ensure an equal amount of training/validation/test data for every crossvalidation experiment, the number of sequences in each file was reduced to the smallest number (237). This was done by skipping single sequences and not a whole bunch at the end for not running into the danger of missing whole genes. Intron sequences were generated as follows: *) starting point were three out of five (No 0--2) files from the 1998 GENIE multiple exon genes set. These files contain sequences from the same number of genes each. *) The sequences were cut consecutively into 300 bp long non-overlapping sequences. Shorter sequences and remaining sequences at the end were discarded. *) Because of the different length and number of the intron sequences, the files contained a different number of sequences (80--101). To ensure an equal amount of training/validation/test data for every crossvalidation experiment, the number of sequences in each file was reduced to the smallest number (80). This was done by skipping single sequences and not a whole bunch at the end for not running into the danger of missing the data of whole genes. The promoter data was generated as follows: *) All sequences of the Drosophila Promoter Database (Irina Arkhipova, Harvard University) were taken as a starting point. *) The subset containing EPD drosophila entries were discarded, and the most recent drosophila entries of the EPD rel. 50 independent subset were added. *) entries with less than 40 bp upstream and/or 5 bp downstream were discarded, leaving 256 entries. *) Out of these, 250 bp upstream and 50 bp downstream were extracted, resulting in 300 bp long sequences which may have flanking N regions because of lacking data in the beginning and/or end of the promoter region. *) These sequences were split into three crossvalidation files with 85 promoter sequences each. ===== Uwe Ohler (UCB) ohler@fruitfly.berkeley.edu With help from: Martin Reese (LBNL) mgreese@lbl.gov