Date: 08 Apr 1998

This directory contains a set of FASTA-format sequence files which can be
used for the training and evaluation of promoter prediction
programs in drosophila DNA. The set contains essentially of three parts:
promoter sequences, CDS sequences, and non-coding (intron) sequences.
Each part contains three sequence files to be used for threefold
crossvalidation.


CDS sequences were generated as follows:

*) starting point were three out of five (No 0--2) files from the 1998 GENIE 
multiple exon genes set. These files contain sequences from the same number
of genes each.
*) The exons were concatenated to form single CDS sequences.
*) These complete CDS sequences were cut consecutively into 300 bp long 
non-overlapping sequences. Shorter sequences and remaining sequences at the end
were discarded.
*) Because of the different length of the CDS sequences, the files contained a
different number of sequences (237--277). To ensure an equal amount of
training/validation/test data for every crossvalidation experiment, the
number of sequences in each file was reduced to the smallest number (237).
This was done by skipping single sequences and not a whole bunch at the end
for not running into the danger of missing whole genes.


Intron sequences were generated as follows:

*) starting point were three out of five (No 0--2) files from the 1998 GENIE 
multiple exon genes set. These files contain sequences from the same number
of genes each.
*) The sequences were cut consecutively into 300 bp long 
non-overlapping sequences. Shorter sequences and remaining sequences at the end
were discarded.
*) Because of the different length and number of the intron sequences, 
the files contained a different number of sequences (80--101). To ensure an 
equal amount of training/validation/test data for every crossvalidation 
experiment, the number of sequences in each file was reduced to the smallest 
number (80).
This was done by skipping single sequences and not a whole bunch at the end
for not running into the danger of missing the data of whole genes.


The promoter data was generated as follows:

*) All sequences of the Drosophila Promoter Database (Irina Arkhipova, Harvard
University) were taken as a starting point.
*) The subset containing EPD drosophila entries were discarded, and the most
recent drosophila entries of the EPD rel. 50 independent subset were added.
*) entries with less than 40 bp upstream and/or 5 bp downstream were
discarded, leaving 256 entries.
*) Out of these, 250 bp upstream and 50 bp downstream were extracted, resulting
in 300 bp long sequences which may have flanking N regions because of lacking
data in the beginning and/or end of the promoter region.
*) These sequences were split into three crossvalidation files with 85 promoter
sequences each.

=====
Uwe Ohler (UCB) ohler@fruitfly.berkeley.edu
With help from:
Martin Reese (LBNL) mgreese@lbl.gov