Date: 18 Mar 1998 

This directory contains a set of FASTA-format sequence files which can be
used for the training and evaluation of promoter prediction
programs in human DNA. The set consists essentially of three parts: 
promoter sequences, CDS (coding) sequences, and non-coding (intron) sequences.
Each part contains five sequence files to be used for five-fold cross-validation.

CDS sequences were generated as follows:

*) starting point were five out of seven (No 0--4) files from the 1998 GENIE 
multiple exon genes set. These files contain unrelated CDS sequences.
*) The exons were concatenated to form single CDS sequences.
*) These complete CDS sequences were cut consecutively into 300 bp long 
non-overlapping sequences. Shorter sequences and remaining sequences at the end
were discarded.
*) Because of the different length of the CDS sequences, the files contained a
different number of sequences (178--192). To ensure an equal amount of
training/validation/test data for every cross-validation experiment, the
number of sequences in each file was reduced to the smallest number (178).
This was done by skipping single sequences and not a whole bunch at the end
for not running into the danger of missing whole genes.


Intron sequences were generated as follows:

*) starting point were five out of seven (No 0--4) files from the 1998 GENIE 
multiple exon genes set. These files contain intron sequences.
*) The sequences were cut consecutively into 300 bp long 
non-overlapping sequences. Shorter sequences and remaining sequences at the end
were discarded.
*) Because of the different length and number of the intron sequences, 
the files contained a
different number of sequences (869--1722). To ensure an equal amount of
training/validation/test data for every crossvalidation experiment, the
number of sequences in each file was reduced to the smallest number (869).
This was done by skipping single sequences and not a whole bunch at the end
for not running into the danger of missing the data of whole genes.


The promoter data was generated as follows:

*) All vertebrate sequences (except retroviruses) of the independent subset 
were extracted out of the Eukaryotic promoter database rel. 50 (575 sequences).
*) Entries wth less than 40 bp upstream and/or 5 bp downstream were 
discarded, leaving 565 entries.
*) Out of these, 250 bp upstream and 50 bp downstream were extracted, resulting
in 300 bp long sequences which may have flanking N regions because of lacking
data in the beginning and/or end of the promoter region.