Date: 18 Mar 1998 This directory contains a set of FASTA-format sequence files which can be used for the training and evaluation of promoter prediction programs in human DNA. The set consists essentially of three parts: promoter sequences, CDS (coding) sequences, and non-coding (intron) sequences. Each part contains five sequence files to be used for five-fold cross-validation. CDS sequences were generated as follows: *) starting point were five out of seven (No 0--4) files from the 1998 GENIE multiple exon genes set. These files contain unrelated CDS sequences. *) The exons were concatenated to form single CDS sequences. *) These complete CDS sequences were cut consecutively into 300 bp long non-overlapping sequences. Shorter sequences and remaining sequences at the end were discarded. *) Because of the different length of the CDS sequences, the files contained a different number of sequences (178--192). To ensure an equal amount of training/validation/test data for every cross-validation experiment, the number of sequences in each file was reduced to the smallest number (178). This was done by skipping single sequences and not a whole bunch at the end for not running into the danger of missing whole genes. Intron sequences were generated as follows: *) starting point were five out of seven (No 0--4) files from the 1998 GENIE multiple exon genes set. These files contain intron sequences. *) The sequences were cut consecutively into 300 bp long non-overlapping sequences. Shorter sequences and remaining sequences at the end were discarded. *) Because of the different length and number of the intron sequences, the files contained a different number of sequences (869--1722). To ensure an equal amount of training/validation/test data for every crossvalidation experiment, the number of sequences in each file was reduced to the smallest number (869). This was done by skipping single sequences and not a whole bunch at the end for not running into the danger of missing the data of whole genes. The promoter data was generated as follows: *) All vertebrate sequences (except retroviruses) of the independent subset were extracted out of the Eukaryotic promoter database rel. 50 (575 sequences). *) Entries wth less than 40 bp upstream and/or 5 bp downstream were discarded, leaving 565 entries. *) Out of these, 250 bp upstream and 50 bp downstream were extracted, resulting in 300 bp long sequences which may have flanking N regions because of lacking data in the beginning and/or end of the promoter region.