Date: Oct 7th 2002 This directory contains a set of FASTA-format sequence files which can be used for the training and evaluation of promoter prediction programs in Drosophila DNA. The set contains essentially of three parts: promoter sequences, CDS sequences, and non-coding (intron) sequences. We used this data set in the annotation of Drosophila core promoters as described in Ohler U, Liao GC, Niemann H, Rubin GM, Computational Analysis of Core Promoters in the Drosophila Genome (2002). The 1998 subdirectory contains our previous and smaller set of sequences, used first in Ohler U, Harbeck S, Niemann H, Noth E, Reese MG, Interpolated Markov Chains for Eukaryotic Promoter Recognition, Bioinformatics 1999, and later on in the Genome Annotation Assessment Project (GASP). CDS and intron sequences were generated as follows: ----------------------------------------------------- * We aligned all fully sequenced cDNAs in the Drosophila Gene Collection as of Jan 1st, 2002, and aligned them with tblastx to the SwissProt database. We retained only those cDNAs for which a protein sequence fully aligned to within a cDNA. Start and stop codon could thus be inferred from the first and last amino acid in the alignment. Using sim4, we aligned the cDNAs to the Drosophila release 2 sequence to retrieve the exon-intron structure. * Next, we added the sequences to the representative set of Drosophila single- and multi-exon genes used in the 1999 version of the GENIE gene finder, obeying the rule that any coding sequence has <=80% identity to any other sequence in the set. * We concatenated the coding parts of exons to form single CDS sequences. These complete CDS sequences were cut consecutively into 300 bp long non-overlapping sequences. Shorter sequences and remaining sequences at the end were discarded. * Similarly, introns were cut consecutively into 300 bp long non-overlapping sequences. Shorter sequences and remaining sequences at the end were discarded. * For cross-validation, these sequences were divided in five subsets with sequences coming from an equal number of genes each. To ensure an equal amount of training/validation/test data for every crossvalidation experiment, the number of sequences in each file was reduced to the smallest number of CDS respectively intron sequences. A few ambiguous nucleotides were replaced randomly. (cds-part.*, intron-part.*) Promoter sequences were determined as follows: ----------------------------------------------- * As described in the paper, we selected 1,941 5' EST clusters based on different criteria. The promoter sequences here are the sequences from -250 to +50 wrt the end of the most 5' EST alignment of each cluster to the release 2 genome sequence. A few ambiguous nucleotides were replaced randomly. (promoter_all_1941.fa) * In addition to this full set of 1,941 sequences, we provide a set of 1,864 sequences which were cleaned of redundant sequences with similar criteria used by the Eukaryotic Promoter Database EPD: Sequences with more than 50% similarity in the region from -60 to +40 to any other sequence in the set were removed. This set was split in five equally sized subsets for cross-validation. (dro-part.*) * Finally, to evaluate our predictor on the Adh region, we eliminated further 23 sequences that aligned to the 2.9 Mb Adh region. (promoter_training_1842.fa)