Representative benchmark data sets of D. melanogaster DNA sequences

Please bookmark this site and NOT the sites of the data it is pointing to!

This site will serve as a stable interface to access the datasets, but the location of the sets will be subject to change.

!NEW! (10/15/2002) We added the sequence data sets used for the analysis of core promoter regions in the re-annotated Drosophila genome (see below).

At the current time (October 2002), we are submitting publications on the re-annotation of the complete D. melanogaster genome (release 3), and much of this annotation is at least partly based on computational tools which need to be trained on reliable data sets. Thus we aim to provide common sets to be shared among various research groups as a stable basis for the evaluation and comparison of different methods for the analysis of D. melanogaster DNA sequences. The generation of datasets of confirmed genes is a very time consuming part. Therefore we make our ready-to-use training and test sets available and encourage researchers in the community to use these common datasets for the development of their methods. Common data sets allow also a fair and rigorous scientific comparison between different methods.

We provide these data sets "as is" in an effort to create a common data set to be used by all algorithms which are aimed towards gene finding and the identification of regulatory sequences (promoters and splice sites).

Our goal when generating these data sets was to make them relatively "clean" and to ensure that each sequence conforms to specific criteria which are listed in the documentation files that accompany the data sets. Therefore we used restrictive filters which are run in irregularly intervals on the data bases to create larger sets as more data becomes available. Each set is divided into a number of disjoint parts which can be used for a cross-validated evaluation.

These datasets were generated in a collaboration between the Informatics group of the Berkeley Drosophila Genome project at the Lawrence Berkeley National Laboratory (LBNL), the Computational Biology Group at the UC Santa Cruz, the Mathematics Department at Stanford and the Chair for Pattern Recognition at theUniversity of Erlangen, Germany.

Currently, we can offer the following data sets to the scientific community:

A similar collection by the same authors is also available for human DNA sequences. Also, check out the Website for C. elegans gene finding resources.

In addition to these data sets, we also used data collected by other authors to evaluate the performance of our programs. These sets comprise

We would very much appreciate comments on the appropriateness of the data set and on the results obtained with them. Also, we encourage anyone to create similar data sets for other DNA pattern recognition tasks.

Martin Reese (LBNL)
Uwe Ohler (University of Erlangen) (email)