At the current time (October 2002), we are submitting publications on the re-annotation of the complete D. melanogaster genome (release 3), and much of this annotation is at least partly based on computational tools which need to be trained on reliable data sets. Thus we aim to provide common sets to be shared among various research groups as a stable basis for the evaluation and comparison of different methods for the analysis of D. melanogaster DNA sequences. The generation of datasets of confirmed genes is a very time consuming part. Therefore we make our ready-to-use training and test sets available and encourage researchers in the community to use these common datasets for the development of their methods. Common data sets allow also a fair and rigorous scientific comparison between different methods.

We provide these data sets "as is" in an effort to create a common data set to be used by all algorithms which are aimed towards gene finding and the identification of regulatory sequences (promoters and splice sites).

Our goal when generating these data sets was to make them relatively "clean" and to ensure that each sequence conforms to specific criteria which are listed in the documentation files that accompany the data sets. Therefore we used restrictive filters which are run in irregularly intervals on the data bases to create larger sets as more data becomes available. Each set is divided into a number of disjoint parts which can be used for a cross-validated evaluation.

These datasets were generated in a collaboration between the Informatics group of the Berkeley Drosophila Genome project at the Lawrence Berkeley National Laboratory (LBNL), the Computational Biology Group at the UC Santa Cruz, the Mathematics Department at Stanford and the Chair for Pattern Recognition at theUniversity of Erlangen, Germany.

Currently, we can offer the following data sets to the scientific community:

A similar collection by the same authors is also available for human DNA sequences. Also, check out the Website for C. elegans gene finding resources.

In addition to these data sets, we also used data collected by other authors to evaluate the performance of our programs. These sets comprise

Martin Reese (LBNL) martinr@bdgp.lbl.gov Uwe Ohler (University of Erlangen) (email)