Representative benchmark data sets of D. melanogaster DNA sequences
Please bookmark this site and NOT the sites of the data it is pointing to!
This site will serve as a stable interface to access the datasets, but the location of the sets will be subject to change.
!NEW! (10/15/2002) We added the sequence data sets used for the analysis of core promoter regions in the re-annotated Drosophila genome (see below).
At the current time (October 2002), we are submitting publications on the re-annotation of the complete D. melanogaster genome (release 3), and much of this annotation is at least partly based on computational tools which need to be trained on reliable data sets. Thus we aim to provide common sets to be shared among various research groups as a stable basis for the evaluation and comparison of different methods for the analysis of D. melanogaster DNA sequences. The generation of datasets of confirmed genes is a very time consuming part. Therefore we make our ready-to-use training and test sets available and encourage researchers in the community to use these common datasets for the development of their methods. Common data sets allow also a fair and rigorous scientific comparison between different methods.
We provide these data sets "as is" in an effort to create a common data set to be used by all algorithms which are aimed towards gene finding and the identification of regulatory sequences (promoters and splice sites).
Our goal when generating these data sets was to make them relatively "clean" and to ensure that each sequence conforms to specific criteria which are listed in the documentation files that accompany the data sets. Therefore we used restrictive filters which are run in irregularly intervals on the data bases to create larger sets as more data becomes available. Each set is divided into a number of disjoint parts which can be used for a cross-validated evaluation.
These datasets were generated in a collaboration between the Informatics group of the Berkeley Drosophila Genome project at the Lawrence Berkeley National Laboratory (LBNL), the Computational Biology Group at the UC Santa Cruz, the Mathematics Department at Stanford and the Chair for Pattern Recognition at theUniversity of Erlangen, Germany.
Currently, we can offer the following data sets to the scientific community:
The GENIE set of unrelated D. melanogaster genes . This data set was used to train the GENIE gene finding system developed at LBNL and UCSC. The last update was done in October 1998 using GenBank v.109. Please see the documentation file for further information, including links to previous versions of this data collection.
The collection of data for D. melanogaster splice sites used in the GENIE system. The splice sites were extracted from the 1996 GENIE data set using GenBank v.95. The set also contains negative samples. Further information
!NEW! The 2002 collection of data of _ D. melanogaster_ core promoter regions. The promoters were derived from a large set of cap-trapped full-length cDNAs, the non-promoters from representative sets for gene-finding augmented by full-length cDNAs. Further information (Our previous set from 1998 is still available here.)
In addition to these data sets, we also used data collected by other authors to evaluate the performance of our programs. These sets comprise
- The collection of unique coding transcript sequences of D. melanogaster by Ashburner (1997) For additional D. melanogaster sequences check out their Ftp site.
We would very much appreciate comments on the appropriateness of the data set and on the results obtained with them. Also, we encourage anyone to create similar data sets for other DNA pattern recognition tasks.