How Swiss-Prot and TrEMBL entries are used to validate gene models in release 3 reannotation of the Drosophila genome What are the files? ******************* The Drosophila melanogaster Swiss-Prot and TrEMBL entries have been divided into five datasets: 1) aa_thypo.dros.embl - hypothetical TrEMBL entries 2) aa_treal.real.dros.embl - real TrEMBL entries 3) aa_treal.genome.dros.embl - real TrEMBL entries that only have genome reference 4) aa_shypo.dros.embl - hypothetical SWISS-PROT entries 5) aa_sreal.dros.embl - real SWISS-PROT entries Within Swiss-Prot and TrEMBL the definition of a hypothetical protein is a translation that is derived from an automated gene model prediction. There is no evidence that this ORF is translated or expressed in vivo. For Drosophila melanogaster, hypothetical proteins are those defined by the current and terminated genome projects, and also transcription units identified in literature that have no known function. Source of the files: ******************** Entries were extracted from the weekly publicly available external SPTr corresponding to release 42.5 of Swiss-Prot protein Knowledgebase, 21-Nov-2003 and release 25.6 of TrEMBL, 28-Nov-2003. Identification of hypothetical and real protein entries: ******************************************************** The distinction between a protein entry belonging to a 'real' or to a hypothetical file was identified on the basis of the gene symbol prefix in the FlyBase cross-reference line. Entries that had a valid FlyBase symbols beginning with CG, BG, EG or anon were considered to be hypothetical. CG corresponds to Release 1 and 2 Drosophila genome predicted annotations EG corresponds to European Drosophila Genome Project predictions BG corresponds to Berkeley Drosophila Genome Project predictions anon corresponds to transcription units identified in literature that have no known function. The remainder of the entries have FlyBase cross-reference lines that are for phenotypically, mutationally and/or molecularly defined genes. The cDNA set defined by the Berkeley Drosophila Genome Project were considered to be 'real' for this division of the proteins. for example: CPO_DROME (AC Q01617) DR FlyBase; FBgn0000363; cpo. is a real gene. Y799_DROME (AC P83501) DR FlyBase; FBgn0052799; CG32799. is a hypothetical gene. The FlyBase cross-references are actively maintained by collaboration between Swiss-Prot/TrEMBL and FlyBase. Generation of the files: ************************ Swiss-Prot and TrEMBL entries corresponding to the 4 types of gene symbol (CG, BG, EG and anon prefixes) were put in the aa_shypo.dros.embl and aa_thypo.dros.embl files, respectively. The remainder of the Swiss-Prot and TrEMBL entries were put in the aa_sreal.dros.embl and aa_treal.dros.embl files, respectively. One further subdivision of aa_treal.dros.embl was made. Some of the entries contain just one reference that is a genome project reference, yet have a gene symbol that is 'real'. The CG/EG/BG/anon prefixed names have been replaced as the gene sequence has strong homology to a functionally defined gene or gene family. These entries are subdivided into the aa_treal.genome.dros.embl file and the remainder are aa_treal.real.dros.embl. aa_treal.real.dros.embl contains those TrEMBL entries that are identified by research groups. Numbers of entries in the files: ******************************** 1) aa_thypo.dros.embl - 11906 entries 2) aa_treal.real.embl - 8344 entries 3) aa_treal.genome.dros.embl - 696 entries 4) aa_shypo.dros.embl - 438 entries 5) aa_sreal.dros.embl - 3410 entries The United Protein Databases (UniProt) project will create a central database of protein sequence and function by joining the forces of the Swiss-Prot, TrEMBL and PIR protein database activities (http://www.ebi.ac.uk/uniprot/). The project is funded by the U.S. National Human Genome Research Institute (NHGRI), in cooperation with five other institutes and centers at the National Institutes of Health (NIH) (Grant Number: 1 U01 HG02712-01). The broad, long-term objectives of this project are: To provide a stable and comprehensive resource for information on proteins, their sequences and their functions. To enable scientists to use these data to identify and analyse genes and their products and to make queries across databases containing complementary information. To provide efficient and unencumbered access to the Database. The specific aims are: To develop and maintain a central database of curated protein sequences with annotations of sequence and functional information. To facilitate use of the database by providing user-friendly interfaces, tools for simple and complex queries and for retrieval of large datasets, down- loadable database records in defined, parsable format, and user support services. To provide the flexibility and adaptability needed to be responsive to the changing needs of the scientific community. *********** Boeckmann B., Bairoch A., Apweiler R., Blatter M., Estreicher A., Gasteiger E., Martin M. J., Michoud K., O'Donovan C., Phan I., Pilbout S., and Schneider M. (2003) The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 31: 365-370 (2003). The FlyBase Consortium The FlyBase database of the Drosophila genome projects and community literature Nucl. Acids. Res. 2003 31: 172-175 Adams MD, Celniker SE, Holt RA, et al. The genome sequence of Drosophila melanogaster. Science 287(5461):2185-2195 (2000). Benos PV, Gatt MK, Ashburner M, et al. From sequence to chromosome: the tip of the X chromosome of D. melanogaster. Science 287(5461):2220-2222 (2000). Rubin GM, Hong L, Brokstein P, et al. A Drosophila complementary DNA resource. Science 287(5461):2222-2224 (2000). Misra S, Crosby MA, Mungall CJ, et al. Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol. 3(12):RESEARCH0083 (2002). Contact information: ******************** Eleanor Whitfield European Bioinformatics Institute EMBL Outstation Wellcome Trust Genome Campus Hinxton Cambs CB10 1SD Tel 0044 1223 494680 Fax 0044 1223 494468 email: eleanor@ebi.ac.uk