How Swiss-Prot and TrEMBL entries are used to validate gene models in release 3 reannotation of the Drosophila genome What are the files? ******************* The Drosophila melanogaster Swiss-Prot and TrEMBL entries have been divided into five datasets: 1) aa_thypo.dros.embl - hypothetical TrEMBL entries 2) aa_treal.real.dros.embl - real TrEMBL entries 3) aa_treal.genome.dros.embl - real TrEMBL entries that only have genome reference 4) aa_shypo.dros.embl - hypothetical SWISS-PROT entries 5) aa_sreal.dros.embl - real SWISS-PROT entries Within Swiss-Prot and TrEMBL the definition of a hypothetical protein is a translation that is derived from an automated gene model prediction. There is no evidence that this ORF is translated or expressed in vivo. For Drosophila melanogaster, hypothetical proteins are those defined by the current and terminated genome projects, and also transcription units identified in literature that have no known function. Source of the files: ******************** Entries were extracted from the weekly publicly available external SPTr corresponding to release 40.20 of Swiss-Prot protein Knowledgebase and release 20.10 of TrEMBL, 19th June 2002. Identification of hypothetical and real protein entries: ******************************************************** The distinction between a protein entry belonging to a 'real' or to a hypothetical file was identified on the basis of the gene symbol prefix in the FlyBase cross-reference line. Entries that had a valid FlyBase symbols beginning with CG, BG, EG or anon were considered to be hypothetical. CG corresponds to Release 1 and 2 Drosophila genome predicted annotations EG corresponds to European Drosophila Genome Project predictions BG corresponds to Berkeley Drosophila Genome Project predictions anon corresponds to transcription units identified in literature that have no known function. The remainder of the entries have FlyBase cross-reference lines that are for phenotypically, mutationally and/or molecularly defined genes. The cDNA set defined by the Berkeley Drosophila Genome Project were considered to be 'real' for this division of the proteins. for example: CPO_DROME (AC Q01617) DR FlyBase; FBgn0000363; cpo. is a real gene. Y799_DROME (AC P83501) DR FlyBase; FBgn0052799; CG32799. is a hypothetical gene. The FlyBase cross-references are actively maintained by collaboration between Swiss-Prot/TrEMBL and FlyBase. Generation of the files: ************************ Swiss-Prot and TrEMBL entries corresponding to the 4 types of gene symbol (CG, BG, EG and anon prefixes) were put in the aa_shypo.dros.embl and aa_thypo.dros.embl files, respectively. The remainder of the Swiss-Prot and TrEMBL entries were put in the aa_sreal.dros.embl and aa_treal.dros.embl files, respectively. One further subdivision of aa_treal.dros.embl was made. Some of the entries contain just one reference that is a genome project reference, yet have a gene symbol that is 'real'. The CG/EG/BG/anon prefixed names have been replaced as the gene sequence has strong homology to a functionally defined gene or gene family. These entries are subdivided into the aa_treal.genome.dros.embl file and the remainder are aa_treal.real.dros.embl. aa_treal.real.dros.embl contains those TrEMBL entries that are identified by research groups. Numbers of entries in the files: ******************************** 1) aa_thypo.dros.embl - 10841 entries 2) aa_treal.real.embl - 3906 entries 3) aa_treal.genome.dros.embl - 1716 entries 4) aa_shypo.dros.embl - 180 entries 5) aa_sreal.dros.embl - 1445 entries United protein database (UniProt): ********************************** The United Protein Databases (UniProt) project will create a central database of protein sequence and function by joining the forces of the Swiss-Prot, TrEMBL and PIR protein database activities. The protein sequences will be stored in UNIPARC. There will be cross- references to EMBL, TrEMBL, SWISS-PROT, PIR-PSD, PIR-PSD archive, EnsEMBL, RefSeq, EMSD, EPO, USPO and JPO entries. Each unique sequence will be stored only once, and will get a unique sequence identifier. Only protein sequence specific database cross-references will be stored (e.g. /protein_id in EMBL). It will be possible to request cross-references retrieved from a particular release or update. UNIPARC is planned to be available from 1 of April 2003. References: *********** Boeckmann B., Bairoch A., Apweiler R., Blatter M., Estreicher A., Gasteiger E., Martin M. J., Michoud K., O'Donovan C., Phan I., Pilbout S., and Schneider M. (2003) The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 31: 365-370 (2003). The FlyBase Consortium The FlyBase database of the Drosophila genome projects and community literature Nucl. Acids. Res. 2003 31: 172-175 Adams MD, Celniker SE, Holt RA, et al. The genome sequence of Drosophila melanogaster. Science 287(5461):2185-2195 (2000). Benos PV, Gatt MK, Ashburner M, et al. From sequence to chromosome: the tip of the X chromosome of D. melanogaster. Science 287(5461):2220-2222 (2000). Rubin GM, Hong L, Brokstein P, et al. A Drosophila complementary DNA resource. Science 287(5461):2222-2224 (2000). Misra S, Crosby MA, Mungall CJ, et al. Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol. 3(12):RESEARCH0083 (2002). Contact information: ******************** Eleanor Whitfield European Bioinformatics Institute EMBL Outstation Wellcome Trust Genome Campus Hinxton Cambs CB10 1SD Tel 0044 1223 494680 Fax 0044 1223 494468 email: eleanor@ebi.ac.uk