!dros_sequence_set.README.v3.87 !October 25 2003 TRAINING AND SCREENING SETS FOR DROSOPHILA MELANOGASTER. ======================================================== These sets of sequences were compiled by Takis Benos (EBI), Leyla Bayraktaroglu (Harvard) and Michael Ashburner (EBI & Cambridge) with help from Aubrey de Grey (Cambridge), Joe Chillemi (Harvard) and Martin Reese (LBL). We thank Suzi Lewis (Berkeley) for inspiration and discussion, and Guochun Liao (Berkeley) for his repeat sequence set and newly discovered transposable element sequences from the Berkeley P1 clones. These sequence sets have been designed to provide either training sets for gene prediction or screening sets for genomic sequences for Drosophila melanogaster genomic sequence data. These files are available from: ftp://ftp.ebi.ac.uk/pub/databases/edgp/sequence_sets/ nuclear_cds_set mitochondrial_cds_set repeat_sequence_set transposon_sequence_set unique-50-genes.80.fa unique-50-genes.50.fa files with .embl extension are in EMBL format. files with .fa extension are in FASTA format. [nuclear_cds.embl.v1.5 was the last released made only by Method 1, see below. It will remain available. After v1.5 all nuclear_cds.embl sets are combined sets made by Method 1 and Method 2.] The directory: ftp://ftp.ebi.ac.uk/pub/databases/edgp/sequence_sets/ARCHIVE will include old versions of sequence data sets. *** FOR VERSION RELEASE NOTES (below) SEARCH ON VERSION NUMBER. *** =============================================================================== ~/nuclear_cds_set.embl ====================== This sequence set was made by two methods. Sets up to and including nuclear_cds_set.embl.v1.5 were made by method 1. Set v2.0 and subsequent where made by combining set v1.5 with that made by Method 2. Method 1. --------- 1. Records were retrieved from the EMBL and EMBLNEW data libraries at the EBI using the search terms "Drosophila melanogaster" + "CDS" with the public version of SRS5 (http://srs.ebi.ac.uk:5000). 2. Using the AC numbers stored by FlyBase, FlyBase gene symbols were added to all records as: FT /db_xref="FLYBASE:" 3. The correspondence between AC number, FBgn number, PID and SWISS-PROT/TREMBL identifiers was checked against FlyBase. Some incorrect relationships (in both EMBL and FlyBase) were corrected for this dataset and were reported back to FlyBase (from where they will be reported to EBI/NCBI/DDBJ). 4. For those records which lacked FlyBase identifier numbers the sequence AC & PID numbers were used to add both FlyBase gene symbols and FlyBase FBgn identifiers to all records. Records for genes not yet assigned FBgn identifiers have no FT /db_xref="FLYBASE:FBgn" line. 5. All FlyBase genes encoded by the mitochondrial genome or transposons were filtered out. 6. All FlyBase genes with >1 record were examined and filtered out so that for each gene there remains only a single record. The following filter criteria were used: a. If >1 record from the same gene and both start ATG and end with a terminator then only retain the longest. If >1 of the same length then choice was arbitary. b. All records start with ATG, unless a translation_exception is explicit feature qualifier in record (listed below). These all have an added: CC !!start_exception="" line. c. All records end with TAA|TAG|TGA. If this was missing from the feature then source record was examined in case of mis- -annotation. If not the record was deleted. If so, then record was edited and a comment line that begins: CC !!CDS annotation wrong . added to the record. A list of these in included below. 7. Records derived from Berkeley P1 sequences were filtered out. 8. All extraneous FT and CC lines of source were deleted. Only these were retained: CC !! FT CDS [nucleotides of parent sequence] FT /db_xref="FLYBASE:" FT /db_xref="FLYBASE:FBgn" FT /db_xref="PID:" FT /db_xref="SPTREMBL:" FT /db_xref="SWISS-PROT:" FT /translation="" 9. All edited sequence records have this comment line: CC !! All positions within these CC lines refer to daughter sequences, not parent, unless it is obvious from context that a direct comparison is being made. Method 2. --------- 1. A list of all genes and accessions in FlyBase were made available by Aubrey de Grey 12/15/98. Those already in nuclear_cds_set.embl.v1.5 were removed, and the body of the remaining accessions were obtained from EMBL. 2. Transposons, non-coding RNAs, pseudogenes, mitochondrial sequences and viruses were removed. 3. All FlyBase genes with >1 record were examined and filtered out so that for each gene there remains only a single record. The following filter criteria were used: a. If >1 record from the same gene and both start ATG and end with a terminator then only retain the longest. If >1 of the same length then choice was arbitary. b. All full records start with ATG, unless a translation_exception is explicit feature qualifier in record (listed below). These all have an added: CC !!start_exception="" line. c. All records end with TAA|TAG|TGA. If this was missing from the feature then source record was examined in case of mis- -annotation. If so, then record was edited and a comment line -that begins: CC !!CDS annotation corrected from added to the record. 4. a. For incomplete CDS's, the longest available CDS was selected. If the choice was between a longer incomplete CDS and shorter complete CDS, the shorter complete CDS was selected. If the choice was between a longer predicted CDS and a shorter experimental CDS, the shorter experimental CDS was selected. b. <> signs were entered around the coordinates of incomplete CDS's that had not been marked correctly in databank records. Subsequently, incomplete CDS's were automatically marked with: CC !!Thought to be an incomplete sequence. c. If translation of an incomplete CDS started at the 2nd or 3rd codon, the 5' orphan nucleotides were removed so all CDS's now start at with a complete codon. If there were only 2 nucleotides specifying the last codon, but if the codon assignment could be made unambiguously, these cases were marked with: CC !!Last base of last codon unknown, assignment unambiguous. 3' CDS sequences not assigned to a codon were removed. d. Fragments encoding less than 50 amino acids were removed. 5. Predicted genes were marked with: CC !!Predicted gene. Records of EDGP and BDGP predicted genes were marked with: CC !!Predicted gene by EDGP. CC !!Predicted gene by BDGP. 6. Genes predicted based on PCR fragments were marked with: !!Predicted based on PCR fragment. 7. For either unannotated CDS entries, or for annotated CDS's that could be extended, new coordinates were curated by the FB curator, and marked with the CC line: CC !!CDS compiled by FB curator. If more than one accession was involved, the CC line includes the accession numbers: CC !!CDS compiled by FB curator from Acc#1 and Acc#2. 8. The formatting of the file was done as a last step, to comply with step 8 of dros_sequence_sets.doc: All extraneous FT and CC lines of source were deleted. Only these were retained: CC !! FT source FT CDS FT /db_xref="FLYBASE:" FT /db_xref="FLYBASE:FBgn" FT /db_xref="PID:" FT /db_xref="SPTREMBL:" FT /db_xref="SWISS-PROT:" FT /translation="" 9. All edited sequence records have this comment line: CC !! All positions within these CC lines refer to daughter sequences, not parent, unless it is obvious from context that a direct comparison is being made. July 30 1997. v1.0: ==================== Database searched: EMBL version 51; SRS indexed 18 June 1997. EMBLNEW indexed 5am 22 July 1997. Source records: 2998. Retained records: 1345. Records edited: X04427; X75541; Y00133; X05245; X00854; X07311; X07181; X79243 V00248; Y00049. Non-ATG starts: U22825; M63724; Z14974; L11345. Records deleted due to known problems: M25662. Records compiled from >1 AC number: M25292+M25294 [given ID & AC XX1111] Known problems: 1. The cases of translation start correction need to be checked with literature. Cavener and Thummel asked for advice. 2. In-frame stop codons, translated as seleno-cysteine, not dealt with. Only case known to be is kel and the 'second' 3' to TGA codon of L08483 is not included. Cooley asked for advice. July 31 1997. v1.1: ==================== 1. Cavener confirms the translational start exceptions as ok. 2. Cooley confirms in-frame TGA in kelch (FBgn0001301). ID DMRCPA_4; parent: DMRCPA and ID DMRCPA_6; parent: DMRCPA combined to: ID DMRCPA_X; parent: DMRCPA by joining both ORF's of L08483. 3. oaf (FBgn0011818) also has in-frame stop codon. ID DMTNDPRO_5; parent: DMTNDPRO and ID DMTNDPRO_9; parent: DMTNDPRO combined to: ID DMTNDPRO_X; parent: DMTNDPRO by joining both ORF's of L31349. August 6 1997. v1.2: ==================== File cleaned as follows after checks by Martin Reese at LBL: 1. Mis-annotation of CDS of U61976 corrected (incorrect start). 2. Base 2034 of U49439 changed from 'S' to 'N'. 3. Following records deleted: Y09063; Y12701; K01664. 4. Following records edited to correct annotation of terminator: V00204; X04426; M62975; X15400; X58826; August 13 1997. v1.3: ===================== 1. Following records deleted (partial): X01524; X16992; Y09064; Y10911; X78219; Y10910. 2. Following records edited to correct annotation of terminator: S41484; X03414; Y00308; X04896; X14215; X13331; X05245; V00204; M71249; X03121; X04426; Y00228; X07194. 3. The data set includes 1336 sequences. August 25 1997. v1.4: ===================== 1. U73823 deleted (non-ATG start). February 18 1999. v.1.5: ======================== 1. This is the original training set, indentical in content to v1.4 but with a correction to syntax of the 'source' and 'CDS' feature lines and updates on some of the db_xref lines. March 1 1999. v2.0: =================== 1. This is the first released set that combines those made by Method 1 and Method 2. The reason for this was both to update the sequence sets and to provide better coverage for screening newly sequenced genomic data. For the latter reason the second set (included in v2.0) includes partial sequences, and sequences annotated as predicted genes from the EDGP and BDGP sequence data. The following notes are in addition to those inherited from v1.5. Annotation corrected: Osbp (Y13951), Fs(2)Ket (AJ002729), byn(S74163). Annotation extended to include stop codon: dnt (AJ224361), Gyc32E (X72800). Complete CDS annotated: crm (Y13674),faf(L04959), Fdxh (X06542). Partial CDS annotated: kz (X64418), Rtc1(X04754), bcn92 (Z46608), gdl(X58286). CDS extended: lab (X13104,X13103), nAcR&bgr;-96A (Y14678,X55676), Acp98AB (U85762;U90949). 2. In this version the following genes are annotated as having non-ATG start codons: amn (FBgn0000076) (CTG) Cha (FBgn0000303) (GTG) cpo (FBgn0000363) (CTC) ewg (FBgn0005427) (CTG) l(2)k10201 (FBgn0016970) (CTG) anon-fe2D7 (FBgn0022342) (CTG) 3. In this version the following genes are annotated as having in-frame termination codons: oaf (FBgn0011818) (translated as X) kel (FBgn0001301) (translated as X) 3. This combined set includes 2053 sequences. March 17 1999. v2.1: ==================== 1. Addition of new records; total number 2062. April 15 1999. v2.2: ==================== 1. A parse error which resulted in the retention of old source & CDS lines towards the end of the file (from the record for &agr;Cop to end of file) has been corrected. A glitch in the annotation of the Eaat1 record has been corrected. 2. Only records new (or changed) in this release have the new protein_id from the nucleic acid sequence database. The older records still have PIDs. This will be fixed for the next release. 3. The number of records is now 2070. May 24 1999. v2.3: ================== 1. The PID to protein_id translation is still not complete. 2. Corrections (updates) have been made to gene symbols, following changes in FlyBase. Three records (FBgn0002354, FBgn000342,FBgn0004514) have been replaced by longer sequences. 3. Addition of new records; total number 2122. July 31 1999. v2.6: ================== 0. versions v2.4 and v2.5 were internal and not released. 1. All PIDs are now translated to the corresponding protein_ids. 2. Corrections (updates) have been made to gene symbols, following changes in FlyBase. Records for the following genes have been replaced by new records: Ddc, Dref, Prat, sisA, LanB1, Egfr, EG:56G7.1. 3. Records for 18 genes have been removed due to the fact that they were duplicates (due to having had different gene symbols in the past). 4. This data set now includes short CDSs (<150-bp). 5. The number of records is: 2193 October 3 1999. v2.7: ===================== 1. A number of FlyBase symbols have been updated. 2. As a consequence of an ongoing collaboration with Eleanor Whitfield of the EBI SwissProt team a number of EMBL/GenBank/DDBJ records have now updated FT lines, resulting in the removal of several !!CC lines from these records. In addition some of the translations that had been previously compiled by FlyBase are now in the nucleic acid sequence records. The records for the following genes are affected: kz, bcn92, Rtc1, Gyc32E, Fdxh, gdl. 3. Records for the following genes have been replaced by newer versions: crm, Hr38, lok, mus201, oaf, Pfk, Pu, Est-P. 4. Two duplicate records (BcDNA:LD24492, BcDNA:GM07659) have been removed. 5. The CTC "start exception" for the cpo gene was an error; this has been corrected. The interpretation of the in-frame stop codon within the oaf gene has changed; only the first ORF of this gene is now included. 6. This data set includes translations from genes predicted by the Drosophila genome projects, including the BDGP cDNAs. It does not include translations of BDGP EST sequences. 7. The number of records is: 2452. November 8 1999. v2.8.5: ======================== 1. The number of records is now 2546. 2. Apart from routine updates and corrections to symbols the major change initiated with this release is to coordinate this sequence set with the curation of gene structures by FlyBase. These records have FBpp db_xref identifier numbers. January 3 2000. v2.9: ===================== 1. The number of records is now 2636. 2. Six sequences have been removed (Bkm, Ptth, Tk3, Tk4, Tk6, Sry-rDM17) as we have good reasons to believe that they were not Drosophila. 3. The following duplicate records have been removed: BcDNA:GM12270, dek, Ac35C, BcDNA:GM07659, BcDNA:LD07532, BcDNA:LD20207, BcDNA:LD21772, BcDNA:LD24440 and ubiquitin-C-terminal-hydrolase. 4. A number of symbols have been updated. 5. The protein_id's of the records for aub and EG:196F3.2 have been corrected. 6. The sequences for the following genes have been replaced with longer or different versions: Fak56D, Fcp3C, N, prominin-like, Pten, Rab4, rdgA RecQ5. =============================================================================== ~/unique-50-genes.80.fa ~/unique-50-genes.50.fa ======================= These data sets were made from the ~/nuclear_cds_set.embl file by Martin Reese at LBL, Berkeley. They have been obtained from: ftp://www-hgc.lbl.gov/pub/genesets/dro/unique-50-genes.fa ftp://www-hgc.lbl.gov/pub/genesets/dro/unique-80-genes.fa BLASTN was used in an all-against-all comparison of the nucleic acid sequences. From any clump of sequences with an overall identity above 80% (or 50%) only one member was chosen (at random) for inclusion. August 13 1997. v1.3: ===================== Derived from nuclear_cds_set.embl.v1.3. The 80% cutoff data set includes 1114 sequences; the 50% cutoff set, 866. =============================================================================== mitochondrial_cds_set.embl ========================== This sequence set includes the coding sequences of mitochondrial proteins. They are derived from the EMBL accessions U37541 and M37275. The protein sequences are translated with Translation table 5 of the International Nucleotide Sequence Data Library (http://www.ebi.ac.uk/ebi_docs/embl_db/ft/genetic_codes.html). Coding sequences with start exceptions are noted in CC lines, e.g.: CC !!Translation_start_exception="ATAA". and non ATG starts are noted, e.g.: CC !!Translation_start="ATA". January 30 1999. v1.0: ====================== This is the first released set and includes all thirteen known sequences of the mitochondrial genome that encode proteins. =============================================================================== ~/repeat_sequence_set.embl ============================== Method. Records for 'gene' objects annotated in FlyBase as representing a 'repetitive sequence' or 'repetitive element' (in *v field) were retrieved from the EMBL and EMBLNEW data libraries at the EBI using the SRS5 command line interface (getz). Records corresponding to transposable elements of known structural classes were filtered. Using the FT descriptions of the sequences a script was used to trim extraneous sequences, leaving only that annotated as being repetitive. Identical sequences were removed by a filter, but no attempt has been made to reduce redundancy in this data set. The sequence retained from the original record is described in a line with this syntax: FT source :.. where AC is the primary accession number of the parent record. The records retain the original ID lines of their EMBL record parents, with the substitution of the length of the derived sequence. They retain the DR (data base cross_reference) lines pointing to the corresponding FlyBase gene identifier and symbol and their parent's AC and NI lines. The criteria for inclusion in this data set do not include repetitive coding sequences. This data set excludes, therefore, sequences coding for homopolymeric amino-acid runs (e.g. the opa and pen repeats) and motifs such as the EGF-repeat. August 19 1997. v1.0 ==================== 1. Consists of 89 records; total length 46,489-bp. 2. Includes entries from 24 different FlyBase gene objects. FBgn0004084 &agr;&ggr;-element. FBgn0004950 1688-3C. FBgn0004951 1688-10Ea. FBgn0004952 1688-10Eb. FBgn0005587 anon-alaA. FBgn0005588 anon-alaE. FBgn0005589 anon-alaD. FBgn0015362 anon-k. FBgn0011748 anon-M/SAR. FBgn0014123 anon-Xh1. FBgn0000192 Bkm. FBgn0004633 CAT-repeat. FBgn0005663 dodeca. FBgn0004142 GATA-sequence. FBgn0004409 het-rep1. FBgn0004411 het-rep3. FBgn0005668 HMR-element. FBgn0015786 Porto1. FBgn0003315 satDNA. FBgn0003561 suffix. FBgn0003582 Su(Ste). FBgn0015563 vivi-repeat. FBgn0004081 XDm. FBgn0004083 Y-seqs. August 23 1997. v2.1 ==================== 1. Consists of 96 records; total length 64,075-bp. 2. The following entries have been added following analysis of Berkeley na_gb.trans set: FBgn0002643 mam. [RS repeat] FBgn0000164 AluI-like repeat from rDNA. - His3-His1 spacer sequence. FBgn0000126 Ars4. FBgn0000128 Ars6. FBgn0000002 5SRNA. [Repeat unit.] FBgn0000164 bb. [Complete rDNA repeat unit.] FBgn0003284 Rsp. - Repetitive sequence flanking su(f). FBgn0005668 HMR-element. [#2] 3. These sequences have been removed: FBgn0005587 anon-alaA. FBgn0005588 anon-alaE. FBgn0005589 anon-alaD. =============================================================================== ~/transposon_sequence_set.embl ============================== Method. Records for 'gene' objects annotated in FlyBase as representing transposon (in the *t field) were retrieved from the EMBL and EMBLNEW data libraries at the EBI using the SRS5 command line. For each transposon all records were screened by eye to select the longest sequence not noted as being from a defective element. In all possible cases the annotation in the FT lines was checked against that published in the literature; in case of conflict the literature was preferred over the sequence annotation. Extraneous sequences, including target site duplications, were trimmed and joins were made across records in an attempt to assemble a 'virtual' sequence. In this data set there is only a single record for each transposon. Many, if not all, transposons exist in variant forms and no attempt has been made to annotate these differences. Only three features have been carried across from the parent sequence records - these are the limits of the direct or inverted repeat ends of the element (if present), the open reading frames (CDS) and the introns. The sequence retained from the original record is described in a line with this syntax: FT source :.. Joins are annotated in an 'FT source' line with this syntax: join(:..,..) where AC_1 and AC_2 are the primary accession numbers of the parent sequences. Sequences that are the complement of their parent are annotated with this sytax: complement(:..) The records retain the original ID lines of their EMBL record parents, with the substitution of the length of the derived sequence. They retain the DR (data base cross_reference) lines pointing to the corresponding FlyBase gene identifier and symbol and their parent's AC and NI lines. The Drosophila genome sequencing projects will identify both new elements and more 'complete' sequences of known elements. These sequences will be used in this data set. Annotation: These sequences have minimal annotation; only open reading frames, introns and the extents of the terminal repeats (if present) are annotated. This annotation derives from the original sequence record but has, in all cases, been checked, and if necessary, supplemented, against the original publication and by analysis of the sequences. August 19 1997. v1.0 ==================== The data set consists of records from 40 different transposons. These records have no annotation of CDS or terminal repeat features: FB gene ID Symbol EMBL Size Comment Retroviral elements: FBgn0000004 17.6 X01472 7439bp complete FBgn0000007 1731 X07656 4648bp complete FBgn0000005 297 X03431 6995bp complete FBgn0005384 3S18 U23420 6126bp complete FBgn0000006 412 X04132 6897bp complete FBgn0010103 aurora-element X70361 344bp incomplete; LTR FBgn0014947 blastopia Z27119 5034bp complete FBgn0000199 blood AY180916 7419bp complete FBgn0010302 Burdock U89994 6411bp complete FBgn0000349 copia D90356 5143bp complete FBgn0000481 Doc X17551 4725bp ?incomplete FBgn0015945 gate X97139 245bp incomplete; LTR FBgn0001167 gypsy M12927 7469bp complete FBgn0001207 HMS-Beagle J01078 266bp incomplete; LTR FBgn0002697 mdg1 X59545 7480bp complete FBgn0002698 mdg3 X95908 5519bp complete FBgn0002745 micropia X14037 5457bp complete FBgn0000155 roo Z48503 2138bp incomplete; cDNA FBgn0003490 springer M10908 405bp incomplete; LTR FBgn0003519 Stalker X86075 1651bp incomplete FBgn0004082 Tirant X93507 2484bp incomplete LINE-like retrotransposons: FBgn0000224 BS X77571 5126bp ?complete FBgn0000430 D-element X05643 408bp incomplete FBgn0000652 F-element M17214 3520bp incomplete FBgn0001100 G-element X06950 4346bp ?complete G3A FBgn0004141 HeT-A U06920 6943bp complete FBgn0001249 I-element M14954 5371bp complete FBgn0001283 jockey M22874 5020bp complete FBgn0003908 R1-element X51968 5356bp complete FBgn0003909 R2-element X51967 3607bp complete FBgn0004904 TART-element U14101 10654bp ?complete IR-elements: FBgn0005773 Bari1 X67681 1728bp complete FBgn0001210 hobo M69216 2959bp complete FBgn0001181 HB X01748 1653bp ?incomplete FBgn0014967 hopper X80025 1435bp incomplete FBgn0003055 P-element X06779 2907bp complete FBgn0003122 pogo X59837 2121bp ?incomplete FBgn0004905 S-element U33463 1736bp ?incomplete FBgn0002949 NOF X15469;X51937 4347bp complete FB-elements: FBgn0000638 FB V00246 1492bp ?incomplete August 25 1997. v2.0 ==================== 1. The sequences have been annotated for ORFs, introns and terminal repeats. These features have been checked against the original publications, except for X95908, but have not yet been checked by sequence analysis. 2. The sequence trimming of the following records has been corrected: join(X15469;X51937); X86075. October 07 1997. v3.0 ===================== 1. The roo sequence (Z48503) has been replaced by a 9092-bp sequence from a Berkeley P1 clone. 2. A new retroviral element, yoyo, similar to the element of the same name from the Medfly, has been added from a Berkeley P1 clone. 3. The sequence of the 1360 element, inadvertently missed in previous versions, has been added. October 10 1997. v3.1 ===================== 1. The Tirant sequence has been updated to that of the complete element. November 24 1997. v3.2 ====================== 1. The blood sequence has been updated to that of the complete element, from a Berkeley P1 clone sequence. 2. The FlyBase gene identifier has been added to the yoyo record. November 24 1997. v3.3 ====================== 1. The ZAM element has been added. September 8 1998. v3.4 ====================== 1. The full length sequence of the GATE element has been added. 2. The FlyBase gene identifier has been added to the ZAM record. February 1 1999. v3.6 ===================== 1. The full length sequence of the Idefix element has been added. March 27 1999. v3.7 =================== 1. A more complete sequence of the aurora element has been added, replacing the LTR sequence previously included. May 6 2000. v3.8 ================ 1. K. O'Hare, Personal communication to FlyBase, 1 May 2000: The pogo sequence is probably complete element. 2. The sequence of the F element has been replaced by one of 4708bp from K. O'Hare (Personal communication to FlyBase, 1 May 2000). 3. The following three new elements have been added: X-element, Transpac. and Circe. November 26 2000. v.4.0 ======================= 1. You element added. November 26 2000. v.4.1 ======================= Line 190 of the file had, mysteriously, lost a base. Now corrected. Thankyou Suzi for discovering this ! April 9 2001. v.4.2 =================== 1. Full length springer and cruiser elements added from BDGP. 2. midline element added. April 9 2001. v.4.3/4.31 ======================== 1. Full length HMS Beagle element added from BDGP. 2. In v4.31 there have been updates to FBgn and AC numbers; If these are not known then 'n' filled dummies are used. The NI lines, sometimes containing the NCBI internal tracking number, and the FH lines (which contained no information) have been deleted. May 4 2001. v.4.4 ================= 1. The BEL-like Tinker element sequence has been added. May 4 2001. v.4.5 ================= 1. Updated AC etc data on springer, HMS-Beage and cruiser. July 29 2001. v.4.5 =================== 1. Added FBgn numbers for Tinker & cruiser; FBgn for yoyo corrected. August 21 2001. v.4.6 ===================== 1. midline element sequence deleted; it is an internally deleted form of HMS Beagle. September 21 2001. v.4.7 ======================== 1. The sequence of the pilgrim element has been added. 2. The incomplete sequence of the Stalker element (EMBL:X86075) has been replaced by a full-length sequence from the BDGP. September 21 2001. v.4.71 ========================= 1. AC lines with no accession number, i.e. AC nnnnnnnnn; have been removed. 2. Some extraneous text was appended to the end of the Stalker sequence; Now removed. September 21 2001. v.4.73 ========================= 1. Sequence of INE-1 element added. November 21 2001. v.4.8 ======================= 1. The sequences of the new elements identified by REPBASE http://www.girinst.org/server/RepBase/RepBase6.6.embl/drorep.ref have been assembled by Josh Kaminker and added. 2. The sequences of the Tirant and BS elements have been replaced with sequences from the BDGP. November 26 2001. v.4.82 ======================== 1. REPBASE copia2 sequence added. November 26 2001. v.4.9 ======================= 1. Following the identification of synonyms the following sequences have been removed from the set: Ivk (same element as You) Waldo-B (same element as pilger) diver (same element as Tinker) tabor (same element as Pilgrim) D-element (same element as jockey) December 20 2001. v.4.91 ======================== 1. The sequence of the strider element has been added. January 26 2002. v.4.92 ======================= 1. The Het-A sequence has been revised. 2. The sequences of Juan, forgger and rover have been added. February 12 2002. v.4.93 ======================== 1. The Rt1b sequence has been removed; it is the same element as Waldo-A. March 13 2002. v.4.94 ===================== 1. The strider sequence has been removed; it is the same element as Juan. 2. The M55078 sequence of 1360 is replaced by the REPBASE "protop" sequence; but the valid name of this element remains 1360. 3. The FlyBase names of the following elements have been changed: cruiser to Quasimodo Pilgrim to Tabor Tinker to diver You to Ivk Waldo-A to Rt1b March 15 2002. v.4.95 ===================== 1. Name of yoyo changed to opus. March 15 2002. v.4.96 ===================== 1. Duplicate HMS-Beagle record (J01078) removed. May 4 2002. v.4.97 ================== 1. Name update. August 15 2002. v.5.00 ====================== 1. Name & FBgn updates. 2. Following added from Repbase 4.4.3: jockey2, looper1, Tom1, G4, G5, G6. 3. FB element corrected to remove 3' HB element sequences. 4. A duplicated record of transib2 has been removed. August 15 2002. v.5.01 ====================== 1. Partial Penelope sequence added. August 28 2002. v.5.02 ====================== 1. The following sequences have been added from the BDGP: qbert, McClintock, hopper2, Stalker4. November 14 2002. v.6.2 ======================= 1. Sequence of Bari2, Max & Stalker2 elements added. 2. FBid's & AC numbers updated. 3. narep1 sequence deleted (synonym of INE-1). November 14 2002. v.7.1 ======================= 1. The sequence of 412 was found to be deleted. A new sequence from Release 3 has replaced it. 2. CDS translations (from SWALL) have been added to many of the records. These have not yet been checked against other annotation. December 28 2002. v.7.2 ======================= 1. Updates of FB identifiers and additions of CDS translations of blood, roo, opus and Juan. October 25 2003. v.7.2.1 ========================= 1. Updates of FB identifiers. The current data set is: FB gene ID Symbol EMBL Size Comment Retroviral elements: FBgn0000004 17.6 X01472 7439bp complete FBgn0000007 1731 X07656 4648bp complete FBgn0000005 297 X03431 6995bp complete FBgn0005384 3S18 U23420 6126bp complete FBgn0000006 412 nnnnnnnn 7567bp complete FBgn0063447 accord nnnnnnnn 7404bp complete FBgn0010103 aurora-element X70361 4263bp ?complete FBgn0014947 blastopia Z27119 5034bp complete FBgn0000199 blood nnnnnnnn 7410bp complete FBgn0010302 Burdock U89994 6411bp complete FBgn0022937 Circe X98424 6356bp complete FBgn0000349 copia D90356 5143bp complete FBgn0062343 Dm88 nnnnnnnn 4558bp complete FBgn0044355 Quasimodo AF364550 7387bp complete FBgn0063439 diver2 nnnnnnnn 4917bp complete FBgn0061513 frogger AF492763 2483bp ?complete FBgn0015945 GATE AJ010298 8507bp complete FBgn0063436 gtwin nnnnnnnn 7411bp complete FBgn0001167 gypsy M12927 7469bp complete FBgn0063435 gypsy2 nnnnnnnn 6841bp complete FBgn0063434 gypsy3 nnnnnnnn 6973bp complete FBgn0063433 gypsy4 nnnnnnnn 7369bp complete FBgn0063432 gypsy5 nnnnnnnn 6852bp complete FBgn0063431 gypsy6 nnnnnnnn 7826bp complete FBgn0001207 HMS-Beagle AF365402 7062bp complete FBgn0026065 Idefix AJ009736 7411bp complete FBgn0063430 invader1 nnnnnnnn 4032bp complete FBgn0063429 invader2 nnnnnnnn 5124bp complete FBgn0063428 invader3 nnnnnnnn 5484bp complete FBgn0063427 invader4 nnnnnnnn 3105bp complete FBgn0063426 invader5 nnnnnnnn 4038bp complete FBgn0063919 Max-element AJ487856 8556bp complete FBgn0063917 McClintock AF541948 6450bp complete FBgn0002697 mdg1 X59545 7480bp complete FBgn0002698 mdg3 X95908 5519bp complete FBgn0002745 micropia X14037 5457bp complete FBgn0063782 qbert AF541947 7650bp complete FBgn0045970 Tabor nnnnnnnn 7345bp complete FBgn0063450 Tom1 nnnnnnnn 410bp incomplete FBgn0000155 roo AY180917 9092bp complete FBgn0063394 rooA nnnnnnnn 7621bp complete FBgn0061485 rover AF492764 7318bp complete FBgn0003490 springer AF364549 7546bp complete FBgn0003519 Stalker AF420242 7365bp complete FBgn0063455 Stalker2 nnnnnnnn 7672bp complete FBgn0063454 Stalker3 nnnnnnnn 372bp incomplete FBgn0063897 Stalker4 AF541949 7359bp complete FBgn0043969 diver nnnnnnnn 6112bp complete FBgn0040267 Transpac AF222049 5249bp complete FBgn0004082 Tirant nnnnnnnn 8526bp complete FBgn0003007 opus AY180918 7521bp complete FBgn0023131 ZAM AJ000387 8435bp complete LINE-like retrotransposons: FBgn0063440 baggins nnnnnnnn 5453bp complete FBgn0000224 BS nnnnnnnn 5142bp complete FBgn0063594 Cr1a nnnnnnnn 4470bp complete FBgn0000481 Doc X17551 4725bp ?incomplete FBgn0063534 Doc2-element nnnnnnnn 4789bp complete FBgn0063533 Doc3-element nnnnnnnn 4740bp complete FBgn0000652 F-element AC005198 4708bp complete FBgn0001100 G-element X06950 4346bp ?complete G3A FBgn0063507 G2 nnnnnnnn 3102bp complete FBgn0063506 G3 nnnnnnnn 4605bp complete FBgn0063505 G4 nnnnnnnn 3856bp complete FBgn0063504 G5 nnnnnnnn 4856bp complete FBgn0063503 G6 nnnnnnnn 2042bp complete FBgn0004141 HeT-A U06920 6083bp complete FBgn0001249 I-element M14954 5371bp complete FBgn0043055 Ivk nnnnnnnn 5402bp complete FBgn0046110 Juan AY180919 4236bp complete FBgn0001283 jockey M22874 5020bp complete FBgn0063425 jockey2 nnnnnnnn 3428bp complete FBgn0046701 Penelope AF418572 804bp incomplete FBgn0003908 R1-element X51968 5356bp complete FBgn0003909 R2-element X51967 3607bp complete FBgn0041728 Rt1a AJ278684 5108bp complete FBgn0063467 Rt1c nnnnnnnn 5443bp complete FBgn0042682 Rt1b AF281636 5171bp complete FBgn0004904 TART-element U14101 10654bp ?complete FBgn0042231 X-element AF237761 4740bp complete IR-elements: FBgn0005673 1360 nnnnnnnn 3409bp complete FBgn0005773 Bari1 X67681 1728bp complete FBgn0064134 Bari2 AF541951 1064bp complete FBgn0001210 H-element M69216 2959bp complete FBgn0001181 HB X01748 1653bp ?incomplete FBgn0014967 hopper X80025 1435bp incomplete FBgnnnnnnnn hopper2 AF541950 1593bp incomplete FBgn0026416 INE-1 U66884 611bp ?incomplete FBgn0063402 looper1 nnnnnnnn 1881bp incomplete FBgn0063401 mariner2 nnnnnnnn 912bp complete FBgn0002949 NOF X15469;X51937 4347bp complete FBgn0003055 P-element X06779 2907bp complete FBgn0003122 pogo X59837 2121bp complete FBgn0004905 S-element U33463 1736bp ?incomplete FBgn0063466 S2 nnnnnnnn 1735bp complete FBgn0026410 Tc1 nnnnnnnn 1666bp complete FBgn0063372 transib1 nnnnnnnn 2167bp complete FBgn0063371 transib2 nnnnnnnn 2844bp complete FBgn0063370 transib3 nnnnnnnn 2883bp complete FBgn0063369 transib4 nnnnnnnn 2656bp complete Foldback elements: FBgn0000638 FB V00246 1106bp ?incomplete ===============================================================================