Large-scale annotation with Genotator and Genie

Nomi L. Harris
nomi@mhgc.lbl.gov

Barret D. Pfeiffer
bear@mhgc.lbl.gov

Martin G. Reese
martinr@mhgc.lbl.gov

Susan E. Celniker

Lawrence Berkeley National Laboratory
Drosophila Genome Center
Mailstop 46A/1123
One Cyclotron Rd.
Berkeley, CA 94720

The Drosophila Genome Center at LBNL finished five megabases of Drosophila melanogaster genomic sequence in 1997. In order to be of maximal use to the community, the sequence needs to be annotated, automatically. To achieve this goal, we are using a number of computational tools, several of which were developed in our center: Genotator, an annotation workbench, and Genie, a gene prediction tool.

Genotator is a workbench for automated sequence annotation and annotation browsing. The back end runs a series of sequence analysis tools on a DNA sequence, handling the various input and output formats required by the tools. Genotator runs five different gene finding programs (including Genie) and three homology searches, plus searches for promoters, splice sites, and ORFs. The results of the analyses run by Genotator can be viewed with its interactive graphical browser. By graphically displaying the output of multiple sequence analyses, Genotator provides an intuitive way to identify the significant regions (for example, probable exons) in a sequence.

Genie, our gene prediction program, finds potential exons by using a generalized hidden Markov model to assign a coding probability to each base. Gene predictions are then deduced from the model using a dynamic programming algorithm to identify the path through the model consisting of gene features with maximum probability. The generalized hidden Markov model architecture allows Genie to combine information from multiple content and signal sensors, including sensors that integrate matches to homologous sequences from a database.

Genie is trained to predict complete genes for both human and Drosophila DNA sequences. An analysis of 2,613 Kbases of human DNA sequences shows Genie to correctly predict 75% of exon boundaries including splice sites, start codons and stop codons, where 85% of the coding nucleotides are identified. A similar analysis of 766 Kbases of Drosophila DNA sequences finds 80% of the exon boundaries, where 94% of the coding nucleotides are identified.

We have used Genotator and Genie to annotate much of the sequence produced at LBNL, including the Antennapedia complex of Drosophila. The Antennapedia complex contains a cluster of five homeotic genes (Antennapedia, Sex combs reduced, Deformed, proboscepedia and labial) that are required for proper development of the head and first thoracic segment. The entire complex, which includes 429,822 basepairs, has been sequenced. We present the annotation of the complete complex, which is estimated to contain at least 18 genes (verified by high sequence homology) composed of 65 coding exons, most of which where at least partially predicted by Genie.

A version of this abstract that includes graphics can be viewed at http://www-hgc.lbl.gov/homes/nomi/psb98-abstract.html.