nlharris@lbl.gov
Genotator is a workbench for automated sequence annotation and annotation browsing. The back end runs a series of sequence analysis tools on a DNA sequence, handling the various input and output formats required by the tools. Genotator currently runs five different gene finding programs, three homology searches, and searches for promoters, splice sites, and ORFs.
The results of the analyses run by Genotator can be viewed with the interactive graphical browser. The browser displays color-coded sequence annotations on a canvas that can be scrolled and zoomed, allowing the annotated sequence to be explored at multiple levels of detail. The user can view the actual DNA sequence in a separate window; when a region is selected in the map display, it is automatically highlighted in the sequence display, and vice-versa. By displaying the output of all of the sequence analyses, Genotator provides an intuitive way to identify the significant regions (for example, probable exons) in a sequence. Users can interactively add personal annotations to label regions of interest. Additional capabilities of Genotator include primer design and pattern searching.
Many researchers are developing tools for analyzing DNA sequences. These tools include programs that look for homologies to sequence in a database; predict possible exons; find repeats; and identify gene signals such as promoters and splice sites. Many of these sequence analysis tools can offer useful insight into the biological significance and possible function of a new sequence; however, they tend to suffer from several shortcomings. First, each sequence analysis program has its own output format (and often its own input format as well). This makes it difficult to compare the results of multiple programs. Second, although the ability to predict the locations of exons and other genetic signals continues to improve, it would still be rash to place absolute faith in the predictions of any one program. If, however, several different programs, with different approaches, make the same prediction on a sequence region, our confidence in the prediction is increased. Another limitation of many sequence analysis programs is that the output is textual rather than graphical, which makes it hard to quickly identify the significant regions of a genomic sequence. Some programs do have graphical displays, but that doesn't ease the problem of comparing the output of several different programs. Finally, most sequence analysis programs are not a solution to the problem of automated annotation, because they don't provide many of the features that one would want in such a tool, such as the ability to add personal annotations or to inspect the sequence at an arbitrary level of detail.
We have developed a sequence annotation workbench, Genotator, that addresses these shortcomings. Genotator provides a flexible, transparent system for automatically running a series of sequence analysis programs on genetic sequences. It also has a graphical display that allows users to view all of the annotations and add or delete their own. Genotator's display allows annotated sequences to be examined at multiple levels of detail, from an overview of the entire sequence down to individual bases.
Genome Topographer (T.G. Marr, unpublished) is another example of an ACeDB-like program that includes a database to hold genome-related data plus displays to allow various views of the data. Like ACeDB, Genome Topographer was not designed as an interactive annotation tool.
Others have written tools more specifically designed for sequence annotation. These include GeneQuiz, SCAN, the BCM search launcher, and GAIA. GeneQuiz (Scharf et. al, 1994), like Genotator, automatically runs a series of sequence analysis tools, including BLAST and FASTA. The results are displayed as structured text. Darrell Ricke's SCAN program (Ricke et. al, in preparation) has a back end similar in some respects to Genotator, although it concentrates more on database homology searches and less on exon prediction. The displays are mostly structured text, some with hyperlinks. The BCM Search Launcher (Smith et al., 1996) provides a point from which to access various sequence (and structure) analysis tools available on the Web. The user can request any of a variety of such searches; the results of each search are displayed separately as hyperlinked structured text. GAIA (Genome Annotation and Information Analysis) (Bailey et al., in preparation), which is being developed at the University of Pennsylvania, is perhaps the most similar system to Genotator in terms of its goals, organization, and features. Sequences are submitted to ATLAS, the data management portion of GAIA, and then automatically annotated by CARTA. The annotated sequence is displayed with Java applets (based on the bioWidget components). Although GAIA calls only one exon prediction program (GRAIL), rather than several as Genotator does, GAIA includes some types of features (e.g. poly-A signals) that are not reported by Genotator.
Recently there has been interest in developing Java displays for visualization of sequences and related information. Groups working on such displays include the Berkeley Drosophila Genome group (Helt and Rubin, 1996), EMBL, and the Computational Biology and Informatics Laboratory at the University of Pennsylvania. The bioWidget Consortium (http://www.biowidgets.org/) involves some of the groups interested in collaborative development of Java displays for bioinformatics. Most of this work has focused on graphical displays rather than on back-end software for sequence analysis. Genotator offers a combined system, which runs a sequence through various analysis tools and then displays the results. The next section describes how Genotator is organized.

Annotated sequences
/ \
User 1 User 2
/ \ / \
Sequence A Sequence B
/ | \ / \
GRAIL exons EST hits ...
Out of the many available sequence analysis tools, I chose a reasonable subset to integrate into Genotator. Exclusion of some tools from Genotator's collection is not meant to imply that such tools are inferior. Offsite users who set up Genotator at their site (see Appendix A) are free to modify the code to integrate their favorite sequence analysis tools. Also, various labs are sequencing the DNA of various organisms; I set up Genotator to work on human or Drosophila (which are the organisms being sequenced at LBNL). Users can specify which organism their sequence is from (if left unspecified, human is assumed).
The analysis programs called by Genotator fall into three main categories: gene finders (Genie (Kulp et al., 1997), GRAIL (Xu et. al, 1994), GeneFinder (Green, 1994), xpound (Thomas and Skolnick, 1994), and GeneMark (Borodovsky and McInich, 1993)); database homology searches (BLASTN (Altschul et. al, 1990) on dbEST and database of human or drosophila repeat sequences; BLASTX on GenPept (Benson et. al, 1993)); and sequence feature predictors (start/stop codons, open reading frames (ORFs), promoters (Reese and Eeckman, 1994), splice sites (Reese et al., 1997), and tRNA genes (Lowe and Eddy, 1997)). The promoter and splice site predictors and the Genie gene finder were developed by members of our group at LBNL. Most of the other programs are freely available (see Appendix A). For each analysis program, there is a perl filter that parses the results, filters out the insignificant ones, and saves the significant annotations in .ace files, from which they can be read by the browser.
Figure 3 shows a simplified view of the annotation process used by Genotator. First the incoming sequence is cleaned up (nonstandard characters are converted to Ns; long lines are broken up) and converted to FASTA format, which is used as the input format for many of the sequence analysis tools. The sequence is BLASTed against a database of human (or Drosophila) repeats and the repeats that are found are masked out with xblast. The masked sequence is then BLASTed against databases of EST sequences and GenPept (translated coding regions from GenBank). The BLAST hits are filtered and stored both in .ace format and in a file for Blixem (a BLAST hit viewer from the Sanger Centre). Issues having to do with BLAST hits are discussed in the next section.
The next phase of processing involves converting the sequence to the appropriate input format for each of five gene prediction tools, running the tools (using parameters appropriate for human or Drosophila sequence), and parsing the results. Stop codons and open reading frames are also found and their positions recorded. Martin Reese's neural network programs are run to find potential promoters and splice sites. tRNAscan-SE is run to look for potential tRNA genes (although these are found so rarely that they are not displayed in the graphical output).

Another step we take to try to maximize the information content of the reported BLAST hits is to search first for repeat sequences (such as ALUs, which are ubiquitous throughout the human genome; there are also repeat sequences found in the genomes of other organisms). The repeat sequences are then masked with xblast (Claverie & States, 1993), and the other BLASTs are run on the masked sequence so that the hits that are found don't include repeat sequences.

The GUI is designed to minimize the number of choices the user has to make; in most cases, the user can simply click "Start annotation" and everything will proceed automatically.
The command-line interface is useful when the user wants to annotate several sequences at the same time. It can be invoked with no arguments in order to run the standard analyses, or it can be called with various command-line options to alter its behavior.
We have also developed a Web front end to Genotator that looks much like the GUI. Like the other approaches to running Genotator, the Web interface allows the user to specify a sequence to be annotated and to select which analyses are to be performed. Once a sequence is submitted, the back end runs the analyses as usual and saves the results in the database, where they can be viewed with the Genotator browser.
When Genotator is invoked, its first step is to check the availability of all the sequence analysis programs it knows about. Any that are missing are not offered to the user as choices. Genotator can run with any subset of the suite of sequence analysis programs it is capable of calling. It is written in such a way that new analysis tools can fairly easily be integrated. (Integrating a new tool would involve creating filters to convert the input and output formats, and adding new functions to the back end and front end to run the tool and display the results.)
The Genotator browser is built on top of the bioTkperl widgets (Helt, unpublished) developed by Gregg Helt of the UC Berkeley Drosophila Genome Center (which were in turn inspired by the bioTk widgets developed by David Searls (Searls, 1995)). It can be invoked with the name of an annotated sequence file as an argument. If it is invoked with no arguments, a list of annotated sequences is displayed, with the sequences annotated by the invoking user listed first. Once a sequence has been selected, all of its annotations are loaded and displayed in the map display.
BLAST hits can be double-clicked to view them in more detail. For BLASTN hits (against nucleotide sequences), the complete alignment pops up in a separate window. BLASTX hits against GenPept can be viewed in Blixem.
In Figure 5 the Genotator browser is shown displaying the annotations on HUMTFPB (Mackman et al., 1989), a human tissue factor gene sequence obtained from GenBank. (Splice site predictions and start/stop codons are not displayed until they are explicitly turned on.)

The default annotation colors are:
| Magenta | NNPP promoter predictions |
| Red | GenPept hits (using BLASTX): GenPept consists of all the GenBank coding regions translated to amino acids |
| Orange | EST hits (using BLASTN) |
| Yellow | Human repeat sequence hits (using BLASTN) |
| Chartreuse | xpound exon predictions |
| Green | GeneFinder exon predictions |
| Turquoise | GRAIL exon predictions |
| Dark Blue | Genie exon predictions |
| Purple | GenBank CDS (exons) |
| Magenta/Red/Orange | Open reading frames (>=150 bases), colored by frame |
In Figure 5, the user has clicked on one of the red GenPept BLAST
hits. The browser put a black frame around the hit and printed
information about the hit in the box labeled "Annotation".

The lineup of exon predictions displayed by Genotator was the inspiration for GeneNomi (Harris et al., 1996), a method for combining information from several different predictions to make conservative exon predictions. GeneNomi starts with the exons predicted by Genotator's suite of gene finders, takes the overlapping portions of the predicted exons (which are weighted by the measured accuracy of the gene finding method used to predict each exon), and refines the end points of the consensus exons by looking for splice sites or start/stop codons. GeneNomi was tested on a standardized data set of 305 "clean" gene sequences carefully selected from GenBank (Kulp et al., 1996). By combining several sources of information, GeneNomi was able to come up with slightly better predictions than the best gene finder used by itself. The fact that its predictions were only a slight improvement suggests that we are not yet at the point where a single consensus exon prediction would inspire confidence. It is more useful for a biologist to see the predictions of all of the gene finders lined up (plus the BLAST hits, splice sites, and other supporting features) and to make an informed decision about which exons are most believable. (GeneNomi was developed for research purposes; its predictions are not currently included in the Genotator display.)
After using the Genotator browser to identify probable exons or other interesting features, biologists may choose to confirm these predictions at the bench. (For their convenience, Genotator can also select primers.) By looking at Genotator's predictions, one may minimize the number of sequence regions that need to be checked.
Annotations that refer to a sizable portion of the sequence are generally added to the map; those referring to a small region (such as a primer) are more appropriately added to the sequence. All personal annotations are saved in the database along with the automatically generated annotations. Examples of personal annotations can be seen in the map display in Figure 5 ("Personal annotation" and "Reverse strand annotation") and the sequence display in Figure 6 ("personal annotation in sequence").

Figure 8 shows a sequence that a group of biologists at LBNL (Collins and Cloutier, unpublished) annotated using Genotator. The sequence, h78_1_c10, is a 3523-basepair subclone from human chromosome 7 that was sequenced at LBNL. Personal annotations have been added to indicate regions of interest. The regions marked GCAP indicate where homologies to two retinal guanylyl cyclase activator proteins were found coincident to predicted exons. These predicted exons may thus belong to some new gene associated in some way with the photoreceptor membrane.

Lewis, E.B., J.D. Knafels, D.R. Mathog, and S.E. Celniker (1995). Sequence analysis of the cis-regulatory regions of the bithorax complex of Drosophila. Proc. Natl. Acad. Sci. USA 92:8403-8407.
Helt, G. (unpublished). bioTkperl: Graphics widgets for genomics.
Searls, D.B. (1995). bioTk: Componentry for genome informatics graphical user interfaces. COMBIS. http://www.cbil.upenn.edu/~dsearls/bioTk_paper/paper.html
Collins, C., and T. Cloutier (unpublished). Annotations on sequence h78_1_c10. Personal communication.
Kulp, D., D. Haussler, M.G. Reese, and F.H. Eeckman (1996). A generalized hidden Markov
model for the recognition of human genes in DNA. Proc. Conf. on Intelligent Systems in
Molecular Biology '96, St. Louis, Missouri, AAAI/MIT Press.
(Set of 305 genes is available at ftp://www-hgc.lbl.gov/pub/genesets.)
Durbin, R., and J. Thierry-Mieg (1991-). A C. elegans Database. Documentation, code and data available from anonymous FTP servers at lirmm.lirmm.fr, cele.mrc-lmb.cam.ac.uk and ncbi.nlm.nih.gov.
Marr, T.G. (unpublished). Genome Topographer. Personal communication.
Scharf, M., R. Schneider, G. Casari, P. Bork, A. Valencia, C. Ouzounis, and C. Sander (1994). GeneQuiz: a workbench for sequence analysis. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology (eds. Altman, R., Brutlag, D., Karp., P., Lathrop, R., Searls, D.) AAAI Press, Menlo Park, California, 348-353.
Bailey, L.C., J. Schug, S. Fischer, M. Gibson, J. Crabtree, D.B. Searls, and G.C. Overton (in preparation). GAIA: A prototype system for high-throughput framework annotation. Manuscript in preparation.
Ricke, D.O., J.M. Buckingham, A.C. Munk, Y. Liu, D.C. Bruce, J.F. Chao, Y. Shi, R. Lobb, E.H. Saunders, H.-C. Chi, J.-R. Wu, N.A. Doggett, M.R. Altherr, L.L. Deaven, and R.K. Moyzis (in preparation). Sample Sequencing (SASE) and analysis of a one megabase region of human chromosome 16p13. Manuscript in preparation.
Smith, R.F., B.A. Wiese, M.K. Wojzynski, D.B. Davison, and K.C. Worley (1996). BCM Search Launcher - an integrated interface to molecular biology data base search and analysis services available on the World Wide Web. Genome Research 6:454-462.
Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol. 215:403-10.
Claverie J.M., and D.J. States (1993). Information enhancement methods for large scale sequence analysis. Computers and Chemistry 17:191-201. Green, P. (1994). Ancient conserved regions in gene sequences. Current Opinions in Structural Biology, 4:404-412.
Xu, Y., R.J. Mural, M.B. Shah and E.C. Uberbacher (1994). Recognizing Exons in Genomic Sequence Using GRAIL II. Genetic Engineering: Principles and Methods, Jane Setlow (ed.), Plenum Press, Vol. 15.
Thomas, A. and M.H. Skolnick (1994). A probabilistic model for detecting coding regions in DNA sequences. IMA Journal of Mathematics Applied in Medicine and Biology, 11:149-160.
Borodovsky, M., and J.D. McIninch (1993). GENEMARK: Parallel gene recognition for both DNA strands. Computers & Chemistry 17(2):123-133.
Benson, D., D.J. Lipman, and J. Ostell (1993). GenBank. Nucleic Acids Research 21, 2963-2965.
Mackman, N., J.H. Morrissey, B. Fowler and T.S. Edgington (1989). Complete sequence of the human tissue factor gene, a highly regulated cellular receptor that initiates the coagulation protease cascade [humtfpb]. Biochemistry 28 (4), 1755-1762.
Kulp, D., D. Haussler, M.G. Reese and F.H. Eeckman (1996). A generalized hidden Markov model for the recognition of human genes in DNA, Proc. Conf. on Intelligent Systems in Molecular Biology '96, St. Louis, Missouri, AAAI/MIT Press.
Reese, M.G., and F.H. Eeckman (1994).
New neural network algorithms for improved eukaryotic promoter site recognition.
The Seventh International Genome Sequencing and Analysis Conference,
Hilton Head Island, South Carolina, September 16-20, 1995.
Reese, M.G., F.H. Eeckman, D. Kulp, D. Haussler (1997). Improved splice site detection in Genie. RECOMB (First Annual International Conference on Computational Molecular Biology, 1997), Santa Fe, ed. M. Waterman.
Lowe, T.M. and S.R. Eddy (1997). tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Research 25:955-964.
Rozen, S. and H.J. Skaletsky (1996). Primer3. Code available at http://www-genome.wi.mit.edu/genome_software/other/primer3.html
Sonnhammer, E.L.L., and R. Durbin (1994). A workbench for large scale sequence homology analysis. Comput. Applic. Biosci. 10:301-307.
Harris, N.L., M.G. Reese, G.A. Helt, and F.H. Eeckman (1996). Towards automatic recognition of exons. Intelligent Systems for Molecular Biology '96, St. Louis, Missouri. [Poster]
Helt, G. and G. Rubin (1996). A Java-based interactive genome browser. Fourth International Conference on Intelligent Systems for Molecular Biology, St. Louis, MO, June, 1996.