Towards Automatic Recognition of Exons

Nomi L. Harris, Martin G. Reese, Gregg A. Helt*, and Frank H. Eeckman

Human Genome Informatics Group, Lawrence Berkeley National Laboratory

* Berkeley Drosophila Genome Project, University of California, Berkeley

nlharris@lbl.gov, mgreese@lbl.gov, gregg@fruitfly.berkeley.edu, fheeckman@lbl.gov


Keywords: sequence analysis, genetic; gene finding; user interfaces; databases, sequence

In the quest to decode the meaning of DNA sequences, the recognition of coding regions is of key importance. Although computational gene-finding methods continue to improve, none of them does a perfect job. Since the strengths and weaknesses of different methods vary, we can combine information from several different predictions to make conservative exon predictions. If an exon predicted by Genie (a new gene-finding method that we are developing in collaboration with UCSC) coincides with exons predicted by other gene finding programs such as GRAIL and also with a region of homology (found by BLAST) to a known sequence, we can be fairly confident that it is a true exon. Moreover, since the middle of a BLAST hit is more likely than the ends to indicate a significant match, we can improve our prediction by restricting it to the overlapping portions of the exon predictions and sequence homologies. Our goal is to define an automatic scoring scheme to combine all these methods and assign a probability score to each predicted exon.

We are developing a workbench for automatic sequence annotation and annotation browsing. The annotation workbench runs a series of analysis tools (including exon predictions and homology searches) on a DNA sequence. The annotated sequences can then be viewed with the browser that we built using Gregg Helt's bioTkperl widgets. Below, the annotation browser is shown displaying the annotations on part of the forward strand of HUMTFPB, a 13,865-basepair human sequence obtained from GenBank. Annotations appear as colored rectangles in the appropriate position in the sequence.

In HUMTFPB, Genie has predicted an exon from 6392 to 6592. xpound, Genefinder and GRAIL have all predicted exons that overlap with this. There is also a GenPept hit that coincides with the exon predictions, which provides further support. Our conservative exon prediction is calculated by finding the overlapping regions of the various predictions and the GenPept hit, shown by the shaded area. The predicted exon thus extends from 6412 to 6591. This is similar to the true exon (GenBank CDS) which starts at 6392 and ends at 6591. All aspects of the annotation workbench are currently under development, as is the exon scoring method.


References

1. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D. J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol. 215:403-10.

2. P. Green (1994). Ancient conserved regions in gene sequences. Current Opinions in Structural Biology, 4:404-412.

3. E.C. Uberbacher and R.J. Mural (1991). Locating Protein-Coding Regions in Human DNA Sequences by a Multiple Sensor-Neural Network Approach. PNAS 88:11261-11265.

4. Y. Xu, R.J. Mural, M.B. Shah and E.C. Uberbacher (1994). Recognizing Exons in Genomic Sequence Using GRAIL II. Genetic Engineering: Principles and Methods, Jane Setlow (ed.), Plenum Press, Vol. 15.

5. A. Thomas, and M.H. Skolnick (1994). A probabilistic model for detecting coding regions in DNA sequences. IMA Journal of Mathematics Applied in Medicine and Biology, 11:149-160.

6. D. Benson, D.J. Lipman, and J. Ostell, (1993). GenBank. Nucleic Acids Research 21, 2963-2965.

7. D. Kulp, D. Haussler, M.G. Reese, and F.H. Eeckman (submitted). A generalized Hidden Markov Model for the recognition of human genes in DNA. Submitted to ISMB-96.

8. M.G. Reese and F.H. Eeckman (1995). New neural network algorithms for improved eukaryotic promoter site recognition. The Seventh International Genome Sequencing and Analysis Conference, Hilton Head Island, South Carolina, September 16-20, 1995.

9. G. Helt, manuscript in preparation. See also http://fruitfly.berkeley.edu/BDGP/informatics/bioTkperl.html.

10. D.B. Searls (1995). bioTk: Componentry for genome informatics graphical user interfaces. Gene, 163(2):GC1-16. See also http://www.cbil.upenn.edu/~dsearls/bioTk_paper/paper.html

11. Mackman, N., Morrissey, J.H., Fowler, B. and Edgington, T.S. (1989). Complete sequence of the human tissue factor gene, a highly regulated cellular receptor that initiates the coagulation protease cascade. Biochemistry 28 (4), 1755-1762.