Integrating Database Homology in a Probabilistic Gene Structure Model
David Kulp and David Haussler
Baskin Center for Computer
Engineering and Information Sciences
University of California
Santa Cruz, CA 95064
{dkulp,haussler}@cse.ucsc.edu
Martin G. Reese and Frank H. Eeckman
Lawrence Berkeley Laboratory
Genome Informatics Group
1 Cyclotron Road
Berkeley, CA, 94720
{martinr,eeckman}@genome.lbl.gov
Abstract:
We present an improved stochastic model of
genes in DNA, and describe a method for integrating database
homology into the probabilistic framework. A generalized hidden
Markov model (GHMM) describes the grammar of a legal parse of a
DNA sequence. Probabilities are estimated for gene features by
using dynamic programming to combine information from multiple sensors.
We show how matches to homologous sequences from a database can be integrated
into the probability estimation by interpreting the likelihood of a
sequence in terms of the bit-cost to encode a sequence given a homology
match. We also demonstrate how homology matches in protein databases
can be exploited to help identify splice sites.
Our experiments show significant improvements in the sensitivity and
specificity of gene structure identification when these new features
are added to our gene-finding system, Genie. Experimental results
in tests using a standard set of annotated genes showed that Genie
identified 95% of coding nucleotides correctly with a specificity of
91%, and 77% of exons were identified exactly.
Back to GENIE
Back to List of Publications
Back to Martin's Corner