[an error occurred while processing this directive] BDGP: Genie: Abstract


Genie: Abstract

Do Search

A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA

David Kulp and David Haussler
Baskin Center for Computer
Engineering and Information Sciences
University of California
Santa Cruz, CA 95064

Martin G. Reese and Frank H. Eeckman
Lawrence Berkeley Laboratory
Genome Informatics Group
1 Cyclotron Road
Berkeley, CA, 94720

We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence (Stormo and Haussler, 1994). Probabilities are assigned to transitions between states in the GHMM and to the generation of each nucleotide base given a particular state. Machine learning techniques are applied to optimize these probabilities using a standardized training set. Given a new candidate sequence, the best parse is deduced from the model using a dynamic programming algorithm to identify the path through the model with maximum probability. The GHMM is flexible and modular, so new sensors and additional states can be inserted easily. In addition, it provides simple solutions for integrating cardinality constraints, reading frame constraints, ``indels'', and homology searching.

The description and results of an implementation of such a gene-finding model, called Genie, is presented. The exon sensor is a codon frequency model conditioned on windowed nucleotide frequency and the preceding codon. Two neural networks are used, as in (Brunak et al., 1991), for splice site prediction.

We show that this simple model performs quite well. For a cross-validated standard test set of 304 genes [ftp://www-hgc.lbl.gov/pub/genesets] in human DNA, our gene-finding system identified up to 85% of protein-coding bases correctly with a specificity of 80%. 58% of exons were exactly identified with a specificity of 51%. Genie is shown to perform favorably compared with several other gene-finding systems.

Download paper (56 Kb compressed postscript).