Interpolated Markov Chains for Eukaryotic Promoter Recognition
Uwe Ohler, Stefan Harbeck, Heinrich Niemann, Elmar Noeth, and Martin G. Reese
Chair for Pattern Recognition (Computer Science V)
University of Erlangen-Nuremberg
Martensstrasse 3
D-91058 Erlangen
and
Department of Molecular and Cell Biology
University of California at Berkeley
539 Life Sciences Addition
Berkeley, CA 94720-3200
Abstract:
We describe a new content based approach for the detection of promoter regions
of eukaryotic protein encoding genes. Our system is based on three interpolated
Markov chains (IMCs) of different order which are trained on coding,
non-coding, and promoter sequences. It was recently shown
that the interpolation of Markov chains leads to stable parameters and
improves on the results in microbial gene finding (Salzberg et
al., 1998).
Here, we present new methods for an automated estimation of optimal
interpolation parameters and show how the IMCs can be applied to detect
promoters in contiguous DNA sequences.
Our interpolation approach can also be employed to obtain a reliable scoring
function for human coding DNA regions, and the trained models can easily be
incorporated in the general framework for gene recognition systems.
Results:
A fivefold cross-validation evaluation of our IMC approach on a
representative sequence set yielded a mean correlation coefficient of
0.84 (promoter vs. coding sequences) respectively 0.53 (promoter vs. non-coding
sequences). Applied on the task of eukaryotic promoter region identification
in genomic DNA sequences, our classifier identifies 50% of the promoter
regions in the sequences used in the most recent review and comparison by
(Fickett et al., 1997), while having a false positive rate of 1/849
bp.