Detection of eukaryotic promoter regions using polygrams

Ohler, U. and Reese, M.G.

In R. Hofestädt, editor, Molekulare Bioinformatik, pages 89-100, Aachen, 1998. Shaker


We present a new search-by-content method to identify transcriptional regulatory regions in eukaryotic genomic sequences. The method is based on stochastic language models which are a straightforward generalization of oligomer statistics. We describe the theoretical background and different parameter estimation techniques used to build the models. The resulting language models are applied to classify fixed length sequences into the classes of promoters and non-promoters, and to search for transcription start sites in contiguous sequences.

Detailed classification results for human and Drosophila data sets are presented, and the practical applicability of the models is demonstrated on an independent test set of vertebrate genomic sequences. On this set, which has already been used to compare different computational approaches for promoter recognition, the performance of our method is comparable to the best algorithms described so far. The number of false positives can be further reduced by a post processing step on the output scores. Examining both strands of the independent test set, the models thus are able to identify about half of the annotated transcription start sites (12 out of 22) while making a false prediction roughly every 800 base pairs.