New Neural Network Algorithms for Improved Eukaryotic Promoter Site Recognition

Martin G. Reese and Frank H. Eeckman
Lawrence Berkeley Laboratory
Genome Informatics Group
1 Cyclotron Road
Berkeley, CA, 94720
mgreese@lbl.gov

Abstract:

We present a detailed theoretical study of the organization and sequence of the eukaryotic polymerase II promoter site. The function of the eukaryotic promoter as a initiator for transcription is one of the most complex processes in molecular biology. It has been shown that multiple functional sites in the primary DNA are involved in the polymerase binding process. These elements, such as TATA-box, GC-box, CAAT-box and the transcription start site, are known to function as binding sites for transcription factors and other proteins, that are involved in the initiation process. These promoter elements are present in various combinations separated by various distances in sequence.

We analyzed the structure of the individual elements within promoter sites using a novel technique that combines neural networks with weight pruning. A neural network is trained to recognize promoter elements until it reaches a local minimum. Then the pruning procedure deletes those weights in the network that add the lowest predictive value to the overall prediction. After pruning, the neural network is retrained until it is stuck again in a minimum. This procedure is repeated until a defined error level is reached. Eventually, the pruned neural network gives clues about the importance of specific positions in the promoter element by studying the remaining weights.

We combine these single predictions for each element using time-delay neural networks for a complete promoter site prediction. TDNNs are appropriate for recognizing promoter elements because they are able to combine multiple features, even those that appear at different relative positions in different sequences. Subsequent analysis of the weight matrices in these TDNNs reveals the importance of the various elements.

With the TDNN, we achieve very good prediction accuracy for the local core promoter features, which is somewhat surprising given the non-local structure of the promoter site. The network predicts the 9 known promoters in the whole genome of the adeno2-virus with a false positive score of 2.5%. On a test set containing 42 known human gene promoters and 84 random DNA sequences we were able to recognize 50% of the human gene promoters without false positive classification (correlation coefficient of 0.61).

In the future we expect to improve performance by combining gene-finding prediction methods with our TDNN method to reduce the false positive predictions.

Back to List of Publications

Back to Martin's Corner