Martin G. Reese and Frank H. Eeckman
Lawrence Berkeley Laboratory
Genome Informatics Group
1 Cyclotron Road
Berkeley, CA, 94720
mgreese@lbl.gov
We analyzed the structure of the individual elements within promoter sites using a novel technique that combines neural networks with weight pruning. A neural network is trained to recognize promoter elements until it reaches a local minimum. Then the pruning procedure deletes those weights in the network that add the lowest predictive value to the overall prediction. After pruning, the neural network is retrained until it is stuck again in a minimum. This procedure is repeated until a defined error level is reached. Eventually, the pruned neural network gives clues about the importance of specific positions in the promoter element by studying the remaining weights.
We combine these single predictions for each element using time-delay neural networks for a complete promoter site prediction. TDNNs are appropriate for recognizing promoter elements because they are able to combine multiple features, even those that appear at different relative positions in different sequences. Subsequent analysis of the weight matrices in these TDNNs reveals the importance of the various elements.
With the TDNN, we achieve very good prediction accuracy for the local core promoter features, which is somewhat surprising given the non-local structure of the promoter site. The network predicts the 9 known promoters in the whole genome of the adeno2-virus with a false positive score of 2.5%. On a test set containing 42 known human gene promoters and 84 random DNA sequences we were able to recognize 50% of the human gene promoters without false positive classification (correlation coefficient of 0.61).
In the future we expect to improve performance by combining gene-finding prediction methods with our TDNN method to reduce the false positive predictions.