Novel Neural Network Prediction Systems for Human Promoters and Splice Sites

Martin G. Reese and Frank H. Eeckman
Lawrence Berkeley National Laboratory
Genome Informatics Group
1 Cyclotron Road
Berkeley, CA, 94720
mgreese@lbl.gov, fheeckman@lbl.gov

Abstract:

We present a detailed theoretical study of the organization and structure of landmark sequences like promoters and splice junctions in Human DNA. An improved detection of these landmark sequences in genomic DNA is important for further combinatorial approaches that deal with exon and gene assembly.

The function of eukaryotic promoters as initiators for transcription and of splice sites as signals for RNA assembly are among of the most complex processes in molecular biology. Both consist of multiple functional sites in primary DNA that are involved in the polymerase binding and splicing process, respectively.

We analyzed the structure of the individual elements within promoters and splice sites using a novel technique that combines neural networks with weight pruning. A neural network is trained to recognize promoter or splice site elements until it reaches a local minimum. Then the pruning procedure deletes those weights in the network that add the lowest predictive value to the overall prediction. After pruning, the neural network is retrained until it is stuck again in a new minimum. This procedure is repeated until a defined error level is reached. Eventually, the pruned neural network gives clues about the importance of specific positions in the promoter element and splice junction by studying the remaining weights.

For a complete promoter site prediction we combine these single predictions for each element using time-delay neural networks. TDNNs are appropriate for recognizing promoter elements because they are able to combine multiple features, even those that appear at different relative positions in different sequences. Another advantage is the high selectivity of the TDNN, which is extremely important for promoter prediction systems, that are known to have high amount of false positive classifications.

This TDNN predicts most of the annotated promoters in a set of human genes from Genbank (version 86.0). As an example, the TDNN finds the annotated promoter from a 13,865 base pair test gene, HUMTFPB.gb_pr, with a false positive score of 0.07% (10 false positive predictions out of 13865). On a test set containing 42 known human gene promoters and 84 random DNA sequences we were able to recognize 50% of the human gene promoters without false positive classification (correlation coefficient of 0.61).

Preliminary tests, using these neural networks for splice site predictions show very promising results.

In the future we expect to improve performance by combining gene-finding prediction methods with our local signal predictors like the promoter and splice site networks to reduce the false positive predictions.

Back to List of Publications

Back to Martin's Corner