Neural Network Promoter Prediction
About the neural network method
NNPP is a method that finds eukaryotic and prokaryotic promoters in a DNA sequence. The function of the promoter as a initiator for transcription is one of the most complex processes in molecular biology. It has been shown that multiple functional sites in the primary DNA are involved in the polymerase binding process. These elements, such as the TATA-box and the transcription start site ("Initiator") for eukaryotes, are known to function as binding sites for Polymerase II, transcription factors, and other proteins that are involved in the transcription initiation process. These promoter elements are present in various combinations separated by various distances in the sequence.
The basis of the NNPP program is a time-delay neural network (see further References for details). The time-delay network consists mainly of two feature layers, one for recognizing the TATA-box and one for recognizing the "Initiator", which is the region spanning the transcription start site. Both feature layers are combined into one output unit, which gives output scores between 0 and 1. The neural network method is described in detail in
(1) Reese, M.G.
Diploma Thesis, 1994
German Cancer Research Center, Heidelberg.
(2) Reese, M.G. and Eeckman, F.H. (1995)
"Novel Neural Network Algorithms for Improved Eukaryotic Promoter
Site Recognition".
The Seventh International Genome Sequencing and Analysis
Conference,
Hilton Head Island, South Carolina.
Abstract
(3) Reese, M.G., Harris, N.L. and Eeckman, F.H. (1996)
"Large Scale Sequencing Specific Neural Networks for Promoter and
Splice Site Recognition"
Biocomputing:
Proceedings of the 1996 Pacific Symposium
edited by Lawrence Hunter and Terri E. Klein, World Scientific
Publishing Co, Singapore, 1996, January 2-7, 1996.
Abstract
Please cite these when quoting NNPP output.
Estimated accuracy of prediction
Eukaryotes
A careful 4-fold cross validation test on 429 eukaryotic RNA Polymerase II promoters from the Eukaryotic Promoter Database (EPD, version 50)
-
Bucher,P. & Trifonov,E.N. (1986). Compilation and analysis of
eukaryotic POL II promoter sequences. Nucl. Acids Res. 14, 10009-10026. -
Bucher, P. (1989). Weight Matrix Description of Four Eukaryotic RNA
Polymerase II Promotor Elements Derived from 502 Unrelated Promotor
Sequences. J. Mol. Biol. 212, 563-578.
and on 305 unrelated genes with less than 50% pairwise sequence identity (gene data set) gave the following results (results averaged over both test sets):
+------------+-----------+------------+------------+
| threshold | % | | correlation|
| | promoters | false | coefficient|
| | recognized| positives | (CC) |
+------------+-----------+------------+------------+
| | | | |
| 0.99 | 10% | 0.0% | 0.38 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.97 | 20% | 0.0-0.1% | 0.38 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.92 | 30% | 0.1-0.3% | 0.50 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.85 | 40% | 0.1-0.4% | 0.60 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.70 | 50% | 0.8-1.0% | 0.65 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.38 | 60% | 1.0-3.1% | 0.61 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.20 | 70% | 2.2-5.3% | 0.58 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.12 | 80% | 5.1-12.5% | 0.52 |
| | | | |
+------------+-----------+------------+------------+
These percentages are defined by:
predicted promoters
promoters recognized = -------------------------
all observed promoters
predicted promoters
false positives = -------------------------
all observed non-promoter
(TPxTN)-(FNxFP)
correlation coefficient (CC) = ------------------------------------
________________________________
V (TP+FN)x(TN+FP)x(TP+FP)x(TN+FN)
TP = true positive = promoters recognized
TN = true negative = non-promoters recognized
FP = false positive = observed non-promoters predicted as
promoters
FN = false negatives = observed promoters predicted as
non-promoters
Prokaryotes
A careful cross validated test on 272 prokaryotic E. coli promoters collected and described in
- Harley, C. B. & Reynolds, R. P. (1987). Analysis of E.coli
promoter
sequences. Nucl. Acids Res. 15, 2343-2361.
gave the following results:
+------------+-----------+------------+------------+
| threshold | % | | correlation|
| | promoters | false | coefficient|
| | recognized| positives | (CC) |
+------------+-----------+------------+------------+
| | | | |
| 0.9 | 50% | 0.3% | 0.71 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.8 | 60% | 0.4% | 0.72 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.65 | 70% | 0.9% | 0.73 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.55 | 75% | 1.3% | 0.72 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.35 | 80% | 1.7% | 0.72 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.15 | 90% | 2.7% | 0.70 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.03 | 95% | 4.7% | 0.63 |
| | | | |
+------------+-----------+------------+------------+
The performance per base position was tested on the pBR322 vector:
+------------+-----------+------------+------------+
| threshold | % | | correlation|
| | promoters | false | coefficient|
| | recognized| positives | (CC) |
+------------+-----------+------------+------------+
| | | | |
| 0.96 | 30% | 0.03% | 0.38 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.92 | 50% | 0.11% | 0.48 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.89 | 80% | 0.16% | 0.51 |
| | | | |
+------------+-----------+------------+------------+
Further References and Abstract
Another promoter finder on the Web
There exists an additional program SIGNALSCAN developed by Dr. Dan Prestridge which can be
used to search for transcription factor binding sites in promoter
regions. The program can be accessed at 2 different WWW
sites:
SIGNALSCAN at
NIH
or
SIGNALSCAN in
Singapore.
