BDGP Logo BDGP - Berkeley Drosophila Genome Group
Searches

Neural Network Promoter Prediction

Read Abstract

About the neural network method

NNPP is a method that finds eukaryotic and prokaryotic promoters in a DNA sequence. The function of the promoter as a initiator for transcription is one of the most complex processes in molecular biology. It has been shown that multiple functional sites in the primary DNA are involved in the polymerase binding process. These elements, such as the TATA-box and the transcription start site ("Initiator") for eukaryotes, are known to function as binding sites for Polymerase II, transcription factors, and other proteins that are involved in the transcription initiation process. These promoter elements are present in various combinations separated by various distances in the sequence.

The basis of the NNPP program is a time-delay neural network (see further References for details). The time-delay network consists mainly of two feature layers, one for recognizing the TATA-box and one for recognizing the "Initiator", which is the region spanning the transcription start site. Both feature layers are combined into one output unit, which gives output scores between 0 and 1. The neural network method is described in detail in

(1) Reese, M.G.
Diploma Thesis, 1994
German Cancer Research Center, Heidelberg.

(2) Reese, M.G. and Eeckman, F.H. (1995)
"Novel Neural Network Algorithms for Improved Eukaryotic Promoter Site Recognition".
The Seventh International Genome Sequencing and Analysis Conference,
Hilton Head Island, South Carolina.
Abstract

(3) Reese, M.G., Harris, N.L. and Eeckman, F.H. (1996)
"Large Scale Sequencing Specific Neural Networks for Promoter and Splice Site Recognition"
Biocomputing: Proceedings of the 1996 Pacific Symposium
edited by Lawrence Hunter and Terri E. Klein, World Scientific Publishing Co, Singapore, 1996, January 2-7, 1996.
Abstract

Please cite these when quoting NNPP output.

Estimated accuracy of prediction

Eukaryotes

A careful 4-fold cross validation test on 429 eukaryotic RNA Polymerase II promoters from the Eukaryotic Promoter Database (EPD, version 50)

  • Bucher,P. & Trifonov,E.N. (1986). Compilation and analysis of
    eukaryotic POL II promoter sequences. Nucl. Acids Res. 14, 10009-10026.

  • Bucher, P. (1989). Weight Matrix Description of Four Eukaryotic RNA
    Polymerase II Promotor Elements Derived from 502 Unrelated Promotor
    Sequences. J. Mol. Biol. 212, 563-578.

    and on 305 unrelated genes with less than 50% pairwise sequence identity (gene data set) gave the following results (results averaged over both test sets):

    
      +------------+-----------+------------+------------+
      | threshold  |    %      |            | correlation|
      |            | promoters |   false    | coefficient|
      |            | recognized| positives  |    (CC)    |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.99    |    10%    |    0.0%    |    0.38    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.97    |    20%    |  0.0-0.1%  |    0.38    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.92    |    30%    |  0.1-0.3%  |    0.50    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.85    |    40%    |  0.1-0.4%  |    0.60    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.70    |    50%    |  0.8-1.0%  |    0.65    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.38    |    60%    |  1.0-3.1%  |    0.61    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.20    |    70%    |  2.2-5.3%  |    0.58    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.12    |    80%    | 5.1-12.5%  |    0.52    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
     
    
    These percentages are defined by:
    
                                    predicted promoters
    promoters recognized =       -------------------------
                                  all observed promoters
    
    
                                    predicted promoters
    false positives =            -------------------------
                                 all observed non-promoter
    
    
                                              (TPxTN)-(FNxFP)
    correlation coefficient (CC) =  ------------------------------------
                                      ________________________________
                                     V (TP+FN)x(TN+FP)x(TP+FP)x(TN+FN)
    
    
    TP = true positive = promoters recognized
    TN = true negative = non-promoters recognized
    FP = false positive = observed non-promoters predicted as promoters
    FN = false negatives = observed promoters predicted as non-promoters

    Prokaryotes

    A careful cross validated test on 272 prokaryotic E. coli promoters collected and described in

  • Harley, C. B. & Reynolds, R. P. (1987). Analysis of E.coli promoter
    sequences. Nucl. Acids Res. 15, 2343-2361.

    gave the following results:

    
      +------------+-----------+------------+------------+
      | threshold  |    %      |            | correlation|
      |            | promoters |   false    | coefficient|
      |            | recognized| positives  |    (CC)    |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.9     |    50%    |    0.3%    |    0.71    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.8     |    60%    |    0.4%    |    0.72    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.65    |    70%    |    0.9%    |    0.73    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.55    |    75%    |    1.3%    |    0.72    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.35    |    80%    |    1.7%    |    0.72    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.15    |    90%    |    2.7%    |     0.70   |
      |            |           |            |            |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.03    |    95%    |    4.7%    |     0.63   |
      |            |           |            |            |
      +------------+-----------+------------+------------+
    

    The performance per base position was tested on the pBR322 vector:

    
      +------------+-----------+------------+------------+
      | threshold  |    %      |            | correlation|
      |            | promoters |   false    | coefficient|
      |            | recognized| positives  |    (CC)    |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.96    |    30%    |    0.03%   |    0.38    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.92    |    50%    |    0.11%   |    0.48    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
      |            |           |            |            |
      |    0.89    |    80%    |    0.16%   |    0.51    |
      |            |           |            |            |
      +------------+-----------+------------+------------+
    
    


    Further References and Abstract

    Return to NNPP page



    Another promoter finder on the Web
    There exists an additional program SIGNALSCAN developed by Dr. Dan Prestridge which can be used to search for transcription factor binding sites in promoter regions. The program can be accessed at 2 different WWW sites:
    SIGNALSCAN at NIH
    or
    SIGNALSCAN in Singapore.


  • Please send comments or questions about the web site to bdgp@fruitfly.org