BDGP Logo BDGP - Berkeley Drosophila Genome Group
Searches

Human and Drosophila melanogaster Splice Site Prediction using Neural Networks

Read Abstract

About the neural network method

Splice sites are the key signal sequences that determine the boundaries of exons. A method for splice site detection should ideally be based on a thorough understanding of the complex eukaryotic splicing process. We trained a backpropagation feedforward neural network with one layer of hidden units to recognize 5' and 3' splice sites, using a representative data set (Drosophila melanogaster data set). We only consider genes that have constraint consensus splice sites, i.e., `GT' for the 5' and `AG' for the 3' splice site. The output of the network is a score between 0 and 1 for a potential splice site.

The neural network method is described in detail in
References and Abstract

Estimated accuracy of prediction

Human

A carefully randomly chosen independent test set of 43 human genes (/sequence/human-datasets.html) with no related sequences to the training set gave the following results:

Human 5' Splice Site prediction:


  +------------+-----------+----------------+------------+
  | threshold  |    %      |     %          | correlation|
  |            | sites     | false positive | coefficient|
  |            | recognized| sites          |    (CC)    |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.99    |   26.0%   |      0.1%      |    0.46    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.95    |   50.4%   |      0.7%      |    0.65    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.90    |   64.1%   |      1.1%      |    0.73    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.85    |   72.7%   |      1.4%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.80    |   74.4%   |      1.9%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.75    |   77.8%   |      1.9%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.70    |   81.6%   |      2.7%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.65    |   85.0%   |      3.2%      |    0.83    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.60    |   88.0%   |      3.5%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.55    |   89.3%   |      3.7%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.50    |   91.5%   |      4.2%      |    0.85    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.45    |   93.2%   |      4.7%      |    0.85    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.40    |   93.2%   |      5.2%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.35    |   93.6%   |      5.3%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.30    |   94.9%   |      5.8%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.25    |   95.3%   |      6.2%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.20    |   96.2%   |      6.7%      |    0.83    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.15    |   96.6%   |      8.2%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.10    |   97.9%   |      9.1%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.05    |   98.3%   |     11.1%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
These percentages are defined by:
                                predicted sites
sites recognized =        -------------------------
                              all observed sites


                                predicted sites
false positive sites =       -------------------------
                             all observed non-sites


                                          (TPxTN)-(FNxFP)
correlation coefficient (CC) =  ------------------------------------
                                  ________________________________
                                 V (TP+FN)x(TN+FP)x(TP+FP)x(TN+FN)

TP = true positive = sites recognized
TN = true negative = non-sites recognized
FP = false positive = observed non-sites predicted as sites
FN = false negatives = observed sites predicted as non-sites

Human 3' Splice Site prediction:


  +------------+-----------+----------------+------------+
  | threshold  |    %      |     %          | correlation|
  |            | sites     | false positive | coefficient|
  |            | recognized| sites          |    (CC)    |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.99    |    7.3%   |      0.0%      |    0.25    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.95    |   33.3%   |      0.4%      |    0.52    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.90    |   47.9%   |      0.5%      |    0.64    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.85    |   57.7%   |      0.6%      |    0.70    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.80    |   61.2%   |      0.9%      |    0.72    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.75    |   65.4%   |      1.1%      |    0.74    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.70    |   69.7%   |      1.3%      |    0.77    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.65    |   73.5%   |      1.5%      |    0.79    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.60    |   76.5%   |      1.8%      |    0.80    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.55    |   79.1%   |      2.0%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.50    |   80.8%   |      2.4%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.45    |   82.5%   |      2.9%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.40    |   83.8%   |      3.1%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.35    |   86.8%   |      3.7%      |    0.82    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.30    |   88.5%   |      4.0%      |    0.82    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.25    |   88.5%   |      4.5%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.20    |   90.2%   |      4.8%      |    0.82    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.15    |   91.0%   |      6.0%      |    0.80    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.10    |   92.3%   |      7.9%      |    0.77    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.05    |   94.9%   |     10.4%      |    0.74    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+

Neural Network based "consensi" sequences: Extensive analysis of the perceptron neural network weight matrices have revealed the following "refined" 5' and 3' splice site consensus and non-consensus sequences:

5' Splice Site: 

              -7  6  5  4  3   2  -1     +1  2  3   4  5  6  7 +8
consensus:     a  a  a  A C|a  A   G   /  G  T  A   A  G  T  -  c      

non-consensus: g  g  g  G G|T G|T A|T     -  - C|t g|t -  -  t  -     


3' Splice Site: 

               -21 -20 19 18  17 16  15  14  13  12  11  10   9   8   7   6   5  4   3   2 -1
consensus:       -   T  T T|c  T T|C T|C T|c T|c T|c T|c T|c T|c T|C T|c T|C T|c A  T|C  A  G  
non-consensus:                                                                       G        

               +1  2  3  4  5  6  7  8  9  10  11  12 13 14 15 16  17 18 19 +20
consensus:      G  T  c  -  -  -  g  g  -   g  g|a  c  g  a  a a|c  a  g  -   -
non-consensus: c|t       t    g|t

Capital letters indicate strong weights and lower case letters weaker weights.
"|" means "or"
"-" no significant weight "non-consensus" indicates bases that are very unlikely to appear at this position.

Drosophila melanogaster

A carefully randomly chosen independent test set of 41 genes (Drosophila melanogaster gene set) with no related sequences to the training set gave the following results:

Drosophila melanogaster 5' Splice Site prediction:


  +------------+-----------+----------------+------------+
  | threshold  |    %      |     %          | correlation|
  |            | sites     | false positive | coefficient|
  |            | recognized| sites          |    (CC)    |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.99    |    0.0%   |      0.0%      |     -      |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.95    |   22.9%   |      0.0%      |    0.44    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.90    |   53.3%   |      0.0%      |    0.69    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.85    |   61.9%   |      0.0%      |    0.75    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.80    |   66.7%   |      0.0%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.75    |   69.5%   |      0.8%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.70    |   77.1%   |      0.8%      |    0.83    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.65    |   78.1%   |      1.0%      |    0.83    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.60    |   81.9%   |      1.0%      |    0.86    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.55    |   82.9%   |      1.0%      |    0.86    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.50    |   88.6%   |      1.8%      |    0.88    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.45    |   90.5%   |      2.5%      |    0.88    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.40    |   91.4%   |      3.0%      |    0.88    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.35    |   91.4%   |      4.0%      |    0.85    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.30    |   94.3%   |      4.8%      |    0.86    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.25    |   96.2%   |      5.3%      |    0.86    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.20    |   97.1%   |      5.8%      |    0.86    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.15    |   97.1%   |      8.0%      |    0.82    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.10    |   99.1%   |     10.3%      |    0.80    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.05    |   99.1%   |     15.1%      |    0.73    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+

Drosophila melanogaster 3' Splice Site prediction:

+------------+-----------+----------------+------------+ | threshold | % | % | correlation| | | sites | false positive | coefficient| | | recognized| sites | (CC) | +------------+-----------+----------------+------------+ | | | | | | 0.99 | 1.9% | 0.0% | 0.12 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.95 | 11.4% | 0.0% | 0.30 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.90 | 28.6% | 0.6% | 0.46 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.85 | 44.8% | 0.6% | 0.60 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.80 | 53.3% | 1.1% | 0.65 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.75 | 60.1% | 2.0% | 0.69 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.70 | 69.5% | 2.3% | 0.74 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.65 | 73.3% | 2.5% | 0.76 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.60 | 76.2% | 3.1% | 0.77 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.55 | 79.0% | 4.2% | 0.77 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.50 | 83.8% | 5.4% | 0.78 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.45 | 87.6% | 5.9% | 0.80 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.40 | 90.5% | 6.5% | 0.81 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.35 | 92.4% | 7.0% | 0.81 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.30 | 94.3% | 9.0% | 0.79 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.25 | 94.3% | 10.7% | 0.77 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.20 | 96.2% | 13.0% | 0.75 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.15 | 96.2% | 14.7% | 0.73 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.10 | 96.2% | 17.5% | 0.69 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.05 | 97.1% | 30.7% | 0.56 | | | | | | +------------+-----------+----------------+------------+
Please send comments or questions about the web site to bdgp@fruitfly.org