Searches

Human and Drosophila melanogaster Splice Site Prediction using Neural Networks

Read Abstract

About the neural network method

Splice sites are the key signal sequences that determine the boundaries of exons. A method for splice site detection should ideally be based on a thorough understanding of the complex eukaryotic splicing process. We trained a backpropagation feedforward neural network with one layer of hidden units to recognize 5' and 3' splice sites, using a representative data set (Drosophila melanogaster data set). We only consider genes that have constraint consensus splice sites, i.e., GT' for the 5' andAG' for the 3' splice site. The output of the network is a score between 0 and 1 for a potential splice site.

The neural network method is described in detail in
References and Abstract

Estimated accuracy of prediction

Human

A carefully randomly chosen independent test set of 43 human genes (/sequence/human-datasets.html) with no related sequences to the training set gave the following results:

Human 5' Splice Site prediction:

  +------------+-----------+----------------+------------+
  | threshold  |    %      |     %          | correlation|
  |            | sites     | false positive | coefficient|
  |            | recognized| sites          |    (CC)    |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.99    |   26.0%   |      0.1%      |    0.46    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.95    |   50.4%   |      0.7%      |    0.65    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.90    |   64.1%   |      1.1%      |    0.73    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.85    |   72.7%   |      1.4%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.80    |   74.4%   |      1.9%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.75    |   77.8%   |      1.9%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.70    |   81.6%   |      2.7%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.65    |   85.0%   |      3.2%      |    0.83    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.60    |   88.0%   |      3.5%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.55    |   89.3%   |      3.7%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.50    |   91.5%   |      4.2%      |    0.85    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.45    |   93.2%   |      4.7%      |    0.85    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.40    |   93.2%   |      5.2%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.35    |   93.6%   |      5.3%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.30    |   94.9%   |      5.8%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.25    |   95.3%   |      6.2%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.20    |   96.2%   |      6.7%      |    0.83    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.15    |   96.6%   |      8.2%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.10    |   97.9%   |      9.1%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.05    |   98.3%   |     11.1%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+

These percentages are defined by:

                                predicted sites
sites recognized =        -------------------------
                              all observed sites


                                predicted sites
false positive sites =       -------------------------
                             all observed non-sites


                                          (TPxTN)-(FNxFP)
correlation coefficient (CC) =  ------------------------------------
                                  ________________________________
                                 V (TP+FN)x(TN+FP)x(TP+FP)x(TN+FN)

TP = true positive = sites recognized
TN = true negative = non-sites recognized
FP = false positive = observed non-sites predicted as sites
FN = false negatives = observed sites predicted as non-sites

Human 3' Splice Site prediction:

  +------------+-----------+----------------+------------+
  | threshold  |    %      |     %          | correlation|
  |            | sites     | false positive | coefficient|
  |            | recognized| sites          |    (CC)    |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.99    |    7.3%   |      0.0%      |    0.25    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.95    |   33.3%   |      0.4%      |    0.52    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.90    |   47.9%   |      0.5%      |    0.64    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.85    |   57.7%   |      0.6%      |    0.70    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.80    |   61.2%   |      0.9%      |    0.72    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.75    |   65.4%   |      1.1%      |    0.74    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.70    |   69.7%   |      1.3%      |    0.77    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.65    |   73.5%   |      1.5%      |    0.79    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.60    |   76.5%   |      1.8%      |    0.80    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.55    |   79.1%   |      2.0%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.50    |   80.8%   |      2.4%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.45    |   82.5%   |      2.9%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.40    |   83.8%   |      3.1%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.35    |   86.8%   |      3.7%      |    0.82    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.30    |   88.5%   |      4.0%      |    0.82    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.25    |   88.5%   |      4.5%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.20    |   90.2%   |      4.8%      |    0.82    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.15    |   91.0%   |      6.0%      |    0.80    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.10    |   92.3%   |      7.9%      |    0.77    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.05    |   94.9%   |     10.4%      |    0.74    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+

Neural Network based "consensi" sequences: Extensive analysis of the perceptron neural network weight matrices have revealed the following "refined" 5' and 3' splice site consensus and non-consensus sequences:

5' Splice Site: 

              -7  6  5  4  3   2  -1     +1  2  3   4  5  6  7 +8
consensus:     a  a  a  A C|a  A   G   /  G  T  A   A  G  T  -  c      

non-consensus: g  g  g  G G|T G|T A|T     -  - C|t g|t -  -  t  -     


3' Splice Site: 

               -21 -20 19 18  17 16  15  14  13  12  11  10   9   8   7   6   5  4   3   2 -1
consensus:       -   T  T T|c  T T|C T|C T|c T|c T|c T|c T|c T|c T|C T|c T|C T|c A  T|C  A  G  
non-consensus:                                                                       G        

               +1  2  3  4  5  6  7  8  9  10  11  12 13 14 15 16  17 18 19 +20
consensus:      G  T  c  -  -  -  g  g  -   g  g|a  c  g  a  a a|c  a  g  -   -
non-consensus: c|t       t    g|t

Capital letters indicate strong weights and lower case letters weaker weights.
"|" means "or"
"-" no significant weight "non-consensus" indicates bases that are very unlikely to appear at this position.

Drosophila melanogaster

A carefully randomly chosen independent test set of 41 genes (Drosophila melanogaster gene set) with no related sequences to the training set gave the following results:

_Drosophila melanogaster _5' Splice Site prediction:

  +------------+-----------+----------------+------------+
  | threshold  |    %      |     %          | correlation|
  |            | sites     | false positive | coefficient|
  |            | recognized| sites          |    (CC)    |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.99    |    0.0%   |      0.0%      |     -      |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.95    |   22.9%   |      0.0%      |    0.44    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.90    |   53.3%   |      0.0%      |    0.69    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.85    |   61.9%   |      0.0%      |    0.75    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.80    |   66.7%   |      0.0%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.75    |   69.5%   |      0.8%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.70    |   77.1%   |      0.8%      |    0.83    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.65    |   78.1%   |      1.0%      |    0.83    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.60    |   81.9%   |      1.0%      |    0.86    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.55    |   82.9%   |      1.0%      |    0.86    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.50    |   88.6%   |      1.8%      |    0.88    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.45    |   90.5%   |      2.5%      |    0.88    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.40    |   91.4%   |      3.0%      |    0.88    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.35    |   91.4%   |      4.0%      |    0.85    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.30    |   94.3%   |      4.8%      |    0.86    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.25    |   96.2%   |      5.3%      |    0.86    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.20    |   97.1%   |      5.8%      |    0.86    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.15    |   97.1%   |      8.0%      |    0.82    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.10    |   99.1%   |     10.3%      |    0.80    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.05    |   99.1%   |     15.1%      |    0.73    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+



#### _Drosophila melanogaster _3' Splice Site prediction: 




  +------------+-----------+----------------+------------+
  | threshold  |    %      |     %          | correlation|
  |            | sites     | false positive | coefficient|
  |            | recognized| sites          |    (CC)    |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.99    |    1.9%   |      0.0%      |    0.12    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.95    |   11.4%   |      0.0%      |    0.30    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.90    |   28.6%   |      0.6%      |    0.46    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.85    |   44.8%   |      0.6%      |    0.60    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.80    |   53.3%   |      1.1%      |    0.65    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.75    |   60.1%   |      2.0%      |    0.69    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.70    |   69.5%   |      2.3%      |    0.74    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.65    |   73.3%   |      2.5%      |    0.76    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.60    |   76.2%   |      3.1%      |    0.77    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.55    |   79.0%   |      4.2%      |    0.77    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.50    |   83.8%   |      5.4%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.45    |   87.6%   |      5.9%      |    0.80    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.40    |   90.5%   |      6.5%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.35    |   92.4%   |      7.0%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.30    |   94.3%   |      9.0%      |    0.79    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.25    |   94.3%   |     10.7%      |    0.77    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.20    |   96.2%   |     13.0%      |    0.75    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.15    |   96.2%   |     14.7%      |    0.73    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.10    |   96.2%   |     17.5%      |    0.69    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.05    |   97.1%   |     30.7%      |    0.56    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+