Human and Drosophila melanogaster Splice Site Prediction using Neural Networks
Read Abstract
About the neural network method
Splice sites are the key signal sequences that determine the boundaries
of exons. A method for splice site detection should ideally be based on a thorough
understanding of the complex eukaryotic splicing process.
We trained a backpropagation feedforward neural
network with one layer of hidden units to recognize 5' and 3' splice
sites, using a
representative data set (Drosophila melanogaster data set).
We only consider genes that have
constraint consensus splice sites, i.e., `GT' for the 5' and `AG'
for the 3' splice site.
The output of the network is a score between 0 and 1 for a potential splice site.
The neural network method is described in detail in
References and Abstract
Estimated accuracy of prediction
Human
A carefully randomly chosen independent test set of 43 human genes
(/sequence/human-datasets.html)
with no related sequences to the training set
gave the following results:
Human 5' Splice Site prediction:
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 26.0% | 0.1% | 0.46 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 50.4% | 0.7% | 0.65 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 64.1% | 1.1% | 0.73 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 72.7% | 1.4% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 74.4% | 1.9% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 77.8% | 1.9% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 81.6% | 2.7% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 85.0% | 3.2% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 88.0% | 3.5% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 89.3% | 3.7% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 91.5% | 4.2% | 0.85 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 93.2% | 4.7% | 0.85 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 93.2% | 5.2% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 93.6% | 5.3% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 94.9% | 5.8% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 95.3% | 6.2% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 96.2% | 6.7% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 96.6% | 8.2% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 97.9% | 9.1% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 98.3% | 11.1% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
These percentages are defined by:
predicted sites
sites recognized = -------------------------
all observed sites
predicted sites
false positive sites = -------------------------
all observed non-sites
(TPxTN)-(FNxFP)
correlation coefficient (CC) = ------------------------------------
________________________________
V (TP+FN)x(TN+FP)x(TP+FP)x(TN+FN)
TP = true positive = sites recognized
TN = true negative = non-sites recognized
FP = false positive = observed non-sites predicted as sites
FN = false negatives = observed sites predicted as non-sites
Human 3' Splice Site prediction:
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 7.3% | 0.0% | 0.25 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 33.3% | 0.4% | 0.52 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 47.9% | 0.5% | 0.64 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 57.7% | 0.6% | 0.70 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 61.2% | 0.9% | 0.72 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 65.4% | 1.1% | 0.74 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 69.7% | 1.3% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 73.5% | 1.5% | 0.79 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 76.5% | 1.8% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 79.1% | 2.0% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 80.8% | 2.4% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 82.5% | 2.9% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 83.8% | 3.1% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 86.8% | 3.7% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 88.5% | 4.0% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 88.5% | 4.5% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 90.2% | 4.8% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 91.0% | 6.0% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 92.3% | 7.9% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 94.9% | 10.4% | 0.74 |
| | | | |
+------------+-----------+----------------+------------+
|
Neural Network based "consensi" sequences:
Extensive analysis of the perceptron neural network weight matrices have revealed the following "refined"
5' and 3' splice site consensus and non-consensus sequences:
5' Splice Site:
-7 6 5 4 3 2 -1 +1 2 3 4 5 6 7 +8
consensus: a a a A C|a A G / G T A A G T - c
non-consensus: g g g G G|T G|T A|T - - C|t g|t - - t -
3' Splice Site:
-21 -20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 -1
consensus: - T T T|c T T|C T|C T|c T|c T|c T|c T|c T|c T|C T|c T|C T|c A T|C A G
non-consensus: G
+1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 +20
consensus: G T c - - - g g - g g|a c g a a a|c a g - -
non-consensus: c|t t g|t
Capital letters indicate strong weights and lower case letters weaker weights.
"|" means "or"
"-" no significant weight
"non-consensus" indicates bases that are very unlikely to appear at this position.
Drosophila melanogaster
A carefully randomly chosen independent test set of 41 genes
(Drosophila melanogaster gene set)
with no related sequences to the training set
gave the following results:
Drosophila melanogaster 5' Splice Site prediction:
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 0.0% | 0.0% | - |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 22.9% | 0.0% | 0.44 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 53.3% | 0.0% | 0.69 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 61.9% | 0.0% | 0.75 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 66.7% | 0.0% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 69.5% | 0.8% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 77.1% | 0.8% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 78.1% | 1.0% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 81.9% | 1.0% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 82.9% | 1.0% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 88.6% | 1.8% | 0.88 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 90.5% | 2.5% | 0.88 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 91.4% | 3.0% | 0.88 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 91.4% | 4.0% | 0.85 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 94.3% | 4.8% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 96.2% | 5.3% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 97.1% | 5.8% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 97.1% | 8.0% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 99.1% | 10.3% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 99.1% | 15.1% | 0.73 |
| | | | |
+------------+-----------+----------------+------------+
Drosophila melanogaster 3' Splice Site prediction:
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 1.9% | 0.0% | 0.12 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 11.4% | 0.0% | 0.30 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 28.6% | 0.6% | 0.46 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 44.8% | 0.6% | 0.60 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 53.3% | 1.1% | 0.65 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 60.1% | 2.0% | 0.69 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 69.5% | 2.3% | 0.74 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 73.3% | 2.5% | 0.76 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 76.2% | 3.1% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 79.0% | 4.2% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 83.8% | 5.4% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 87.6% | 5.9% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 90.5% | 6.5% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 92.4% | 7.0% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 94.3% | 9.0% | 0.79 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 94.3% | 10.7% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 96.2% | 13.0% | 0.75 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 96.2% | 14.7% | 0.73 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 96.2% | 17.5% | 0.69 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 97.1% | 30.7% | 0.56 |
| | | | |
+------------+-----------+----------------+------------+
| |
|