Splice Sites: A detailed neural network study
Martin G. Reese and Frank H. Eeckman Lawrence Berkeley National Laboratory Genome Informatics Group 1 Cyclotron Road Berkeley, CA, 94720
Many large scale sequencing efforts are entering into a phase where sequence is produced at a rate of 40 kb or more per day per laboratory. Sophisticated tools are needed to annotate this amount of data. Given the nature of genomic sequence in humans, where large introns are known to exist, we recognize the need for highly specific gene finding algorithms.
Existing gene finding methods rely on a heuristic mix of content-statistics, specific signal sequences, with or without homology information. The goal is always to get the complete and exact sequence of a particular gene. Homology hits or content statistics may give us a strong hint that a particular sequence has coding potential but we will always need to refine these predictions using signal sequences if we are to find the exact structure of a gene. Splice sites are the key signal sequences that determine the boundaries of the exons. Therefore, we need to find an optimal splice site recognition method. Such a method should ideally be based on a thorough understanding of the complex eukaryotic splicing process. The newest available genomic data can be used to shed some light on this process.
We analyze the structure of donor and acceptor sites using a separate neural network recognizer for each site. These two neural network recognizers were developed as described in Brunak et al., 1991. We trained a backpropagation feedforward neural network with one layer of hidden units to recognize donor and acceptor sites, respectively, using a novel optimized representative data set. Different from Brunak et al., we only consider genes that have constraint consensus splice sites, i.e., GT' for the donor and AG' for the acceptor site. The output of the network is a score for a potential splice site.
The correlation coefficient (CC) for donor site prediction in an optimized network with one layer of hidden units is 0.855 versus 0.810 for a network with no hidden units; for the acceptor site prediction, the values are 0.824 and 0.750 respectively. At the 5% false positive level, 6% of the real donor sites and 9% of the real acceptor sites are missed.
We studied the distributions of the scores produced by the neural network versus the corresponding exon length, the frame of the splice site, and sequence distance to nearby GT and AG. We noticed several interesting features, for example, that GT sites closest to a real donor sites have weaker scores than GT sites further away while the same is not true for acceptor sites and AG sequences. More detailed results will be presented at the conference.