In a collaboration between the Computational Biology Group at the University of California, Santa Cruz, headed by David Haussler, and the Genome Informatics Group at Lawrence Berkeley National Laboratory (LBNL) headed by Frank Eeckman, we have developed a new gene finding program called GENIE.
GENIE uses a statistical model of genes in DNA.
A Generalized Hidden
Markov Model (GHMM) provides the framework for describing the grammar
of a legal parse of a DNA sequence. Probabilities are
assigned to transitions between states in the GHMM and to the
generation of each nucleotide base given a particular state. Machine
learning techniques are applied to optimize these probabilities using
a standardized gene data set
,which we provide for the community to test
gene finding tools.
GENIE's performance is tested on a second dataset provided by Burset and Guigo (1996). This dataset of 570 genes from different organisms was used in Burset and Guigo (1996) to compare different gene-finding methods. In the following table GENIE's performance is added to the table copied from the Burset and Guigo paper. The first table shows the results for gene finders that do not use any database information to existing protein homologs in the databases. The second table shows GENIE performance when information about existing protein homologs is used for prediction.
| Base-level | Exon-level | |||||||
|---|---|---|---|---|---|---|---|---|
| Method | Sn | Sp | AC | Sn | Sp | (Sn+Sp)/2 | ME | WE |
| Genie | 0.87 | 0.88 | 0.85 | 0.69 | 0.70 | 0.69 | 0.10 | 0.15 |
| FGENEH | 0.77 | 0.85 | 0.78 | 0.61 | 0.61 | 0.61 | 0.15 | 0.11 |
| GeneID | 0.63 | 0.81 | 0.67 | 0.44 | 0.45 | 0.45 | 0.28 | 0.24 |
| GeneParser2 | 0.66 | 0.79 | 0.66 | 0.35 | 0.39 | 0.37 | 0.29 | 0.17 |
| GenLang | 0.72 | 0.75 | 0.69 | 0.50 | 0.49 | 0.50 | 0.21 | 0.21 |
| GRAILII | 0.72 | 0.84 | 0.75 | 0.36 | 0.41 | 0.38 | 0.25 | 0.10 |
| SORFIND | 0.71 | 0.85 | 0.73 | 0.42 | 0.47 | 0.45 | 0.24 | 0.14 |
| Xpound | 0.61 | 0.82 | 0.68 | 0.15 | 0.17 | 0.16 | 0.32 | 0.13 |
In the first 6 columns, higher values indicate better performance. In the last 2 columns lower values indicate better performance.
| Base-level | Exon-level | |||||||
|---|---|---|---|---|---|---|---|---|
| Method | Sn | Sp | AC | Sn | Sp | (Sn+Sp)/2 | ME | WE |
| Genie | 0.95 | 0.91 | 0.91 | 0.77 | 0.74 | 0.76 | 0.04 | 0.13 |
| GeneID+ | 0.91 | 0.90 | 0.88 | 0.73 | 0.70 | 0.71 | 0.07 | 0.13 |
| GeneParser3 | 0.86 | 0.91 | 0.86 | 0.56 | 0.58 | 0.57 | 0.14 | 0.09 |
GENIE was presented at the 4th Conference on Intelligent Systems in Molecular Biology in St. Louis, June 1996.
Read abstract.
Download
paper (56 Kb compressed postscript).
GENIE uses a neural network recognizer for splice sites; this splice site predictor program can independently be accessed via our NEW splice site WWW-interface.