ORFFinder, a program to detect open reading frames, frameshifts and startcodons in Drosophila melanogaster cDNA's

E. Frise, M.G. Reese and G.M. Rubin.
Department of Molecular and Cell Biology,
Life Science Addition #3200,
University of California,
Berkeley, CA 94720-3200.
erwin@fruitfly.berkeley.edu, mgreese@lbl.gov, gerry@fruitfly.berkeley.edu

Abstract:

The Berkeley Drosophila Genome Project (BDGP) systematically sequences cDNA's from various tissue sources. The large number of sequences generated in such a project make it necessary to create a tool for sequence quality control. Identification of a credible coding region enables the BDGP and others to carry out subsequent protein analysis. We have developed a program called ORFFinder to find the open reading frame (ORF), putative frameshifts and the startcodon in a cDNA using an integration of Markov models and neural networks. ORFFinder was written completely in Java (JDK1.1) for portability. The program can be run in a command line, graphical mode or as an applet in a web browser and allows visualization of the results.

ORFFinder uses a 5th order Markov model to predict the most likely open reading frame. We trained the Markov model using a collection of 809 Drosophila melanogaster cDNA sequences derived from Genbank. The program finds positions of putative frameshifts by comparing in-frame and out-of- frame Markov model scores before and after every sequence position. Any potential frameshifts are corrected and the resulting reading frame is evaluated using a knowledge base. If frameshifts are detected, their positions are reported for further detailed analysis. Those frameshifts are corrected automatically, the ORF translated and the amino acid sequence is shown in the program output.

To recognize the correct startcodon a combination of a neural network and the in-frame coding probability is used. A feed-forward neural network is trained for the nucleotide sequence around the annotated startcodons from the collected cDNA dataset described above (similar to Brunak et.al. (1991), JMB 220: 49-65). The program combines the predictions from the neural net with the differences of the 5th order Markov model coding scores before and after the startcodon position to derive the most likely translation initiation.

We will present data about the performance of ORFFinder with known and newly sequenced cDNA's.