From: "Anders Krogh" To: mgreese@lbl.gov Subject: Re: Information on submissions Date: Mon, 26 Jul 1999 15:10:20 -0600 Dear Martin, Here is my description of the gene finder. I'm just back from vacation, and I found a fax telling that ISMB is full and that I cannot register. So I may not make it for the dinner (which of course is the main reason that I want to go to ISMB). It's sad. Do you happen to know anyone who can pull some strings? - Anders HMMgene ======= by Anders Krogh Center for Biological Sequence Analysis The Technical University of Denmark Building 208, 2800 Lyngby, Denmark e-mail: krogh@cbs.dtu.dk ------ HMMgene is an HMM that has modules for coding regions, splice sites, translation start/stop, etc. This prototype version incorporates database hits and other `external information' probabilistically. It will be a fully automated system. ------- The hidden Markov was essentially identical to one I am currently using for human (experimental). It has modules to recognize splice sites (and branch points), the regions around translation start and end, simple modules for 5' and 3' UTR, and of course a module for coding regions. The prediction is made by finding the most probable gene structure for a given sequence. Contrary to Genscan it analyzes each strand separately. I have recently added the capability of using database matches. For each database match a begin and end is specified along with a type and a score. For such a region the probability of certain features are increased proportional to the score. For an EST, for example, the probability of coding, 5' UTR, and 3' UTR is increased relative to that of intergenic and intron. By using probabilities, it is possible for the system to ignore a database match that does not fit well in a gene structure, or to use other splice sites than the ones implied by the database match. The difficulty is to tune the scale of these probabilities correctly. This prediction was the first serious attempt to make a prediction with the new system. The parameters governing the database match probabilities needs more tuning to optimize performance. One of the most difficult problems is that EST hits often occur on the wrong strand. For this problem, a model taking both strands at once have an advantage. In the prediction submitted, I used several ad-hoc rules to filter the EST hits. The model was estimated from a set of genes that was a combination of the set provided by Martin Reese [multi_exon_GB.dat and single_exon_GB.dat (*)] and a set originally made by Victor Solovyev and then checked by Staffan Bergh, Anneli Attersand, and Luis Parodi at Pharmacia & Upjohn in Stockholm (I will see if the data can be made publicly available.). I used data base matches to Swiss-prot 37 (didn't have local access to a non-redundant database), ESTs [na_EST.dros (*)], cDNA [na_gb.dros.cDNA.unique.fa.gz (*)], transposons [transposon_sequence_set.embl.v3.7 (*)], and repeat sequences [repeat_sequence_set.embl.v2.1.Z (*)]. *) Data found at http://www-hgc.lbl.gov/homes/reese/genome-annotation/data/data.html) ------- Bits and pieces of HMMgene are described in A. Krogh 1998. An Introduction to Hidden Markov Models for Biological Sequences, In S. L. Salzberg et al., eds., Computational Methods in Molecular Biology, 45-63. Elsevier. A. Krogh 1997. Two methods for improving performance of an HMM and their application for gene finding, Gaasterland, T. et al., eds., Proc. ISMB 97, 179-186. Menlo Park, CA: AAAI Press. Both are available from http://www.cbs.dtu.dk/krogh/refs.html ------- The old HMMgene is available at http://www.cbs.dtu.dk/services/HMMgene/ But not for Drosophila.