Format for submitting annotations to the Genome Annotation Experiment

General guidelines

Example of format:


# organism: Drosophila melanogaster 
# std1 
Adh     std1    TFBS    32002   32006   .       +       .        
Adh     std1    TATA_signal     32009   32012   .       +       .       transcript "1" 
Adh     std1    TSS     32033   32034   .       +       .       transcript "1" 
Adh     std1    prim_transcript 32034   33122   .       +       .       transcript "1" 
Adh     std1    exon    32034   32277   .       +       .       transcript "1" 
Adh     std1    start_codon     32122   32124   .       +       .       transcript "1" 
Adh     std1    CDS     32122   32277   .       +       .       transcript "1" 
Adh     std1    splice5 32277   32278   .       +       .       transcript "1" 
Adh     std1    splice3 32332   32333   .       +       .       transcript "1" 
Adh     std1    exon    32333   32452   .       +       .       transcript "1" 
Adh     std1    CDS     32333   32452   .       +       .       transcript "1" 
Adh     std1    splice5 32452   32453   .       +       .       transcript "1" 
Adh     std1    splice3 32571   32572   .       +       .       transcript "1" 
Adh     std1    exon    32572   32729   .       +       .       transcript "1" 
Adh     std1    CDS     32572   32729   .       +       .       transcript "1" 
Adh     std1    splice5 32729   32730   .       +       .       transcript "1" 
Adh     std1    splice3 32784   32785   .       +       .       transcript "1" 
Adh     std1    exon    32785   32830   .       +       .       transcript "1" 
Adh     std1    CDS     32785   32830   .       +       .       transcript "1" 
Adh     std1    splice5 32830   32831   .       +       .       transcript "1" 
Adh     std1    splice3 32825   32826   .       +       .       transcript "1" 
Adh     std1    CDS     32826   33003   .       +       .       transcript "1" 
Adh     std1    exon    32826   33122   .       +       .       transcript "1" 
Adh     std1    stop_codon      33001   33003   .       +       .       transcript "1" 
Adh     std1    polyA_signal    33090   33095   .       +       .       transcript "1" 
Adh     std1    polyA_site      33101   33102   .       +       .       transcript "1" 
Adh     std1    prim_transcript 38100   41973   .       -       .       transcript "2" 
Adh     std1    exon    38100   41973   .       -       .       transcript "2" 
Adh     std1    polyA_site      39620   39621   .       -       .       transcript "2" 
Adh     std1    polyA_signal    39685   39690   .       -       .       transcript "2" 
Adh     std1    stop_codon      40125   40127   .       -       .       transcript "2" 
Adh     std1    CDS     40125   40390   .       -       .       transcript "2"  
Adh     std1    start_codon     40388   40390   .       -       .       transcript "2" 
Adh     std1    TSS     41973   41974   .       -       .       transcript "2" 
Adh     std1    TATA_signal     41998   42001   .       -       .       transcript "2" 
Adh     std1    TFBS    42187   42193   .       -       .         
Adh     std1    TFBS    42211   42216   .       -       .         

Explanations for the format

The organizers guarantee that they will analyze the predictions for the following features in the annotation file. But this should not limit your prediction variety.

TFBS		transcription factor binding site.

TATA_signal	TATA-box (TBP binding site).

TSS 		transcription start site (note the transcription start is in
		between the  <start> and <end> annotation in the gff file! Historically
		the <start> is -1 and <end> is +1.

prim_transcript	primary (initial, unprocessed) transcript.

exon 		region of genome that codes for portion of spliced mRNA (does not always CDS).

start_codon	start codon (ATG).

CDS		coding sequence; sequence of nucleotides that
                corresponds with the sequence of amino acids in a
                protein (location includes start and stop codon). 

splice5		5' splice site (note the exon and intron boundary is between <start> and <end>; 
		the last base of the exon is <start> and the "G" of the "GT" consensus sequence is 
		<end> (for a gene on the forward strand). 

splice3		3' splice site (note the exon and intron boundary is between <start> and <end>; 
		the "G" of the "AG" consensus sequence is the <start> and the first base of the exon 
		is the <end> (for a gene on the forward strand).

stop_codon	stop codon (TAA|TGA|TAG).

polyA_signal	recognition region necessary for endonuclease cleavage
                of an RNA transcript.

polyA_site 	site on an RNA transcript to which will be added adenine
                residues by post-transcriptional polyadenylation (Note the site 
		is in between <start> and <stop>.

Please also note that it is very important that your group (usually by gene) together your predictions using the last column of the GFF format. In our examples there are 2 genes predicted, transcript "1" and transcript "2". Note also that this last column is now following the newer GFF version 2!

A typical "gene" finding prediction for "transcript "1" " submitted by groupX should look like (Note in this example the second 3' splice off which results also in a different CDS annotation. In addition, the third exon is missed.


# groupX 
Adh     groupX    start_codon     32122   32124   .       +       .       transcript "1" 
Adh     groupX    CDS     32122   32277   .       +       .       transcript "1" 
Adh     groupX    splice5 32277   32278   .       +       .       transcript "1" 
Adh     groupX    splice3 32382   32383   .       +       .       transcript "1" 
Adh     groupX    CDS     32383   32452   .       +       .       transcript "1" 
Adh     groupX    splice5 32452   32453   .       +       .       transcript "1" 
Adh     groupX    splice3 32571   32572   .       +       .       transcript "1" 
Adh     groupX    CDS     32572   32830   .       +       .       transcript "1" 
Adh     groupX    splice5 32830   32831   .       +       .       transcript "1" 
Adh     groupX    splice3 32825   32826   .       +       .       transcript "1" 
Adh     groupX    CDS     32826   33003   .       +       .       transcript "1" 
Adh     groupX    stop_codon      33001   33003   .       +       .       transcript "1" 

GenBank to GFF converter:

genbank2gff: a perl script that converts GenBank format to GFF format. You will also need the perl GFF module Gff.pm. Both are provide by David Kulp.


[email protected]
Last modified: Fri Jun 25 10:43:39 PDT 1999