The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster

Abstract

Tutorial goals

Tutorial organization

What is a gene?

What are annotations?

How does an annotation differ from a gene?

Transcription and translation

Schematic gene structure

Sequence feature types

DNA transcription unit features

mRNA features

PPT Slide

Definitions for data modeling

Annotation

Annotation process overview

Types of sequence data

Auxiliary data

Computational annotation tools

Database resources

Biological issues in annotation

Engineering issues in annotation

Engineering issues in annotation

Engineering issues in annotation

Engineering issues in annotation

Engineering issues in annotation

Drosophila melanogaster

Drosophila Genome Project

Goals of the Drosophila Genome Project

Sequencing at the BDGP

The BDGP sequence annotation process

What sequence to start with?

Which analyses need to be run?

Which analyses need to be run and how?

What public sequence data sets are needed?

Which analyses need to be run and how?

How do you achieve computational throughput?

What do you do with the results?

Is human curation needed?

Gene Skimmer

Gene Skimmer

CloneCurator

PPT Slide

How do we annotate gene/protein function?

Ontology browser

PPT Slide

Ontology browser: searching for terms

How do you distribute the data?

Ribbon

Ribbon

How do you manage the data?

How do you maintain annotations?

Integrated annotation systems

Integrated annotation systems: ACeDB

ACeDB

Genotator

Magpie

GAIA

TIGR Human Gene Index

Computational analysis tools

Gene finding: Prokaryotes vs. Eukaryotes

Gene finding: Prokaryotes vs. Eukaryotes

Integrated gene finding

Integrated gene finding: Dynamic programming

Integrated gene finding: Dynamic programming

Integrated gene finding: Linear and Quadratic Discriminant Analysis (LDA/QDA)

Integrated gene finding: Feed-forward neural networks

Approaches to gene finding: Hidden Markov models

Approaches to gene finding: Generalized hidden Markov models

Gene finding software

Promoter recognition

Promoter recognition (cont.)

Promoter recognition (cont.)

Promoter recognition (cont.)

Example: NNPP

Promoter recognition (cont.)

Splice site prediction

Splice site prediction (cont.)

Splice site prediction (cont.)

Start codon prediction

Poly-adenylation signal prediction

Prediction of coding potential

Prediction of coding potential (cont.)

Prediction of coding potential (cont.)

Prediction of coding potential (cont.)

Prediction of coding potential (cont.)

Prediction of coding exons

“Integrated” gene models: LDA/QDA

“Integrated” gene models: NN

“Integrated” gene models: Artificial intelligence approaches

“Integrated” gene models: Artificial intelligence approaches

“Integrated” gene models: HMMs

“Integrated” gene models: GHMMs

Example: Genie

“Integrated” gene models: GHMMs

EST/cDNA alignment for gene finding: Spliced alignments

EST/cDNA alignment

EST/cDNA alignment (cont.)

Repeat finders

Repeat finders (cont.)

Homology searching

Gene family searching

The genome annotation experiment (GASP1)

PPT Slide

Goals of the experiment

Adh contig

Adh paper (to appear in Genetics)

Raw sequence: Adh.fa

Drosophila data sets provided to participants

Timetable

Resources for assessing predictions

Curated data sets for assessing predictions

Curated data sets for assessing predictions

Curated data sets for assessment

Submission format

Sample submission

Submissions

Submissions (cont.)

Submissions (cont.)

Submissions (cont.)

Submissions (cont.)

Submissions (cont.)

Submissions (cont.)

Submissions (cont.)

Submissions (cont.)

Submissions (cont.)

Submissions (cont.)

Submissions (cont.)

Submissions (cont.)

Submissions (cont.)

Submissions (cont.)

Submissions (cont.)

Submissions (cont.)

Submissions (cont.)

Submission classes

Submission classes (cont.)

Gene finding techniques

Measuring success

Definitions and formulae

Genes: True positives (TP)

Genes: False positives (FP)

Genes: False Negatives (FN)

Toy example 1 (1)

Genes: Missing Genes (MG)

Genes: Wrong Genes (WG)

Toy example 1 (2)

Genes: Std 1 versus Std 3

Toy example 1 (3)

Genes: Std1 and Std3 versus “real” gene structure

Toy example 1 (4)

Toy example 1 (5): Exon level

Genes: Joined genes (JG)

Genes: Split genes (SG)

Definition: “Joined” and “split” genes

Toy example 2 (1)

Annotation experiment results

Results: Base level

Results: Exon level

Results: Gene level

Results: Gene level

Results (protein homology): Base level

Results (protein homology): Exon level

Results (protein homology): Gene level

Transcription Start Site (TSS): Standard 1

TSS: Standard 3

Results: TSS recognition

Interesting gene examples: bubblegum

Adh/Adhr (Alcohol dehydrogenase/Adh related)

Adh/Adhr (cont..)

osp (outspread)

cact (cactus)

kuz (kuzbanian)

beat (beaten path)

Idfg1, Idfg2, Idfg3 (Imaginal Disc Growth Factor)

Idfg1, Idfg2, Idfg3 (cont.)

Conclusion of GASP1

Conclusion GASP1 (cont.)

Discussion GASP1

Conclusions on annotating complete eukaryotic genomes

Conclusions on annotating complete eukaryotic genomes (cont.)

Discussion on annotating complete eukaryotic genomes

Acknowledgments