|
|
This talk is focused on the sequence module; we will also discuss parts of the cv module as ontologies are crucial to how chado represents all data.
The actual chado tables themselves are not discussed in attribute-by-attribuite detail; this can be browsed by checking out the 'schema' module from the GMOD cvs .
One of the main strengths of chado is that it brings the sequence and genetics views of the world together - I will be mentioning some aspects of the genetics module in this talk.
Simplified Sequence Ontology
isa: subtypes, specialisation/generalization
partof: compositional
Basic Centra Dogma Example
One gene, one transcript, one exon, one protein
Alternate Splicing
Dicistronic Gene
Trans-splicing
for example: mod(mdg4) - exons on both strands
(other cases of trans-splicing may involve spatially distributed
primary transcripts)
CDS boundaries + exons IMPLIES CDS exons
exons IMPLIES introns
CDS boundaries + transcript IMPLIES UTR
UTR + exons IMPLIES UTR exons
Two table structure required for representing graphs
cvterms (controlled vocabulary terms) connected by cvrelationships
The relationship type is a controlled term in
itself. Each cvrelationship can be thought of as a SUBJECT
PREDICATE OBJECT statement (eg "GPCR is-a
transmembrane_receptor).
The structure above is exactly the same as the RDF datamodel - many modern ontology languages (eg DAML, OWL) are layered on top of RDF, so the above structure ensures we will be able to represent all the most advanced ontological concepts.
features are the nodes - feature_relationships are the arcs
Note: the different classes of features could be modeled
relationally; the principle is to keep the stable stuff modeled
relationally, and the fluid/extensible stuff modeled in an
ontology that sits in a generic database structure.
Problem: find all genes; find all genes (generic) find all noncoding genes find all protein coding genes find all tRNA genes find all snRNA genes find all snoRNA genes ...etc eek!
Solution: pre-compute transitive closure
GO Ontology subgraph
Transitive closure of graph:
Solid lines represent the actual relationships. The collection
of dotted lines is the closure of the relationships.
forall x ALWAYS TRUE: x R* x
x R y IMPLIES: x R* y
x R y, y R* z IMPLIES: x R* z
CREATE VIEW fgene AS
SELECT
feature.*
FROM
feature INNER JOIN cvpath ON (feature.ftype_id = cvpath.subjterm_id)
INNER JOIN cvterm ON (cvpath.objterm_id = cvterm.cvterm_id)
WHERE cvterm.termname = 'gene';
Here in Berkeley we will mostly be using chado in data-mining mode - i.e. we will be querying, not updating. This means we can materialize views for speed.
We can attach any properties we like to feature:
A feature can have multiple locations - however, "split" locations should not be used (for an example of a split location, look at how genbank represents a transcript).
Any feature can have 0 to many locations:
Each location is relative to another feature (the srcfeature)
The featureloc table includes the following attributes:
Interbase coordinates (top) and base-oriented (below)
The position of the ATG in interbase is [3, 6] (between the 3rd and 6th gaps)
The position of the ATG in base coordinates is [4, 6] (between 4th and 6th bases inclusive)
Note the different arithmetic for calculating length in these two systems.
unlike mathematical vectors, we must also explicitly store the directionality (strand). even though this is surplus to requirements most of the time, it is required for zero-length features, and for circular chromosomes.
Central dogma - with exons and CDSs localised
Using the principle of minimal storage (do not store anything
that does not increase the information content of the
database - i.e. nothing redundant), we store only exon and CDS
boundary localisations. In the BDGP data warehouse instantiation
of chado, we may choose to store locations for all features
where known - this can vastly simplify some queries, but care
must be taken to make sure we don't end up with inconsistent
data.
For the most part, infering the boundaries of composite features requires fairly simple graph transformations, although care must be taken for the genes that break central dogma rules.
repeat localised to a contig, itself on a chromosome arm
featurelocs are represented by dashed lines.
Note that the position of the repeat on the chromosome arm is implicit, and can be calculated with a simple graph transform, but following the principle of minimal storage, we do not store this in the management db.
If we wish to store the redundant position in a for-querying copy of the db, chado allows us this option - we can have as many locations as we like for a feature. We use an extra attribute called locgroup to distinguish locations. locagroup=0 is conventionally used for the non-redundant location.
the repeat feature now has two locations
If you look at the underlying data, you will see that the
featureloc that locates the repeat on the arm has a locgroup
values of 1.
Genscan predicts CDSs and CDS exons (not genes in the Sequence Ontology sense). A typical genscan prediction may look like this:
Genscan 3-exon 'gene' prediction
Pairwise alignments produce HSPs. HSPs are scored features with two locations - one on the query, one on the subject.
Blast hit with 3 HSPs
Each HSP has two featurelocs. featureloc has an attribute "rank"
to order the locations; by convention 0 is the query loc and 1
is the subject loc.
Examples of BDGP chado xml (which were used to make the diagrams in this document) can be found here. (You will need Stag to convert the hand-edited lisp expressions to xml)
Gadfly3 data moved to chado schema
psql -h scabrous chado_gadfly3
$as = $adap_handle->get_AnnotatedSeq({seq=>"AE003677"});
print $as->to_chado->xml;
Uses the Data::Stag module
bioperl allows a feature to have multiple non-contiguous locations. Even though the chado schema allows multiple locations, attaching multiple non-contiguous locations is a violation of the chado semantics. To cope with this, we create extra features for the sublocations. For instance, for a transcript, we would create an exon for each of the sublocations.
If you parse a genbank file including variations (eg SNPs) into bioperl objects, you will get a feature with two properties of type "allele". This can be represented using chado; however, the chado semantics state that these variations should be represented using multiple locations. Transforms will have to be written to fix this.
The data files are found here.
The bubble graphs were drawn using a program called bubbles.pl, part of the experimental cabal package, currently available from the BDGP CVS repository (see project scratch/cabal).
The commands to build the diagrams are here.