Increased complexity in the GO

Chris Mungall, BDGP / GO Consortium

!draft document!

The path-to-term ratio is one measure of complexity in an ontology. Complexity has been rising steadily for GO since 2002, in particular the BP ontology. The complexity should be managed both in the ontology and at the display level to make the ontologies more tractable.

Intro

One measure of the complexity of an ontology is the average number of paths-to-top of a term. This is the number of distinct paths that can be taken from a particular term to the most general term in the ontology.

Difficulties in navigating and visualising the ontology rise proportionate to the average number of paths a term has. This can be illustrated by looking at the term cysteine biosynthesis in AmiGO, [see also AmiGO Graphical View], or EGO. The EGO default display has most clarity here.

It is possible that these difficulties may result in curation errors in the ontology.

I shall refer to the averge number of paths per term as the p/t ratio. For any ontology, p/t must be greater than or equal to 1 (every term must have a parent, even if it is the universal root class). If p/t=1, then the ontology is a tree (each term has a single parent, and thus a single path to the most general term).

Multifaceted classes (aka cross-product terms) result in higher p/t ratios for an ontology. This is illustrated by comparing p/t for both cellular component (5.25, as of 2005-02-01) vs biological process (19.83 as of the same date). Some ontologies in OBO have very high p/t ratios - for example, fly_anatomy (p/t=144.8, maximum(p)=1974, FBbt:00004471) - I believe there are a few simple ways to re-organise this ontology to reduce p/t.

Results

I plotted the growth in complexity since we began archiving the ontologies.

The graphs below illustrate the p/t ratio for the 3 GO ontologies over time.

Molecular Function
Cellular Component
Biological Process

Red lines show number of terms (increasing at a steady but slow rate for all ontologies), and blue lines show the total number of paths in the ontology. p/t can be determined by dividing total paths by total terms.

The blip in early 2001 may reflect unparseable ontology files in the archive

p/t has remained constant for the MF ontology throughout its existence. p/t for CC has remained mostly constant with a big rise in the 2004-11-01 release (reason: integrating obol results??? cellul). BP, with the highest p/t ratio overall, has been climbing steadily since the end of 2002, with a very large increase also at 2004-11-01 (presumably there was some large reorganisation at this time?).

Current Complexity

Here are stats from the current (2005-02-01) ontology:
Ontology Total terms Total rels Total paths p/t max(p) ID with max(p)
molecular_function 7459 8149 11586 1.553291325915 19 GO:0015375
cellular_component 1532 2122 7545 4.92493472584856 64 GO:0012507
biological_process 9383 14996 179821 19.164552914846 406 GO:0046641
The final column gives the ID of the term with most paths (in the case of a draw, one is chosen arbitrarily). The resulting terms: "positive regulation of alpha-beta T-cell proliferation"; "ER-Golgi transport vesicle membrane" and "glycine:sodium symporter activity" are all highly multi-faceted. The DAG-as-tree view for the BP term is particularly uninformative (in fact AmiGO truncates this as it is so huge!)

OBO

The same statistics were calculated for OBO, only for a static time slice (2005-02-01). This gives an idea of how complex GO would be if we were to make full cross-products with external ontologies; for example, fly_anatomy has p/t=144.79. This would obviously need to be filtered in some way prior to making cross-products.
Ontology Total terms Total rels Total paths p/t max(p) ID with max(p)
Sequence Ontology 968 1168 1727 1.78409090909091 25 SO:0000470
fly_anatomy.ontology 6077 9828 879879 144.788382425539 1974 FBbt:00004471
Drosophila_qualifier 160 159 160 1 1 FBql:00535257
human 8340 8339 8340 1 1 EHDA:4586
medaka ontology 4358 4402 5393 1.23749426342359 46 MFO:0000470
Mouse_anatomy_by_time_xproduct 13731 13730 13731 1 1 EMAP:11737
Mouse_anatomy_by_time_xproduct 2405 2922 7704 3.2033264033264 18 MA:0000962
Staged_ZebraFish_Gross_Anatomy 6708 6719 6810 1.01520572450805 3 ZFIN:0006768
Dictyostelium discoideum Anatomy 38 57 73 1.92105263157895 6 DDANAT:0000055
microbial structure ontology 64 87 125 1.953125 14 FAO:0001017
Arabidopsis ontology 299 479 1232 4.12040133779264 43 TAIR:0000358
cereal plant ontology 480 639 1608 3.35 41 GRO:0005860
plant ontology 587 907 3083 5.25212947189097 71 PO:0006494
fly_development.ontology 97 160 63303 652.60824742268 3692 FBdv:00005375
medaka ontology 4358 4402 5393 1.23749426342359 46 MFO:0000470
C. elegans Development DAG 69 68 69 1 1 WBls:0000014
cereal plant ontology 230 253 382 1.66086956521739 4 GRO:0007018
cell.ontology 706 992 3372 4.77620396600567 34 CL:0000129
environment_ontology 479 477 508 1.06054279749478 4 EO:0007332
Biological Imaging Methods 259 258 259 1 1 FBbi:00000246
environment ontology 935 980 1133 1.21176470588235 6 GEO:0007219
pathology_ontology 459 458 459 1 1 MPATH:225
trait ontology 582 654 877 1.50687285223368 10 TO:0000110
relationship 12 11 12 1 1 OBO_REL:improper_part_of
chebi_ontology 10338 11694 24260 2.34668214354808 22 CHEBI:28240
anatomicalsystem 391 390 391 1 1 EV:0100140
associatedwith 24 23 24 1 1 EV:0600004
celltype 162 161 162 1 1 EV:0200075
developmentalstage 155 154 155 1 1 EV:0300099
experimentaltechnique 28 27 28 1 1 EV:0900000
microarrayplatform 19 18 19 1 1 EV:0700005
pathology 174 173 174 1 1 EV:0400125
pooling 8 7 8 1 1 EV:0500005
tissuepreparation 8 7 8 1 1 EV:1000001
treatment 24 23 24 1 1 EV:0800005
pato.ontology 1196 1635 2650 2.21571906354515 14 PATO:0000286
rex.ontology 494 651 1385 2.80364372469636 16 REX:0000308
microbial_anatomical_structure 1 1 2 2 2 UBO:0000009
Phenotypic_manifestation_ontology 30 29 30 1 1 PM:0000011
biophysical chemistry 1128 1614 8162 7.23581560283688 52 FIX:0000520
evidence_code 129 136 162 1.25581395348837 4 ECO:0000114

Conclusions and discussion

The complexity for the GO, as measured by p/t, in particular the BP ontology is high, and is likely to rise. It is difficult to extrapolate into the future as the Nov-2004 burst may have been a one-off occurence. Nevertheless, it seems likely that the complexity will rise, especially as many of the terms likely to be added will be multi-faceted as different organisms are annotated with the GO.

Complexity is detrimental to both navigation and comprehending the parentage of a term. There are a number of band-aids we can patch on to hide this complexity from the casual user. EGO's graph view display seems to use a better algorithm than AmiGO's (GraphViz). We can hide the DAG-as-tree display for certain complex terms, and instead write a flat list of all ancestors as the default display and/or a list of immediate parents.

None of these are quite satisfactory as the fundamental problem remains. In order to address this, the very structure of the ontology itself must be addressed.

Reducing Complexity

Take cysteine biosynthesis as an example. The complexity of this term resides from the graph product of both biosynthesis (a simple term, with p=1) and cysteine (which is implicitly represented in GO and in ChEBI as a complex term, with p=6 with at least two axes of classification: family,..). Combining these gives a total of 87 paths! Clearly the DAG-as-tree display idiom becomes more of a hindrance than a help here.

The appendix contains more on calculating the explosive growth of the number of paths for multifaceted classes.

The composite graph is rarely useful to a user. What is most useful is (a) the immediate parentage and (b) the full de-tangled ancestry in each of the composing terms ontologies

For biosynthesis we would have a simple tree:

Displaying cysteine is still problematic, as p=6. Even so, displaying two trees with p 1 and 6 respectively is better than one tree with p=87. Ideally we would disentangle cysteine along different axes of classification too.

It's not quite as simple as this - the decomposition above hides the transitive is_a relationship between "cysteine biosynthesis" and "cellular metabolism" (via "amine biosynthesis") which is not derivable from the cross-products. (This is the reason cross-product terms should not be excised from the ontologies). The new display algorithm would find all these additional relationships cannot be derived from the detangled trees.

There are a number of other possibilities, such as dynamic filters. users could choose what level of resolution they prefer in each of the various ontologies and the display would be tailored accordingly. For example, a biochemist may be interested in the entire ancestry of "cysteine biosynthesis", but for the default user, a low level of resolution could be used over ChEBI (for example, cysteine is_a amino acid is_a molecule). Similarly, non-pathogenic prokaryote oriented people may prefer to filter out anything regarding anatomies or high-level behaviour.

To drive these displays, the ontology would have to be recast. For example, the ontology would have to have a link between "cysteine biosynthesis" and "cysteine". See the main Obol pages for more descriptions here.

Methods

This data was generated using go-dag-summary.pl, part of go-perl. Ontologies were sourced from ftp://ftp.geneontology.org/pub/go/ontology-archive. The results of running the script on the archive are available in this table. Graphs were generated from this file using this R script

References

Appendix

A. Formula for number of paths

If we were to combine two trees (ie DAGs with p/t=1) and instantiate every term in the cross-product (ie a maximally dense cross-product matrix), then the formula for the number of paths for the most specific term in the resulting DAG is:

 v: depth of first tree
 w: depth of second tree
 
 p(v,w) = (v+w)!/(v!w!)
          
Of course, we may not wish to create every term in the XP (it makes no sense to combine "physiological process" with "cysteine"). Nevertheless, even with sparsely populated XP matrices, the complexity does rise rapidly

Combining a DAG with a tree, combining two DAGs, or three dimensional XPs yields even more complexity. I haven't been able to work out any formula for this. Note that many existing terms in GO are recursively nested cross-products (eg "negative regulation of X biosynthesis")


Last modified: Mon Feb 21 11:56:08 PST 2005