Increased complexity in the GO
Chris Mungall, BDGP / GO Consortium
The path-to-term ratio is one measure of complexity in an ontology. Complexity has been rising steadily for GO since 2002, in particular the BP ontology. The complexity should be managed both in the ontology and at the display level to make the ontologies more tractable.
Intro
One measure of the complexity of an ontology is the average number of paths-to-top of a term. This is the number of distinct paths that can be taken from a particular term to the most general term in the ontology.
Difficulties in navigating and visualising the ontology rise proportionate to the average number of paths a term has. This can be illustrated by looking at the term cysteine biosynthesis in AmiGO, [see also AmiGO Graphical View], or EGO. The EGO default display has most clarity here.
It is possible that these difficulties may result in curation errors in the ontology.
I shall refer to the averge number of paths per term as the p/t ratio. For any ontology, p/t must be greater than or equal to 1 (every term must have a parent, even if it is the universal root class). If p/t=1, then the ontology is a tree (each term has a single parent, and thus a single path to the most general term).
Multifaceted classes (aka cross-product terms) result in higher p/t ratios for an ontology. This is illustrated by comparing p/t for both cellular component (5.25, as of 2005-02-01) vs biological process (19.83 as of the same date). Some ontologies in OBO have very high p/t ratios - for example, fly_anatomy (p/t=144.8, maximum(p)=1974, FBbt:00004471) - I believe there are a few simple ways to re-organise this ontology to reduce p/t.
Results
I plotted the growth in complexity since we began archiving the ontologies.
The graphs below illustrate the p/t ratio for the 3 GO ontologies over time.
Red lines show number of terms (increasing at a steady but slow rate for all ontologies), and blue lines show the total number of paths in the ontology. p/t can be determined by dividing total paths by total terms.
The blip in early 2001 may reflect unparseable ontology files in the archive
p/t has remained constant for the MF ontology throughout its existence. p/t for CC has remained mostly constant with a big rise in the 2004-11-01 release (reason: integrating obol results??? cellul). BP, with the highest p/t ratio overall, has been climbing steadily since the end of 2002, with a very large increase also at 2004-11-01 (presumably there was some large reorganisation at this time?).
Current Complexity
Here are stats from the current (2005-02-01) ontology:
| Ontology | Total terms | Total rels | Total paths | p/t | max(p) | ID with max(p) |
|---|---|---|---|---|---|---|
| molecular_function | 7459 | 8149 | 11586 | 1.553291325915 | 19 | GO:0015375 |
| cellular_component | 1532 | 2122 | 7545 | 4.92493472584856 | 64 | GO:0012507 |
| biological_process | 9383 | 14996 | 179821 | 19.164552914846 | 406 | GO:0046641 |
OBO
The same statistics were calculated for OBO, only for a static time slice (2005-02-01). This gives an idea of how complex GO would be if we were to make full cross-products with external ontologies; for example, fly_anatomy has p/t=144.79. This would obviously need to be filtered in some way prior to making cross-products.
| Ontology | Total terms | Total rels | Total paths | p/t | max(p) | ID with max(p) |
|---|---|---|---|---|---|---|
| Sequence Ontology | 968 | 1168 | 1727 | 1.78409090909091 | 25 | SO:0000470 |
| fly_anatomy.ontology | 6077 | 9828 | 879879 | 144.788382425539 | 1974 | FBbt:00004471 |
| Drosophila_qualifier | 160 | 159 | 160 | 1 | 1 | FBql:00535257 |
| human | 8340 | 8339 | 8340 | 1 | 1 | EHDA:4586 |
| medaka ontology | 4358 | 4402 | 5393 | 1.23749426342359 | 46 | MFO:0000470 |
| Mouse_anatomy_by_time_xproduct | 13731 | 13730 | 13731 | 1 | 1 | EMAP:11737 |
| Mouse_anatomy_by_time_xproduct | 2405 | 2922 | 7704 | 3.2033264033264 | 18 | MA:0000962 |
| Staged_ZebraFish_Gross_Anatomy | 6708 | 6719 | 6810 | 1.01520572450805 | 3 | ZFIN:0006768 |
| Dictyostelium discoideum Anatomy | 38 | 57 | 73 | 1.92105263157895 | 6 | DDANAT:0000055 |
| microbial structure ontology | 64 | 87 | 125 | 1.953125 | 14 | FAO:0001017 |
| Arabidopsis ontology | 299 | 479 | 1232 | 4.12040133779264 | 43 | TAIR:0000358 |
| cereal plant ontology | 480 | 639 | 1608 | 3.35 | 41 | GRO:0005860 |
| plant ontology | 587 | 907 | 3083 | 5.25212947189097 | 71 | PO:0006494 |
| fly_development.ontology | 97 | 160 | 63303 | 652.60824742268 | 3692 | FBdv:00005375 |
| medaka ontology | 4358 | 4402 | 5393 | 1.23749426342359 | 46 | MFO:0000470 |
| C. elegans Development DAG | 69 | 68 | 69 | 1 | 1 | WBls:0000014 |
| cereal plant ontology | 230 | 253 | 382 | 1.66086956521739 | 4 | GRO:0007018 |
| cell.ontology | 706 | 992 | 3372 | 4.77620396600567 | 34 | CL:0000129 |
| environment_ontology | 479 | 477 | 508 | 1.06054279749478 | 4 | EO:0007332 |
| Biological Imaging Methods | 259 | 258 | 259 | 1 | 1 | FBbi:00000246 |
| environment ontology | 935 | 980 | 1133 | 1.21176470588235 | 6 | GEO:0007219 |
| pathology_ontology | 459 | 458 | 459 | 1 | 1 | MPATH:225 |
| trait ontology | 582 | 654 | 877 | 1.50687285223368 | 10 | TO:0000110 |
| relationship | 12 | 11 | 12 | 1 | 1 | OBO_REL:improper_part_of |
| chebi_ontology | 10338 | 11694 | 24260 | 2.34668214354808 | 22 | CHEBI:28240 |
| anatomicalsystem | 391 | 390 | 391 | 1 | 1 | EV:0100140 |
| associatedwith | 24 | 23 | 24 | 1 | 1 | EV:0600004 |
| celltype | 162 | 161 | 162 | 1 | 1 | EV:0200075 |
| developmentalstage | 155 | 154 | 155 | 1 | 1 | EV:0300099 |
| experimentaltechnique | 28 | 27 | 28 | 1 | 1 | EV:0900000 |
| microarrayplatform | 19 | 18 | 19 | 1 | 1 | EV:0700005 |
| pathology | 174 | 173 | 174 | 1 | 1 | EV:0400125 |
| pooling | 8 | 7 | 8 | 1 | 1 | EV:0500005 |
| tissuepreparation | 8 | 7 | 8 | 1 | 1 | EV:1000001 |
| treatment | 24 | 23 | 24 | 1 | 1 | EV:0800005 |
| pato.ontology | 1196 | 1635 | 2650 | 2.21571906354515 | 14 | PATO:0000286 |
| rex.ontology | 494 | 651 | 1385 | 2.80364372469636 | 16 | REX:0000308 |
| microbial_anatomical_structure | 1 | 1 | 2 | 2 | 2 | UBO:0000009 |
| Phenotypic_manifestation_ontology | 30 | 29 | 30 | 1 | 1 | PM:0000011 |
| biophysical chemistry | 1128 | 1614 | 8162 | 7.23581560283688 | 52 | FIX:0000520 |
| evidence_code | 129 | 136 | 162 | 1.25581395348837 | 4 | ECO:0000114 |
Conclusions and discussion
The complexity for the GO, as measured by p/t, in particular the BP ontology is high, and is likely to rise. It is difficult to extrapolate into the future as the Nov-2004 burst may have been a one-off occurence. Nevertheless, it seems likely that the complexity will rise, especially as many of the terms likely to be added will be multi-faceted as different organisms are annotated with the GO.
Complexity is detrimental to both navigation and comprehending the parentage of a term. There are a number of band-aids we can patch on to hide this complexity from the casual user. EGO's graph view display seems to use a better algorithm than AmiGO's (GraphViz). We can hide the DAG-as-tree display for certain complex terms, and instead write a flat list of all ancestors as the default display and/or a list of immediate parents.
None of these are quite satisfactory as the fundamental problem remains. In order to address this, the very structure of the ontology itself must be addressed.
Reducing Complexity
Take cysteine biosynthesis as an example. The complexity of this term resides from the graph product of both biosynthesis (a simple term, with p=1) and cysteine (which is implicitly represented in GO and in ChEBI as a complex term, with p=6 with at least two axes of classification: family,..). Combining these gives a total of 87 paths! Clearly the DAG-as-tree display idiom becomes more of a hindrance than a help here.
The appendix contains more on calculating the explosive growth of the number of paths for multifaceted classes.
The composite graph is rarely useful to a user. What is most useful is (a) the immediate parentage and (b) the full de-tangled ancestry in each of the composing terms ontologies
For biosynthesis we would have a simple tree:
- biological process
- physiological process
- metabolism
- biosynthesis
It's not quite as simple as this - the decomposition above hides the transitive is_a relationship between "cysteine biosynthesis" and "cellular metabolism" (via "amine biosynthesis") which is not derivable from the cross-products. (This is the reason cross-product terms should not be excised from the ontologies). The new display algorithm would find all these additional relationships cannot be derived from the detangled trees.
There are a number of other possibilities, such as dynamic filters. users could choose what level of resolution they prefer in each of the various ontologies and the display would be tailored accordingly. For example, a biochemist may be interested in the entire ancestry of "cysteine biosynthesis", but for the default user, a low level of resolution could be used over ChEBI (for example, cysteine is_a amino acid is_a molecule). Similarly, non-pathogenic prokaryote oriented people may prefer to filter out anything regarding anatomies or high-level behaviour.
To drive these displays, the ontology would have to be recast. For example, the ontology would have to have a link between "cysteine biosynthesis" and "cysteine". See the main Obol pages for more descriptions here.
Methods
This data was generated using go-dag-summary.pl, part of go-perl. Ontologies were sourced from ftp://ftp.geneontology.org/pub/go/ontology-archive. The results of running the script on the archive are available in this table. Graphs were generated from this file using this R script
References
- Hill DP, Blake JA, Richardson JE, Ringwald M: Extension and integration of the gene ontology (GO): combining GO vocabularies with external vocabularies. Genome Res 2002, 12(12):1982-1991.
- Mungall CJ: Obol: Integrating Language and Meaning in Bio-Ontologies. Comparative and Functional Genomics 2004(5(7)):509-520.
Appendix
A. Formula for number of paths
If we were to combine two trees (ie DAGs with p/t=1) and instantiate every term in the cross-product (ie a maximally dense cross-product matrix), then the formula for the number of paths for the most specific term in the resulting DAG is:
v: depth of first tree
w: depth of second tree
p(v,w) = (v+w)!/(v!w!)
Combining a DAG with a tree, combining two DAGs, or three dimensional XPs yields even more complexity. I haven't been able to work out any formula for this. Note that many existing terms in GO are recursively nested cross-products (eg "negative regulation of X biosynthesis")