Position
Currently employed by Howard Hughes Medical Institute as a bioinformatics specialist, with the BDGP and GO Consortium
Research Interests
- Design, evolution and use of ontologies, databases and knowledge representation in molecular biology
- Genome analysis, genomic databases and intelligent automation of analyses and annotation
- Genomic variation and contribution to phenotype
- Natural language, declarative languages, and domain specific languages for biology and bioinformatics
Projects
My current projects, responsibilities and roles include the following:
- Drosophila Informatics at the BDGP: enhancers in the fruitfly brain; patterns of intron evolution
- Informatics for the GO Database
- Integrating the Gene Ontology with other OBO ontologies
- Combinatorial generation of anatomical structure classes in the Plant Ontology
- Chado development
- Curation and maintenance of the OBO Relations ontology
- Publishing the OBO-Download Matrix resource
- Representing, querying and reasoning sequence feature data in OWL and other formalisms
- Occasional lecturer/TA at the Cold Spring Harbor Labs course Bioinformatics: Writing Software for Genome Research
Software
Current software
I am still actively using and developing the following systems and tools:
- Obol is both an ontological reasoner and a system for deriving meaning encoded in natural language descriptions of class names, . Obol is currently being used by The Gene Ontology and the The Plant Ontology Consortium, and has so far detected hundreds of errors. See Comparative and Functional Genomics; Volume 5, Issue 6-7, 2004. Pages 509-520 and this article in The Scientist.
- The Chado schema (co-authored by David Emmert). Chado is a modular schema for modeling biological data in relational databases, using ontologies such as The Sequence Ontology. Currently there are modules for sequence features, IDs, genetics, phylogeny, maps, bibliographic data and expression data. No publication as yet.
- Chaos-XML and supporting software and documentation (scripts, library, XSLTs, DTD). Publication forthcoming. Chaos is an integral part of CGL
- go-perl, go-db-perl and the GO MySQL schema, parts of the go-dev toolkit. go-dev also includes AmiGO (written by Brad Marshall and ShengQiang Shu) and OBO-Edit (written by John Day-Richter)
- Stag is a framework for manipulating nested tag-value data in perl, and for mapping between XML and SQL Databases
- Skam is a functional-logical replacement for Makefiles, specifically designed at automating large bioinformatics pipelines on beowulf clusters. Skam is an example of a bottom-up domain specific language
- The BioPerl Unflattener. This is for discovering and normalising gene models from lossy representations such as GenBank. Available as part of BioPerl (1.5+ recommended).
- Blipkit: Biomedical Logic Programming Knowledge Integration Toolkit. A collection of SWI-Prolog modules for bioinformatics and ontologies. Website forthcoming, contact me for details
Previous software projects
I no longer support or maintain the following pieces of software:
- The GadFly genome annotation database, pipeline and browser. This has been superseded by Chado and is no longer supported. Still in use at the BDGP and JGI. Described in Genome Biology. Software available from BDGP CVS, or on request.
- sim4wrap - a simple standalone C program and perl wrapper for speeding up sim4 analyses using blast, described in the above publication
- The Anubis map viewer, which I developed whilst employed at the Roslin Institute. I also contributed to ArkDB. Both Anubis and ArkDB are currently still in operation in multiple sites across the world.
Availability
All the software I have written is available under an open source license, unless otherwise indicated. Software should be available from the the project sites listed above, or in some cases from the following repositories:
- My perl modules on CPAN
- Projects on Sourceforge.net which I contribute to
Selected Presentations
- Managing complexity in the Gene Ontology [ PowerPoint | PDF ] -- GO Meeting, Caltech, 2005
- Chado and interoperability [ PowerPoint | PDF ] -- Generic Model Organism Databases meeting, SRI, 2005
- Modeling with SO [ PowerPoint | PDF ] -- Sequence Ontology meeting, Berkeley, 2004
- Decoding the genome with perl, XML and SQL [ PowerPoint | PDF ] -- San Francisco Perl Mongers meeting, 2005
- Obol: Integrating Bio-Ontologies [ PowerPoint | PDF ] -- Bio-Ontologies, Glasgow, 2004
- BioMake [ PDF | Powerpoint ] -- Bioinformatics Open Source Conference, Glasgow 2004
- Makefiles, pipelines and Skam [ OpenOffice | PowerPoint ] -- BDGP 2004
- Sequence data in Chado [ HTML ] (now outdated) -- Cambridge, 2003
- GadFly: Building a genome annotation database [ StarOffice | PowerPoint | HTML ] -- BDGP 2002
- The GO Database and go-dev toolkit [ StarOffice ] -- Hinxton 2002
Education
BSc Hons in Artificial Intelligence and Computer Science from University of Edinburgh . Currently about to embark on a PhD by research publication, also from University of Edinburgh.
