About
Currently the specification for Chaos-XML is as a DTD. In the future we may also provide XML Schema and/or Relax-NG specifications.
For a detailed description of the meaning of the various elements in a Chado XML you should consult The Chado sequence module DDL. Chaos-XML is an almost direct transformation of the Chado relational schema, see chaos-xml and chado for a description of differences.
Download
The DTD is located at http://www.fruitfly.org/chaos-xml/dtd/chaos.dtd. Note that you may have to download the file by right-clicking on this link, as browsers may have problems displaying the DTD.
DTD
You can look at the DTD here:
<!--
$Id: chaos.dtd,v 1.7 2005/04/27 19:32:45 cmungall Exp $
CHAOS XML DTD
Version 1
For representing sequences and sequence annotation
The root node is a "chaos" element
Underneath this there is a "metadata" element, followed by
elements of type "feature" or "feature_relationship" in any order
Feature locations are indicated with "featureloc", nested inside the
"feature" element
Contact: Chris Mungall [BDGP] cjm@fruitfly.org
-->
<!-- chaos: this is the top-level node -->
<!ELEMENT chaos (chaos_metadata?,feature+,feature_relationship*)+>
<!-- ======== -->
<!-- METADATA -->
<!-- ======== -->
<!-- chaos_metadata: information about the document file itself (node) -->
<!ELEMENT chaos_metadata (chaos_version|chaos_flavour?|focus_feature_id?|feature_unique_key?|equiv_chado_release?|sequence_ontology_revision?|dsn?|dbname?|prog_args?|export_unixtime?|export_localtime?|export_host?|export_user?|export_perl5lib?|export_program?|export_module_cvs_id?)+>
<!-- chaos_version: (s) -->
<!ELEMENT chaos_version (#PCDATA)>
<!-- chaos_flavour: there may be subtle differences in the chaos xml
depending on the generation method. Typical values may be
"bioperl", "ensembl", "chado" -->
<!ELEMENT chaos_flavour (#PCDATA)>
<!-- focus_feature_id: if the chaos-xml document is gene-centric,
this will be the feature_id of the central gene in the document
-->
<!ELEMENT focus_feature_id (#PCDATA)>
<!-- feature_unique_key: this is the element that provides the unique
identifier for features. If set, this should always be equal to
the string "feature_id" -->
<!ELEMENT feature_unique_key (#PCDATA)>
<!-- equiv_chado_release: this is the tag for the equivalent chado
version release that corresponds to the chaos xml. If set, this
should be equal to "chado_1_01" -->
<!ELEMENT equiv_chado_release (#PCDATA)>
<!-- various metadata tags; may be specific to export program -->
<!ELEMENT sequence_ontology_revision (#PCDATA)>
<!ELEMENT dsn (#PCDATA)>
<!ELEMENT prog_args (#PCDATA)>
<!ELEMENT dbname (#PCDATA)>
<!-- export_unixtime: unixtime (seconds since 1970) at which the
document was exported -->
<!ELEMENT export_unixtime (#PCDATA)>
<!-- export_localtime: human readable string of the time at which
the document was exported -->
<!ELEMENT export_localtime (#PCDATA)>
<!-- export_host: name of the machine from which the document was
exported -->
<!ELEMENT export_host (#PCDATA)>
<!-- export_user: fullname or username of person who exported the
document -->
<!ELEMENT export_user (#PCDATA)>
<!-- export_perl5lib: value for env var PERL5LIB at export time -->
<!ELEMENT export_perl5lib (#PCDATA)>
<!-- export_program: name of executable that generated the document -->
<!ELEMENT export_program (#PCDATA)>
<!-- ======== -->
<!-- FEATURES -->
<!-- ======== -->
<!-- feature:
A feature is any biological sequence entity or any entity that
can be potentially located relative to a biological sequence.
It is any entity that can be typed using the Sequence Ontology
(SO)
It is equivalent to the chado table "feature"
-->
<!ELEMENT feature (feature_id|dbxrefstr?|name?|uniquename|type|residues?|md5checksum?|featureprop*|organismstr?|featureloc*|feature_dbxref*|seqlen?|is_analysis?)+>
<!-- feature_id:
A unique identifier for this feature that is NOT required to be
persistent outside the document. This field is necessary for
resolving feature graphs.
Unlike the equivalent chado column "feature.feature_id", this is
not required to be an integer.
If the chaos-xml is to be mapped to a chado database, then the
feature_id element should NOT be stored in the database. The
feature_id is purely for resolving feature graphs WITHIN a
document.
If a chado database is to be mapped to chaos-xml, then the chado
feature.feature_id values may be used in the document.
-->
<!ELEMENT feature_id (#PCDATA)>
<!-- dbxrefstr:
A string representing a database name plus database identifier
If the dbxrefstr element is a subelement of the feature element,
then this is the PRIMARY dbxrefstr for this feature. A primary
dbxrefstr is optional for features (for example, features
representing blast HSPs are not expected to have persistent external
identifiers)
Of the dbxrefstr element is a subelement of the feature_dbxref
element (see below), then this is a SECONDARY dbxrefstr. A feature
can have any number of secondary dbxrefstrs. These will generally be
to external databases.
A dbxrefstr always has the format DBNAME:ACCESSION
or the format DBNAME:ACCESION.VERSION
The ':' symbol is not allowed in the DBNAME
(if the dbxrefstr contains more than one ':' then all but the first
are considered part of the ACCESSION)
The ':' symbol is not allowed in the ACCESSION
(if the dbxrefstr contains more than one '.' then all but the first
are considered part of the VERSION)
ACCESSION and VERSION are mapped to the chado columns
dbxref.accession and dbxref.version
DBNAME is mapped to the chado column db.name
-->
<!ELEMENT dbxrefstr (#PCDATA)>
<!-- name:
A feature can have an optional human-readable name (for example,
gene symbol). This is the label that is generally presented to the
end-user.
Names do not have to be unique, but it is wise to make the names
unique whenever possible.
One circumstance where names may not be unique is if the document
contains features for two organisms. There may be name clashes
between orthologous genes and other features.
Names are optional, but for annotated features data providers should
aim to supply them whenever possible, even if these names are
automatically generated (eg p53-mRNA-1).
Names should in general not be supplied for computed features (eg
HSPs)
-->
<!ELEMENT name (#PCDATA)>
<!-- uniquename:
Every feature must have a uniquename. The uniquename is guaranteed
to be a unique identifier for that feature, not just in that
document, but in the whole world.
Uniquenames should be human readable where possible, but they do not
need to be
It is up to the data provider to come up with a policy for
generating uniquenames.
One policy would be to combine the feature name and the feature
organismstr (for example "Drosophila_melanogaster:p53"). The
assumption here is that the naming authorities for each organism
would take care to ensure name uniqueness within that organism.
Care must be taken if the data provider has gene models for
different haplotypes. Each model is a different instantiation of a
canonical form of the gene. In this case the haplotype must be
included in the uniquename somehow (if it is not somehow included in
the name, as it is in the case of FlyBase alleles).
The safest policy is to make the uniquename from the primary
location of the feature, and the feature type. For example, if an
exon is located on position 0->1000 on contig CTG0001.1 then a valid
uniquename would be exon:CTG00001.1:0-1000.
The data consumer should make no assumptions about the syntactic
structure of the uniquename, other than it is globally unique
-->
<!ELEMENT uniquename (#PCDATA)>
<!-- residues:
The IUPAC residue sequence of the feature. This can be either DNA or
Amino Acid residues.
Residues are optional for most features. They shoould be provided
for features of type transcript and subtypes (in which case it is
the cDNA sequence of the processed transcript) and for
proteins/polypeptides. Residues should also be provided for any
features that have other features located relative to them
-->
<!ELEMENT residues (#PCDATA)>
<!-- md5checksum:
The unique 32 character signature of the residues
The md5checksum may be present even if the residues are not
-->
<!ELEMENT md5checksum (#PCDATA)>
<!-- seqlen:
The length of the residues sequence. If residues is present, this
MUST be identical to length(residues).
With non-discontiguous features (eg exons) this MUST be identical to
abs(featureloc/nbeg - featureloc/nend)
Note that the seqlen element may be present even if the residues
element is not present
-->
<!ELEMENT seqlen (#PCDATA)>
<!-- is_analysis:
Boolean. Set to 1 if this is an analysis feature
-->
<!ELEMENT is_analysis (#PCDATA)>
<!-- type:
This element can occur in different contexts
feature/type
- this should be the name of a term from the SO
http://song.sf.net
featureprop/type
- this is the name of the property
feature_relationship/type
- this is the type of relationship occuring between two features
-->
<!ELEMENT type (#PCDATA)>
<!-- feature_cvterm:
features can be associated with cvterms (eg GO IDs)
-->
<!ELEMENT feature_cvterm (cvterm)>
<!ELEMENT cvterm (dbxrefstr)>
<!-- featureprop:
Features can have multiple tag=value properties
For some tags (types) there may be multiple values - the ordering is
indicated with featureprop/rank
-->
<!ELEMENT featureprop (type|value|rank?)+>
<!-- value: (s)
The value of a property
-->
<!ELEMENT value (#PCDATA)>
<!-- rank:
This element can occur in different contexts
featureprop/rank
- indicates the ordering of values for a particular property type
feature_relationship/rank
- indicates the ordering of child nodes in the feature graph
(for example, the ordering of exons that are part_of a transcript)
featureloc/rank
- Some features have paired locations (eg Hits and HSPs)
In this case the match feature should have one featureloc of
rank=0 (indicating the location of the match on the query
feature) and one featureloc of rank=1 (indicating the location of
the match on the sbjct feature)
rank is unbounded in the case of multiple alignments
multiple locations with rank > 0 are also used for representing
variation features (eg SNPs) relative to different haplotype
srcfeatures
-->
<!ELEMENT rank (#PCDATA)>
<!-- organismstr: (s)
The scientific name of an organism
Corresponds to the columns of the "organism" table in chado
The syntax of organismstr is
"GENUS SPECIES" (eg "Drosophila melanogaster")
The GENUS field cannot have spaces in it; the SPECIES field can
(eg "Homo sapiens neanderthalis")
The common name can also be included in the organismstr; in this
case the syntax is
"GENUS SPECIES (COMMON_NAME)"
The common name can have any number of spaces or other characters
(including nested brackets) in it.
For other cases (eg hybrids, engineered genes) [TODO syntax policy]
-->
<!ELEMENT organismstr (#PCDATA)>
<!-- feature_dbxref:
For representing secondary identifiers on a feature
(eg OMIM IDs, LocusLink IDs, UniProt IDs)
Each identifier must have a seperate feature_dbxref element
-->
<!ELEMENT feature_dbxref (dbxrefstr)>
<!-- =========== -->
<!-- FEATURELOCS -->
<!-- =========== -->
<!-- featureloc:
A feature can have zero or more locations relative to another
feature
!!!!!!!! VERY IMPORTANT !!!!!!!!
**** ALL LOCATIONS ARE INTERBASE ****
!!!!!!!! VERY IMPORTANT !!!!!!!!
interbase starts with origin 0, and counts the spaces between bases
for example:
sequence: a t g c c g a a a
position: 0 1 2 3 4 5 6 7 8 9
the first three bases "atg" are indicated by the interbase range
[0,3]
All featurelocs are relative to the feature indicated by srcfeature_id
-->
<!ELEMENT featureloc (nbeg|nend|strand|srcfeature_id|locgroup?|rank?)+>
<!-- nbeg: INTEGER
Position of the natural begin (ie 5' end) of a feature
(relative to the feature indicated by srcfeature_id)
-->
<!ELEMENT nbeg (#PCDATA)>
<!-- nend: INTEGER
Position of the natural end (ie 3' end) of a feature
(relative to the feature indicated by srcfeature_id)
-->
<!ELEMENT nend (#PCDATA)>
<!-- strand: INTEGER
Relative direction of the location
Note: this is usually implicit from whether nbeg<nend
EXCEPT in the case of zero-length features (ie insertions)
-->
<!ELEMENT strand (#PCDATA)>
<!-- srcfeature_id:
This is the feature_id of the feature to which the featureloc is
relative to
-->
<!ELEMENT srcfeature_id (#PCDATA)>
<!-- locgroup: INT
(optional)
A feature can have multiple redundant locations; for example,
located relative to both chromosome and to contig. Redundant
featurelocs are indicated with a locgroup>0
-->
<!ELEMENT locgroup (#PCDATA)>
<!-- ===================== -->
<!-- FEATURE_RELATIONSHIPS -->
<!-- ===================== -->
<!-- feature_relationship:
For representing feature graphs
A feature_relationship can be thought of as an arc in a graph, or
as a SUBJECT PREDICATE OBJECT statement; for example
p53-exon-1 :part_of p53-mRNA-1
p53-protein-1 :derived_from p53-mRNA-1
The subject of the statement is the feature on the left (child node)
The object of the statement is the feature on the right (parent node)
-->
<!ELEMENT feature_relationship (subject_id|object_id|type|rank?)+>
<!-- subject_id:
feature_id of subject (child) feature
-->
<!ELEMENT subject_id (#PCDATA)>
<!-- object_id:
feature_id of object (parent) feature
-->
<!ELEMENT object_id (#PCDATA)>
<!-- END OF CHAOS XML DTD -->
<!-- $Author: cmungall $ -->
<!-- $Id: chaos.dtd,v 1.7 2005/04/27 19:32:45 cmungall Exp $ -->