[ <About> <Download> <DTD> ]

About

Currently the specification for Chaos-XML is as a DTD. In the future we may also provide XML Schema and/or Relax-NG specifications.

For a detailed description of the meaning of the various elements in a Chado XML you should consult The Chado sequence module DDL. Chaos-XML is an almost direct transformation of the Chado relational schema, see chaos-xml and chado for a description of differences.

Download

The DTD is located at http://www.fruitfly.org/chaos-xml/dtd/chaos.dtd. Note that you may have to download the file by right-clicking on this link, as browsers may have problems displaying the DTD.

DTD

You can look at the DTD here:


<!--

  $Id: chaos.dtd,v 1.7 2005/04/27 19:32:45 cmungall Exp $

  CHAOS XML DTD

  Version 1

  For representing sequences and sequence annotation

  The root node is a "chaos" element

  Underneath this there is a "metadata" element, followed by
  elements of type "feature" or "feature_relationship" in any order

  Feature locations are indicated with "featureloc", nested inside the
  "feature" element

  Contact: Chris Mungall [BDGP] cjm@fruitfly.org

-->

<!-- chaos: this is the top-level node -->
<!ELEMENT chaos (chaos_metadata?,feature+,feature_relationship*)+>

<!-- ======== -->
<!-- METADATA -->
<!-- ======== -->

<!-- chaos_metadata: information about the document file itself (node) -->
<!ELEMENT chaos_metadata (chaos_version|chaos_flavour?|focus_feature_id?|feature_unique_key?|equiv_chado_release?|sequence_ontology_revision?|dsn?|dbname?|prog_args?|export_unixtime?|export_localtime?|export_host?|export_user?|export_perl5lib?|export_program?|export_module_cvs_id?)+>

<!-- chaos_version: (s) -->
<!ELEMENT chaos_version (#PCDATA)>

<!-- chaos_flavour: there may be subtle differences in the chaos xml
     depending on the generation method. Typical values may be
     "bioperl", "ensembl", "chado" -->
<!ELEMENT chaos_flavour (#PCDATA)>

<!-- focus_feature_id: if the chaos-xml document is gene-centric,
     this will be the feature_id of the central gene in the document
     -->
<!ELEMENT focus_feature_id (#PCDATA)>

<!-- feature_unique_key: this is the element that provides the unique
     identifier for features. If set, this should always be equal to
     the string "feature_id"  -->
<!ELEMENT feature_unique_key (#PCDATA)>

<!-- equiv_chado_release: this is the tag for the equivalent chado
     version release that corresponds to the chaos xml. If set, this
     should be equal to "chado_1_01"    -->
<!ELEMENT equiv_chado_release (#PCDATA)>

<!-- various metadata tags; may be specific to export program   -->
<!ELEMENT sequence_ontology_revision (#PCDATA)>
<!ELEMENT dsn (#PCDATA)>
<!ELEMENT prog_args (#PCDATA)>
<!ELEMENT dbname (#PCDATA)>

<!-- export_unixtime: unixtime (seconds since 1970) at which the
     document was exported       -->
<!ELEMENT export_unixtime (#PCDATA)>

<!-- export_localtime: human readable string of the time at which
     the document was exported -->
<!ELEMENT export_localtime (#PCDATA)>

<!-- export_host: name of the machine from which the document was
     exported  -->
<!ELEMENT export_host (#PCDATA)>

<!-- export_user: fullname or username of person who exported the
     document  -->
<!ELEMENT export_user (#PCDATA)>

<!-- export_perl5lib: value for env var PERL5LIB at export time -->
<!ELEMENT export_perl5lib (#PCDATA)>

<!-- export_program: name of executable that generated the document -->
<!ELEMENT export_program (#PCDATA)>

<!-- ======== -->
<!-- FEATURES -->
<!-- ======== -->

<!-- feature: 

     A feature is any biological sequence entity or any entity that
     can be potentially located relative to a biological sequence. 

     It is any entity that can be typed using the Sequence Ontology
     (SO)

     It is equivalent to the chado table "feature"

      -->
<!ELEMENT feature (feature_id|dbxrefstr?|name?|uniquename|type|residues?|md5checksum?|featureprop*|organismstr?|featureloc*|feature_dbxref*|seqlen?|is_analysis?)+>

<!-- feature_id:

     A unique identifier for this feature that is NOT required to be
     persistent outside the document. This field is necessary for
     resolving feature graphs.

     Unlike the equivalent chado column "feature.feature_id", this is
     not required to be an integer.

     If the chaos-xml is to be mapped to a chado database, then the
     feature_id element should NOT be stored in the database. The
     feature_id is purely for resolving feature graphs WITHIN a
     document.

     If a chado database is to be mapped to chaos-xml, then the chado
     feature.feature_id values may be used in the document.

 -->
<!ELEMENT feature_id (#PCDATA)>

<!-- dbxrefstr:

  A string representing a database name plus database identifier

  If the dbxrefstr element is a subelement of the feature element,
  then this is the PRIMARY dbxrefstr for this feature. A primary
  dbxrefstr is optional for features (for example, features
  representing blast HSPs are not expected to have persistent external
  identifiers)

  Of the dbxrefstr element is a subelement of the feature_dbxref
  element (see below), then this is a SECONDARY dbxrefstr. A feature
  can have any number of secondary dbxrefstrs. These will generally be
  to external databases.

  A dbxrefstr always has the format  DBNAME:ACCESSION
  or the format                      DBNAME:ACCESION.VERSION

  The ':' symbol is not allowed in the DBNAME

  (if the dbxrefstr contains more than one ':' then all but the first
  are considered part of the ACCESSION)

  The ':' symbol is not allowed in the ACCESSION

  (if the dbxrefstr contains more than one '.' then all but the first
  are considered part of the VERSION)

  ACCESSION and VERSION are mapped to the chado columns
  dbxref.accession and dbxref.version

  DBNAME is mapped to the chado column db.name

 -->
<!ELEMENT dbxrefstr (#PCDATA)>

<!-- name:

  A feature can have an optional human-readable name (for example,
  gene symbol). This is the label that is generally presented to the
  end-user.

  Names do not have to be unique, but it is wise to make the names
  unique whenever possible.

  One circumstance where names may not be unique is if the document
  contains features for two organisms. There may be name clashes
  between orthologous genes and other features.

  Names are optional, but for annotated features data providers should
  aim to supply them whenever possible, even if these names are
  automatically generated (eg p53-mRNA-1).

  Names should in general not be supplied for computed features (eg
  HSPs)

 -->
<!ELEMENT name (#PCDATA)>

<!-- uniquename:

  Every feature must have a uniquename. The uniquename is guaranteed
  to be a unique identifier for that feature, not just in that
  document, but in the whole world.

  Uniquenames should be human readable where possible, but they do not
  need to be

  It is up to the data provider to come up with a policy for
  generating uniquenames.

  One policy would be to combine the feature name and the feature
  organismstr (for example "Drosophila_melanogaster:p53"). The
  assumption here is that the naming authorities for each organism
  would take care to ensure name uniqueness within that organism.

  Care must be taken if the data provider has gene models for
  different haplotypes. Each model is a different instantiation of a
  canonical form of the gene. In this case the haplotype must be
  included in the uniquename somehow (if it is not somehow included in
  the name, as it is in the case of FlyBase alleles).

  The safest policy is to make the uniquename from the primary
  location of the feature, and the feature type. For example, if an
  exon is located on position 0->1000 on contig CTG0001.1 then a valid
  uniquename would be exon:CTG00001.1:0-1000.

  The data consumer should make no assumptions about the syntactic
  structure of the uniquename, other than it is globally unique

-->
<!ELEMENT uniquename (#PCDATA)>

<!-- residues:

  The IUPAC residue sequence of the feature. This can be either DNA or
  Amino Acid residues.

  Residues are optional for most features. They shoould be provided
  for features of type transcript and subtypes (in which case it is
  the cDNA sequence of the processed transcript) and for
  proteins/polypeptides. Residues should also be provided for any
  features that have other features located relative to them

 -->
<!ELEMENT residues (#PCDATA)>

<!-- md5checksum:

  The unique 32 character signature of the residues

  The md5checksum may be present even if the residues are not

 -->
<!ELEMENT md5checksum (#PCDATA)>

<!-- seqlen:

  The length of the residues sequence. If residues is present, this
  MUST be identical to length(residues).

  With non-discontiguous features (eg exons) this MUST be identical to
  abs(featureloc/nbeg - featureloc/nend)

  Note that the seqlen element may be present even if the residues
  element is not present

 -->
<!ELEMENT seqlen (#PCDATA)>

<!-- is_analysis:

  Boolean. Set to 1 if this is an analysis feature

-->
<!ELEMENT is_analysis (#PCDATA)>

<!-- type:

  This element can occur in different contexts

  feature/type
    - this should be the name of a term from the SO

      http://song.sf.net

  featureprop/type
    - this is the name of the property
  
  feature_relationship/type
    - this is the type of relationship occuring between two features

 -->
<!ELEMENT type (#PCDATA)>



<!-- feature_cvterm:

features can be associated with cvterms (eg GO IDs)

-->

<!ELEMENT feature_cvterm (cvterm)>
<!ELEMENT cvterm (dbxrefstr)>


<!-- featureprop:

  Features can have multiple tag=value properties

  For some tags (types) there may be multiple values - the ordering is
  indicated with featureprop/rank

-->

<!ELEMENT featureprop (type|value|rank?)+>

<!-- value: (s) 

  The value of a property

-->

<!ELEMENT value (#PCDATA)>

<!-- rank:

  This element can occur in different contexts

  featureprop/rank
    - indicates the ordering of values for a particular property type

  feature_relationship/rank
    - indicates the ordering of child nodes in the feature graph
      (for example, the ordering of exons that are part_of a transcript)

  featureloc/rank
    - Some features have paired locations (eg Hits and HSPs)

     In this case the match feature should have one featureloc of
     rank=0 (indicating the location of the match on the query
     feature) and one featureloc of rank=1 (indicating the location of
     the match on the sbjct feature)

     rank is unbounded in the case of multiple alignments

     multiple locations with rank > 0 are also used for representing
     variation features (eg SNPs) relative to different haplotype
     srcfeatures
 -->

<!ELEMENT rank (#PCDATA)>

<!-- organismstr: (s) 

  The scientific name of an organism

  Corresponds to the columns of the "organism" table in chado

  The syntax of organismstr is

  "GENUS SPECIES" (eg "Drosophila melanogaster")

  The GENUS field cannot have spaces in it; the SPECIES field can
  (eg "Homo sapiens neanderthalis")

  The common name can also be included in the organismstr; in this
  case the syntax is

  "GENUS SPECIES (COMMON_NAME)"

  The common name can have any number of spaces or other characters
  (including nested brackets) in it.

  For other cases (eg hybrids, engineered genes) [TODO syntax policy]

-->

<!ELEMENT organismstr (#PCDATA)>

<!-- feature_dbxref:

 For representing secondary identifiers on a feature
 (eg OMIM IDs, LocusLink IDs, UniProt IDs)

 Each identifier must have a seperate feature_dbxref element

 -->
<!ELEMENT feature_dbxref (dbxrefstr)>

<!-- =========== -->
<!-- FEATURELOCS -->
<!-- =========== -->

<!-- featureloc:

  A feature can have zero or more locations relative to another
  feature
  
  !!!!!!!! VERY IMPORTANT !!!!!!!!
  **** ALL LOCATIONS ARE INTERBASE ****
  !!!!!!!! VERY IMPORTANT !!!!!!!!

  interbase starts with origin 0, and counts the spaces between bases

  for example:

  sequence:    a t g c c g a a a
  position:   0 1 2 3 4 5 6 7 8 9  

  the first three bases "atg" are indicated by the interbase range
  [0,3]

  All featurelocs are relative to the feature indicated by srcfeature_id

 -->
<!ELEMENT featureloc (nbeg|nend|strand|srcfeature_id|locgroup?|rank?)+>

<!-- nbeg: INTEGER

  Position of the natural begin (ie 5' end) of a feature
  (relative to the feature indicated by srcfeature_id)

 -->
<!ELEMENT nbeg (#PCDATA)>

<!-- nend: INTEGER

  Position of the natural end (ie 3' end) of a feature
  (relative to the feature indicated by srcfeature_id)

 -->
<!ELEMENT nend (#PCDATA)>

<!-- strand: INTEGER

  Relative direction of the location

  Note: this is usually implicit from whether nbeg<nend
  EXCEPT in the case of zero-length features (ie insertions)

-->
<!ELEMENT strand (#PCDATA)>

<!-- srcfeature_id:

  This is the feature_id of the feature to which the featureloc is
  relative to

 -->
<!ELEMENT srcfeature_id (#PCDATA)>

<!-- locgroup: INT 

   (optional)

   A feature can have multiple redundant locations; for example,
   located relative to both chromosome and to contig. Redundant
   featurelocs are indicated with a locgroup>0

-->
<!ELEMENT locgroup (#PCDATA)>

<!-- ===================== -->
<!-- FEATURE_RELATIONSHIPS -->
<!-- ===================== -->

<!-- feature_relationship:

  For representing feature graphs

  A feature_relationship can be thought of as an arc in a graph, or
  as a SUBJECT PREDICATE OBJECT statement; for example

  p53-exon-1       :part_of          p53-mRNA-1
  p53-protein-1    :derived_from     p53-mRNA-1

  The subject of the statement is the feature on the left (child node)
  The object of the statement is the feature on the right (parent node)

  

 -->
<!ELEMENT feature_relationship (subject_id|object_id|type|rank?)+>

<!-- subject_id:

  feature_id of subject (child) feature

 -->
<!ELEMENT subject_id (#PCDATA)>

<!-- object_id:

   feature_id of object (parent) feature

 -->
<!ELEMENT object_id (#PCDATA)>

<!--  END OF CHAOS XML DTD -->
<!--  $Author: cmungall $ -->
<!--  $Id: chaos.dtd,v 1.7 2005/04/27 19:32:45 cmungall Exp $ -->