The DBStag module

This is an extension to DBI - DBStag objects can be used in exactly the same way as DBI. There are additional methods for conversion between relations and hierarchical datastructures (STags).

see http://stag.sf.net for more details.

Similarity between relational schema and hierarchical structures

For any relational schema with relations linked by primary and foreign keys, there exists one or more hierarchical trees that can be built from those relations.

Click here to see the SQL table creation statements;


Gene-centric tree:
  gene
    gene_function
    transcript
      protein
    transcript_to_exon
      exon
    

Protein-centric tree:

  protein
    transcript
      gene
        gene_function
      transcript_to_exon
        exon
    

Automatically decomposing cartesian products into trees

The following query does natural joins between all primary and foreign keys - it fetches everything in the database.
  SELECT *
  FROM 
  (gene NATURAL JOIN gene_function)
       NATURAL JOIN 
    (transcript NATURAL JOIN protein)
         NATURAL JOIN 
       exon_to_transcript NATURAL JOIN 
         exon;
            
            

The resulting relation looks like this:
gene_accsymbolcytologymol_functiontranscript_accmrna_seqprotein_accprotein_seqexon_accseq_startseq_endseq_strandchromosome_arm
gn00002 foo35A7transmembrane receptorgn00002A ATGAATTGCGGCTTNNNNNNNATTgn00002P1 MQVVQVQVQVQSCSCSgn00002X1 110012L
gn00002 foo35A7transmembrane receptorgn00002A ATGAATTGCGGCTTNNNNNNNATTgn00002P1 MQVVQVQVQVQSCSCSgn00002X2 20120012L
gn00003 bar60F1glutamine-tRNA ligasegn00003A ATGCCGTGATAGCTAGCGTAgn00003P1 MSVXXXXXQLQLQgn00003X1 100112001X
gn00003 bar60F1ATP bindinggn00003A ATGCCGTGATAGCTAGCGTAgn00003P1 MSVXXXXXQLQLQgn00003X1 100112001X
gn00003 bar60F1glutamine-tRNA ligasegn00003B ATGCCGTGATAGTTTTGGGGGGgn00003P2 MSVXXXXXSSSSSSSVVVgn00003X1 100112001X
gn00003 bar60F1ATP bindinggn00003B ATGCCGTGATAGTTTTGGGGGGgn00003P2 MSVXXXXXSSSSSSSVVVgn00003X1 100112001X
gn00003 bar60F1glutamine-tRNA ligasegn00003A ATGCCGTGATAGCTAGCGTAgn00003P1 MSVXXXXXQLQLQgn00003X2 200122001X
gn00003 bar60F1ATP bindinggn00003A ATGCCGTGATAGCTAGCGTAgn00003P1 MSVXXXXXQLQLQgn00003X2 200122001X
gn00003 bar60F1glutamine-tRNA ligasegn00003B ATGCCGTGATAGTTTTGGGGGGgn00003P2 MSVXXXXXSSSSSSSVVVgn00003X3 140116001X
gn00003 bar60F1ATP bindinggn00003B ATGCCGTGATAGTTTTGGGGGGgn00003P2 MSVXXXXXSSSSSSSVVVgn00003X3 140116001X

When you feed the same query to DBStag, it will take the results of the natural join, and decompose it into a hierarchically structured tag-value pairs, which can be expressed in XML.

The following piece of code:

$stag = $dbh->selectall_stag("SELECT * FROM (gene NATURAL JOIN
gene_function) NATURAL JOIN (transcript NATURAL JOIN protein) NATURAL
JOIN exon_to_transcript NATURAL JOIN exon");

print $stag->xml;
            
Click here to download

Produces this output:

<dataset>
  <gene>
    <gene_acc>gn00002     </gene_acc>
    <symbol>foo</symbol>
    <cytology>35A7</cytology>
    <gene_function>
    </gene_function>
    <transcript>
      <transcript_acc>gn00002A    </transcript_acc>
      <gene_acc>gn00002     </gene_acc>
      <protein_acc>gn00002P1   </protein_acc>
      <mrna_seq>ATGAATTGCGGCTTNNNNNNNATT</mrna_seq>
      <protein>
        <protein_acc>gn00002P1   </protein_acc>
        <molecular_weight>62</molecular_weight>
        <protein_seq>MQVVQVQVQVQSCSCS</protein_seq>
      </protein>
      <exon_to_transcript>
        <exon>
          <exon_acc>gn00002X1   </exon_acc>
          <seq_start>1</seq_start>
          <seq_end>100</seq_end>
          <seq_strand>1</seq_strand>
          <chromosome_arm>2L</chromosome_arm>
        </exon>
        <exon>
          <exon_acc>gn00002X2   </exon_acc>
          <seq_start>201</seq_start>
          <seq_end>200</seq_end>
          <seq_strand>1</seq_strand>
          <chromosome_arm>2L</chromosome_arm>
        </exon>
      </exon_to_transcript>
    </transcript>
  </gene>
  <gene>
    <gene_acc>gn00003     </gene_acc>
    <symbol>bar</symbol>
    <cytology>60F1</cytology>
    <gene_function>
    </gene_function>
    <transcript>
      <transcript_acc>gn00003A    </transcript_acc>
      <gene_acc>gn00003     </gene_acc>
      <protein_acc>gn00003P1   </protein_acc>
      <mrna_seq>ATGCCGTGATAGCTAGCGTA</mrna_seq>
      <protein>
        <protein_acc>gn00003P1   </protein_acc>
        <molecular_weight>55.8</molecular_weight>
        <protein_seq>MSVXXXXXQLQLQ</protein_seq>
      </protein>
      <exon_to_transcript>
        <exon>
          <exon_acc>gn00003X1   </exon_acc>
          <seq_start>1001</seq_start>
          <seq_end>1200</seq_end>
          <seq_strand>1</seq_strand>
          <chromosome_arm>X </chromosome_arm>
        </exon>
        <exon>
          <exon_acc>gn00003X2   </exon_acc>
          <seq_start>2001</seq_start>
          <seq_end>2200</seq_end>
          <seq_strand>1</seq_strand>
          <chromosome_arm>X </chromosome_arm>
        </exon>
      </exon_to_transcript>
    </transcript>
    <transcript>
      <transcript_acc>gn00003B    </transcript_acc>
      <gene_acc>gn00003     </gene_acc>
      <protein_acc>gn00003P2   </protein_acc>
      <mrna_seq>ATGCCGTGATAGTTTTGGGGGG</mrna_seq>
      <protein>
        <protein_acc>gn00003P2   </protein_acc>
        <molecular_weight>57.2</molecular_weight>
        <protein_seq>MSVXXXXXSSSSSSSVVV</protein_seq>
      </protein>
      <exon_to_transcript>
        <exon>
          <exon_acc>gn00003X1   </exon_acc>
          <seq_start>1001</seq_start>
          <seq_end>1200</seq_end>
          <seq_strand>1</seq_strand>
          <chromosome_arm>X </chromosome_arm>
        </exon>
        <exon>
          <exon_acc>gn00003X3   </exon_acc>
          <seq_start>1401</seq_start>
          <seq_end>1600</seq_end>
          <seq_strand>1</seq_strand>
          <chromosome_arm>X </chromosome_arm>
        </exon>
      </exon_to_transcript>
    </transcript>
  </gene>
  <gene>
    <gene_acc>gn00004     </gene_acc>
    <symbol>whee</symbol>
    <cytology>58C2</cytology>
    <gene_function>
    </gene_function>
    <transcript>
      <transcript_acc>gn00004A    </transcript_acc>
      <gene_acc>gn00004     </gene_acc>
      <protein_acc>gn00004P1   </protein_acc>
      <mrna_seq>GTGATGTGGCTGNNNNNTTGTTAA</mrna_seq>
      <protein>
        <protein_acc>gn00004P1   </protein_acc>
        <molecular_weight>82.3</molecular_weight>
        <protein_seq>MLSTRQQCSCSXXXXLV</protein_seq>
      </protein>
      <exon_to_transcript>
        <exon>
          <exon_acc>gn00004X1   </exon_acc>
          <seq_start>4389</seq_start>
          <seq_end>4418</seq_end>
          <seq_strand>1</seq_strand>
          <chromosome_arm>3L</chromosome_arm>
        </exon>
        <exon>
          <exon_acc>gn00004X2   </exon_acc>
          <seq_start>4890</seq_start>
          <seq_end>5101</seq_end>
          <seq_strand>1</seq_strand>
          <chromosome_arm>3L</chromosome_arm>
        </exon>
        <exon>
          <exon_acc>gn00004X3   </exon_acc>
          <seq_start>5812</seq_start>
          <seq_end>6043</seq_end>
          <seq_strand>1</seq_strand>
          <chromosome_arm>3L</chromosome_arm>
        </exon>
      </exon_to_transcript>
    </transcript>
  </gene>
</dataset>

Which can be expressed visually like this:
gene_accsymbolcytologygene_functiontranscript
gn00002 foo35A7
mol_function
transmembrane receptor
transcript_accmrna_seqproteinexon_to_transcript
gn00002A ATGAATTGCGGCTTNNNNNNNATT
protein_accprotein_seq
gn00002P1 MQVVQVQVQVQSCSCS
exon
exon_accseq_startseq_endseq_strandchromosome_arm
gn00002X1 110012L
gn00002X2 20120012L
gn00003 bar60F1
mol_function
glutamine-tRNA ligase
ATP binding
transcript_accmrna_seqproteinexon_to_transcript
gn00003A ATGCCGTGATAGCTAGCGTA
protein_accprotein_seq
gn00003P1 MSVXXXXXQLQLQ
exon
exon_accseq_startseq_endseq_strandchromosome_arm
gn00003X1 100112001X
gn00003X2 200122001X
gn00003B ATGCCGTGATAGTTTTGGGGGG
protein_accprotein_seq
gn00003P2 MSVXXXXXSSSSSSSVVV
exon
exon_accseq_startseq_endseq_strandchromosome_arm
gn00003X1 100112001X
gn00003X3 140116001X

DBStag comes with a program selectall_html that produces nested HTML tables from SQL queries.

Advanced Stuff

The results of DBStag queries are Data::Stag objects (type "man Data::Stag", or have a look at Data::Stag for more details). This provides both a handy way of traversing the hierarchical datastructure, and also provides auto-generated objects.

For example, the following script will query and pretty-print genes from the database:
$stag = $dbh->selectall_stag("SELECT * FROM (gene NATURAL JOIN
gene_function) NATURAL JOIN (transcript NATURAL JOIN protein) NATURAL
JOIN exon_to_transcript NATURAL JOIN exon");

foreach $gene ($stag->get_gene) {
    print "*****\n";
    printf "Gene:%s\nCytology:%s\n\n",
      $gene->get_symbol,
      $gene->get_cytology;
    foreach $func ($gene->get_gene_function) {
        printf "  Function: %s\n", $func->get_mol_function;
    }
    foreach $transcript ($gene->get_transcript) {
        printf "  Transcript:%s\n",
          $transcript->get_transcript_acc;
        printf "  Transcript Seq:%s\n",
          $transcript->get_mrna_seq;

        my $protein = $transcript->get_protein;
        printf "  Protein:%s\n",
          $protein->get_protein_acc;
        printf "  Transcript Seq:%s\n",
          $protein->get_protein_seq;

        printf "  Exons:\n";
        my @exons = $transcript->find_exon;
        foreach my $exon (@exons) {
            printf "    [%d, %d] strand:%s on %s\n",
              $exon->get_seq_start,
              $exon->get_seq_end,
              $exon->get_seq_strand,
              $exon->get_chromosome_arm;
        }
    }
}
            
Autogenerared method calls are highlighted

Click here to download

This script produces this as output:

*****
Gene:foo
Cytology:35A7

  Function: transmembrane receptor
  Transcript:gn00002A
  Transcript Seq:ATGAATTGCGGCTTNNNNNNNATT
  Protein:gn00002P1
  Transcript Seq:MQVVQVQVQVQSCSCS
  Exons:
    [1, 100] strand:1 on 2L
    [201, 200] strand:1 on 2L
*****
Gene:bar
Cytology:60F1

  Function: glutamine-tRNA ligase
  Function: ATP binding
  Transcript:gn00003A
  Transcript Seq:ATGCCGTGATAGCTAGCGTA
  Protein:gn00003P1
  Transcript Seq:MSVXXXXXQLQLQ
  Exons:
    [1001, 1200] strand:1 on X
    [2001, 2200] strand:1 on X
  Transcript:gn00003B
  Transcript Seq:ATGCCGTGATAGTTTTGGGGGG
  Protein:gn00003P2
  Transcript Seq:MSVXXXXXSSSSSSSVVV
  Exons:
    [1001, 1200] strand:1 on X
    [1401, 1600] strand:1 on X
*****
Gene:whee
Cytology:58C2

  Function: GPCR
  Function: taste receptor
  Transcript:gn00004A
  Transcript Seq:GTGATGTGGCTGNNNNNTTGTTAA
  Protein:gn00004P1
  Transcript Seq:MLSTRQQCSCSXXXXLV
  Exons:
    [4389, 4418] strand:1 on 3L
    [4890, 5101] strand:1 on 3L
    [5812, 6043] strand:1 on 3L
      
Note that this is all achieved *without* a domain object model - the domain model is purely relational.

Storing Data

DBStag provides a method storenode for storing a hierarchical structure into a database. You don't need to provide any meta-data, the schema is introspected automatically. DBStag will figure out where to use foreign keys for these nested elements.
Previous |
chris mungall
Last modified: Wed Aug 6 18:18:09 PDT 2003