Bio::ObjectCompat
Bio::ObjectCompat - Object compatibility for phylogenetic software in OO perl
Rutger A. Vos rvos@interchange.ubc.ca Department of Zoology, 6270 University Boulevard University of British Columbia Vancouver, BC, V6T 1Z4, Canada
The most recent version of this document can be found at (user=guest, pass=guest):
$URL: http://nladr-cvs.sdsc.edu/svn/CIPRES/cipresdev/trunk/cipres/framework/perl/phylo/lib/Bio/ObjectCompat.pod $
The trunk version of this document is written in pod, a simple source code documentation format for perl5. To view it in an nroff-like formatter, use 'perldoc ObjectCompat.pod'. Pod can be converted to a number of different formats; by default the pod2text, pod2latex and pod2html utilities should be available for this purpose on systems with a recent perl installation.
The version you are reading now is: $Revision: 3409 $
Please help improve this document by making sure you are reading the most recent version, and sharing your feedback with the author.
This document describes the steps required to obtain object compatibility between three software packages written in object-oriented perl5: Bio::Perl, Bio::NEXUS and Bio::Phylo. Of these three, BioPerl is by far the most commonly used, largest and oldest project. We therefore suggest an approach that requires minimal, optional changes on its part, playing to the strength of its design in using interfaces such as Bio::Tree::TreeI and Bio::Tree::NodeI. We are implementing several new such interfaces, in particular for characters or character sequences, character state matrices and a character-data-and-tree object that forms a container for comparative data and phylogenetic trees. Implementation of these interfaces is largely left to Bio::NEXUS and Bio::Phylo, which thereby become compatible, such that users can draw on the strengths of both packages more easily.
Phylogenetic analysis is a field that, from a programmer's perspective, deals with a limited set of objects: trees which are comprised of nodes, matrices which are comprised of character sequences of some sort, and a containing context to describe the relationship between the two: a character-data-and-tree object.
Objects in perl5 are references to data structures 'blessed
into' a package, which defines the methods implemented by the
object. Perl5 allows for multiple inheritance either by using the
base pragma or by manipulating the @ISA array. Runtime
modification of the inheritance tree and the symbol table allows for
optional implementation of java-like interfaces, so that classes
from different packages can become loosely coupled through the
interfaces they implement. These properties can be used to make
different packages written in object-oriented perl5 object-compatible.
Several software libraries written in object-oriented perl5 now exist that all implement objects from the phylogenetic problem space - though all in slightly different ways. The largest among these packages is Bio::Perl, which is widely used by molecular biologists around the world. BioPerl's architecture is broad, with branches being maintained by many different developers who maintain compatibility with each other by implementing interfaces such as Bio::Tree::TreeI, Bio::Tree::NodeI (see also: http://search.cpan.org/~birney/bioperl-1.4/biodesign.pod). Here we will describe how two smaller packages, Bio::NEXUS and Bio::Phylo can be modified to become compatible with BioPerl so that their respective strengths become more easily accessible to the BioPerl user community. The approach we suggest may be a model for other phylogenetic software written in OO perl5, with BioPerl taking on the role of defining the standard interfaces - a kind of W3C for phyloinformatics.
The typical approach taken in BioPerl is that java-like interfaces are defined in classes whose name are suffixed with an 'I', e.g. Bio::Tree::TreeI. These classes inherit from Bio::Root::RootI, which defines exception handling methods.
The interfaces are never instantiated directly. Rather, the implementation class objects such as Bio::Tree::Tree are instantiated by the IO system, in this case Bio::TreeIO.
The interfaces define method names to be implemented, throwing
throw_not_implemented exceptions when the code blocks are ever executed.
Classes in BioPerl such as Bio::Tree::Tree implement the actual subroutines
defined in the interfaces they contain in their @ISA arrays, in this
case Bio::Tree::TreeI, thereby preventing these exceptions from ever
being thrown.
BioPerl's general design philosophy is that "complex" operations (generally, anything that is computationally intensive and/or requires external tools) are provided by separate factory classes that operate on the objects. The basic objects modelling biological data (trees, matrices) are therefore intentionally fairly concise.
Third-party packages can become compatible with BioPerl by defining using
base which BioPerl interfaces they implement (and then correctly
implementing the methods defined in the interface). However,
this creates a permanent compile time dependency between it and BioPerl.
A more dynamic option is by testing at runtime whether an interface is
installed, and only then inheriting from it by including the class in the
@ISA array.
I (RAV) found that in many instances the interface defined methods only differ slightly from those implemented natively by the Bio::Phylo classes (e.g. return values passed as a list versus an array reference), so implementing adaptor classes to create object compatibility with bioperl was fairly straightforward - as shown in the Bio::Phylo::Adaptor architecture.
The Bio::NEXUS::Tree and Bio::NEXUS::Node object could be modified in a similar way, such that tree objects and node objects from Bio::NEXUS can similarly masquerade as BioPerl objects.
Bio::NEXUS and Bio::Phylo can integrate further along three tracks:
The next section discusses these interfaces in more detail.
The interfaces we propose are meant to be fairly minimal, providing mostly just accessors and mutators for the object's data. Substantial operations (e.g. calculations) will be provided by factory objects. For example, inferring a tree would be something like:
my $inferrer = Bio::Tools::InferTree::FooBar->new; my $tree = $inferrer->inferTree( $matrix );
Rather than:
my $tree = $matrix->inferTree;
At present, no suitable interface for character state matrices has been
defined in BioPerl. However, having a Bio::Phylo::Matrices::Matrix
masquerade as a Bio::Align::AlignI instance well enough that it is
written as proper #nexus without too much trouble, as shown in
Bio::Phylo::Adaptor::Bioperl::Matrix. But a character state
matrix object can be many other things besides an alignment. The #nexus
format specifies many other data types (categorical, continuous values)
which should also be validated.
A character state matrix has a pre-defined data type
(dna/rna/nucleotide; amino acid; standard categorical; continuous)
against which data inserted in the matrix must be validated. Once data has
been inserted in the matrix there is little point in changing the datatype,
so perhaps this should be a constant specified in the constructor, so that
subsequently the interface only defines a readonly $matrix->datatype()
method. Likewise, the number of taxa and characters in a matrix should be
an emergent property of its contents so the $matrix->ntax() and
$matrix->nchar() methods should be readonly.
In a character state matrix, some symbols may be more ambiguous than others - most sequence alignments have gaps in them, and sometimes the sequences are just bad, with many N's or ?'s. Under the IUPAC single character ambiguity conventions, ambiguous symbols map to non-ambiguous ones as follows:
my $IUPAC = {
'A' => [ 'A' ],
'B' => [ 'C','G','T' ],
'C' => [ 'C' ],
'D' => [ 'A','G','T' ],
'G' => [ 'G' ],
'H' => [ 'A','C','T' ],
'K' => [ 'G','T' ],
'M' => [ 'A','C' ],
'N' => [ 'A','C','G','T' ],
'R' => [ 'A','G' ],
'S' => [ 'C','G' ],
'T' => [ 'T' ],
'U' => [ 'U' ],
'V' => [ 'A','C','G' ],
'W' => [ 'A','T' ],
'X' => [ 'A','C','G','T' ],
'Y' => [ 'C','T' ],
'-' => [ ],
'?' => [ 'A','C','G','T' ],
};
The matrix interface should be able to take this ambiguity into account when parsing matrices, or when transforming them, for example for serialization to the CIPRES architecture.
To allow for this during validation of character $c a character state lookup
should be performed, such as by checking the $IUPAC hash reference. If
$matrix->datatype =~ /^dna$/i it means that the $IUPAC hash reference
is the lookup table. If not exists $IUPAC->{$c} an exception is thrown.
For instances where none of the default lookup tables suffice (i.e. when handling a 'mixed' matrix) the matrix interface should allow a lookup table as an argument to the constructor.
title and link tokens, or possibly
just by allowing only one taxa block, one tree block and one characters block
to be in context at any one time). This facility may be defined as in
Bio::Phylo::Matrices::Matrix, using $matrix->set_cdat($cdat) and
$matrix->get_cdat() methods, or just from the perspective of the CDAT
container, e.g. $cdat->add_matrix($matrix). Or using a mediator
architecture that manages the bi-directional relationships between the objects
involved.
BioPerl does not define a suitable interface for character sequences. We propose a character sequence interface that meets the following requirements:
$char->set_type($type) and $char->get_type() methods.
Conceptually, nodes in phylogenetic trees and character sequences in
matrices both refer to biological entities (e.g. OTUs). We want to make
this relationship explicit by creating an intersection object that
links the two. The CDAT object would be a thin wrapper around the more
fine grained BioPerl objects (Bio::Tree::TreeI and
Bio::CDAT::CharMatrixI) it contains. This CDAT object must meet the
following requirements:
$cdat->set_tree($tree) and $cdat->get_trees()
(and perhaps $cdat->remove_trees($tree)).
Bio::CDAT::CharMatrixI
objects, e.g. using $cdat->set_matrices($matrix) and
$cdat->get_matrices($matrix) (and perhaps
$cdat->remove_matrices($matrix)) methods.
We suggest as a namespace Bio::CDAT.
Bio::NEXUS, Bio::Phylo and BioPerl should become better integrated at the input/output level, for example by adopting the standard BioPerl architectures for parsers (e.g. Bio::TreeIO), and by making trees received from CIPRES conform to the BioPerl interfaces.
In order to ensure quality coding, we should adopt a set of test data files and a regression testing strategy. This is likely to develop out of the use cases.
The intent is that the design phase takes place on cpan releases of Bio::NEXUS and Bio::Phylo, and only once the API has stabilized changes to the BioPerl core will be proposed.