SDBP: An R package for assessing statistical reliability of phylogenetic trees

Speedy double bootstrap method

Evaluating the reliability of estimated phylogenetic trees is of critical importance in the field of molecular phylogenetics, and for other endeavors that depend on accurate phylogenetic reconstruction. The bootstrap method is a well-known computational approach to phylogenetic tree assessment, and more generally for assessing the reliability of statistical models. However, it is known to be biased under certain circumstances, calling into question the accuracy of the method. Therefor several advanced bootstrap methods have been developed to achieve higher accuracy, one of which is the speedy double bootstrap approach. In phylogenetic tree selection problem, it has been shown that the speedy double bootstrap approach has comparable accuracy to the double bootstrap approach and is much more computational efficient.

R package SDBP

SDBP is R package for assenssing the statistical reliability of phylogenetic trees. It is distributed for academic use free of charge by Aizhen Ren. The package was written in the S language using the S3 object system. For each phylogenetic tree in given condidate tree set, called p-values are calculated via speedy double bootstrap method. p-value of a tree indicates how strong the tree is supported by data.
SDBP provides three types of p-value: sDBP(speedy double bootstrap probability), DBP(double bootstrap probability), and BP(bootstrap probability).

Download

UNIX/Windows

The source code should be found at CRAN web site

The official SDBP page at CRAN

SDBP and its supporting document are also available from this web site:

On Windows you can put the SDBP_1.0.zip which was download from CRAN anywhere on your computer, I just put it on my desktop. On UNIX machine, you can put the SDBP_1.0.tar.gz on your home directory.

Installation

Our SDBP package is built under R version 3.0.0. Therefore, this R version (or later) is needed to install our package. For Windows OS, after booting R, choose Packages in the upper toolbar and select the Install Package(s) from zip files option, then choose the SDBP_1.0.zip file downloaded from CRAN. For UNIX machine, install the source version package SDBP_1.0.tar.gz, and write the following command on the command line at your home directory where you put the source file on.

R CMD INSTALL SDBP_1.0.tar.gz

and boot R via the command line using the command.

R

Then, the following on the R console command line to load our package (the following command can be typed on both Unix and regular Windows machines):

library("SDBP") # load our package

Until this step, if you do not get any error, it should be installed. About the detail of installing the R package can see Ligges (2008).

Data files

This data files are available as supplementary material.

Compressed file: mam20files.tgz (unix), mam20files.zip (win)

This data files mentioned briefly by Ren et al. (2013).

How to obtain an input log likelihood file for this tool

Used the software package PAML, to calculate the site-wise log-likelihood for each tree. The output will be .lnf file, for example mam20-conc.lnf.
Change the format using CONSEL by executing the command "seqmt --paml .lnf", for example "seqmt --paml mam20-conc.lnf". Because the format of PAML .lnf file is not available for ourprogram. Then we obtain the site-wise log-likelihood matrix saved in the .mt file for each tree, for example mam20-conc.mt.
The .mt file obtained by CONSEL should be placed in the R work directory.

Requirment

The R package "scaleboot" is required for read .mt files. "scaleboot" can be available via CRAN.

Usage

Following is an example of typical usage of "SDBP" using data named mam20 saved in the SDBP data file mam20.rda, and also for the mam20-conc.mt file.

> data(mam20) # data named mam20 was loaded
> dim(mam20) # mam20 matrix demation
[1] 5879 15

For mam20-conc.mt

> library(scaleboot) # read library scaleboot
> dat<-read.mt(mam20-conc.mt) # load the mam20-conc.mt file
> dim(dat) # dat matrix demation

To calculate the sDBP-value for each tree is only following one line.

> result <- sdbp.default(mam20)
> result

The result is in diminishing order of log-likelihood.

Call:
SDBP.default(dat = dat)
SDBP double bootstrap probabilities:
t1     t4     t3     t7     t2     t5
0.7503 0.4281 0.3794 0.3338 0.3054 ...

> summay(result)

The output is

$Call:
sdbp.default(dat = mam20)

$coefficients

stdErr p.value
t1 0.0043 0.7503
t4 0.0049 0.4281
t3 0.0048 0.3794
t7 0.0047 0.3338
...
attr(,"class")
[1] "summary.sdbp"

When we want to calculate the reliability for one tree, for example tree 2, we can use the command sdbpk , with the output shown below.

> result1 <- sdbpk(mam20,2)
> result1
then, the output is
Call:
sdbpk(dat = mam20, k = 2)

t2
0.3018

Then, calculating the bootstrap probability can use the command bp, again shown with the output.

> result2 <- bp(mam20)
the output is following
Call:
bp(dat = mam20)
Bootstrap probabilities:
t1     t4     t3     t7     t2     
0.4887 0.1978 0.1128 0.0882 0.0270 ...

Then, calculating the bootstrap probability for one tree can use the command bpk(mam20), and calculating the double bootstrap probability for one tree can use command dbpk(mam20).

Reference

Ligges, Uwe, 2008. Programmieren mit R Springer.
Ren, A., Ishida, T. and Akiyama, Y., 2013. Assessing statistical reliability of phylogenetic trees via a speedy double bootstrap method Molecular Phylogenetic of evolutionm, 67(2), 429-435. doi:10.1016/j.ympev.2013.02.011