Bioinformatics Toolbox 3.4
Bootstrapping Phylogenetic Trees
This demonstration shows how to generate bootstrap replicates of DNA sequences. The data generated by bootstrapping are used to estimate the confidence of the branches in a phylogenetic tree.
Contents
Introduction
Bootstrap, jackknife, and permutation tests are common tests used in phylogenetics to estimate the significance of the branches of a tree. This process can be very time consuming because of the large number of samples that have to be taken in order to have an accurate confidence estimate. The more times the data are sampled the better the analysis. A cluster of computers can shorten the time needed for this analysis by distributing the work to several machines and recombining the data.
Load Sequence Data and Build the Original Tree
This demonstration uses the primates data from primatesdemo. The data are 12 pre-aligned sequences from different primates in FASTA format. Using the Bioinformatics Toolbox™, you can construct a phylogenetic tree by computing the pairwise distances with the seqpdist function and then link the species using the UPGA method with the seqlinkage function. seqlinkage builds the tree and returns the data in a phytree object. You can use the phytreetool function to visualize and explore the tree.
primates = fastaread('primatesaligned.fa'); num_seqs = length(primates) orig_primates_dist = seqpdist(primates); orig_primates_tree = seqlinkage(orig_primates_dist,'average',primates); phytreetool(orig_primates_tree);
num_seqs =
12
Make Bootstrap Replicates from the Data
A bootstrap replicate is a shuffled representation of the DNA sequence data. To make a bootstrap replicate of the primates data, bases are sampled randomly from the sequences with replacement and concatenated to make new sequences. The same number of bases as the original multiple alignment is used in this analysis, and then gaps are removed to force a new pairwise alignment. randsample from Statistics Toolbox™ samples the data with replacement. This function can also sample the data randomly without replacement to perform jackknife analysis. For this analysis, 100 bootstrap replicates for each sequence are made, the sequences are grouped first into structs and then these are stored in a cell array.
num_boots = 100; seq_len = length(primates(1).Sequence); boots = cell(num_boots,1); for n = 1:num_boots reorder_index = randsample(seq_len,seq_len,true); for i = num_seqs:-1:1 %reverse order to preallocate memory bootseq(i).Header = primates(i).Header; bootseq(i).Sequence = strrep(primates(i).Sequence(reorder_index),'-' ,''); end boots{n} = bootseq; end
Computation of the Distances Between Bootstraps and Phylogenetic Reconstruction
Determining the distances between DNA sequences for a large data set and building the phylogenetic trees can take a long time. Distributing these calculations over several machines/cores decreases the computation time. This demonstration assumes that you have already started a MATLAB® pool with additional parallel resources. For information about setting up and selecting parallel configurations, see "Programming with User Configurations" in the Parallel Computing Toolbox™ documentation. If you do not have the Parallel Computing Toolbox™ the following PARFOR loop executes sequentially without any further modification.
fun = @(x) seqlinkage(x,'average',{primates.Header}); boot_trees = cell(num_boots,1); parfor (n = 1:num_boots) dist_tmp = seqpdist(boots{n}); boot_trees{n} = fun(dist_tmp); end
Counting the Branches with Similar Topology
The topology of every bootstrap tree is compared with that of the original tree. Any interior branch that gives the same partition of species is counted. Since branches may be ordered differently among different trees but still represent the same partition of species it is necessary to get the canonical form for each subtree before comparison. The first step is to get the canonical subtrees of the original tree using the subtree and getcanonical methods from the Bioinformatics Toolbox™.
for i = num_seqs-1:-1:1 % for every branch, reverse order to preallocate branch_pointer = i + num_seqs; sub_tree = subtree(orig_primates_tree,branch_pointer); orig_pointers{i} = getcanonical(sub_tree); orig_species{i} = sort(get(sub_tree,'LeafNames')); end
Now you can get the canonical subtrees for all the branches of every bootstrap tree.
for j = num_boots:-1:1 for i = num_seqs-1:-1:1 % for every branch branch_ptr = i + num_seqs; sub_tree = subtree(boot_trees{j},branch_ptr); clusters_pointers{i,j} = getcanonical(sub_tree); clusters_species{i,j} = sort(get(sub_tree,'LeafNames')); end end
For each subtree in the original tree count how many times it appears within the bootstrap subtrees. To be considered as similar they must have the same topology and span the same species.
count = zeros(num_seqs-1,1); for i = 1 : num_seqs -1 % for every branch for j = 1 : num_boots * (num_seqs-1) if isequal(orig_pointers{i},clusters_pointers{j}) if isequal(orig_species{i},clusters_species{j}) count(i) = count(i) + 1; end end end end Pc = count ./ num_boots % confidence probability (Pc)
Pc =
1.0000
1.0000
1.0000
1.0000
0.5900
0.5900
1.0000
0.4800
0.4600
0.4600
0.4600
Visualizing the Confidence Values in the Original Tree
Store the data from this analysis in a new tree. The tree is the same as the original one but the branch names contain the confidence information. Using the phytree function a new tree is created with the updated values. phytreetool displays this data when the mouse is over the internal nodes of the tree.
[ptrs dist names] = get(orig_primates_tree,'POINTERS','DISTANCES','NODENAMES '); for i = 1 : num_seqs -1 % for every branch branch_ptr = i + num_seqs; names{branch_ptr} = [names{branch_ptr} ', confidence: ' num2st r(100*Pc(i)) ' %']; end tr = phytree(ptrs,dist,names) phytreetool(tr);
Phylogenetic tree object with 12 leaves (11 branches)
References
[1] J. Felsenstein, 2004. "Inferring Phylogenies", Sinaur Associates,Inc.
[2] M. Nei and S. Kumar, 2000. "Molecular Evolution and Phylogenetics". Oxford University Press. Chapter 4.
Store