Documentation Center

  • Trial Software
  • Product Updates

Building a Phylogenetic Tree

    Note:   For information on creating a phylogenetic tree with multiply aligned sequences, see the phytree function.

Overview of the Primate Example

In this example, a phylogenetic tree is constructed from mitochondrial DNA (mtDNA) sequences for the family Hominidae. This family includes gorillas, chimpanzees, orangutans, and humans.

The following procedures demonstrate the phylogenetic analysis features in the Bioinformatics Toolbox™ software. They are not intended to teach the process of phylogenetic analysis, but to show you how to use MathWorks® products to create a phylogenetic tree from a set of nonaligned nucleotide sequences.

The origin of modern humans is a heavily debated issue that scientists have recently tackled by using mitochondrial DNA (mtDNA) sequences. One hypothesis explains the limited genetic variation of human mtDNA in terms of a recent common genetic ancestry, implying that all modern population mtDNA originated from a single woman who lived in Africa less than 200,000 years ago.

Why Use Mitochondrial DNA Sequences for Phylogenetic Study?

Mitochondrial DNA sequences, like the Y chromosome, do not recombine and are inherited from the maternal parent. This lack of recombination allows sequences to be traced through one genetic line and all polymorphisms assumed to be caused by mutations.

Mitochondrial DNA in mammals has a faster mutation rate than nuclear DNA sequences. This faster rate of mutation produces more variance between sequences and is an advantage when studying closely related species. The mitochondrial control region (Displacement or D-loop) is one of the fastest mutating sequence regions in animal DNA.

Neanderthal DNA

The ability to isolate mitochondrial DNA (mtDNA) from palaeontological samples has allowed genetic comparisons between extinct species and closely related nonextinct species. The reasons for isolating mtDNA instead of nuclear DNA in fossil samples have to do with the fact that:

  • mtDNA, because it is circular, is more stable and degrades slower then nuclear DNA.

  • Each cell can contain a thousand copies of mtDNA and only a single copy of nuclear DNA.

While there is still controversy as to whether Neanderthals are direct ancestors of humans or evolved independently, the use of ancient genetic sequences in phylogenetic analysis adds an interesting dimension to the question of human ancestry.

References

Ovchinnikov I., et al. (2000). Molecular analysis of Neanderthal DNA from the northern Caucasus. Nature 404(6777), 490–493.

Sajantila A., et al. (1995). Genes and languages in Europe: an analysis of mitochondrial lineages. Genome Research 5 (1), 42–52.

Krings M., et al. (1997). Neanderthal DNA sequences and the origin of modern humans. Cell 90 (1), 19–30.

Jensen-Seaman, M., Kidd K. (2001). Mitochondrial DNA variation and biogeography of eastern gorillas. Molecular Ecology 10(9), 2241–2247.

Searching NCBI for Phylogenetic Data

The NCBI taxonomy Web site includes phylogenetic and taxonomic information from many sources. These sources include the published literature, Web databases, and taxonomy experts. And while the NCBI taxonomy database is not a phylogenetic or taxonomic authority, it can be useful as a gateway to the NCBI biological sequence databases.

This procedure uses the family Hominidae (orangutans, chimpanzees, gorillas, and humans) as a taxonomy example for searching the NCBI Web site and locating mitochondrial D-loop sequences.

  1. Use the MATLAB® Help browser to search for data on the Web. In the MATLAB Command Window, type

    web('http://www.ncbi.nlm.nih.gov')

    A separate browser window opens with the home page for the NCBI Web site.

  2. Search the NCBI Web site for information. For example, to search for the human taxonomy, from the Search list, select Taxonomy, and in the for box, enter hominidae.

    The NCBI Web search returns a list of links to relevant pages.

  3. Select the taxonomy link for the family Hominidae. A page with the taxonomy for the family is shown.

Creating a Phylogenetic Tree for Five Species

Drawing a phylogenetic tree using sequence data is helpful when you are trying to visualize the evolutionary relationships between species. The sequences can be multiply aligned or a set of nonaligned sequences, you can select a method for calculating pairwise distances between sequences, and you can select a method for calculating the hierarchical clustering distances used to build a tree.

After locating the GenBank® accession codes for the sequences you are interested in studying, you can create a phylogenetic tree with the data. For information on locating accession codes, see Searching NCBI for Phylogenetic Data.

In the following example, you will use the Jukes-Cantor method to calculate distances between sequences, and the Unweighted Pair Group Method Average (UPGMA) method for linking the tree nodes.

  1. Create a MATLAB structure with information about the sequences. This step uses the accession codes for the mitochondrial D-loop sequences isolated from different hominid species.

    data = {'German_Neanderthal'      'AF011222';
            'Russian_Neanderthal'     'AF254446';
            'European_Human'          'X90314'  ;
            'Mountain_Gorilla_Rwanda' 'AF089820';
            'Chimp_Troglodytes'       'AF176766';
           };
    
  2. Retrieve sequence data from the GenBank database and copy into the MATLAB environment.

    for ind = 1:5
        seqs(ind).Header   = data{ind,1};
        seqs(ind).Sequence = getgenbank(data{ind,2},...
                                        'sequenceonly', true);
    end
    
  3. Calculate pairwise distances and create a phytree object. For example, compute the pairwise distances using the Jukes-Cantor distance method and build a phylogenetic tree using the UPGMA linkage method. Since the sequences are not prealigned, seqpdist pairwise aligns them before computing the distances.

    distances = seqpdist(seqs,'Method','Jukes-Cantor','Alphabet','DNA');
    tree = seqlinkage(distances,'UPGMA',seqs)    

    The MATLAB software displays information about the phytree object. The function seqpdist calculates the pairwise distances between pairs of sequences while the function seqlinkage uses the distances to build a hierarchical cluster tree. First, the most similar sequences are grouped together, and then sequences are added to the tree in descending order of similarity.

    Phylogenetic tree object with 5 leaves (4 branches)
    
  4. Draw a phylogenetic tree.

    h = plot(tree,'orient','top');
    ylabel('Evolutionary distance')
    set(h.terminalNodeLabels,'Rotation',65)
    

    The MATLAB software draws a phylogenetic tree in a Figure window. In the figure below, the hypothesized evolutionary relationships between the species is shown by the location of species on the branches. The horizontal distances do not have any biological significance.

Creating a Phylogenetic Tree for Twelve Species

Plotting a simple phylogenetic tree for five species seems to indicate a number of monophyletic groups (see Creating a Phylogenetic Tree for Five Species). After a preliminary analysis with five species, you can add more species to your phylogenetic tree. Adding more species to the data set will help you to confirm the observed monophyletic groups are valid.

  1. Add more sequences to a MATLAB structure. For example, add mtDNA D-loop sequences for other hominid species.

    data2 = {'Puti_Orangutan'          'AF451972';
             'Jari_Orangutan'          'AF451964';
             'Western_Lowland_Gorilla' 'AY079510';
             'Eastern_Lowland_Gorilla' 'AF050738';
             'Chimp_Schweinfurthii'    'AF176722';
             'Chimp_Vellerosus'        'AF315498';
             'Chimp_Verus'             'AF176731';
           };
    
    
  2. Get additional sequence data from the GenBank database, and copy the data into the next indices of a MATLAB structure.

    for ind = 1:7
        seqs(ind+5).Header   = data2{ind,1};
        seqs(ind+5).Sequence = getgenbank(data2{ind,2},...
                                          'sequenceonly', true);
    end
  3. Calculate pairwise distances and the hierarchical linkage.

    distances = seqpdist(seqs,'Method','Jukes-Cantor','Alpha','DNA');
    tree = seqlinkage(distances,'UPGMA',seqs);
  4. Draw a phylogenetic tree.

    h = plot(tree,'orient','top');
    ylabel('Evolutionary distance')
    set(h.terminalNodeLabels,'Rotation',65)

    The MATLAB software draws a phylogenetic tree in a Figure window. You can see four main clades for humans, gorillas, chimpanzee, and orangutans.

Exploring the Phylogenetic Tree

After you create a phylogenetic tree, you can explore the tree using the MATLAB command line or the Phylogenetic Tree app. This procedure uses the tree created in Creating a Phylogenetic Tree for Twelve Species as an example.

  1. List the members of a tree.

    names = get(tree,'LeafNames')
    names = 
    
        'German_Neanderthal'
        'Russian_Neanderthal'
        'European_Human'
        'Chimp_Troglodytes'
        'Chimp_Schweinfurthii'
        'Chimp_Verus'
        'Chimp_Vellerosus'
        'Puti_Orangutan'
        'Jari_Orangutan'
        'Mountain_Gorilla_Rwanda'
        'Eastern_Lowland_Gorilla'
        'Western_Lowland_Gorilla'

    From the list, you can determine the indices for its members. For example, the European Human leaf is the third entry.

  2. Find the closest species to a selected species in a tree. For example, find the species closest to the European human.

    [h_all,h_leaves] = select(tree,'reference',3,...
                              'criteria','distance',...
                              'threshold',0.6);
    

    h_all is a list of indices for the nodes within a patristic distance of 0.6 to the European human leaf, while h_leaves is a list of indices for only the leaf nodes within the same patristic distance.

    A patristic distance is the path length between species calculated from the hierarchical clustering distances. The path distance is not necessarily the biological distance.

  3. List the names of the closest species.

    subtree_names = names(h_leaves)
    

    The MATLAB software prints a list of species with a patristic distance to the European human less than the specified distance. In this case, the patristic distance threshold is less than 0.6.

       subtree_names = 
    
        'German_Neanderthal'
        'Russian_Neanderthal'
        'European_Human'
        'Chimp_Schweinfurthii'
        'Chimp_Verus'
        'Chimp_Troglodytes'
    
  4. Extract a subtree from the whole tree by removing unwanted leaves. For example, prune the tree to species within 0.6 of the European human species.

    leaves_to_prune = ~h_leaves;
    pruned_tree = prune(tree,leaves_to_prune)
    h = plot(pruned_tree,'orient','top');
    ylabel('Evolutionary distance')
    set(h.terminalNodeLabels,'Rotation',65)    

    The MATLAB software returns information about the new subtree and plots the pruned phylogenetic tree in a Figure window.

    Phylogenetic tree object with 6 leaves (5 branches)

  5. Explore, edit, and format a phylogenetic tree using the Phylogenetic Tree app.

    phytreeviewer(pruned_tree)
    

    The Phylogenetic Tree window opens, showing the tree.

    You can interactively change the appearance of the tree using the app. For information on using this app, see Phylogenetic Tree App Reference.

Was this topic helpful?