multialign

Align multiple sequences using progressive method

Syntax

SeqsMultiAligned = multialign(Seqs)
SeqsMultiAligned = multialign(Seqs, Tree)

multialign(..., 'PropertyName', PropertyValue,...)
multialign(..., 'Weights', WeightsValue)
multialign(..., 'ScoringMatrix', ScoringMatrixValue)
multialign(..., 'SMInterp', SMInterpValue)
multialign(..., 'GapOpen', GapOpenValue)
multialign(..., 'ExtendGap', ExtendGapValue)
multialign(..., 'DelayCutoff', DelayCutoffValue)
multialign(..., 'UseParallel', UseParallelValue)
multialign(..., 'Verbose', VerboseValue)
multialign(..., 'ExistingGapAdjust', ExistingGapAdjustValue)
multialign(..., 'TerminalGapAdjust', TerminalGapAdjustValue)

Input Arguments

Seqs

Vector of structures with the fields 'Sequence' for the residues and 'Header' or 'Name' for the labels.

Seqs can also be a cell array of strings or a char array.

TreePhylogenetic tree calculated with the seqlinkage or seqneighjoin function.
WeightsValueProperty to select the sequence weighting method. Enter 'THG' (default) or 'equal'.
ScoringMatrixValue

Either of the following:

  • String specifying the scoring matrix to use for the alignment. Choices for amino acid sequences are:

    • 'BLOSUM62'

    • 'BLOSUM30' increasing by 5 up to 'BLOSUM90'

    • 'BLOSUM100'

    • 'PAM10' increasing by 10 up to 'PAM500'

    • 'DAYHOFF'

    • 'GONNET'

    Default is:

    • 'BLOSUM80' to 'BLOSUM30' series — When AlphabetValue equals 'AA'

    • 'NUC44' — When AlphabetValue equals 'NT'

      Note:   The above scoring matrices, provided with the software, also include a structure containing a scale factor that converts the units of the output score to bits. You can also use the 'Scale' property to specify an additional scale factor to convert the output score from bits to another unit.

  • Matrix representing the scoring matrix to use for the alignment. It can be a matrix, such as returned by the blosum, pam, dayhoff, gonnet, or nuc44 function. It can also be an M-by-M matrix or M-by-M-by-N array of matrices with N user-defined scoring matrices.

      Note:   If you use a scoring matrix that you created or was created by one of the above functions, the matrix does not include a scale factor. The output score will be returned in the same units as the scoring matrix. When passing your own series of scoring matrices, ensure they share the same scale.

    Note:   If you need to compile multialign into a stand-alone application or software component using MATLAB® Compiler™, use a matrix instead of a string for ScoringMatrixValue.

SMInterpValueProperty to specify whether linear interpolation of the scoring matrices is on or off. When false, the scoring matrix is assigned to a fixed range depending on the distances between the two profiles (or sequences) being aligned. Default is true.
GapOpenValueScalar or a function specified using @. If you enter a function, multialign passes four values to the function: the average score for two matched residues (sm), the average score for two mismatched residues (sx), and, the length of both profiles or sequences (len1, len2). Default is @(sm,sx,len1,len2) 5*sm.
ExtendGapValueScalar or a function specified using @. If you enter a function, multiialign passes four values to the function: the average score for two matched residues (sm), the average score for two mismatched residues (sx), and the length of both profiles or sequences (len1, len2). Default is @(sm,sx,len1,len2) sm/4.
DelayCutoffValueProperty to specify the threshold delay of divergent sequences. Default is unity where sequences with the closest sequence farther than the median distance are delayed.
UseParallelValueControls the computation of the pairwise alignments using parfor-loops. When true, and Parallel Computing Toolbox™ is installed and a parpool is open, computation occurs in parallel. If there are no open parpool, but automatic creation is enabled in the Parallel Preferences, the default pool will be automatically open and computation occurs in parallel. If Parallel Computing Toolbox is installed, but there are no open parpool and automatic creation is disabled, then computation uses parfor-loops in serial mode. If Parallel Computing Toolbox is not installed, then computation uses parfor-loops in serial mode. Default is false, which uses for-loops in serial mode.
VerboseValueProperty to control displaying the sequences with sequence information. Default is false.
ExistingGapAdjustValueProperty to control automatic adjustment based on existing gaps. Default is true.
TerminalGapAdjustValueProperty to adjust the penalty for opening a gap at the ends of the sequence. Default is false.

Output Arguments

SeqsMultiAligned

Vector of structures (same as Seqs) but with the field 'Sequence' updated with the alignment.

When Seqs is a cell or char array, SeqsMultiAligned is a char array with the output alignment following the same order as the input.

Description

SeqsMultiAligned = multialign(Seqs) performs a progressive multiple alignment for a set of sequences (Seqs). Pairwise distances between sequences are computed after pairwise alignment with the Gonnet scoring matrix and then by counting the proportion of sites at which each pair of sequences are different (ignoring gaps). The guide tree is calculated by the neighbor-joining method assuming equal variance and independence of evolutionary distance estimates.

SeqsMultiAligned = multialign(Seqs, Tree) uses a tree (Tree) as a guide for the progressive alignment. The sequences (Seqs) should have the same order as the leaves in the tree (Tree) or use a field ('Header' or 'Name') to identify the sequences.


multialign(..., 'PropertyName', PropertyValue,...)
enters optional arguments as property name/property value pairs. Specify one or more properties in any order. Enclose each PropertyName in single quotation marks. Each PropertyName is case insensitive. These property name/property value pairs are as follows:

multialign(..., 'Weights', WeightsValue) selects the sequence weighting method. Weights emphasize highly divergent sequences by scaling the scoring matrix and gap penalties. Closer sequences receive smaller weights.

Values of the property Weights are:

  • 'THG' (default) — Thompson-Higgins-Gibson method using the phylogenetic tree branch distances weighted by their thickness.

  • 'equal' — Assigns the same weight to every sequence.

multialign(..., 'ScoringMatrix', ScoringMatrixValue) selects the scoring matrix (ScoringMatrixValue) for the progressive alignment. Match and mismatch scores are interpolated from the series of scoring matrices by considering the distances between the two profiles or sequences being aligned. The first matrix corresponds to the smallest distance, and the last matrix to the largest distance. Intermediate distances are calculated using linear interpolation.

multialign(..., 'SMInterp', SMInterpValue), when SMInterpValue is false, turns off the linear interpolation of the scoring matrices. Instead, each supplied scoring matrix is assigned to a fixed range depending on the distances between the two profiles or sequences being aligned.

multialign(..., 'GapOpen', GapOpenValue) specifies the initial penalty for opening a gap.

multialign(..., 'ExtendGap', ExtendGapValue) specifies the initial penalty for extending a gap.

multialign(..., 'DelayCutoff', DelayCutoffValue) specifies a threshold to delay the alignment of divergent sequences whose closest neighbor is farther than

(DelayCutoffValue) * (median patristic distance between sequences)

multialign(..., 'UseParallel', UseParallelValue) specifies whether to use parfor-loops when computing the pairwise alignments. When true, and Parallel Computing Toolbox is installed and a parpool is open, computation occurs in parallel. If there are no open parpool, but automatic creation is enabled in the Parallel Preferences, the default pool will be automatically open and computation occurs in parallel. If Parallel Computing Toolbox is installed, but there are no open parpool and automatic creation is disabled, then computation uses parfor-loops in serial mode. If Parallel Computing Toolbox is not installed, then computation uses parfor-loops in serial mode. Default is false, which uses for-loops in serial mode.

multialign(..., 'Verbose', VerboseValue), when VerboseValue is true, turns on verbosity.

The remaining input optional arguments are analogous to the function profalign and are used through every step of the progressive alignment of profiles.

multialign(..., 'ExistingGapAdjust', ExistingGapAdjustValue), when ExistingGapAdjustValue is false, turns off the automatic adjustment based on existing gaps of the position-specific penalties for opening a gap.

When ExistingGapAdjustValue is true, for every profile position, profalign proportionally lowers the penalty for opening a gap toward the penalty of extending a gap based on the proportion of gaps found in the contiguous symbols and on the weight of the input profile.

multialign(..., 'TerminalGapAdjust', TerminalGapAdjustValue), when TerminalGapAdjustValue is true, adjusts the penalty for opening a gap at the ends of the sequence to be equal to the penalty for extending a gap.

Examples

Align multiple sequences

This example shows how to align multiple protein sequences.

Use the fastaread function to read p53samples.txt, a FASTA-formatted file included with Bioinformatics Toolbox™, which contains p53 protein sequences of seven species.

p53 = fastaread('p53samples.txt')
p53 = 

7x1 struct array with fields:

    Header
    Sequence

Compute the pairwise distances between each pair of sequences using the 'GONNET' scoring matrix.

dist = seqpdist(p53,'ScoringMatrix','GONNET');

Build a phylogenetic tree using an unweighted average distance (UPGMA) method. This tree will be used as a guiding tree in the next step of progressive alignment.

tree = seqlinkage(dist,'average',p53)
    Phylogenetic tree object with 7 leaves (6 branches)

Perform progressive alignment using the PAM family scoring matrices.

ma = multialign(p53,tree,'ScoringMatrix',...
                {'pam150','pam200','pam250'})
showalignment(ma)
ma = 

7x1 struct array with fields:

    Header
    Sequence

Align Nucleotide Sequences

  1. Enter an array of sequences.

    seqs = {'CACGTAACATCTC','ACGACGTAACATCTTCT','AAACGTAACATCTCGC'};
    
  2. Promote terminations with gaps in the alignment.

    multialign(seqs,'terminalGapAdjust',true)
    
    ans =
    --CACGTAACATCTC--
    ACGACGTAACATCTTCT
    -AAACGTAACATCTCGC
  3. Compare the alignment without termination gap adjustment.

    multialign(seqs)
    
    ans =
    CA--CGTAACATCT--C
    ACGACGTAACATCTTCT
    AA-ACGTAACATCTCGC
Was this topic helpful?