seqpdist - Calculate pairwise distance between sequences

Syntax

D = seqpdist(Seqs)

D = seqpdist(Seqs, ...'Method', MethodValue, ...)
D = seqpdist(Seqs, ...'Indels', IndelsValue, ...)
D = seqpdist(Seqs, ...'Optargs', OptargsValue, ...)
D = seqpdist(Seqs, ...'PairwiseAlignment', PairwiseAlignmentValue, ...)
D = seqpdist(Seqs, ...'JobManager', JobManagerValue, ...)
D = seqpdist(Seqs, ...'WaitInQueue', WaitInQueueValue, ...)
D = seqpdist(Seqs, ...'SquareForm', SquareFormValue, ...)
D = seqpdist(Seqs, ...'Alphabet', AlphabetValue, ...)
D = seqpdist(Seqs, ...'ScoringMatrix', ScoringMatrixValue, ...)
D = seqpdist(Seqs, ...'Scale', ScaleValue, ...)
D = seqpdist(Seqs, ...'GapOpen', GapOpenValue, ...)
D = seqpdist(Seqs, ...'ExtendGap', ExtendGapValue, ...)

Arguments

SeqsAny of the following:
  • Cell array containing nucleotide or amino acid sequences

  • Vector of structures containing a Sequence field

  • Matrix of characters, in which each row corresponds to a nucleotide or amino acid sequence

MethodValueString that specifies the method for calculating pairwise distances. Default is Jukes-Cantor.
IndelsValueString that specifies how to treat sites with gaps. Default is score.
OptargsValueString or cell array specifying one or more input arguments required or accepted by the distance method specified by the Method property.
PairwiseAlignmentValueControls the global pairwise alignment of input sequences (using the nwalign function), while ignoring the multiple alignment of the input sequences (if any). Choices are true or false. Default is:
  • true — When all input sequences do not have the same length.

  • false — When all input sequences have the same length.

    Tip   If your input sequences have the same length, seqpdist will assume they aligned. If they are not aligned, do one of the following:

    • Align the sequences before passing them to seqpdist, for example, using the multialign function.

    • Set PairwiseAlignment to true when using seqpdist.

JobManagerValueA jobmanager object, such as returned by the Parallel Computing Toolbox™ function findResource, that represents an available distributed MATLAB® resource. Specifying this property distributes pairwise alignments into a cluster of computers using the Parallel Computing Toolbox software. You must have the Parallel Computing Toolbox software to use this property.
WaitInQueueValueControls whether seqpdist waits for a distributed MATLAB resource to be available when you have set the JobManager property. Choices are true or false (default). You must have the Parallel Computing Toolbox software to use this property.
SquareFormValue

Controls the conversion of the output into a square matrix. Choices are true or false (default).

AlphabetValueString specifying the type of sequence (nucleotide or amino acid). Choices are 'NT' or 'AA' (default).
ScoringMatrixValue

String specifying the scoring matrix to use for the global pairwise alignment. Choices for amino acid sequences are:

  • 'PAM40'

  • 'PAM250'

  • 'DAYHOFF'

  • 'GONNET'

  • 'BLOSUM30' increasing by 5 up to 'BLOSUM90'

  • 'BLOSUM62'

  • 'BLOSUM100'

Default is:

  • 'NUC44' (when AlphabetValue equals 'NT')

  • 'BLOSUM50' (when AlphabetValue equals 'AA')

ScaleValuePositive value that specifies the scale factor used to return the score in arbitrary units. If the scoring matrix information also provides a scale factor, then both are used.
GapOpenValuePositive integer specifying the penalty for opening a gap in the alignment. Default is 8.
ExtendedGapValuePositive integer specifying the penalty for extending a gap. Default is equal to GapOpenValue.

Return Values

DVector containing biological distances between each pair of sequences stored in the M elements of Seqs.

Description

D = seqpdist(Seqs) returns D, a vector containing biological distances between each pair of sequences stored in the M sequences of Seqs, a cell array of sequences, a vector of structures, or a matrix or sequences.

D is a 1-by-(M*(M-1)/2) row vector corresponding to the M*(M-1)/2 pairs of sequences in Seqs. The output D is arranged in the order ((2,1),(3,1),..., (M,1),(3,2),...(M,2),.....(M,M-1)). This is the lower-left triangle of the full M-by-M distance matrix. To get the distance between the Ith and the Jth sequences for I > J, use the formula D((J-1)*(M-J/2)+I-J).

D = seqpdist(Seqs, ...'PropertyName', PropertyValue, ...) calls seqpdist with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Each PropertyName must be enclosed in single quotation marks and is case insensitive. These property name/property value pairs are as follows:


D = seqpdist(Seqs, ...'Method', MethodValue, ...)
specifies a method to compute distances between every pair of sequences. Choices are shown in the following tables.

Methods for Nucleotides and Amino Acids

MethodDescription
p-distanceProportion of sites at which the two sequences are different. p is close to 1 for poorly related sequences, and p is close to 0 for similar sequences.
d = p
Jukes-Cantor (default)

Maximum likelihood estimate of the number of substitutions between two sequences. p is described with the method p-distance.

For nucleotides:

d = -3/4 log(1-p * 4/3)

For amino acids:

d = -19/20 log(1-p * 20/19)
alignment-scoreDistance (d) between two sequences (1, 2) is computed from the pairwise alignment score between the two sequences (score12), and the pairwise alignment score between each sequence and itself (score11, score22) as follows:
d = (1-score12/score11)* (1-score12/score22)    
This option does not imply that prealigned input sequences will be realigned, it only scores them. Use with care; this distance method does not comply with the ultrametric condition. In the rare case where the score between sequences is greater than the score when aligning a sequence with itself, then d = 0.

Methods with No Scoring of Gaps (Nucleotides Only)

MethodDescription
Tajima-NeiMaximum likelihood estimate considering the background nucleotide frequencies. It can be computed from the input sequences or given by setting Optargs to [gA gC gG gT]. gA, gC, gG, gT are scalar values for the nucleotide frequencies.
KimuraConsiders separately the transitional nucleotide substitution and the transversional nucleotide substitution.
TamuraConsiders separately the transitional nucleotide substitution, the transversional nucleotide substitution, and the GC content. GC content can be computed from the input sequences or given by setting Optargs to the proportion of GC content (scalar value form 0 to 1).
HasegawaConsiders separately the transitional nucleotide substitution, the transversional nucleotide substitution, and the background nucleotide frequencies. Background frequencies can be computed from the input sequences or given by setting the Optargs property to [gA gC gG gT].
Nei-TamuraConsiders separately the transitional nucleotide substitution between purines, the transitional nucleotide substitution between pyrimidines, the transversional nucleotide substitution, and the background nucleotide frequencies. Background frequencies can be computed from the input sequences or given by setting the Optargs property to [gA gC gG gT].

Methods with No Scoring of Gaps (Amino Acids Only)

MethodDescription
PoissonAssumes that the number of amino acid substitutions at each site has a Poisson distribution.
GammaAssumes that the number of amino acid substitutions at each site has a Gamma distribution with parameter a. You can set a by using the Optargs property. Default is 2.

You can also specify a user-defined distance function using @, for example, @distfun. The distance function must be of the form:

function D = distfun(S1, S2, OptArgsValue)

The distfun function takes the following arguments:

The distfun function returns a scalar that represents the distance between S1 and S2.

D = seqpdist(Seqs, ...'Indels', IndelsValue, ...) specifies how to treat sites with gaps. Choices are:

D = seqpdist(Seqs, ...'Optargs', OptargsValue, ...) passes one or more arguments required or accepted by the distance method specified by the Method property. Use a string or cell array to pass one or multiple input arguments. For example, you can provide the nucleotide frequencies for the Tajima-Nei distance method, instead of computing them from the input sequences.

D = seqpdist(Seqs, ...'PairwiseAlignment', PairwiseAlignmentValue, ...) controls the global pairwise alignment of input sequences (using the nwalign function), while ignoring the multiple alignment of the input sequences (if any). Default is:

D = seqpdist(Seqs, ...'JobManager', JobManagerValue, ...) distributes pairwise alignments into a cluster of computers using the Parallel Computing Toolbox software. JobManagerValue is a jobmanager object such as returned by the Parallel Computing Toolbox function findResource, that represents an available distributed MATLAB resource. You must have the Parallel Computing Toolbox software to use this property.

D = seqpdist(Seqs, ...'WaitInQueue', WaitInQueueValue, ...) controls whether seqpdist waits for a distributed MATLAB resource to be available when you have set the JobManager property. When WaitInQueueValue is true, seqpdist waits in the job manager queue for an available worker. When WaitInQueueValue is false (default) and there are no workers immediately available, seqpdist stops and displays an error message. You must have the Parallel Computing Toolbox™ software and have also set the JobManager property to use this property.

D = seqpdist(Seqs, ...'SquareForm', SquareFormValue, ...), controls the conversion of the output into a square matrix such that D(I,J) denotes the distance between the Ith and Jth sequences. The square matrix is symmetric and has a zero diagonal. Choices are true or false (default). Setting Squareform to true is the same as using the squareform function in the Statistics Toolbox™ software.

D = seqpdist(Seqs, ...'Alphabet', AlphabetValue, ...) specifies the type of sequence (nucleotide or amino acid). Choices are 'NT' or 'AA' (default).

The remaining input properties are available when the Method property equals 'alignment-score' or the PairwiseAlignment property equals true.

D = seqpdist(Seqs, ...'ScoringMatrix', ScoringMatrixValue, ...) specifies the scoring matrix to use for the global pairwise alignment. Default is:

D = seqpdist(Seqs, ...'Scale', ScaleValue, ...) specifies the scale factor used to return the score in arbitrary units. Choices are any positive value. If the scoring matrix information also provides a scale factor, then both are used.

D = seqpdist(Seqs, ...'GapOpen', GapOpenValue, ...) specifies the penalty for opening a gap in the alignment. Choices are any positive integer. Default is 8.

D = seqpdist(Seqs, ...'ExtendGap', ExtendGapValue, ...) specifies the penalty for extending a gap in the alignment. Choices are any positive integer. Default is equal to GapOpenValue.

Examples

  1. Read amino acids alignment data into a MATLAB structure.

    seqs = fastaread('pf00002.fa');
  2. For every possible pair of sequences in the multiple alignment, ignore sites with gaps and score with the scoring matrix PAM250.

    dist = seqpdist(seqs,'Method','alignment-score',...
                    'Indels','pairwise-delete',...
                    'ScoringMatrix','pam250');
  3. Force the realignment of every pair of sequences ignoring the provided multiple alignment.

    dist = seqpdist(seqs,'Method','alignment-score',...
                    'Indels','pairwise-delete',...
                    'ScoringMatrix','pam250',...
                    'PairwiseAlignment',true);
  4. Measure the 'Jukes-Cantor' pairwise distances after realigning every pair of sequences, counting the gaps as point mutations.

     dist = seqpdist(seqs,'Method','jukes-cantor',...
                     'Indels','score',...
                     'Scoringmatrix','pam250',...
                     'PairwiseAlignment',true);

See Also

Bioinformatics Toolbox™ functions: fastaread, dnds, dndsml, multialign, nwalign, phytree (object constructor), seqlinkage

Bioinformatics Toolbox object: phytree object

Bioinformatics Toolbox method of a phytree object: pdist

  


 © 1984-2008- The MathWorks, Inc.    -   Site Help   -   Patents   -   Trademarks   -   Privacy Policy   -   Preventing Piracy   -   RSS