Pairwise distance between pairs of observations

`D = pdist(X)`

`D = pdist(X,Distance)`

`D = pdist(X,Distance,DistParameter)`

returns the distance by using the method specified by `D`

= pdist(`X`

,`Distance`

,`DistParameter`

)`Distance`

and `DistParameter`

. You can specify
`DistParameter`

only when `Distance`

is
`'seuclidean'`

, `'minkowski'`

, or
`'mahalanobis'`

.

Compute the Euclidean distance between pairs of observations, and convert the distance vector to a matrix using `squareform`

.

Create a matrix with three observations and two variables.

rng('default') % For reproducibility X = rand(3,2);

Compute the Euclidean distance.

D = pdist(X)

`D = `*1×3*
0.2954 1.0670 0.9448

The pairwise distances are arranged in the order (2,1), (3,1), (3,2). You can easily locate the distance between observations `i`

and `j`

by using `squareform`

.

Z = squareform(D)

`Z = `*3×3*
0 0.2954 1.0670
0.2954 0 0.9448
1.0670 0.9448 0

`squareform`

returns a symmetric matrix where `Z(i,j)`

corresponds to the pairwise distance between observations `i`

and `j`

. For example, you can find the distance between observations 2 and 3.

Z(2,3)

ans = 0.9448

Pass `Z`

to the `squareform`

function to reproduce the output of the `pdist`

function.

y = squareform(Z)

`y = `*1×3*
0.2954 1.0670 0.9448

The outputs `y`

from `squareform`

and `D`

from `pdist`

are the same.

Create a matrix with three observations and two variables.

rng('default') % For reproducibility X = rand(3,2);

Compute the Minkowski distance with the default exponent 2.

`D1 = pdist(X,'minkowski')`

`D1 = `*1×3*
0.2954 1.0670 0.9448

Compute the Minkowski distance with an exponent of 1, which is equal to the city block distance.

`D2 = pdist(X,'minkowski',1)`

`D2 = `*1×3*
0.3721 1.5036 1.3136

`D3 = pdist(X,'cityblock')`

`D3 = `*1×3*
0.3721 1.5036 1.3136

Define a custom distance function that ignores coordinates with `NaN`

values, and compute pairwise distance by using the custom distance function.

Create a matrix with three observations and two variables.

rng('default') % For reproducibility X = rand(3,2);

Assume that the first element of the first observation is missing.

X(1,1) = NaN;

Compute the Euclidean distance.

D1 = pdist(X)

D1 = NaN NaN 0.9448

If observation `i`

or `j`

contains `NaN`

values, the function `pdist`

returns `NaN`

for the pairwise distance between `i`

and `j`

. Therefore, D1(1) and D1(2), the pairwise distances (2,1) and (3,1), are `NaN`

values.

Define a custom distance function `naneucdist`

that ignores coordinates with `NaN`

values and returns the Euclidean distance.

function D2 = naneucdist(XI,XJ) %NANEUCDIST Euclidean distance ignoring coordinates with NaNs n = size(XI,2); sqdx = (XI-XJ).^2; nstar = sum(~isnan(sqdx),2); % Number of pairs that do not contain NaNs nstar(nstar == 0) = NaN; % To return NaN if all pairs include NaNs D2squared = nansum(sqdx,2).*n./nstar; % Correction for missing coordinates D2 = sqrt(D2squared);

Compute the distance with `naneucdist`

by passing the function handle as an input argument of `pdist`

.

D2 = pdist(X,@naneucdist)

D2 = 0.3974 1.1538 0.9448

`X`

— Input datanumeric matrix

Input data, specified as a numeric matrix of size
*m*-by-*n*. Rows correspond to
individual observations, and columns correspond to individual
variables.

**Data Types: **`single`

| `double`

`Distance`

— Distance metriccharacter vector | string scalar | function handle

Distance metric, specified as a character vector, string scalar, or function handle, as described in the following table.

Value | Description |
---|---|

`'euclidean'` | Euclidean distance (default). |

`'squaredeuclidean'` | Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.) |

`'seuclidean'` | Standardized Euclidean distance. Each coordinate difference between observations is scaled by
dividing by the corresponding element of the standard deviation, |

`'mahalanobis'` |
Mahalanobis distance using the sample covariance of |

`'cityblock'` | City block distance. |

`'minkowski'` | Minkowski distance. The default exponent is 2. Use |

`'chebychev'` | Chebychev distance (maximum coordinate difference). |

`'cosine'` | One minus the cosine of the included angle between points (treated as vectors). |

`'correlation'` | One minus the sample correlation between points (treated as sequences of values). |

`'hamming'` | Hamming distance, which is the percentage of coordinates that differ. |

`'jaccard'` | One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ. |

`'spearman'` |
One minus the sample Spearman's rank correlation between observations (treated as sequences of values). |

`@` |
Custom distance function handle. A distance function has the form function D2 = distfun(ZI,ZJ) % calculation of distance ... `ZI` is a`1` -by-`n` vector containing a single observation.`ZJ` is an`m2` -by-`n` matrix containing multiple observations.`distfun` must accept a matrix`ZJ` with an arbitrary number of observations.`D2` is an`m2` -by-`1` vector of distances, and`D2(k)` is the distance between observations`ZI` and`ZJ(k,:)` .
If your data is not sparse, you can generally compute distance more quickly by using a built-in distance instead of a function handle. |

For definitions, see Distance Metrics.

When you use `'seuclidean'`

,
`'minkowski'`

, or `'mahalanobis'`

, you
can specify an additional input argument `DistParameter`

to control these metrics. You can also use these metrics in the same way as
the other metrics with a default value of
`DistParameter`

.

**Example: **
`'minkowski'`

`DistParameter`

— Distance metric parameter valuespositive scalar | numeric vector | numeric matrix

Distance metric parameter values, specified as a positive scalar, numeric vector, or
numeric matrix. This argument is valid only when you specify
`Distance`

as `'seuclidean'`

,
`'minkowski'`

, or `'mahalanobis'`

.

If

`Distance`

is`'seuclidean'`

,`DistParameter`

is a vector of scaling factors for each dimension, specified as a positive vector. The default value is`nanstd(`

.`X`

)If

`Distance`

is`'minkowski'`

,`DistParameter`

is the exponent of Minkowski distance, specified as a positive scalar. The default value is 2.If

`Distance`

is`'mahalanobis'`

,`DistParameter`

is a covariance matrix, specified as a numeric matrix. The default value is`nancov(X)`

.`DistParameter`

must be symmetric and positive definite.

**Example: **
`'minkowski',3`

**Data Types: **`single`

| `double`

`D`

— Pairwise distancesnumeric row vector

Pairwise distances, returned as a numeric row vector of length
*m*(*m*–1)/2, corresponding to pairs
of observations, where *m* is the number of observations in
`X`

.

The distances are arranged in the order (2,1), (3,1), ...,
(*m*,1), (3,2), ..., (*m*,2), ...,
(*m*,*m*–1), i.e., the lower-left
triangle of the *m*-by-*m* distance matrix
in column order. The pairwise distance between observations
*i* and *j* is in
*D((i-1)*(m-i/2)+j-i)* for *i*≤*j*.

You can convert `D`

into a symmetric matrix by using
the `squareform`

function.
`Z = squareform(D)`

returns an
*m*-by-*m* matrix where
`Z(i,j)`

corresponds to the pairwise distance between
observations *i* and *j*.

If observation *i* or *j* contains
`NaN`

s, then the corresponding value in
`D`

is `NaN`

for the built-in
distance functions.

`D`

is commonly used as a dissimilarity matrix in
clustering or multidimensional scaling. For details, see Hierarchical Clustering and the function reference pages for
`cmdscale`

, `cophenet`

, `linkage`

, `mdscale`

, and `optimalleaforder`

. These
functions take `D`

as an input argument.

A distance metric is a function that defines a distance between
two observations. `pdist`

supports various distance
metrics: Euclidean distance, standardized Euclidean distance, Mahalanobis distance,
city block distance, Minkowski distance, Chebychev distance, cosine distance,
correlation distance, Hamming distance, Jaccard distance, and Spearman
distance.

Given an *m*-by-*n* data matrix
`X`

, which is treated as *m*
(1-by-*n*) row vectors
*x _{1}*,

Euclidean distance

$${d}_{st}^{2}=({x}_{s}-{x}_{t})({x}_{s}-{x}_{t}{)}^{\prime}.$$

The Euclidean distance is a special case of the Minkowski distance, where

*p*= 2.Standardized Euclidean distance

$${d}_{st}^{2}=({x}_{s}-{x}_{t}){V}^{-1}({x}_{s}-{x}_{t}{)}^{\prime},$$

where

*V*is the*n*-by-*n*diagonal matrix whose*j*th diagonal element is (*S*(*j*))^{2}, where*S*is a vector of scaling factors for each dimension.Mahalanobis distance

$${d}_{st}^{2}=({x}_{s}-{x}_{t}){C}^{-1}({x}_{s}-{x}_{t}{)}^{\prime},$$

where

*C*is the covariance matrix.City block distance

$${d}_{st}={\displaystyle \sum _{j=1}^{n}\left|{x}_{sj}-{x}_{tj}\right|}.$$

The city block distance is a special case of the Minkowski distance, where

*p*= 1.Minkowski distance

$${d}_{st}=\sqrt[p]{{\displaystyle \sum _{j=1}^{n}{\left|{x}_{sj}-{x}_{tj}\right|}^{p}}}.$$

For the special case of

*p*= 1, the Minkowski distance gives the city block distance. For the special case of*p*= 2, the Minkowski distance gives the Euclidean distance. For the special case of*p*= ∞, the Minkowski distance gives the Chebychev distance.Chebychev distance

$${d}_{st}={\mathrm{max}}_{j}\left\{\left|{x}_{sj}-{x}_{tj}\right|\right\}.$$

The Chebychev distance is a special case of the Minkowski distance, where

*p*= ∞.Cosine distance

$${d}_{st}=1-\frac{{x}_{s}{{x}^{\prime}}_{t}}{\sqrt{\left({x}_{s}{{x}^{\prime}}_{s}\right)\left({x}_{t}{{x}^{\prime}}_{t}\right)}}.$$

Correlation distance

$${d}_{st}=1-\frac{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{t}-{\overline{x}}_{t}\right)}^{\prime}}{\sqrt{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{s}-{\overline{x}}_{s}\right)}^{\prime}}\sqrt{\left({x}_{t}-{\overline{x}}_{t}\right){\left({x}_{t}-{\overline{x}}_{t}\right)}^{\prime}}},$$

where

$${\overline{x}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{x}_{sj}}$$ and $${\overline{x}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{x}_{tj}}$$.

Hamming distance

$${d}_{st}=(\#({x}_{sj}\ne {x}_{tj})/n).$$

Jaccard distance

$${d}_{st}=\frac{\#\left[\left({x}_{sj}\ne {x}_{tj}\right)\cap \left(\left({x}_{sj}\ne 0\right)\cup \left({x}_{tj}\ne 0\right)\right)\right]}{\#\left[\left({x}_{sj}\ne 0\right)\cup \left({x}_{tj}\ne 0\right)\right]}.$$

Spearman distance

$${d}_{st}=1-\frac{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime}}{\sqrt{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{s}-{\overline{r}}_{s}\right)}^{\prime}}\sqrt{\left({r}_{t}-{\overline{r}}_{t}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime}}},$$

where

*r*is the rank of_{sj}*x*taken over_{sj}*x*_{1},_{j}*x*_{2}, ..._{j}*x*, as computed by_{mj}`tiedrank`

.*r*and_{s}*r*are the coordinate-wise rank vectors of_{t}*x*and_{s}*x*, i.e.,_{t}*r*= (_{s}*r*_{s}_{1},*r*_{s}_{2}, ...*r*)._{sn}$${\overline{r}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{sj}}=\frac{\left(n+1\right)}{2}$$.

$${\overline{r}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{tj}}=\frac{\left(n+1\right)}{2}$$.

Generate C and C++ code using MATLAB® Coder™.

Usage notes and limitations:

The distance input argument value (

`Distance`

) must be a compile-time constant. For example, to use the Minkowski distance, include`coder.Constant('Minkowski')`

in the`-args`

value of`codegen`

.The distance input argument value (

`Distance`

) cannot be a custom distance function.For code generation,

`pdist`

uses`parfor`

(by default) to create loops that run in parallel on supported shared-memory multicore platforms. If your compiler does not support the Open Multiprocessing (OpenMP) application interface or you disable OpenMP library, MATLAB^{®}Coder™ treats the`parfor`

-loops as`for`

-loops. To find supported compilers, see Supported Compilers. To disable OpenMP library, specify the`EnableOpenMP`

property of the`codegen`

configuration object as`false`

. For details, see`coder.CodeConfig`

.

Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Usage notes and limitations:

The

`Distance`

argument must be specified as a character vector.

For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

