Accelerating the pace of engineering and science

# pdist2

Pairwise distance between two sets of observations

## Syntax

D = pdist2(X,Y)
D = pdist2(X,Y,distance)
D = pdist2(X,Y,'minkowski',P)
D = pdist2(X,Y,'mahalanobis',C)
D = pdist2(X,Y,distance,'Smallest',K)
D = pdist2(X,Y,distance,'Largest',K)
[D,I] = pdist2(X,Y,distance,'Smallest',K)
[D,I] = pdist2(X,Y,distance,'Largest',K)

## Description

D = pdist2(X,Y) returns a matrix D containing the Euclidean distances between each pair of observations in the mx-by-n data matrix X and my-by-n data matrix Y. Rows of X and Y correspond to observations, columns correspond to variables. D is an mx-by-my matrix, with the (i,j) entry equal to distance between observation i in X and observation j in Y. The (i,j) entry will be NaN if observation i in X or observation j in Y contain NaNs.

D = pdist2(X,Y,distance) computes D using distance. Choices are:

MetricDescription
'euclidean'

Euclidean distance (default).

'seuclidean'

Standardized Euclidean distance. Each coordinate difference between rows in X and Y is scaled by dividing by the corresponding element of the standard deviation computed from X, S=nanstd(X). To specify another value for S, use D = PDIST2(X,Y,'seuclidean',S).

'cityblock'

City block metric.

'minkowski'

Minkowski distance. The default exponent is 2. To compute the distance with a different exponent, use D = pdist2(X,Y,'minkowski',P), where the exponent P is a scalar positive value.

'chebychev'

Chebychev distance (maximum coordinate difference).

'mahalanobis'

Mahalanobis distance, using the sample covariance of X as computed by nancov. To compute the distance with a different covariance, use D = pdist2(X,Y,'mahalanobis',C) where the matrix C is symmetric and positive definite.

'cosine'

One minus the cosine of the included angle between points (treated as vectors).

'correlation'

One minus the sample correlation between points (treated as sequences of values).

'spearman'

One minus the sample Spearman's rank correlation between observations, treated as sequences of values.

'hamming'

Hamming distance, the percentage of coordinates that differ.

'jaccard'

One minus the Jaccard coefficient, the percentage of nonzero coordinates that differ.

function

A distance function specified using @:
D = pdist2(X,Y,@distfun).

A distance function must be of the form

`function D2 = distfun(ZI, ZJ)`

taking as arguments a 1-by-n vector ZI containing a single observation from X or Y, an m2-by-n matrix ZJ containing multiple observations from X or Y, and returning an m2-by-1 vector of distances D2, whose Jth element is the distance between the observations ZI and ZJ(J,:).

If your data is not sparse, generally it is faster to use a built-in distance than to use a function handle.

D = pdist2(X,Y,distance,'Smallest',K) returns a K-by-my matrix D containing the K smallest pairwise distances to observations in X for each observation in Y. pdist2 sorts the distances in each column of D in ascending order. D = pdist2(X,Y,distance,'Largest',K) returns the K largest pairwise distances sorted in descending order. If K is greater than mx, pdist2 returns an mx-by-my distance matrix. For each observation in Y, pdist2 finds the K smallest or largest distances by computing and comparing the distance values to all the observations in X.

[D,I] = pdist2(X,Y,distance,'Smallest',K) returns a K-by-my matrix I containing indices of the observations in X corresponding to the K smallest pairwise distances in D. [D,I] = pdist2(X,Y,distance,'Largest',K) returns indices corresponding to the K largest pairwise distances.

### Metrics

Given an mx-by-n data matrix X, which is treated as mx (1-by-n) row vectors x1, x2, ..., xmx, and my-by-n data matrix Y, which is treated as my (1-by-n) row vectors y1, y2, ...,ymy, the various distances between the vector xs and yt are defined as follows:

• Euclidean distance

${d}_{st}^{2}=\left({x}_{s}-{y}_{t}\right)\left({x}_{s}-{y}_{t}{\right)}^{\prime }$

Notice that the Euclidean distance is a special case of the Minkowski metric, where p=2.

• Standardized Euclidean distance

${d}_{st}^{2}=\left({x}_{s}-{y}_{t}\right){V}^{-1}\left({x}_{s}-{y}_{t}{\right)}^{\prime }$

where V is the n-by-n diagonal matrix whose jth diagonal element is S(j)2, where S is the vector of standard deviations.

• Mahalanobis distance

${d}_{st}^{2}=\left({x}_{s}-{y}_{t}\right){C}^{-1}\left({x}_{s}-{y}_{t}{\right)}^{\prime }$

where C is the covariance matrix.

• City block metric

${d}_{st}=\sum _{j=1}^{n}|{x}_{sj}-{y}_{tj}|$

Notice that the city block distance is a special case of the Minkowski metric, where p=1.

• Minkowski metric

${d}_{st}=\sqrt[p]{\sum _{j=1}^{n}{|{x}_{sj}-{y}_{tj}|}^{p}}$

Notice that for the special case of p = 1, the Minkowski metric gives the City Block metric, for the special case of p = 2, the Minkowski metric gives the Euclidean distance, and for the special case of p=∞, the Minkowski metric gives the Chebychev distance.

• Chebychev distance

${d}_{st}={\mathrm{max}}_{j}\left\{|{x}_{sj}-{y}_{tj}|\right\}$

Notice that the Chebychev distance is a special case of the Minkowski metric, where p=∞.

• Cosine distance

${d}_{st}=\left(1-\frac{{x}_{s}{{y}^{\prime }}_{t}}{\sqrt{\left({x}_{s}{{x}^{\prime }}_{s}\right)\left({y}_{t}{{y}^{\prime }}_{t}\right)}}\right)$

• Correlation distance

${d}_{st}=1-\frac{\left({x}_{s}-{\overline{x}}_{s}\right){\left({y}_{t}-{\overline{y}}_{t}\right)}^{\prime }}{\sqrt{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{s}-{\overline{x}}_{s}\right)}^{\prime }}\sqrt{\left({y}_{t}-{\overline{y}}_{t}\right){\left({y}_{t}-{\overline{y}}_{t}\right)}^{\prime }}}$

where

${\overline{x}}_{s}=\frac{1}{n}\sum _{j}{x}_{sj}$ and

${\overline{y}}_{t}=\frac{1}{n}\sum _{j}{y}_{tj}$

• Hamming distance

${d}_{st}=\left(#\left({x}_{sj}\ne {y}_{tj}\right)/n\right)$

• Jaccard distance

${d}_{st}=\frac{#\left[\left({x}_{sj}\ne {y}_{tj}\right)\cap \left(\left({x}_{sj}\ne 0\right)\cup \left({y}_{tj}\ne 0\right)\right)\right]}{#\left[\left({x}_{sj}\ne 0\right)\cup \left({y}_{tj}\ne 0\right)\right]}$

• Spearman distance

${d}_{st}=1-\frac{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime }}{\sqrt{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{s}-{\overline{r}}_{s}\right)}^{\prime }}\sqrt{\left({r}_{t}-{\overline{r}}_{t}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime }}}$

where

• rsj is the rank of xsj taken over x1j, x2j, ...xmx,j, as computed by tiedrank

• rtj is the rank of ytj taken over y1j, y2j, ...ymy,j, as computed by tiedrank

• rs and rt are the coordinate-wise rank vectors of xs and yt, i.e. rs = (rs1, rs2, ... rsn) and rt = (rt1, rt2, ... rtn)

• ${\overline{r}}_{s}=\frac{1}{n}\sum _{j}{r}_{sj}=\frac{\left(n+1\right)}{2}$

• ${\overline{r}}_{t}=\frac{1}{n}\sum _{j}{r}_{tj}=\frac{\left(n+1\right)}{2}$

## Examples

Generate random data and find the unweighted Euclidean distance, then find the weighted distance using two different methods:

```% Compute the ordinary Euclidean distance
X = randn(100, 5);
Y = randn(25, 5);
D = pdist2(X,Y,'euclidean'); % euclidean distance

% Compute the Euclidean distance with each coordinate
% difference scaled by the standard deviation
Dstd = pdist2(X,Y,'seuclidean');

% Use a function handle to compute a distance that weights
% each coordinate contribution differently.
Wgts = [.1 .3 .3 .2 .1];
weuc = @(XI,XJ,W)(sqrt(bsxfun(@minus,XI,XJ).^2 * W'));
Dwgt = pdist2(X,Y, @(Xi,Xj) weuc(Xi,Xj,Wgts));
```