Accelerating the pace of engineering and science

# pdist

Pairwise distance between pairs of objects

## Syntax

D = pdist(X)
D = pdist(X,distance)

## Description

D = pdist(X) computes the Euclidean distance between pairs of objects in m-by-n data matrix X. Rows of X correspond to observations, and columns correspond to variables. D is a row vector of length m(m–1)/2, corresponding to pairs of observations in X. The distances are arranged in the order (2,1), (3,1), ..., (m,1), (3,2), ..., (m,2), ..., (m,m–1)). D is commonly used as a dissimilarity matrix in clustering or multidimensional scaling.

To save space and computation time, D is formatted as a vector. However, you can convert this vector into a square matrix using the squareform function so that element i, j in the matrix, where i < j, corresponds to the distance between objects i and j in the original data set.

D = pdist(X,distance) computes the distance between objects in the data matrix, X, using the method specified by distance, which can be any of the following character strings.

MetricDescription
'euclidean'

Euclidean distance (default).

'seuclidean'

Standardized Euclidean distance. Each coordinate difference between rows in X is scaled by dividing by the corresponding element of the standard deviation S=nanstd(X). To specify another value for S, use D=pdist(X,'seuclidean',S).

'cityblock'

City block metric.

'minkowski'

Minkowski distance. The default exponent is 2. To specify a different exponent, use D = pdist(X,'minkowski',P), where P is a scalar positive value of the exponent.

'chebychev'

Chebychev distance (maximum coordinate difference).

'mahalanobis'

Mahalanobis distance, using the sample covariance of X as computed by nancov. To compute the distance with a different covariance, use D = pdist(X,'mahalanobis',C), where the matrix C is symmetric and positive definite.

'cosine'

One minus the cosine of the included angle between points (treated as vectors).

'correlation'

One minus the sample correlation between points (treated as sequences of values).

'spearman'

One minus the sample Spearman's rank correlation between observations (treated as sequences of values).

'hamming'

Hamming distance, which is the percentage of coordinates that differ.

'jaccard'

One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ.

custom distance function

A distance function specified using @:
D = pdist(X,@distfun)

A distance function must be of form

`d2 = distfun(XI,XJ)`

taking as arguments a 1-by-n vector XI, corresponding to a single row of X, and an m2-by-n matrix XJ, corresponding to multiple rows of X. distfun must accept a matrix XJ with an arbitrary number of rows. distfun must return an m2-by-1 vector of distances d2, whose kth element is the distance between XI and XJ(k,:).

The output D is arranged in the order of ((2,1),(3,1),..., (m,1),(3,2),...(m,2),.....(m,m–1)), i.e. the lower left triangle of the full m-by-m distance matrix in column order. To get the distance between the ith and jth observations (i < j), either use the formula D((i–1)*(mi/2)+ji), or use the helper function Z = squareform(D), which returns an m-by-m square symmetric matrix, with the (i,j) entry equal to distance between observation i and observation j.

### Metrics

Given an m-by-n data matrix X, which is treated as m (1-by-n) row vectors x1, x2, ..., xm, the various distances between the vector xs and xt are defined as follows:

• Euclidean distance

${d}_{st}^{2}=\left({x}_{s}-{x}_{t}\right)\left({x}_{s}-{x}_{t}{\right)}^{\prime }$

Notice that the Euclidean distance is a special case of the Minkowski metric, where p = 2.

• Standardized Euclidean distance

${d}_{st}^{2}=\left({x}_{s}-{x}_{t}\right){V}^{-1}\left({x}_{s}-{x}_{t}{\right)}^{\prime }$

where V is the n-by-n diagonal matrix whose jth diagonal element is S(j)2, where S is the vector of standard deviations.

• Mahalanobis distance

${d}_{st}^{2}=\left({x}_{s}-{x}_{t}\right){C}^{-1}\left({x}_{s}-{x}_{t}{\right)}^{\prime }$

where C is the covariance matrix.

• City block metric

${d}_{st}=\sum _{j=1}^{n}|{x}_{sj}-{x}_{tj}|$

Notice that the city block distance is a special case of the Minkowski metric, where p=1.

• Minkowski metric

${d}_{st}=\sqrt[p]{\sum _{j=1}^{n}{|{x}_{sj}-{x}_{tj}|}^{p}}$

Notice that for the special case of p = 1, the Minkowski metric gives the city block metric, for the special case of p = 2, the Minkowski metric gives the Euclidean distance, and for the special case of p = ∞, the Minkowski metric gives the Chebychev distance.

• Chebychev distance

${d}_{st}={\mathrm{max}}_{j}\left\{|{x}_{sj}-{x}_{tj}|\right\}$

Notice that the Chebychev distance is a special case of the Minkowski metric, where p = ∞.

• Cosine distance

${d}_{st}=1-\frac{{x}_{s}{{x}^{\prime }}_{t}}{\sqrt{\left({x}_{s}{{x}^{\prime }}_{s}\right)\left({x}_{t}{{x}^{\prime }}_{t}\right)}}$

• Correlation distance

${d}_{st}=1-\frac{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{t}-{\overline{x}}_{t}\right)}^{\prime }}{\sqrt{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{s}-{\overline{x}}_{s}\right)}^{\prime }}\sqrt{\left({x}_{t}-{\overline{x}}_{t}\right){\left({x}_{t}-{\overline{x}}_{t}\right)}^{\prime }}}$

where

${\overline{x}}_{s}=\frac{1}{n}\sum _{j}{x}_{sj}$ and ${\overline{x}}_{t}=\frac{1}{n}\sum _{j}{x}_{tj}$

• Hamming distance

${d}_{st}=\left(#\left({x}_{sj}\ne {x}_{tj}\right)/n\right)$

• Jaccard distance

${d}_{st}=\frac{#\left[\left({x}_{sj}\ne {x}_{tj}\right)\cap \left(\left({x}_{sj}\ne 0\right)\cup \left({x}_{tj}\ne 0\right)\right)\right]}{#\left[\left({x}_{sj}\ne 0\right)\cup \left({x}_{tj}\ne 0\right)\right]}$

• Spearman distance

${d}_{st}=1-\frac{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime }}{\sqrt{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{s}-{\overline{r}}_{s}\right)}^{\prime }}\sqrt{\left({r}_{t}-{\overline{r}}_{t}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime }}}$

where

• rsj is the rank of xsj taken over x1j, x2j, ...xmj, as computed by tiedrank

• rs and rt are the coordinate-wise rank vectors of xs and xt, i.e., rs = (rs1, rs2, ... rsn)

• ${\overline{r}}_{s}=\frac{1}{n}\sum _{j}{r}_{sj}=\frac{\left(n+1\right)}{2}$

• ${\overline{r}}_{t}=\frac{1}{n}\sum _{j}{r}_{tj}=\frac{\left(n+1\right)}{2}$

## Examples

Generate random data and find the unweighted Euclidean distance and then find the weighted distance using two different methods:

```% Compute the ordinary Euclidean distance.
X = randn(100, 5);
D = pdist(X,'euclidean');  % euclidean distance

% Compute the Euclidean distance with each coordinate
% difference scaled by the standard deviation.
Dstd = pdist(X,'seuclidean');

% Use a function handle to compute a distance that weights
% each coordinate contribution differently.
Wgts = [.1 .3 .3 .2 .1];     % coordinate weights
weuc = @(XI,XJ,W)(sqrt(bsxfun(@minus,XI,XJ).^2 * W'));
Dwgt = pdist(X, @(Xi,Xj) weuc(Xi,Xj,Wgts));```