Pairwise distance between pairs of objects

`D = pdist(X)`

D = pdist(X,* distance*)

`D = pdist(X)`

computes the
Euclidean distance between pairs of objects in *m*-by-*n* data
matrix `X`

. Rows of `X`

correspond
to observations, and columns correspond to variables. `D`

is
a row vector of length *m*(*m*–1)/2,
corresponding to pairs of observations in `X`

. The
distances are arranged in the order (2,1), (3,1), ..., (*m*,1),
(3,2), ..., (*m*,2), ..., (*m*,*m*–1)). `D`

is
commonly used as a dissimilarity matrix in clustering or multidimensional
scaling.

To save space and computation time, `D`

is
formatted as a vector. However, you can convert this vector into a
square matrix using the `squareform`

function
so that element *i*, *j* in the
matrix, where *i* < *j*, corresponds
to the distance between objects *i* and *j* in
the original data set.

`D = pdist(X,`

computes
the distance between objects in the data matrix, * distance*)

`X`

,
using the method specified by `distance`

Metric | Description |
---|---|

`'euclidean'` | Euclidean distance (default). |

`'squaredeuclidean'` | Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.) |

`'seuclidean'` | Standardized Euclidean distance. Each coordinate difference
between rows in X is scaled by dividing by the corresponding element
of the standard deviation |

`'cityblock'` | City block metric. |

`'minkowski'` | Minkowski distance. The default exponent is 2. To specify
a different exponent, use |

`'chebychev'` | Chebychev distance (maximum coordinate difference). |

`'mahalanobis'` | Mahalanobis distance, using the sample covariance of |

`'cosine'` | One minus the cosine of the included angle between points (treated as vectors). |

`'correlation'` | One minus the sample correlation between points (treated as sequences of values). |

`'spearman'` | One minus the sample Spearman's rank correlation between observations (treated as sequences of values). |

`'hamming'` | Hamming distance, which is the percentage of coordinates that differ. |

`'jaccard'` | One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ. |

custom distance function | A distance function specified using @: A distance function must be of form d2 = distfun(XI,XJ) n vector `XI` ,
corresponding to a single row of `X` , and an m2-by-n matrix `XJ` ,
corresponding to multiple rows of `X` . `distfun` must
accept a matrix `XJ` with an arbitrary number of
rows. `distfun` must return an m2-by-1
vector of distances `d2` , whose kth
element is the distance between `XI` and `XJ(k,:)` . |

The output `D`

is arranged in the order of ((2,1),(3,1),...,
(*m*,1),(3,2),...(*m*,2),.....(*m*,*m*–1)),
i.e. the lower left triangle of the full *m*-by-*m* distance
matrix in column order. To get the distance between the *i*th
and *j*th observations (*i* < *j*),
either use the formula *D*((*i*–1)*(*m*–*i*/2)+*j*–*i*),
or use the helper function `Z = squareform(D)`

, which
returns an *m*-by-*m* square symmetric
matrix, with the (*i*,*j*) entry
equal to distance between observation *i* and observation *j*.

Given an *m*-by-*n* data matrix `X`

,
which is treated as *m* (1-by-*n*)
row vectors `x`

_{1}, `x`

_{2},
..., `x`

_{m},
the various distances between the vector `x`

_{s} and `x`

_{t} are
defined as follows:

Euclidean distance

$${d}_{st}^{2}=({x}_{s}-{x}_{t})({x}_{s}-{x}_{t}{)}^{\prime}$$

Notice that the Euclidean distance is a special case of the Minkowski metric, where

`p`

= 2.Standardized Euclidean distance

$${d}_{st}^{2}=({x}_{s}-{x}_{t}){V}^{-1}({x}_{s}-{x}_{t}{)}^{\prime}$$

where

`V`

is the*n*-by-*n*diagonal matrix whose*j*th diagonal element is`S`

(*j*)^{2}, where`S`

is the vector of standard deviations.Mahalanobis distance

$${d}_{st}^{2}=({x}_{s}-{x}_{t}){C}^{-1}({x}_{s}-{x}_{t}{)}^{\prime}$$

where

`C`

is the covariance matrix.City block metric

$${d}_{st}={\displaystyle \sum _{j=1}^{n}\left|{x}_{sj}-{x}_{tj}\right|}$$

Notice that the city block distance is a special case of the Minkowski metric, where

`p=`

1.Minkowski metric

$${d}_{st}=\sqrt[p]{{\displaystyle \sum _{j=1}^{n}{\left|{x}_{sj}-{x}_{tj}\right|}^{p}}}$$

Notice that for the special case of

`p`

= 1, the Minkowski metric gives the city block metric, for the special case of`p`

= 2, the Minkowski metric gives the Euclidean distance, and for the special case of`p`

= ∞, the Minkowski metric gives the Chebychev distance.Chebychev distance

$${d}_{st}={\mathrm{max}}_{j}\left\{\left|{x}_{sj}-{x}_{tj}\right|\right\}$$

Notice that the Chebychev distance is a special case of the Minkowski metric, where

`p`

= ∞.Cosine distance

$${d}_{st}=1-\frac{{x}_{s}{{x}^{\prime}}_{t}}{\sqrt{\left({x}_{s}{{x}^{\prime}}_{s}\right)\left({x}_{t}{{x}^{\prime}}_{t}\right)}}$$

Correlation distance

$${d}_{st}=1-\frac{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{t}-{\overline{x}}_{t}\right)}^{\prime}}{\sqrt{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{s}-{\overline{x}}_{s}\right)}^{\prime}}\sqrt{\left({x}_{t}-{\overline{x}}_{t}\right){\left({x}_{t}-{\overline{x}}_{t}\right)}^{\prime}}}$$

where

$${\overline{x}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{x}_{sj}}$$ and $${\overline{x}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{x}_{tj}}$$

Hamming distance

$${d}_{st}=(\#({x}_{sj}\ne {x}_{tj})/n)$$

Jaccard distance

$${d}_{st}=\frac{\#\left[\left({x}_{sj}\ne {x}_{tj}\right)\cap \left(\left({x}_{sj}\ne 0\right)\cup \left({x}_{tj}\ne 0\right)\right)\right]}{\#\left[\left({x}_{sj}\ne 0\right)\cup \left({x}_{tj}\ne 0\right)\right]}$$

Spearman distance

$${d}_{st}=1-\frac{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime}}{\sqrt{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{s}-{\overline{r}}_{s}\right)}^{\prime}}\sqrt{\left({r}_{t}-{\overline{r}}_{t}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime}}}$$

where

*r*is the rank of_{sj}*x*taken over_{sj}*x*_{1},_{j}*x*_{2}, ..._{j}*x*, as computed by_{mj}`tiedrank`

*r*and_{s}*r*are the coordinate-wise rank vectors of_{t}*x*and_{s}*x*, i.e.,_{t}*r*= (_{s}*r*_{s}_{1},*r*_{s}_{2}, ...*r*)_{sn}$${\overline{r}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{sj}}=\frac{\left(n+1\right)}{2}$$

$${\overline{r}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{tj}}=\frac{\left(n+1\right)}{2}$$

Generate random data and find the unweighted Euclidean distance and then find the weighted distance using two different methods:

% Compute the ordinary Euclidean distance. X = randn(100, 5); D = pdist(X,'euclidean'); % euclidean distance % Compute the Euclidean distance with each coordinate % difference scaled by the standard deviation. Dstd = pdist(X,'seuclidean'); % Use a function handle to compute a distance that weights % each coordinate contribution differently. Wgts = [.1 .3 .3 .2 .1]; % coordinate weights weuc = @(XI,XJ,W)(sqrt(bsxfun(@minus,XI,XJ).^2 * W')); Dwgt = pdist(X, @(Xi,Xj) weuc(Xi,Xj,Wgts));

`cluster`

| `clusterdata`

| `cmdscale`

| `cophenet`

| `dendrogram`

| `inconsistent`

| `linkage`

| `pdist2`

| `silhouette`

| `squareform`

Was this topic helpful?