Note: This page has been translated by MathWorks. Please click here

To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

Pairwise distance between pairs of objects

`D = pdist(X)`

D = pdist(X,* distance*)

`D = pdist(X)`

computes the
Euclidean distance between pairs of objects in *m*-by-*n* data
matrix `X`

. Rows of `X`

correspond
to observations, and columns correspond to variables. `D`

is
a row vector of length *m*(*m*–1)/2,
corresponding to pairs of observations in `X`

. The
distances are arranged in the order (2,1), (3,1), ..., (*m*,1),
(3,2), ..., (*m*,2), ..., (*m*,*m*–1)). `D`

is
commonly used as a dissimilarity matrix in clustering or multidimensional
scaling.

To save space and computation time, `D`

is
formatted as a vector. However, you can convert this vector into a
square matrix using the `squareform`

function
so that element *i*, *j* in the
matrix, where *i* < *j*, corresponds
to the distance between objects *i* and *j* in
the original data set.

`D = pdist(X,`

computes
the distance between objects in the data matrix, * distance*)

`X`

,
using the method specified by `distance`

Metric | Description |
---|---|

`'euclidean'` | Euclidean distance (default). |

`'squaredeuclidean'` | Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.) |

`'seuclidean'` | Standardized Euclidean distance. Each coordinate difference
between rows in X is scaled by dividing by the corresponding element
of the standard deviation |

`'cityblock'` | City block metric. |

`'minkowski'` | Minkowski distance. The default exponent is 2. To specify
a different exponent, use |

`'chebychev'` | Chebychev distance (maximum coordinate difference). |

`'mahalanobis'` | Mahalanobis distance, using the sample covariance of |

`'cosine'` | One minus the cosine of the included angle between points (treated as vectors). |

`'correlation'` | One minus the sample correlation between points (treated as sequences of values). |

`'spearman'` | One minus the sample Spearman's rank correlation between observations (treated as sequences of values). |

`'hamming'` | Hamming distance, which is the percentage of coordinates that differ. |

`'jaccard'` | One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ. |

custom distance function | A distance function specified using @: A distance function must be of form d2 = distfun(XI,XJ) n vector `XI` ,
corresponding to a single row of `X` , and an m2-by-n matrix `XJ` ,
corresponding to multiple rows of `X` . `distfun` must
accept a matrix `XJ` with an arbitrary number of
rows. `distfun` must return an m2-by-1
vector of distances `d2` , whose kth
element is the distance between `XI` and `XJ(k,:)` . |

The output `D`

is arranged in the order of ((2,1),(3,1),...,
(*m*,1),(3,2),...(*m*,2),.....(*m*,*m*–1)),
i.e. the lower left triangle of the full *m*-by-*m* distance
matrix in column order. To get the distance between the *i*th
and *j*th observations (*i* < *j*),
either use the formula *D*((*i*–1)*(*m*–*i*/2)+*j*–*i*),
or use the helper function `Z = squareform(D)`

, which
returns an *m*-by-*m* square symmetric
matrix, with the (*i*,*j*) entry
equal to distance between observation *i* and observation *j*.

Given an *m*-by-*n* data matrix `X`

,
which is treated as *m* (1-by-*n*)
row vectors `x`

_{1}, `x`

_{2},
..., `x`

_{m},
the various distances between the vector `x`

_{s} and `x`

_{t} are
defined as follows:

Euclidean distance

$${d}_{st}^{2}=({x}_{s}-{x}_{t})({x}_{s}-{x}_{t}{)}^{\prime}$$

Notice that the Euclidean distance is a special case of the Minkowski metric, where

`p`

= 2.Standardized Euclidean distance

$${d}_{st}^{2}=({x}_{s}-{x}_{t}){V}^{-1}({x}_{s}-{x}_{t}{)}^{\prime}$$

where

`V`

is the*n*-by-*n*diagonal matrix whose*j*th diagonal element is`S`

(*j*)^{2}, where`S`

is the vector of standard deviations.Mahalanobis distance

$${d}_{st}^{2}=({x}_{s}-{x}_{t}){C}^{-1}({x}_{s}-{x}_{t}{)}^{\prime}$$

where

`C`

is the covariance matrix.City block metric

$${d}_{st}={\displaystyle \sum _{j=1}^{n}\left|{x}_{sj}-{x}_{tj}\right|}$$

Notice that the city block distance is a special case of the Minkowski metric, where

`p=`

1.Minkowski metric

$${d}_{st}=\sqrt[p]{{\displaystyle \sum _{j=1}^{n}{\left|{x}_{sj}-{x}_{tj}\right|}^{p}}}$$

Notice that for the special case of

`p`

= 1, the Minkowski metric gives the city block metric, for the special case of`p`

= 2, the Minkowski metric gives the Euclidean distance, and for the special case of`p`

= ∞, the Minkowski metric gives the Chebychev distance.Chebychev distance

$${d}_{st}={\mathrm{max}}_{j}\left\{\left|{x}_{sj}-{x}_{tj}\right|\right\}$$

Notice that the Chebychev distance is a special case of the Minkowski metric, where

`p`

= ∞.Cosine distance

$${d}_{st}=1-\frac{{x}_{s}{{x}^{\prime}}_{t}}{\sqrt{\left({x}_{s}{{x}^{\prime}}_{s}\right)\left({x}_{t}{{x}^{\prime}}_{t}\right)}}$$

Correlation distance

$${d}_{st}=1-\frac{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{t}-{\overline{x}}_{t}\right)}^{\prime}}{\sqrt{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{s}-{\overline{x}}_{s}\right)}^{\prime}}\sqrt{\left({x}_{t}-{\overline{x}}_{t}\right){\left({x}_{t}-{\overline{x}}_{t}\right)}^{\prime}}}$$

where

$${\overline{x}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{x}_{sj}}$$ and $${\overline{x}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{x}_{tj}}$$

Hamming distance

$${d}_{st}=(\#({x}_{sj}\ne {x}_{tj})/n)$$

Jaccard distance

$${d}_{st}=\frac{\#\left[\left({x}_{sj}\ne {x}_{tj}\right)\cap \left(\left({x}_{sj}\ne 0\right)\cup \left({x}_{tj}\ne 0\right)\right)\right]}{\#\left[\left({x}_{sj}\ne 0\right)\cup \left({x}_{tj}\ne 0\right)\right]}$$

Spearman distance

$${d}_{st}=1-\frac{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime}}{\sqrt{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{s}-{\overline{r}}_{s}\right)}^{\prime}}\sqrt{\left({r}_{t}-{\overline{r}}_{t}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime}}}$$

where

*r*is the rank of_{sj}*x*taken over_{sj}*x*_{1},_{j}*x*_{2}, ..._{j}*x*, as computed by_{mj}`tiedrank`

*r*and_{s}*r*are the coordinate-wise rank vectors of_{t}*x*and_{s}*x*, i.e.,_{t}*r*= (_{s}*r*_{s}_{1},*r*_{s}_{2}, ...*r*)_{sn}$${\overline{r}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{sj}}=\frac{\left(n+1\right)}{2}$$

$${\overline{r}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{tj}}=\frac{\left(n+1\right)}{2}$$

Generate random data and find the unweighted Euclidean distance and then find the weighted distance using two different methods:

% Compute the ordinary Euclidean distance. X = randn(100, 5); D = pdist(X,'euclidean'); % euclidean distance % Compute the Euclidean distance with each coordinate % difference scaled by the standard deviation. Dstd = pdist(X,'seuclidean'); % Use a function handle to compute a distance that weights % each coordinate contribution differently. Wgts = [.1 .3 .3 .2 .1]; % coordinate weights weuc = @(XI,XJ,W)(sqrt(bsxfun(@minus,XI,XJ).^2 * W')); Dwgt = pdist(X, @(Xi,Xj) weuc(Xi,Xj,Wgts));

`cluster`

| `clusterdata`

| `cmdscale`

| `cophenet`

| `dendrogram`

| `inconsistent`

| `linkage`

| `pdist2`

| `silhouette`

| `squareform`

Was this topic helpful?