Documentation |
Pairwise distance between pairs of objects
D = pdist(X)
D = pdist(X,distance)
D = pdist(X) computes the Euclidean distance between pairs of objects in m-by-n data matrix X. Rows of X correspond to observations, and columns correspond to variables. D is a row vector of length m(m–1)/2, corresponding to pairs of observations in X. The distances are arranged in the order (2,1), (3,1), ..., (m,1), (3,2), ..., (m,2), ..., (m,m–1)). D is commonly used as a dissimilarity matrix in clustering or multidimensional scaling.
To save space and computation time, D is formatted as a vector. However, you can convert this vector into a square matrix using the squareform function so that element i, j in the matrix, where i < j, corresponds to the distance between objects i and j in the original data set.
D = pdist(X,distance) computes the distance between objects in the data matrix, X, using the method specified by distance, which can be any of the following character strings.
Metric | Description |
---|---|
'euclidean' | Euclidean distance (default). |
'seuclidean' | Standardized Euclidean distance. Each coordinate difference between rows in X is scaled by dividing by the corresponding element of the standard deviation S=nanstd(X). To specify another value for S, use D=pdist(X,'seuclidean',S). |
'cityblock' | City block metric. |
'minkowski' | Minkowski distance. The default exponent is 2. To specify a different exponent, use D = pdist(X,'minkowski',P), where P is a scalar positive value of the exponent. |
'chebychev' | Chebychev distance (maximum coordinate difference). |
'mahalanobis' | Mahalanobis distance, using the sample covariance of X as computed by nancov. To compute the distance with a different covariance, use D = pdist(X,'mahalanobis',C), where the matrix C is symmetric and positive definite. |
'cosine' | One minus the cosine of the included angle between points (treated as vectors). |
'correlation' | One minus the sample correlation between points (treated as sequences of values). |
'spearman' | One minus the sample Spearman's rank correlation between observations (treated as sequences of values). |
'hamming' | Hamming distance, which is the percentage of coordinates that differ. |
'jaccard' | One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ. |
custom distance function | A distance function specified using @: A distance function must be of form d2 = distfun(XI,XJ) taking as arguments a 1-by-n vector XI, corresponding to a single row of X, and an m2-by-n matrix XJ, corresponding to multiple rows of X. distfun must accept a matrix XJ with an arbitrary number of rows. distfun must return an m2-by-1 vector of distances d2, whose kth element is the distance between XI and XJ(k,:). |
The output D is arranged in the order of ((2,1),(3,1),..., (m,1),(3,2),...(m,2),.....(m,m–1)), i.e. the lower left triangle of the full m-by-m distance matrix in column order. To get the distance between the ith and jth observations (i < j), either use the formula D((i–1)*(m–i/2)+j–i), or use the helper function Z = squareform(D), which returns an m-by-m square symmetric matrix, with the (i,j) entry equal to distance between observation i and observation j.
Given an m-by-n data matrix X, which is treated as m (1-by-n) row vectors x_{1}, x_{2}, ..., x_{m}, the various distances between the vector x_{s} and x_{t} are defined as follows:
Euclidean distance
$${d}_{st}^{2}=({x}_{s}-{x}_{t})({x}_{s}-{x}_{t}{)}^{\prime}$$
Notice that the Euclidean distance is a special case of the Minkowski metric, where p = 2.
Standardized Euclidean distance
$${d}_{st}^{2}=({x}_{s}-{x}_{t}){V}^{-1}({x}_{s}-{x}_{t}{)}^{\prime}$$
where V is the n-by-n diagonal matrix whose jth diagonal element is S(j)^{2}, where S is the vector of standard deviations.
Mahalanobis distance
$${d}_{st}^{2}=({x}_{s}-{x}_{t}){C}^{-1}({x}_{s}-{x}_{t}{)}^{\prime}$$
where C is the covariance matrix.
City block metric
$${d}_{st}={\displaystyle \sum _{j=1}^{n}\left|{x}_{sj}-{x}_{tj}\right|}$$
Notice that the city block distance is a special case of the Minkowski metric, where p=1.
Minkowski metric
$${d}_{st}=\sqrt[p]{{\displaystyle \sum _{j=1}^{n}{\left|{x}_{sj}-{x}_{tj}\right|}^{p}}}$$
Notice that for the special case of p = 1, the Minkowski metric gives the city block metric, for the special case of p = 2, the Minkowski metric gives the Euclidean distance, and for the special case of p = ∞, the Minkowski metric gives the Chebychev distance.
Chebychev distance
$${d}_{st}={\mathrm{max}}_{j}\left\{\left|{x}_{sj}-{x}_{tj}\right|\right\}$$
Notice that the Chebychev distance is a special case of the Minkowski metric, where p = ∞.
Cosine distance
$${d}_{st}=1-\frac{{x}_{s}{{x}^{\prime}}_{t}}{\sqrt{\left({x}_{s}{{x}^{\prime}}_{s}\right)\left({x}_{t}{{x}^{\prime}}_{t}\right)}}$$
Correlation distance
$${d}_{st}=1-\frac{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{t}-{\overline{x}}_{t}\right)}^{\prime}}{\sqrt{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{s}-{\overline{x}}_{s}\right)}^{\prime}}\sqrt{\left({x}_{t}-{\overline{x}}_{t}\right){\left({x}_{t}-{\overline{x}}_{t}\right)}^{\prime}}}$$
where
$${\overline{x}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{x}_{sj}}$$ and $${\overline{x}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{x}_{tj}}$$
Hamming distance
$${d}_{st}=(\#({x}_{sj}\ne {x}_{tj})/n)$$
Jaccard distance
$${d}_{st}=\frac{\#\left[\left({x}_{sj}\ne {x}_{tj}\right)\cap \left(\left({x}_{sj}\ne 0\right)\cup \left({x}_{tj}\ne 0\right)\right)\right]}{\#\left[\left({x}_{sj}\ne 0\right)\cup \left({x}_{tj}\ne 0\right)\right]}$$
Spearman distance
$${d}_{st}=1-\frac{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime}}{\sqrt{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{s}-{\overline{r}}_{s}\right)}^{\prime}}\sqrt{\left({r}_{t}-{\overline{r}}_{t}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime}}}$$
where
r_{sj} is the rank of x_{sj} taken over x_{1}_{j}, x_{2}_{j}, ...x_{mj}, as computed by tiedrank
r_{s} and r_{t} are the coordinate-wise rank vectors of x_{s} and x_{t}, i.e., r_{s} = (r_{s}_{1}, r_{s}_{2}, ... r_{sn})
$${\overline{r}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{sj}}=\frac{\left(n+1\right)}{2}$$
$${\overline{r}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{tj}}=\frac{\left(n+1\right)}{2}$$
Generate random data and find the unweighted Euclidean distance and then find the weighted distance using two different methods:
% Compute the ordinary Euclidean distance. X = randn(100, 5); D = pdist(X,'euclidean'); % euclidean distance % Compute the Euclidean distance with each coordinate % difference scaled by the standard deviation. Dstd = pdist(X,'seuclidean'); % Use a function handle to compute a distance that weights % each coordinate contribution differently. Wgts = [.1 .3 .3 .2 .1]; % coordinate weights weuc = @(XI,XJ,W)(sqrt(bsxfun(@minus,XI,XJ).^2 * W')); Dwgt = pdist(X, @(Xi,Xj) weuc(Xi,Xj,Wgts));
cluster | clusterdata | cmdscale | cophenet | dendrogram | inconsistent | linkage | pdist2 | silhouette | squareform