Documentation |
Pairwise distance between two sets of observations
D = pdist2(X,Y)
D = pdist2(X,Y,distance)
D = pdist2(X,Y,'minkowski',P)
D = pdist2(X,Y,'mahalanobis',C)
D = pdist2(X,Y,distance,'Smallest',K)
D = pdist2(X,Y,distance,'Largest',K)
[D,I] = pdist2(X,Y,distance,'Smallest',K)
[D,I]
= pdist2(X,Y,distance,'Largest',K)
D = pdist2(X,Y) returns a matrix D containing the Euclidean distances between each pair of observations in the mx-by-n data matrix X and my-by-n data matrix Y. Rows of X and Y correspond to observations, columns correspond to variables. D is an mx-by-my matrix, with the (i,j) entry equal to distance between observation i in X and observation j in Y. The (i,j) entry will be NaN if observation i in X or observation j in Y contain NaNs.
D = pdist2(X,Y,distance) computes D using distance. Choices are:
Metric | Description |
---|---|
'euclidean' | Euclidean distance (default). |
'seuclidean' | Standardized Euclidean distance. Each coordinate difference between rows in X and Y is scaled by dividing by the corresponding element of the standard deviation computed from X, S=nanstd(X). To specify another value for S, use D = PDIST2(X,Y,'seuclidean',S). |
'cityblock' | City block metric. |
'minkowski' | Minkowski distance. The default exponent is 2. To compute the distance with a different exponent, use D = pdist2(X,Y,'minkowski',P), where the exponent P is a scalar positive value. |
'chebychev' | Chebychev distance (maximum coordinate difference). |
'mahalanobis' | Mahalanobis distance, using the sample covariance of X as computed by nancov. To compute the distance with a different covariance, use D = pdist2(X,Y,'mahalanobis',C) where the matrix C is symmetric and positive definite. |
'cosine' | One minus the cosine of the included angle between points (treated as vectors). |
'correlation' | One minus the sample correlation between points (treated as sequences of values). |
'spearman' | One minus the sample Spearman's rank correlation between observations, treated as sequences of values. |
'hamming' | Hamming distance, the percentage of coordinates that differ. |
'jaccard' | One minus the Jaccard coefficient, the percentage of nonzero coordinates that differ. |
function | A distance function specified using @: A distance function must be of the form function D2 = distfun(ZI, ZJ) taking as arguments a 1-by-n vector ZI containing a single observation from X or Y, an m2-by-n matrix ZJ containing multiple observations from X or Y, and returning an m2-by-1 vector of distances D2, whose Jth element is the distance between the observations ZI and ZJ(J,:). If your data is not sparse, generally it is faster to use a built-in distance than to use a function handle. |
D = pdist2(X,Y,distance,'Smallest',K) returns a K-by-my matrix D containing the K smallest pairwise distances to observations in X for each observation in Y. pdist2 sorts the distances in each column of D in ascending order. D = pdist2(X,Y,distance,'Largest',K) returns the K largest pairwise distances sorted in descending order. If K is greater than mx, pdist2 returns an mx-by-my distance matrix. For each observation in Y, pdist2 finds the K smallest or largest distances by computing and comparing the distance values to all the observations in X.
[D,I] = pdist2(X,Y,distance,'Smallest',K) returns a K-by-my matrix I containing indices of the observations in X corresponding to the K smallest pairwise distances in D. [D,I] = pdist2(X,Y,distance,'Largest',K) returns indices corresponding to the K largest pairwise distances.
Given an mx-by-n data matrix X, which is treated as mx (1-by-n) row vectors x_{1}, x_{2}, ..., x_{mx}, and my-by-n data matrix Y, which is treated as my (1-by-n) row vectors y_{1}, y_{2}, ...,y_{my}, the various distances between the vector x_{s} and y_{t} are defined as follows:
Euclidean distance
$${d}_{st}^{2}=({x}_{s}-{y}_{t})({x}_{s}-{y}_{t}{)}^{\prime}$$
Notice that the Euclidean distance is a special case of the Minkowski metric, where p=2.
Standardized Euclidean distance
$${d}_{st}^{2}=({x}_{s}-{y}_{t}){V}^{-1}({x}_{s}-{y}_{t}{)}^{\prime}$$
where V is the n-by-n diagonal matrix whose jth diagonal element is S(j)^{2}, where S is the vector of standard deviations.
Mahalanobis distance
$${d}_{st}^{2}=({x}_{s}-{y}_{t}){C}^{-1}({x}_{s}-{y}_{t}{)}^{\prime}$$
where C is the covariance matrix.
City block metric
$${d}_{st}={\displaystyle \sum _{j=1}^{n}\left|{x}_{sj}-{y}_{tj}\right|}$$
Notice that the city block distance is a special case of the Minkowski metric, where p=1.
Minkowski metric
$${d}_{st}=\sqrt[p]{{\displaystyle \sum _{j=1}^{n}{\left|{x}_{sj}-{y}_{tj}\right|}^{p}}}$$
Notice that for the special case of p = 1, the Minkowski metric gives the City Block metric, for the special case of p = 2, the Minkowski metric gives the Euclidean distance, and for the special case of p=∞, the Minkowski metric gives the Chebychev distance.
Chebychev distance
$${d}_{st}={\mathrm{max}}_{j}\left\{\left|{x}_{sj}-{y}_{tj}\right|\right\}$$
Notice that the Chebychev distance is a special case of the Minkowski metric, where p=∞.
Cosine distance
$${d}_{st}=\left(1-\frac{{x}_{s}{{y}^{\prime}}_{t}}{\sqrt{\left({x}_{s}{{x}^{\prime}}_{s}\right)\left({y}_{t}{{y}^{\prime}}_{t}\right)}}\right)$$
Correlation distance
$${d}_{st}=1-\frac{\left({x}_{s}-{\overline{x}}_{s}\right){\left({y}_{t}-{\overline{y}}_{t}\right)}^{\prime}}{\sqrt{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{s}-{\overline{x}}_{s}\right)}^{\prime}}\sqrt{\left({y}_{t}-{\overline{y}}_{t}\right){\left({y}_{t}-{\overline{y}}_{t}\right)}^{\prime}}}$$
where
$${\overline{x}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{x}_{sj}}$$ and
$${\overline{y}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{y}_{tj}}$$
Hamming distance
$${d}_{st}=(\#({x}_{sj}\ne {y}_{tj})/n)$$
Jaccard distance
$${d}_{st}=\frac{\#\left[\left({x}_{sj}\ne {y}_{tj}\right)\cap \left(\left({x}_{sj}\ne 0\right)\cup \left({y}_{tj}\ne 0\right)\right)\right]}{\#\left[\left({x}_{sj}\ne 0\right)\cup \left({y}_{tj}\ne 0\right)\right]}$$
Spearman distance
$${d}_{st}=1-\frac{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime}}{\sqrt{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{s}-{\overline{r}}_{s}\right)}^{\prime}}\sqrt{\left({r}_{t}-{\overline{r}}_{t}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime}}}$$
where
r_{sj} is the rank of x_{sj} taken over x_{1j}, x_{2j}, ...x_{mx,j}, as computed by tiedrank
r_{tj} is the rank of y_{tj} taken over y_{1j}, y_{2j}, ...y_{my,j}, as computed by tiedrank
r_{s} and r_{t} are the coordinate-wise rank vectors of x_{s} and y_{t}, i.e. r_{s} = (r_{s1}, r_{s2}, ... r_{sn}) and r_{t} = (r_{t1}, r_{t2}, ... r_{tn})
$${\overline{r}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{sj}}=\frac{\left(n+1\right)}{2}$$
$${\overline{r}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{tj}}=\frac{\left(n+1\right)}{2}$$
Generate random data and find the unweighted Euclidean distance, then find the weighted distance using two different methods:
% Compute the ordinary Euclidean distance X = randn(100, 5); Y = randn(25, 5); D = pdist2(X,Y,'euclidean'); % euclidean distance % Compute the Euclidean distance with each coordinate % difference scaled by the standard deviation Dstd = pdist2(X,Y,'seuclidean'); % Use a function handle to compute a distance that weights % each coordinate contribution differently. Wgts = [.1 .3 .3 .2 .1]; weuc = @(XI,XJ,W)(sqrt(bsxfun(@minus,XI,XJ).^2 * W')); Dwgt = pdist2(X,Y, @(Xi,Xj) weuc(Xi,Xj,Wgts));
createns | ExhaustiveSearcher | KDTreeSearcher | knnsearch | pdist