Pairwise distance between two sets of observations

`D = pdist2(X,Y)`

D = pdist2(X,Y,distance)

D = pdist2(X,Y,'minkowski',P)

D = pdist2(X,Y,'mahalanobis',C)

D = pdist2(X,Y,distance,'Smallest',K)

D = pdist2(X,Y,distance,'Largest',K)

[D,I] = pdist2(X,Y,distance,'Smallest',K)

[D,I]
= pdist2(X,Y,distance,'Largest',K)

`D = pdist2(X,Y)`

returns a matrix `D`

containing
the Euclidean distances between each pair of observations in the *mx*-by-*n* data
matrix `X`

and *my*-by-*n* data
matrix `Y`

. Rows of `X`

and `Y`

correspond
to observations, columns correspond to variables. `D`

is
an *mx*-by-*my* matrix, with the
(*i*,*j*) entry equal to distance
between observation *i* in `X`

and
observation *j* in `Y`

. The (*i*,*j*)
entry will be `NaN`

if observation *i* in `X`

or
observation *j* in `Y`

contain `NaN`

s.

`D = pdist2(X,Y,distance)`

computes `D`

using `distance`

.
Choices are:

Metric | Description |
---|---|

`'euclidean'` | Euclidean distance (default). |

`'squaredeuclidean'` | Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.) |

`'seuclidean'` | Standardized Euclidean distance. Each coordinate difference
between rows in |

`'cityblock'` | City block metric. |

`'minkowski'` | Minkowski distance. The default exponent is 2. To compute
the distance with a different exponent, use |

`'chebychev'` | Chebychev distance (maximum coordinate difference). |

`'mahalanobis'` | Mahalanobis distance, using the sample covariance of |

`'cosine'` | One minus the cosine of the included angle between points (treated as vectors). |

`'correlation'` | One minus the sample correlation between points (treated as sequences of values). |

`'spearman'` | One minus the sample Spearman's rank correlation between observations, treated as sequences of values. |

`'hamming'` | Hamming distance, the percentage of coordinates that differ. |

`'jaccard'` | One minus the Jaccard coefficient, the percentage of nonzero coordinates that differ. |

function | A distance function specified using @: A distance function must be of the form function D2 = distfun(ZI, ZJ) n vector `ZI` containing
a single observation from `X` or `Y` ,
an m2-by-n matrix `ZJ` containing
multiple observations from `X` or `Y` ,
and returning an m2-by-1 vector of distances `D2` ,
whose `J` th element is the distance between the observations `ZI` and `ZJ(J,:)` . If
your data is not sparse, generally it is faster to use a built-in |

`D = pdist2(X,Y,distance,'Smallest',K)`

returns
a `K`

-by-*my* matrix `D`

containing
the `K`

smallest pairwise distances to observations
in `X`

for each observation in `Y`

. `pdist2`

sorts
the distances in each column of `D`

in ascending
order. `D = pdist2(X,Y,distance,'Largest',K)`

returns
the `K`

largest pairwise distances sorted in descending
order. If `K`

is greater than *mx*, `pdist2`

returns
an *mx*-by-*my* distance matrix.
For each observation in `Y`

, `pdist2`

finds
the `K`

smallest or largest distances by computing
and comparing the distance values to all the observations in `X`

.

`[D,I] = pdist2(X,Y,distance,'Smallest',K)`

returns
a `K`

-by-*my* matrix `I`

containing
indices of the observations in `X`

corresponding
to the `K`

smallest pairwise distances in `D`

. ```
[D,I]
= pdist2(X,Y,distance,'Largest',K)
```

returns indices corresponding
to the `K`

largest pairwise distances.

Given an *mx*-by-*n* data
matrix `X`

, which is treated as *mx* (1-by-*n*)
row vectors `x`

_{1}, `x`

_{2},
..., `x`

_{mx},
and *my*-by-*n* data matrix `Y`

,
which is treated as *my* (1-by-*n*)
row vectors `y`

_{1}, `y`

_{2},
...,`y`

_{my},
the various distances between the vector `x`

_{s} and `y`

_{t} are
defined as follows:

Euclidean distance

$${d}_{st}^{2}=({x}_{s}-{y}_{t})({x}_{s}-{y}_{t}{)}^{\prime}$$

Notice that the Euclidean distance is a special case of the Minkowski metric, where

`p=`

2.Standardized Euclidean distance

$${d}_{st}^{2}=({x}_{s}-{y}_{t}){V}^{-1}({x}_{s}-{y}_{t}{)}^{\prime}$$

where

`V`

is the*n*-by-*n*diagonal matrix whose*j*th diagonal element is`S`

(*j*)^{2}, where`S`

is the vector of standard deviations.Mahalanobis distance

$${d}_{st}^{2}=({x}_{s}-{y}_{t}){C}^{-1}({x}_{s}-{y}_{t}{)}^{\prime}$$

where

`C`

is the covariance matrix.City block metric

$${d}_{st}={\displaystyle \sum _{j=1}^{n}\left|{x}_{sj}-{y}_{tj}\right|}$$

Notice that the city block distance is a special case of the Minkowski metric, where

`p=`

1.Minkowski metric

$${d}_{st}=\sqrt[p]{{\displaystyle \sum _{j=1}^{n}{\left|{x}_{sj}-{y}_{tj}\right|}^{p}}}$$

Notice that for the special case of

`p`

= 1, the Minkowski metric gives the City Block metric, for the special case of`p`

= 2, the Minkowski metric gives the Euclidean distance, and for the special case of`p=`

∞, the Minkowski metric gives the Chebychev distance.Chebychev distance

$${d}_{st}={\mathrm{max}}_{j}\left\{\left|{x}_{sj}-{y}_{tj}\right|\right\}$$

Notice that the Chebychev distance is a special case of the Minkowski metric, where

`p=`

∞.Cosine distance

$${d}_{st}=\left(1-\frac{{x}_{s}{{y}^{\prime}}_{t}}{\sqrt{\left({x}_{s}{{x}^{\prime}}_{s}\right)\left({y}_{t}{{y}^{\prime}}_{t}\right)}}\right)$$

Correlation distance

$${d}_{st}=1-\frac{\left({x}_{s}-{\overline{x}}_{s}\right){\left({y}_{t}-{\overline{y}}_{t}\right)}^{\prime}}{\sqrt{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{s}-{\overline{x}}_{s}\right)}^{\prime}}\sqrt{\left({y}_{t}-{\overline{y}}_{t}\right){\left({y}_{t}-{\overline{y}}_{t}\right)}^{\prime}}}$$

where

$${\overline{x}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{x}_{sj}}$$ and

$${\overline{y}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{y}_{tj}}$$

Hamming distance

$${d}_{st}=(\#({x}_{sj}\ne {y}_{tj})/n)$$

Jaccard distance

$${d}_{st}=\frac{\#\left[\left({x}_{sj}\ne {y}_{tj}\right)\cap \left(\left({x}_{sj}\ne 0\right)\cup \left({y}_{tj}\ne 0\right)\right)\right]}{\#\left[\left({x}_{sj}\ne 0\right)\cup \left({y}_{tj}\ne 0\right)\right]}$$

Spearman distance

$${d}_{st}=1-\frac{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime}}{\sqrt{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{s}-{\overline{r}}_{s}\right)}^{\prime}}\sqrt{\left({r}_{t}-{\overline{r}}_{t}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime}}}$$

where

*r*is the rank of_{sj}*x*taken over_{sj}*x*,_{1j}*x*, ..._{2j}*x*, as computed by_{mx,j}`tiedrank`

*r*is the rank of_{tj}*y*taken over_{tj}*y*,_{1j}*y*, ..._{2j}*y*, as computed by_{my,j}`tiedrank`

*r*and_{s}*r*are the coordinate-wise rank vectors of_{t}*x*and_{s}*y*, i.e._{t}*r*= (_{s}*r*,_{s1}*r*, ..._{s2}*r*) and_{sn}*r*= (_{t}*r*,_{t1}*r*, ..._{t2}*r*)_{tn}$${\overline{r}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{sj}}=\frac{\left(n+1\right)}{2}$$

$${\overline{r}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{tj}}=\frac{\left(n+1\right)}{2}$$

Generate random data and find the unweighted Euclidean distance, then find the weighted distance using two different methods:

% Compute the ordinary Euclidean distance X = randn(100, 5); Y = randn(25, 5); D = pdist2(X,Y,'euclidean'); % euclidean distance % Compute the Euclidean distance with each coordinate % difference scaled by the standard deviation Dstd = pdist2(X,Y,'seuclidean'); % Use a function handle to compute a distance that weights % each coordinate contribution differently. Wgts = [.1 .3 .3 .2 .1]; weuc = @(XI,XJ,W)(sqrt(bsxfun(@minus,XI,XJ).^2 * W')); Dwgt = pdist2(X,Y, @(Xi,Xj) weuc(Xi,Xj,Wgts));

`createns`

| `ExhaustiveSearcher`

| `KDTreeSearcher`

| `knnsearch`

| `pdist`

Was this topic helpful?