Categorizing query points based on their distance to points
in a training dataset can be a simple yet effective way of classifying
new points. You can use various metrics to determine the distance,
described next. Use `pdist2`

to
find the distance between a set of data and query points.

Given an *mx*-by-*n* data
matrix `X`

, which is treated as *mx* (1-by-*n*)
row vectors `x`

_{1}, `x`

_{2},
..., `x`

_{mx},
and *my*-by-*n* data matrix `Y`

,
which is treated as *my* (1-by-*n*)
row vectors `y`

_{1}, `y`

_{2},
...,`y`

_{my},
the various distances between the vector `x`

_{s} and `y`

_{t} are
defined as follows:

Euclidean distance

$${d}_{st}^{2}=({x}_{s}-{y}_{t})({x}_{s}-{y}_{t}{)}^{\prime}$$

The Euclidean distance is a special case of the Minkowski metric, where

`p`

= 2.Standardized Euclidean distance

$${d}_{st}^{2}=({x}_{s}-{y}_{t}){V}^{-1}({x}_{s}-{y}_{t}{)}^{\prime}$$

where

`V`

is the*n*-by-*n*diagonal matrix whose*j*th diagonal element is`S`

(*j*)^{2}, where`S`

is the vector containing the inverse weights.Mahalanobis distance

$${d}_{st}^{2}=({x}_{s}-{y}_{t}){C}^{-1}({x}_{s}-{y}_{t}{)}^{\prime}$$

where

`C`

is the covariance matrix.City block metric

$${d}_{st}={\displaystyle \sum _{j=1}^{n}\left|{x}_{sj}-{y}_{tj}\right|}$$

The city block distance is a special case of the Minkowski metric, where

`p`

= 1.Minkowski metric

$${d}_{st}=\sqrt[p]{{\displaystyle \sum _{j=1}^{n}{\left|{x}_{sj}-{y}_{tj}\right|}^{p}}}$$

For the special case of

`p`

= 1, the Minkowski metric gives the city block metric, for the special case of`p`

= 2, the Minkowski metric gives the Euclidean distance, and for the special case of`p`

= ∞, the Minkowski metric gives the Chebychev distance.Chebychev distance

$${d}_{st}={\mathrm{max}}_{j}\left\{\left|{x}_{sj}-{y}_{tj}\right|\right\}$$

The Chebychev distance is a special case of the Minkowski metric, where

`p`

= ∞.Cosine distance

$${d}_{st}=\left(1-\frac{{x}_{s}{{y}^{\prime}}_{t}}{\sqrt{\left({x}_{s}{{x}^{\prime}}_{s}\right)\left({y}_{t}{{y}^{\prime}}_{t}\right)}}\right)$$

Correlation distance

$${d}_{st}=1-\frac{\left({x}_{s}-{\overline{x}}_{s}\right){\left({y}_{t}-{\overline{y}}_{t}\right)}^{\prime}}{\sqrt{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{s}-{\overline{x}}_{s}\right)}^{\prime}}\sqrt{\left({y}_{t}-{\overline{y}}_{t}\right){\left({y}_{t}-{\overline{y}}_{t}\right)}^{\prime}}}$$

where

$${\overline{x}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{x}_{sj}}$$

and

$${\overline{y}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{y}_{tj}}$$

Hamming distance

$${d}_{st}=(\#({x}_{sj}\ne {y}_{tj})/n)$$

Jaccard distance

$${d}_{st}=\frac{\#\left[\left({x}_{sj}\ne {y}_{tj}\right)\cap \left(\left({x}_{sj}\ne 0\right)\cup \left({y}_{tj}\ne 0\right)\right)\right]}{\#\left[\left({x}_{sj}\ne 0\right)\cup \left({y}_{tj}\ne 0\right)\right]}$$

Spearman distance

$${d}_{st}=1-\frac{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime}}{\sqrt{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{s}-{\overline{r}}_{s}\right)}^{\prime}}\sqrt{\left({r}_{t}-{\overline{r}}_{t}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime}}}$$

where

*r*is the rank of_{sj}*x*taken over_{sj}*x*_{1j},*x*_{2j}, ...*x*, as computed by_{mx,j}`tiedrank`

.*r*is the rank of_{tj}*y*taken over_{tj}*y*_{1j},*y*_{2j}, ...*y*, as computed by_{my,j}`tiedrank`

.*r*and_{s}*r*are the coordinate-wise rank vectors of_{t}*x*and_{s}*y*, i.e.,_{t}*r*= (_{s}*r*_{s}_{1},*r*_{s}_{2}, ...*r*) and_{sn}*r*= (_{t}*r*_{t1},*r*_{t2}, ...*r*)._{tn}$${\overline{r}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{sj}}=\frac{\left(n+1\right)}{2}$$.

$${\overline{r}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{tj}}=\frac{\left(n+1\right)}{2}$$.

Given a set *X* of *n* points
and a distance function, *k*-nearest neighbor (*k*NN)
search lets you find the *k* closest points in *X* to
a query point or set of points `Y`

. The *k*NN
search technique and *k*NN-based algorithms are widely
used as benchmark learning rules. The relative simplicity of the *k*NN
search technique makes it easy to compare the results from other classification
techniques to *k*NN results. The technique has been
used in various areas such as:

bioinformatics

image processing and data compression

document retrieval

computer vision

multimedia database

marketing data analysis

You can use *k*NN search for other machine
learning algorithms, such as:

*k*NN classificationlocal weighted regression

missing data imputation and interpolation

density estimation

You can also use *k*NN search with many distance-based
learning functions, such as K-means clustering.

In contrast, for a positive real value `r`

, `rangesearch`

finds all points in `X`

that
are within a distance `r`

of each point in `Y`

.
This fixed-radius search is closely related to *k*NN
search, as it supports the same distance metrics and search classes,
and uses the same search algorithms.

When your input data meets any of the following criteria, `knnsearch`

uses the exhaustive search method
by default to find the *k*-nearest neighbors:

The number of columns of

`X`

is more than 10.`X`

is sparse.The distance measure is either:

`'seuclidean'`

`'mahalanobis'`

`'cosine'`

`'correlation'`

`'spearman'`

`'hamming'`

`'jaccard'`

A custom distance function

`knnsearch`

also uses the exhaustive
search method if your search object is an `ExhaustiveSearcher`

model
object. The exhaustive search method finds the distance from each
query point to every point in `X`

, ranks them in
ascending order, and returns the *k* points with
the smallest distances. For example, this diagram shows the *k* = 3 nearest neighbors.

When your input data meets all of the following criteria, `knnsearch`

creates
a *K*d-tree by default to find the *k*-nearest
neighbors:

The number of columns of

`X`

is less than 10.`X`

is not sparse.The distance measure is either:

`'euclidean'`

(default)`'cityblock'`

`'minkowski'`

`'chebychev'`

`knnsearch`

also uses a *K*d-tree
if your search object is a `KDTreeSearcher`

model object.

*K*d-trees divide your data into nodes with
at most `BucketSize`

(default is 50) points per node,
based on coordinates (as opposed to categories). The following diagrams
illustrate this concept using `patch`

objects to
color code the different "buckets."

When you want to find the *k*-nearest neighbors
to a given query point, `knnsearch`

does the following:

Determines the node to which the query point belongs. In the following example, the query point (32,90) belongs to Node 4.

Finds the closest

*k*points within that node and its distance to the query point. In the following example, the points in red circles are equidistant from the query point, and are the closest points to the query point within Node 4.Chooses all other nodes having any area that is within the same distance, in any direction, from the query point to the

*k*th closest point. In this example, only Node 3 overlaps the solid black circle centered at the query point with radius equal to the distance to the closest points within Node 4.Searches nodes within that range for any points closer to the query point. In the following example, the point in a red square is slightly closer to the query point than those within Node 4.

Using a *K*d-tree for large data sets with
fewer than 10 dimensions (columns) can be much more efficient than
using the exhaustive search method, as `knnsearch`

needs
to calculate only a subset of the distances. To maximize the efficiency
of *K*d-trees, use a `KDTreeSearcher`

model.

Basically, model objects are a convenient way of storing information. Related models have the same properties with values and types relevant to a specified search method. In addition to storing information within models, you can perform certain actions on models.

You can efficiently perform a *k*-nearest neighbors
search on your search model using `knnsearch`

. Or, you can search for all neighbors
within a specified radius using your search model and `rangesearch`

.
In addition, there are a generic `knnsearch`

and `rangesearch`

functions that search without
creating or using a model.

To determine which type of model and search method is best for your data, consider the following:

Does your data have many columns, say more than 10? The

`ExhaustiveSearcher`

model may perform better.Is your data sparse? Use the

`ExhaustiveSearcher`

model.Do you want to use one of these distance measures to find the nearest neighbors? Use the

`ExhaustiveSearcher`

model.`'seuclidean'`

`'mahalanobis'`

`'cosine'`

`'correlation'`

`'spearman'`

`'hamming'`

`'jaccard'`

A custom distance function

Is your data set huge (but with fewer than 10 columns)? Use the

`KDTreeSearcher`

model.Are you searching for the nearest neighbors for a large number of query points? Use the

`KDTreeSearcher`

model.

This example shows how to classify query data by:

Growing a

*K*d-treeConducting a

*k*nearest neighbors search using the grown tree.Assigning each query point the class with the highest representation among their respective nearest neighbors.

Classify a new point based on the last two columns of the Fisher iris data. Using only the last two columns makes it easier to plot.

load fisheriris x = meas(:,3:4); gscatter(x(:,1),x(:,2),species) legend('Location','best')

Plot the new point.

newpoint = [5 1.45]; line(newpoint(1),newpoint(2),'marker','x','color','k',... 'markersize',10,'linewidth',2)

Prepare a *K* d-tree neighbor searcher model.

Mdl = KDTreeSearcher(x)

Mdl = KDTreeSearcher with properties: BucketSize: 50 Distance: 'euclidean' DistParameter: [] X: [150x2 double]

`Mdl`

is a `KDTreeSearcher`

model. By default, the distance metric it uses to search for neighbors is Euclidean distance.

Find the 10 sample points closest to the new point.

[n,d] = knnsearch(Mdl,newpoint,'k',10); line(x(n,1),x(n,2),'color',[.5 .5 .5],'marker','o',... 'linestyle','none','markersize',10)

It appears that `knnsearch`

has found only the nearest eight neighbors. In fact, this particular dataset contains duplicate values.

x(n,:)

ans = 5.0000 1.5000 4.9000 1.5000 4.9000 1.5000 5.1000 1.5000 5.1000 1.6000 4.8000 1.4000 5.0000 1.7000 4.7000 1.4000 4.7000 1.4000 4.7000 1.5000

Make the axes equal so the calculated distances correspond to the apparent distances on the plot axis equal and zoom in to see the neighbors better.

```
xlim([4.5 5.5]);
ylim([1 2]);
axis square
```

Find the species of the 10 neighbors.

tabulate(species(n))

Value Count Percent virginica 2 20.00% versicolor 8 80.00%

Using a rule based on the majority vote of the 10 nearest neighbors, you can classify this new point as a versicolor.

Visually identify the neighbors by drawing a circle around the group of them. Define the center and diameter of a circle, based on the location of the new point.

ctr = newpoint - d(end); diameter = 2*d(end); % Draw a circle around the 10 nearest neighbors. h = rectangle('position',[ctr,diameter,diameter],... 'curvature',[1 1]); h.LineStyle = ':';

Using the same dataset, find the 10 nearest neighbors to three new points.

figure newpoint2 = [5 1.45;6 2;2.75 .75]; gscatter(x(:,1),x(:,2),species) legend('location','best') [n2,d2] = knnsearch(Mdl,newpoint2,'k',10); line(x(n2,1),x(n2,2),'color',[.5 .5 .5],'marker','o',... 'linestyle','none','markersize',10) line(newpoint2(:,1),newpoint2(:,2),'marker','x','color','k',... 'markersize',10,'linewidth',2,'linestyle','none')

Find the species of the 10 nearest neighbors for each new point.

tabulate(species(n2(1,:)))

Value Count Percent virginica 2 20.00% versicolor 8 80.00%

tabulate(species(n2(2,:)))

Value Count Percent virginica 10 100.00%

tabulate(species(n2(3,:)))

Value Count Percent versicolor 7 70.00% setosa 3 30.00%

For more examples using `knnsearch`

methods and function, see the individual reference pages.

This example shows how to find the indices of the three nearest observations in `X`

to each observation in `Y`

with respect to the chi-square distance. This distance metric is used in correspondence analysis, particularly in ecological applications.

Randomly generate normally distributed data into two matrices. The number of rows can vary, but the number of columns must be equal. This example uses 2-D data for plotting.

rng(1); % For reproducibility X = randn(50,2); Y = randn(4,2); h = zeros(3,1); figure; h(1) = plot(X(:,1),X(:,2),'bx'); hold on; h(2) = plot(Y(:,1),Y(:,2),'rs','MarkerSize',10); title('Heterogenous Data')

The rows of `X`

and `Y`

correspond to observations, and the columns are, in general, dimensions (for example, predictors).

The chi-square distance between *j*-dimensional points *x* and *z* is

where
is the weight associated with dimension *j*.

Choose weights for each dimension, and specify the chi-square distance function. The distance function must:

Take as input arguments one row of

`X`

, e.g.,`x`

, and the matrix`Z`

.Compare

`x`

to each row of`Z`

.Return a vector

`D`

of length , where is the number of rows of`Z`

. Each element of`D`

is the distance between the observation corresponding to`x`

and the observations corresponding to each row of`Z`

.

w = [0.4; 0.6]; chiSqrDist = @(x,Z)sqrt((bsxfun(@minus,x,Z).^2)*w);

This example uses arbitrary weights for illustration.

Find the indices of the three nearest observations in `X`

to each observation in `Y`

.

k = 3; [Idx,D] = knnsearch(X,Y,'Distance',chiSqrDist,'k',k);

`idx`

and `D`

are 4-by-3 matrices.

`idx(j,1)`

is the row index of the closest observation in`X`

to observation*j*of`Y`

, and`D(j,1)`

is their distance.`idx(j,2)`

is the row index of the next closest observation in`X`

to observation*j*of`Y`

, and`D(j,2)`

is their distance.And so on.

Identify the nearest observations in the plot.

for j = 1:k; h(3) = plot(X(Idx(:,j),1),X(Idx(:,j),2),'ko','MarkerSize',10); end legend(h,{'\texttt{X}','\texttt{Y}','Nearest Neighbor'},'Interpreter','latex'); title('Heterogenous Data and Nearest Neighbors') hold off;

Several observations of `Y`

share nearest neighbors.

Verify that the chi-square distance metric is equivalent to the Euclidean distance metric, but with an optional scaling parameter.

[IdxE,DE] = knnsearch(X,Y,'Distance','seuclidean','k',k,... 'Scale',1./(sqrt(w))); AreDiffIdx = sum(sum(Idx ~= IdxE)) AreDiffDist = sum(sum(abs(D - DE) > eps))

AreDiffIdx = 0 AreDiffDist = 0

The indices and distances between the two implementations of three nearest neighbors are practically equivalent.

The `ClassificationKNN`

class
lets you:

Work with the classifier as you would with `ClassificationTree`

or `ClassificationDiscriminant`

.
In particular, prepare your data for classification according to the
procedure in Steps in Supervised Learning. Then construct
the classifier using `fitcknn`

.

This example shows how to construct a *k*-nearest
neighbor classifier for the Fisher iris data.

Load the Fisher iris data.

load fisheriris X = meas; % use all data for fitting Y = species; % response data

Construct the classifier using `fitcknn`

.

mdl = fitcknn(X,Y)

mdl = ClassificationKNN PredictorNames: {'x1' 'x2' 'x3' 'x4'} ResponseName: 'Y' ClassNames: {'setosa' 'versicolor' 'virginica'} ScoreTransform: 'none' NumObservations: 150 Distance: 'euclidean' NumNeighbors: 1 Properties, Methods

A default *k*-nearest neighbor classifier
uses just the single nearest neighbor. Often, a classifier is more
robust with more neighbors than that. Change the neighborhood size
of `mdl`

to `4`

, meaning `mdl`

classifies
using the four nearest neighbors:

mdl.NumNeighbors = 4;

This example shows how to examine the quality
of a *k*-nearest neighbor classifier using resubstitution
and cross validation.

Construct a KNN classifier for the Fisher iris data as in Construct a KNN Classifier.

load fisheriris X = meas; % use all data for fitting Y = species; % response data mdl = fitcknn(X,Y,'NumNeighbors',4);

Examine the resubstitution loss, which, by default, is
the fraction of misclassifications from the predictions of `mdl`

.
(For nondefault cost, weights, or priors, see `ClassificationKNN.loss`

.)

rloss = resubLoss(mdl)

rloss = 0.0400

The classifier predicts incorrectly for 4% of the training data.

Construct a cross-validated classifier from the model.

cvmdl = crossval(mdl);

Examine the cross-validation loss, which is the average loss of each cross-validation model when predicting on data that is not used for training.

kloss = kfoldLoss(cvmdl)

kloss = 0.0600

The cross-validated classification accuracy resembles the resubstitution
accuracy. Therefore, you can expect `mdl`

to misclassify
approximately 5% of new data, assuming that the new data has about
the same distribution as the training data.

This example shows how to predict classification
for a *k*-nearest neighbor classifier.

Construct a default KNN classifier for the Fisher iris data as in Construct a KNN Classifier.

load fisheriris X = meas; % use all data for fitting Y = species; % response data mdl = fitcknn(X,Y);

Predict the classification of an average flower.

```
flwr = mean(X); % an average flower
flwrClass = predict(mdl,flwr)
```

flwrClass = 'versicolor'

This example shows how to modify a *k*-nearest
neighbor classifier.

Construct a default KNN classifier for the Fisher iris data as in Construct a KNN Classifier.

load fisheriris X = meas; % use all data for fitting Y = species; % response data mdl = fitcknn(X,Y);

Modify the model to use the three nearest neighbors, rather than the default one nearest neighbor.

mdl.NumNeighbors = 3;

Compare the resubstitution predictions and cross-validation loss with the new number of neighbors.

rloss = resubLoss(mdl)

rloss = 0.0400

rng('default') cvmdl = crossval(mdl,'kfold',5); kloss = kfoldLoss(cvmdl)

kloss = 0.0333

The model with three neighbors has lower cross-validated loss than a model with four neighbors (see Examine the Quality of a KNN Classifier).

Modify the model to use cosine distance instead of the default, and examine the loss. To use cosine distance, you must recreate the model using the exhaustive search method.

cmdl = fitcknn(X,Y,'NSMethod','exhaustive',... 'Distance','cosine'); cmdl.NumNeighbors = 3; closs = resubLoss(cmdl)

closs = 0.0200

The classifier now has lower resubstitution error than before.

Check the quality of a cross-validated version of the new model.

cvcmdl = crossval(cmdl); kcloss = kfoldLoss(cvcmdl)

kcloss = 0.0333

The cross-validated loss is the same as before. The lesson is that improving the resubstitution error does not necessarily produce a model with better predictions.

Was this topic helpful?