Documentation Center

  • Trial Software
  • Product Updates

Classification Using Nearest Neighbors

Pairwise Distance

Categorizing query points based on their distance to points in a training dataset can be a simple yet effective way of classifying new points. You can use various metrics to determine the distance, described next. Use pdist2 to find the distance between a set of data and query points.

Distance Metrics

Given an mx-by-n data matrix X, which is treated as mx (1-by-n) row vectors x1, x2, ..., xmx, and my-by-n data matrix Y, which is treated as my (1-by-n) row vectors y1, y2, ...,ymy, the various distances between the vector xs and yt are defined as follows:

  • Euclidean distance

    The Euclidean distance is a special case of the Minkowski metric, where p = 2.

  • Standardized Euclidean distance

    where V is the n-by-n diagonal matrix whose jth diagonal element is S(j)2, where S is the vector containing the inverse weights.

  • Mahalanobis distance

    where C is the covariance matrix.

  • City block metric

    The city block distance is a special case of the Minkowski metric, where p = 1.

  • Minkowski metric

    For the special case of p = 1, the Minkowski metric gives the city block metric, for the special case of p = 2, the Minkowski metric gives the Euclidean distance, and for the special case of p = ∞, the Minkowski metric gives the Chebychev distance.

  • Chebychev distance

    The Chebychev distance is a special case of the Minkowski metric, where p = ∞.

  • Cosine distance

  • Correlation distance

    where

    and

  • Hamming distance

  • Jaccard distance

  • Spearman distance

    where

    • rsj is the rank of xsj taken over x1j, x2j, ...xmx,j, as computed by tiedrank.

    • rtj is the rank of ytj taken over y1j, y2j, ...ymy,j, as computed by tiedrank.

    • rs and rt are the coordinate-wise rank vectors of xs and yt, i.e., rs = (rs1, rs2, ... rsn) and rt = (rt1, rt2, ... rtn).

    • .

    • .

k-Nearest Neighbor Search and Radius Search

Given a set X of n points and a distance function, k-nearest neighbor (kNN) search lets you find the k closest points in X to a query point or set of points Y. The kNN search technique and kNN-based algorithms are widely used as benchmark learning rules. The relative simplicity of the kNN search technique makes it easy to compare the results from other classification techniques to kNN results. The technique has been used in various areas such as:

  • bioinformatics

  • image processing and data compression

  • document retrieval

  • computer vision

  • multimedia database

  • marketing data analysis

You can use kNN search for other machine learning algorithms, such as:

  • kNN classification

  • local weighted regression

  • missing data imputation and interpolation

  • density estimation

You can also use kNN search with many distance-based learning functions, such as K-means clustering.

In contrast, for a positive real value r, rangesearch finds all points in X that are within a distance r of each point in Y. This fixed-radius search is closely related to kNN search, as it supports the same distance metrics and search classes, and uses the same search algorithms.

k-Nearest Neighbor Search Using Exhaustive Search

When your input data meets any of the following criteria, knnsearch uses the exhaustive search method by default to find the k-nearest neighbors:

  • The number of columns of X is more than 10.

  • X is sparse.

  • The distance measure is either:

    • 'seuclidean'

    • 'mahalanobis'

    • 'cosine'

    • 'correlation'

    • 'spearman'

    • 'hamming'

    • 'jaccard'

    • A custom distance function

knnsearch also uses the exhaustive search method if your search object is an ExhaustiveSearcher object. The exhaustive search method finds the distance from each query point to every point in X, ranks them in ascending order, and returns the k points with the smallest distances. For example, this diagram shows the k = 3 nearest neighbors.

k-Nearest Neighbor Search Using a kd-Tree

When your input data meets all of the following criteria, knnsearch creates a kd-tree by default to find the k-nearest neighbors:

  • The number of columns of X is less than 10.

  • X is not sparse.

  • The distance measure is either:

    • 'euclidean' (default)

    • 'cityblock'

    • 'minkowski'

    • 'chebychev'

knnsearch also uses a kd-tree if your search object is a KDTreeSearcher object.

kd-trees divide your data into nodes with at most BucketSize (default is 50) points per node, based on coordinates (as opposed to categories). The following diagrams illustrate this concept using patch objects to color code the different "buckets."

When you want to find the k-nearest neighbors to a given query point, knnsearch does the following:

  1. Determines the node to which the query point belongs. In the following example, the query point (32,90) belongs to Node 4.

  2. Finds the closest k points within that node and its distance to the query point. In the following example, the points in red circles are equidistant from the query point, and are the closest points to the query point within Node 4.

  3. Chooses all other nodes having any area that is within the same distance, in any direction, from the query point to the kth closest point. In this example, only Node 3 overlaps the solid black circle centered at the query point with radius equal to the distance to the closest points within Node 4.

  4. Searches nodes within that range for any points closer to the query point. In the following example, the point in a red square is slightly closer to the query point than those within Node 4.

Using a kd-tree for large datasets with fewer than 10 dimensions (columns) can be much more efficient than using the exhaustive search method, as knnsearch needs to calculate only a subset of the distances. To maximize the efficiency of kd-trees, use a KDTreeSearcher object.

What Are Search Objects?

Basically, objects are a convenient way of storing information. Classes of related objects (for example, all search objects) have the same properties with values and types relevant to a specified search method. In addition to storing information within objects, you can perform certain actions (called methods) on objects.

All search objects have a knnsearch method specific to that class. This lets you efficiently perform a k-nearest neighbors search on your object for that specific object type. In addition, there is a generic knnsearch function that searches without creating or using an object.

To determine which type of object and search method is best for your data, consider the following:

  • Does your data have many columns, say more than 10? The ExhaustiveSearcher object may perform better.

  • Is your data sparse? Use the ExhaustiveSearcher object.

  • Do you want to use one of these distance measures to find the nearest neighbors? Use the ExhaustiveSearcher object.

    • 'seuclidean'

    • 'mahalanobis'

    • 'cosine'

    • 'correlation'

    • 'spearman'

    • 'hamming'

    • 'jaccard'

    • A custom distance function

  • Is your dataset huge (but with fewer than 10 columns)? Use the KDTreeSearcher object.

  • Are you searching for the nearest neighbors for a large number of query points? Use the KDTreeSearcher object.

Classifying Query Data Using knnsearch

This example shows how to classify query data using knnsearch .

Classify a new point based on the last two columns of the Fisher iris data. Using only the last two columns makes it easier to plot.

load fisheriris
x = meas(:,3:4);
gscatter(x(:,1),x(:,2),species)
set(legend,'location','best')

Plot the new point.

newpoint = [5 1.45];
line(newpoint(1),newpoint(2),'marker','x','color','k',...
   'markersize',10,'linewidth',2)

Find the 10 sample points closest to the new point.

[n,d] = knnsearch(x,newpoint,'k',10);
line(x(n,1),x(n,2),'color',[.5 .5 .5],'marker','o',...
    'linestyle','none','markersize',10)

It appears that knnsearch has found only the nearest eight neighbors. In fact, this particular dataset contains duplicate values.

x(n,:)
ans =

    5.0000    1.5000
    4.9000    1.5000
    4.9000    1.5000
    5.1000    1.5000
    5.1000    1.6000
    4.8000    1.4000
    5.0000    1.7000
    4.7000    1.4000
    4.7000    1.4000
    4.7000    1.5000

Make the axes equal so the calculated distances correspond to the apparent distances on the plot axis equal and zoom in to see the neighbors better.

set(gca,'xlim',[4.5 5.5],'ylim',[1 2]); axis square

Find the species of the 10 neighbors.

tabulate(species(n))
       Value    Count   Percent
   virginica        2     20.00%
  versicolor        8     80.00%

Using a rule based on the majority vote of the 10 nearest neighbors, you can classify this new point as a versicolor.

Visually identify the neighbors by drawing a circle around the group of them. Define the center and diameter of a circle, based on the location of the new point.

ctr = newpoint - d(end);
diameter = 2*d(end);
% Draw a circle around the 10 nearest neighbors.
h = rectangle('position',[ctr,diameter,diameter],...
   'curvature',[1 1]);
set(h,'linestyle',':')

Using the same dataset, find the 10 nearest neighbors to three new points.

figure
newpoint2 = [5 1.45;6 2;2.75 .75];
gscatter(x(:,1),x(:,2),species)
legend('location','best')
[n2,d2] = knnsearch(x,newpoint2,'k',10);
line(x(n2,1),x(n2,2),'color',[.5 .5 .5],'marker','o',...
   'linestyle','none','markersize',10)
line(newpoint2(:,1),newpoint2(:,2),'marker','x','color','k',...
   'markersize',10,'linewidth',2,'linestyle','none')

Find the species of the 10 nearest neighbors for each new point.

tabulate(species(n2(1,:)))
       Value    Count   Percent
   virginica        2     20.00%
  versicolor        8     80.00%
tabulate(species(n2(2,:)))
      Value    Count   Percent
  virginica       10    100.00%
tabulate(species(n2(3,:)))
       Value    Count   Percent
  versicolor        7     70.00%
      setosa        3     30.00%

For more examples using knnsearch methods and function, see the individual reference pages.

K-Nearest Neighbor Classification for Supervised Learning

The ClassificationKNN class lets you:

Work with the classifier as you would with ClassificationTree or ClassificationDiscriminant. In particular, prepare your data for classification according to the procedure in Steps in Supervised Learning (Machine Learning). Then construct the classifier using fitcknn.

Construct a KNN Classifier

This example shows how to construct a k-nearest neighbor classifier for the Fisher iris data.

Load the Fisher iris data.

load fisheriris
X = meas; % use all data for fitting
Y = species; % response data

Construct the classifier using fitcknn.

mdl = fitcknn(X,Y)
mdl = 

  ClassificationKNN
    PredictorNames: {'x1'  'x2'  'x3'  'x4'}
      ResponseName: 'Y'
        ClassNames: {'setosa'  'versicolor'  'virginica'}
    ScoreTransform: 'none'
     NumObservations: 150
          Distance: 'euclidean'
      NumNeighbors: 1


  Properties, Methods

A default k-nearest neighbor classifier uses just the single nearest neighbor. Often, a classifier is more robust with more neighbors than that. Change the neighborhood size of mdl to 4, meaning mdl classifies using the four nearest neighbors:

mdl.NumNeighbors = 4;

Examine the Quality of a KNN Classifier

This example shows how to examine the quality of a k-nearest neighbor classifier using resubstitution and cross validation.

Construct a KNN classifier for the Fisher iris data as in Construct a KNN Classifier.

load fisheriris
X = meas; % use all data for fitting
Y = species; % response data
mdl = fitcknn(X,Y,'NumNeighbors',4);

Examine the resubstitution loss, which, by default, is the fraction of misclassifications from the predictions of mdl. (For nondefault cost, weights, or priors, see ClassificationKNN.loss.)

rloss = resubLoss(mdl)
rloss =

    0.0400

The classifier predicts incorrectly for 4% of the training data.

Construct a cross-validated classifier from the model.

cvmdl = crossval(mdl);

Examine the cross-validation loss, which is the average loss of each cross-validation model when predicting on data that is not used for training.

kloss = kfoldLoss(cvmdl)
kloss =

    0.0600

The cross-validated classification accuracy resembles the resubstitution accuracy. Therefore, you can expect mdl to misclassify approximately 5% of new data, assuming that the new data has about the same distribution as the training data.

Predict Classification Based on a KNN Classifier

This example shows how to predict classification for a k-nearest neighbor classifier.

Construct a default KNN classifier for the Fisher iris data as in Construct a KNN Classifier.

load fisheriris
X = meas; % use all data for fitting
Y = species; % response data
mdl = fitcknn(X,Y);

Predict the classification of an average flower.

flwr = mean(X); % an average flower
flwrClass = predict(mdl,flwr)
flwrClass = 

    'versicolor'

Modify a KNN Classifier

This example shows how to modify a k-nearest neighbor classifier.

Construct a default KNN classifier for the Fisher iris data as in Construct a KNN Classifier.

load fisheriris
X = meas; % use all data for fitting
Y = species; % response data
mdl = fitcknn(X,Y);

Modify the model to use the three nearest neighbors, rather than the default one nearest neighbor.

mdl.NumNeighbors = 3;

Compare the resubstitution predictions and cross-validation loss with the new number of neighbors.

rloss = resubLoss(mdl)
rloss =

    0.0400
rng('default')
cvmdl = crossval(mdl,'kfold',5);
kloss = kfoldLoss(cvmdl)
kloss =

    0.0333

The model with three neighbors has lower cross-validated loss than a model with four neighbors (see Examine the Quality of a KNN Classifier).

Modify the model to use cosine distance instead of the default, and examine the loss. To use cosine distance, you must recreate the model using the exhaustive search method.

cmdl = fitcknn(X,Y,'NSMethod','exhaustive',...
    'Distance','cosine');
cmdl.NumNeighbors = 3;
closs = resubLoss(cmdl)
closs =

    0.0200

The classifier now has lower resubstitution error than before.

Check the quality of a cross-validated version of the new model.

cvcmdl = crossval(cmdl);
kcloss = kfoldLoss(cvcmdl)
kcloss =

    0.0333

The cross-validated loss is the same as before. The lesson is that improving the resubstitution error does not necessarily produce a model with better predictions.

Was this topic helpful?