MATLAB Examples

Find Nearest Neighbors Using a Custom Distance Metric

This example shows how to find the indices of the three nearest observations in X to each observation in Y with respect to the chi-square distance. This distance metric is used in correspondence analysis, particularly in ecological applications.

Randomly generate normally distributed data into two matrices. The number of rows can vary, but the number of columns must be equal. This example uses 2-D data for plotting.

rng(1); % For reproducibility
X = randn(50,2);
Y = randn(4,2);

h = zeros(3,1);
figure;
h(1) = plot(X(:,1),X(:,2),'bx');
hold on;
h(2) = plot(Y(:,1),Y(:,2),'rs','MarkerSize',10);
title('Heterogenous Data')

The rows of X and Y correspond to observations, and the columns are, in general, dimensions (for example, predictors).

The chi-square distance between j-dimensional points x and z is

$$\chi(x,z) = \sqrt{\displaystyle\sum^J_{j = 1}w_j\left(x_j - z_j\right)^2},$$

where $w_j$ is the weight associated with dimension j.

Choose weights for each dimension, and specify the chi-square distance function. The distance function must:

  • Take as input arguments one row of X, e.g., x, and the matrix Z.
  • Compare x to each row of Z.
  • Return a vector D of length $n_z$, where $n_z$ is the number of rows of Z. Each element of D is the distance between the observation corresponding to x and the observations corresponding to each row of Z.
w = [0.4; 0.6];
chiSqrDist = @(x,Z)sqrt((bsxfun(@minus,x,Z).^2)*w);

This example uses arbitrary weights for illustration.

Find the indices of the three nearest observations in X to each observation in Y.

k = 3;
[Idx,D] = knnsearch(X,Y,'Distance',chiSqrDist,'k',k);

idx and D are 4-by-3 matrices.

  • idx(j,1) is the row index of the closest observation in X to observation j of Y, and D(j,1) is their distance.
  • idx(j,2) is the row index of the next closest observation in X to observation j of Y, and D(j,2) is their distance.
  • And so on.

Identify the nearest observations in the plot.

for j = 1:k;
    h(3) = plot(X(Idx(:,j),1),X(Idx(:,j),2),'ko','MarkerSize',10);
end
legend(h,{'\texttt{X}','\texttt{Y}','Nearest Neighbor'},'Interpreter','latex');
title('Heterogenous Data and Nearest Neighbors')
hold off;

Several observations of Y share nearest neighbors.

Verify that the chi-square distance metric is equivalent to the Euclidean distance metric, but with an optional scaling parameter.

[IdxE,DE] = knnsearch(X,Y,'Distance','seuclidean','k',k,...
    'Scale',1./(sqrt(w)));
AreDiffIdx = sum(sum(Idx ~= IdxE))
AreDiffDist = sum(sum(abs(D - DE) > eps))
AreDiffIdx =

     0


AreDiffDist =

     0

The indices and distances between the two implementations of three nearest neighbors are practically equivalent.