Find Nearest Neighbors Using a Custom Distance Metric
This example shows how to find the indices of the three nearest observations in X to each observation in Y with respect to the chi-square distance. This distance metric is used in correspondence analysis, particularly in ecological applications.
Randomly generate normally distributed data into two matrices. The number of rows can vary, but the number of columns must be equal. This example uses 2-D data for plotting.
rng(1); % For reproducibility X = randn(50,2); Y = randn(4,2); h = zeros(3,1); figure; h(1) = plot(X(:,1),X(:,2),'bx'); hold on; h(2) = plot(Y(:,1),Y(:,2),'rs','MarkerSize',10); title('Heterogenous Data')
The rows of X and Y correspond to observations, and the columns are, in general, dimensions (for example, predictors).
The chi-square distance between j-dimensional points x and z is
where is the weight associated with dimension j.
Choose weights for each dimension, and specify the chi-square distance function. The distance function must:
- Take as input arguments one row of X, e.g., x, and the matrix Z.
- Compare x to each row of Z.
- Return a vector D of length , where is the number of rows of Z. Each element of D is the distance between the observation corresponding to x and the observations corresponding to each row of Z.
w = [0.4; 0.6]; chiSqrDist = @(x,Z)sqrt((bsxfun(@minus,x,Z).^2)*w);
This example uses arbitrary weights for illustration.
Find the indices of the three nearest observations in X to each observation in Y.
k = 3; [Idx,D] = knnsearch(X,Y,'Distance',chiSqrDist,'k',k);
idx and D are 4-by-3 matrices.
- idx(j,1) is the row index of the closest observation in X to observation j of Y, and D(j,1) is their distance.
- idx(j,2) is the row index of the next closest observation in X to observation j of Y, and D(j,2) is their distance.
- And so on.
Identify the nearest observations in the plot.
for j = 1:k; h(3) = plot(X(Idx(:,j),1),X(Idx(:,j),2),'ko','MarkerSize',10); end legend(h,{'\texttt{X}','\texttt{Y}','Nearest Neighbor'},'Interpreter','latex'); title('Heterogenous Data and Nearest Neighbors') hold off;
Several observations of Y share nearest neighbors.
Verify that the chi-square distance metric is equivalent to the Euclidean distance metric, but with an optional scaling parameter.
[IdxE,DE] = knnsearch(X,Y,'Distance','seuclidean','k',k,... 'Scale',1./(sqrt(w))); AreDiffIdx = sum(sum(Idx ~= IdxE)) AreDiffDist = sum(sum(abs(D - DE) > eps))
AreDiffIdx = 0 AreDiffDist = 0
The indices and distances between the two implementations of three nearest neighbors are practically equivalent.