Efficient method for finding index of closest value in very large array for a very large amount of examples

38 views (last 30 days)
I have two very large one dimensional arrays, 'aRef' which is around 11,000,000 elements and 'aTest' which is around 10,000,000 elements. I need to find the index of the closest element in 'aRef' for all elements in 'aTest'. 'aRef' is sorted and 'aTest' can be sorted if that will help performance.
- Method 1: Returns at out of memory error as the arrays are far too large
diff = abs(bsxfun(@minus,aRef,aTest'));
[~, I] = min(diff);
- Method 2: Takes around 0.03 seconds per iteration (but varies greatly) and therefore around 300000 seconds in total
for k = 1:n
diff = abs(aRef- aTest(k));
[~, I(k)] = min(diff);
end
- Method 3: Takes around 0.013 seconds per iteration and therefore 130000 seconds in total
for k = 1:n
i_lower = find(aRef <= aTest(k),1,'last');
i_higher = find(aRef >= aTest(k),1,'first');
end
Is there a more efficient method for this that won't exhaust the memory or take so long to run?
Thanks for your help.
  1 Comment
Stephen23
Stephen23 on 29 May 2015
Do not use diff as a variable name, as this is the names of the very useful inbuilt diff function.
One can check if a name is already defined using which.

Sign in to comment.

Accepted Answer

Guillaume
Guillaume on 22 May 2015
Note: Using diff as a variable name is not a good idea as it shadows the very useful diff function. Also, for method 2, your code does not show the preallocation of I. If you don't preallocate I, it will seriously slow down the code.
Anyway, for two vectors of around 10,000 elements, the following is around 200 times faster than your method 1 on my machine.
edges = [-Inf, mean([aRef(2:end); aRef(1:end-1)]), +Inf];
I = discretize(aTest, edges);
Basically, it construct an edge vector half way between each elements of your aRef, and use the histogram functions of matlab to get the bin index your aTest would fall in. discretize is new in R2015a. On 2014b, you can use the third return value of hiscounts. On even older versions, the 2nd return value of histc (although histc behaves slightly differently with regards to the last bin).
%2014b
[~, ~, I] = histcounts(aTest, edges); %probably slower than discretize
%before 2014b
[~, I] = histc(aTest, edges); %return an extra element (for the +Inf bin)
I(end) = [];
  4 Comments

Sign in to comment.

More Answers (3)

Damon Landau
Damon Landau on 8 Aug 2018
I = interp1(aRef, 1:numel(aRef), aTest, 'nearest', 'extrap');
should be faster and (arguably) more straightforward than "discretize"

James Tursa
James Tursa on 22 May 2015
Edited: James Tursa on 22 May 2015
This is a fairly simple mex routine if you start with both arrays sorted, since you don't need to make comparisons between all of the elements to get the answer. A simple stepping process can get the job done. E.g., the timing on the sizes you have listed:
>> aRef = sort(rand(11000000,1));
>> aTest = sort(rand(10000000,1));
>> tic; aClosest = closest(aRef,aTest); toc
Elapsed time is 0.130508 seconds.
A shorter example to see that the results are as expected:
>> aRef = 1:10
aRef =
1 2 3 4 5 6 7 8 9 10
>> aTest = sort(rand(1,20)*20-5)
aTest =
Columns 1 through 15
-4.4987 -2.2252 -2.0507 -1.2286 -1.0066 -0.9144 -0.2292 -0.0079 4.0035 4.6724 5.1178 5.3893 5.6688 7.4530 10.6621
Columns 16 through 20
10.9377 12.8271 12.9305 13.5336 14.6298
>> aClosest = closest(aRef,aTest)
aClosest =
1 1 1 1 1 1 1 1 4 5 5 5 6 7 10 10 10 10 10 10
The mex routine is below. (Note the caveat about not handling Inf's and NaN's consistently, although that code could be added if needed)
// closest.c
// C = closest(A,B)
// Inputs: A = full real double vector (reference), sorted ascending
// B = full real double vector (test), sorted ascending
// Output: C = index of closest value in A to value in B
// i.e., A(C(i)) is the closest value in A to B(i)
// Ties are resolved in favor of higher index
// Is not currently coded to handle Inf's and NaN's consistently
// C is the same size as B
// Programmer: James Tursa
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
size_t i, j, k, m, n;
double *A, *B, *C;
//
// Check input arguments
if( nrhs != 2 ||
!mxIsDouble(prhs[0]) || !mxIsDouble(prhs[1]) ||
mxIsComplex(prhs[0]) || mxIsComplex(prhs[1]) ||
mxIsSparse(prhs[0]) || mxIsSparse(prhs[1]) ) {
mexErrMsgTxt("Need exactly two full double vectors as input");
}
// Check number of output arguments
if( nlhs > 1 ) {
mexErrMsgTxt("Too many outputs");
}
// Get the number of elements involved
m = mxGetNumberOfElements(prhs[0]);
n = mxGetNumberOfElements(prhs[1]);
// Disallow empty reference vector
if( m == 0 ) {
mexErrMsgTxt("Reference vector (1st input) cannot be empty");
}
// Create uninitialized output
if( mxGetM(prhs[1]) == 1 ) {
plhs[0] = mxCreateUninitNumericMatrix( 1, n, mxDOUBLE_CLASS, mxREAL );
} else {
plhs[0] = mxCreateUninitNumericMatrix( n, 1, mxDOUBLE_CLASS, mxREAL );
}
// If B is empty, simply return empty result
if( n == 0 ) {
return;
}
// Get the data pointers
A = mxGetPr(prhs[0]);
B = mxGetPr(prhs[1]);
C = mxGetPr(plhs[0]);
k = 0;
// Assign 1st index to all values B that are less than 1st A value
while( k < n && B[k] < *A ) {
C[k++] = 1.0;
}
// Step through until B is between two A values, then test for result
i = 0;
for( j=k; j<n; j++ ) {
while( i+1 < m && B[j] >= A[i+1] ) i++;
if( i+1 == m ) break;
if( B[j] - A[i] < A[i+1] - B[j] ) {
C[j] = i + 1;
} else {
C[j] = i + 2;
}
}
// Assign last index to all values B that are more than last A value
while( j < n ) {
C[j++] = m;
}
}

Jos (10584)
Jos (10584) on 29 May 2015
I point you to my NEARESTPOINT function, available on the File Exchange:

Categories

Find more on Sparse Matrices in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!