Efficient method for finding index of closest value in very large array for a very large amount of examples

Question

Ben M on 22 May 2015

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/218033-efficient-method-for-finding-index-of-closest-value-in-very-large-array-for-a-very-large-amount-of-e

Answered: Damon Landau on 8 Aug 2018

I have two very large one dimensional arrays, 'aRef' which is around 11,000,000 elements and 'aTest' which is around 10,000,000 elements. I need to find the index of the closest element in 'aRef' for all elements in 'aTest'. 'aRef' is sorted and 'aTest' can be sorted if that will help performance.

- Method 1: Returns at out of memory error as the arrays are far too large

diff = abs(bsxfun(@minus,aRef,aTest'));
[~, I] = min(diff);

- Method 2: Takes around 0.03 seconds per iteration (but varies greatly) and therefore around 300000 seconds in total

for k = 1:n
  diff = abs(aRef- aTest(k));
  [~, I(k)] = min(diff);
end

- Method 3: Takes around 0.013 seconds per iteration and therefore 130000 seconds in total

 for k = 1:n
   i_lower  = find(aRef <= aTest(k),1,'last');
   i_higher = find(aRef >= aTest(k),1,'first');
 end

Is there a more efficient method for this that won't exhaust the memory or take so long to run?

Thanks for your help.

1 Comment
Show -1 older commentsHide -1 older comments

Stephen23 on 29 May 2015

Do not use diff as a variable name, as this is the names of the very useful inbuilt diff function.

One can check if a name is already defined using which.

Sign in to comment.

Sign in to answer this question.

Answer 1

Guillaume on 22 May 2015

5
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/218033-efficient-method-for-finding-index-of-closest-value-in-very-large-array-for-a-very-large-amount-of-e#answer_180037

Open in MATLAB Online

Note: Using diff as a variable name is not a good idea as it shadows the very useful diff function. Also, for method 2, your code does not show the preallocation of I. If you don't preallocate I, it will seriously slow down the code.

Anyway, for two vectors of around 10,000 elements, the following is around 200 times faster than your method 1 on my machine.

edges = [-Inf, mean([aRef(2:end); aRef(1:end-1)]), +Inf];
I = discretize(aTest, edges);

Basically, it construct an edge vector half way between each elements of your aRef, and use the histogram functions of matlab to get the bin index your aTest would fall in. discretize is new in R2015a. On 2014b, you can use the third return value of hiscounts. On even older versions, the 2nd return value of histc (although histc behaves slightly differently with regards to the last bin).

%2014b
[~, ~, I] = histcounts(aTest, edges); %probably slower than discretize
%before 2014b
[~, I] = histc(aTest, edges); %return an extra element (for the +Inf bin)
I(end) = [];

4 Comments
Show 2 older commentsHide 2 older comments

Richard Barrett-Jolley on 1 Feb 2018

thanks!

Richard Barrett-Jolley on 1 Feb 2018

..actually none of these seem to quite work for me yet!

Sign in to comment.

Answer 2

Damon Landau on 8 Aug 2018

3
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/218033-efficient-method-for-finding-index-of-closest-value-in-very-large-array-for-a-very-large-amount-of-e#answer_331968

Open in MATLAB Online

I = interp1(aRef, 1:numel(aRef), aTest, 'nearest', 'extrap');

should be faster and (arguably) more straightforward than "discretize"

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 3

James Tursa on 22 May 2015

2
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/218033-efficient-method-for-finding-index-of-closest-value-in-very-large-array-for-a-very-large-amount-of-e#answer_180105

Edited: James Tursa on 22 May 2015

Open in MATLAB Online

This is a fairly simple mex routine if you start with both arrays sorted, since you don't need to make comparisons between all of the elements to get the answer. A simple stepping process can get the job done. E.g., the timing on the sizes you have listed:

>> aRef  = sort(rand(11000000,1));
>> aTest = sort(rand(10000000,1));
>> tic; aClosest = closest(aRef,aTest); toc
Elapsed time is 0.130508 seconds.

A shorter example to see that the results are as expected:

>> aRef = 1:10
aRef =
     1     2     3     4     5     6     7     8     9    10
>> aTest = sort(rand(1,20)*20-5)
aTest =
  Columns 1 through 15
   -4.4987   -2.2252   -2.0507   -1.2286   -1.0066   -0.9144   -0.2292   -0.0079    4.0035    4.6724    5.1178    5.3893    5.6688    7.4530   10.6621
  Columns 16 through 20
   10.9377   12.8271   12.9305   13.5336   14.6298
>> aClosest = closest(aRef,aTest)
aClosest =
     1     1     1     1     1     1     1     1     4     5     5     5     6     7    10    10    10    10    10    10

The mex routine is below. (Note the caveat about not handling Inf's and NaN's consistently, although that code could be added if needed)

// closest.c
// C = closest(A,B)
// Inputs:  A = full real double vector (reference), sorted ascending
//          B = full real double vector (test),      sorted ascending
// Output:  C = index of closest value in A to value in B
//          i.e., A(C(i)) is the closest value in A to B(i)
//          Ties are resolved in favor of higher index
//          Is not currently coded to handle Inf's and NaN's consistently
//          C is the same size as B
// Programmer:  James Tursa
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    size_t i, j, k, m, n;
    double *A, *B, *C;
//    
// Check input arguments
    if( nrhs != 2 || 
        !mxIsDouble(prhs[0]) || !mxIsDouble(prhs[1]) ||
        mxIsComplex(prhs[0]) || mxIsComplex(prhs[1]) ||
        mxIsSparse(prhs[0])  || mxIsSparse(prhs[1]) ) {
        mexErrMsgTxt("Need exactly two full double vectors as input");
    }
// Check number of output arguments
    if( nlhs > 1 ) {
        mexErrMsgTxt("Too many outputs");
    }
// Get the number of elements involved
    m = mxGetNumberOfElements(prhs[0]);
    n = mxGetNumberOfElements(prhs[1]);
// Disallow empty reference vector
    if( m == 0 ) {
        mexErrMsgTxt("Reference vector (1st input) cannot be empty");
    }
// Create uninitialized output
    if( mxGetM(prhs[1]) == 1 ) {
        plhs[0] = mxCreateUninitNumericMatrix( 1, n, mxDOUBLE_CLASS, mxREAL );
    } else {
        plhs[0] = mxCreateUninitNumericMatrix( n, 1, mxDOUBLE_CLASS, mxREAL );
    }
// If B is empty, simply return empty result
    if( n == 0 ) {
        return;
    }
// Get the data pointers
    A = mxGetPr(prhs[0]);
    B = mxGetPr(prhs[1]);
    C = mxGetPr(plhs[0]);
    k = 0;
// Assign 1st index to all values B that are less than 1st A value
    while( k < n && B[k] < *A ) {
        C[k++] = 1.0;
    }
// Step through until B is between two A values, then test for result
    i = 0;
    for( j=k; j<n; j++ ) {
        while( i+1 < m && B[j] >= A[i+1] ) i++;
        if( i+1 == m ) break;
        if( B[j] - A[i] < A[i+1] - B[j] ) {
            C[j] = i + 1;
        } else {
            C[j] = i + 2;
        }
    }
// Assign last index to all values B that are more than last A value
    while( j < n ) {
        C[j++] = m;
    }
}