What is the complexity of ismember function?

Question

1 vote

I want to know what is the complexity of the ismember function. Assuming that my vectors A and B are from size a and b respectively. both of the vectors have integer numbers that arranged in ascending order (number can be missing) and no number can repeats more than once. I am using this syntax: [lia,locb]=ismember(A,B).

thank you!

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Walter Roberson on 5 Oct 2018

1 vote

The ismember code first checks the array with issorted(B), which is O(length(B)) . If necessary it sorts it, which does not apply in your described case. Then it calls an internal binary search routine passing in A and sorted B; average complexity of that is probably length(A) * log2(length(B)) .

It is not specified whether the internal routine does any optimization such as starting from the position of the previous result if the next value to be searched is greater than the immediately preceding value to be looked for.

3 Comments
Show 1 older comment Hide 1 older comment

Stephen23 on 5 Oct 2018

"Is there a better function in matlab (with lower complexity) that i can use that assuming the input is sorted?"

https://www.mathworks.com/matlabcentral/answers/92533-how-do-i-perform-a-binary-search-of-a-presorted-array

Walter Roberson on 5 Oct 2018

In the situation where the inputs are both sorted, then there is probably a lower complexity algorithm, but it would not be easy to calculate what the complexity would be.

For example, you could do a full binary search of B to find the first element of A. Once you know that, that provides a lower bound for the next search, but not an upper bound. If you do your next binary search from there to the end of B, then each time you are throwing away information about the upper end of B. So you could alternate searching from the two ends, which would narrow down first the lower bound and then the upper bound on pairs of searches.

Now, if you were to implement that in MATLAB, the overhead of the coding would be high compared to a C implementation. It might take tens of millions of input elements for the reduced order complexity to overcome the implementation inefficiencies if you implemented in MATLAB. Algorithmic complexity is not the only thing you need to consider in practice.

Sign in to comment.

Answer 2

Bruno Luong on 5 Oct 2018

Edited: Bruno Luong on 5 Oct 2018

0 votes

"Is there a better function in matlab (with lower complexity) that i can use that assuming the input is sorted"

You could do with complexity of (a+b) just by scan alternatively A(i) and B(j) such that B(j-1) <= A(i) < B(j) at all steps and derive the result from that.

Of course (a+b) is the best possible complexity one can achieve.

Now you can do in MATLAB for for loop but as we know MATLAB for loop is far from the speed for C for-loop. So the best is to program a MEX file. Such algorithm is closely related to merge sorted array problem (see this FEX). This problem is one step of the method of sorting by divide and conquer method.

3 Comments
Show 1 older comment Hide 1 older comment

Bruno Luong on 5 Oct 2018

Edited: Bruno Luong on 5 Oct 2018

Theoretically that is right for this degenerated case.

In practice such function - if programmed generically as a library as all MATLAB stock function - must assert for A and B to be sorted, and that verification step alone requires already (a+b).

Walter Roberson on 5 Oct 2018

Stephen Cobeldick pointed to a discussion of the internal functions that do not check first if the values are sorted.

Sorted is being given as a pre-condition, and that potentially changes the algorithms.

Sign in to comment.

Answer 3

Bruno Luong on 5 Oct 2018

Edited: Bruno Luong on 12 Oct 2018

Open in MATLAB Online

0 votes

"Is there a better function in matlab (with lower complexity) that i can use that assuming the input is sorted"

You could do with complexity of (a+b) just by scan alternatively A(i) and B(j) such that B(j-1) <= A(i) < B(j) at all steps and derive the result from that.

Of course (a+b) is the best possible complexity one can achieve.

MATLAB does not have such algo in a ready-to-be-use function. Now you can do in MATLAB for for loop but as we know MATLAB for loop is far from the speed for C for-loop. So the best is to program a MEX file. Such algorithm is closely related to merge sorted array problem (see this FEX). This problem is one step of the method of sorting by divide and conquer method.

Note: (a+b) is already the time spend to construct A and B, admittedly outside the ismembersorted algo itself, .

Just for you Walter, who seems to like to scratch the details, I'll modify the algorithm slightly by chopping the non irrelevant head and tail parts of B by two binary searches.

[~,i] = histc(A([1,end]),[B,Inf]); % O(log(b))
i1 = max(i(1),1);
i2 = i(2);

then followed by the (linear-time) alternating merge scan on A and B(i1:i2).

The complexity of both steps combined is O(log(b)) + O(a+i2-i1);

It's now has the lowest complexity not only in average sense but for all the cases.

You might say actually the construction of [B;Inf] is O(b), but I can tweak the chopping without explicitly constructing [B;Inf], thus for simplicity of the explanation I can assume the construction of [B;Inf] is actually free.

Implementation in MEX

/**************************************************************************
 * MATLAB mex function ismembers.c
 * Calling syntax:
 * >> TF = ismembers(A,B)
 *
 * Like MATLAB stock function ISMEMBER but assuming A and B are numerical
 * vectors and sorted (not checked)
 *
 * Complex array and string arrays not supported
 *
 * Author Bruno Luong <brunoluong@yahoo.com>
 *************************************************************************/
#include "mex.h"
#include "matrix.h"
/* Define correct type depending on platform
 * You might have to modify here depending on your compiler */
#if defined(_MSC_VER) || defined(__BORLANDC__)
typedef __int64 int64;
typedef __int32 int32;
typedef __int16 int16;
typedef __int8 int08;
typedef unsigned __int64 uint64;
typedef unsigned __int32 uint32;
typedef unsigned __int16 uint16;
typedef unsigned __int8 uint08;
#else /* LINUX + LCC, CAUTION: not tested by the author */
typedef long long int int64;
typedef long int int32;
typedef short int16;
typedef char int08;
typedef unsigned long long int uint64;
typedef unsigned long int uint32;
typedef unsigned short uint16;
typedef unsigned char uint08;
#endif
// SCAN engine Template
#define SCAN(type) { \
    Ae = (void*)((type*)A + na); \
    Be = (void*)((type*)B + nb); \
    while (1) \
    { \
        if ( *((type*)A) < *((type*)B) ) { \
            TF++; \
            A = (void*)((type*)A+1); \
            if (A>=Ae) break; \
        } \
        else if ( *((type*)A) == *((type*)B) ) { \
            *TF++ = 1; \
            A = (void*)((type*)A+1); \
            if (A>=Ae) break; \
        } \
        else { \
            B = (void*)((type*)B+1); \
            if (B>=Be) break; \
        }; \
    }; \
}
// SCAN_LOC engine Template
#define SCAN_LOC(type) { \
    Ae = (void*)((type*)A + na); \
    Be = (void*)((type*)B + nb); \
    while (1) \
    { \
        if ( *((type*)A) < *((type*)B) ) { \
            TF++; \
            A = (void*)((type*)A+1); \
            loc++; \
            if (A>=Ae) break; \
        } \
        else if ( *((type*)A) == *((type*)B) ) { \
            *TF++ = 1; \
            A = (void*)((type*)A+1); \
            *loc++ = k; \
            if (A>=Ae) break; \
        } \
        else { \
            B = (void*)((type*)B+1); \
            k++; \
            if (B>=Be) break; \
        }; \
    }; \
}
/* Gateway routine ismembers */
void mexFunction(int nlhs, mxArray *plhs[],
                 int nrhs, const mxArray *prhs[]) {
      const mxArray *A_IN, *B_IN;
      mxClassID ClassID;
      mwIndex m, n, na, nb, nc;
      void *A, *B, *Ae, *Be;
      double *loc;
      uint32 k; 
      uint08 *TF;
      if (nrhs!=2)
          mexErrMsgTxt("ismembers: Two input arguments are required.");
      /* Check data type of input argument  */   
      B_IN = prhs[1];
      m = (mwIndex)mxGetM(B_IN);
      n = (mwIndex)mxGetN(B_IN);
      if ((m > 1) && (n > 1))
          mexErrMsgTxt("ismembers: Second input argument must be a vector.");
      /* Get the total number elements of B */
      nb = m*n;
      /* Check data type of input argument  */   
      A_IN = prhs[0];
      m = (mwIndex)mxGetM(A_IN);
      n = (mwIndex)mxGetN(A_IN);
      if ((m > 1) && (n > 1))
          mexErrMsgTxt("ismembers: First input argument must be a vector.");
      /* Get the total number elements of A */
      na = m*n;
      plhs[0] = mxCreateLogicalMatrix(m, n);
      TF = mxGetLogicals(plhs[0]);
      ClassID = mxGetClassID(A_IN);
      if (ClassID != mxGetClassID(B_IN))
          mexErrMsgTxt("Two inputs must have the same classes.");
      A = mxGetDoubles(A_IN);
      B = mxGetDoubles(B_IN);
      if (nlhs<=1) {
          switch (ClassID) {
              case mxDOUBLE_CLASS:
                  SCAN(double);
                  break;
              case mxSINGLE_CLASS:
                  SCAN(float);
                  break;
              case mxINT64_CLASS:
                  SCAN(int64);
                  break;
              case mxUINT64_CLASS:
                  SCAN(uint64);
                  break;
              case mxINT32_CLASS:
                  SCAN(int32);
                  break;
              case mxUINT32_CLASS:
                  SCAN(uint32);
                  break;
              case mxCHAR_CLASS:
                  SCAN(uint16);
                  break;
              case mxINT16_CLASS:
                  SCAN(int16);
                  break;
              case mxUINT16_CLASS:
                  SCAN(uint16);
                  break;
              case mxLOGICAL_CLASS:
                  SCAN(uint08);
                  break;
              case mxINT8_CLASS:
                  SCAN(int08);
                  break;
              case mxUINT8_CLASS:
                  SCAN(uint08);
                  break;
              default:
                  mexErrMsgTxt("ismembers: Class not supported.");
          } /* switch */
      }
      else
      {
          plhs[1] = mxCreateNumericMatrix(m, n, mxDOUBLE_CLASS, mxREAL);
          loc = mxGetDoubles(plhs[1]);
          k = 1; /* MATLAB indexing starts with 1 */ 
          switch (ClassID) {
              case mxDOUBLE_CLASS:
                  SCAN_LOC(double);
                  break;
              case mxSINGLE_CLASS:
                  SCAN_LOC(float);
                  break;
              case mxINT64_CLASS:
                  SCAN_LOC(int64);
                  break;
              case mxUINT64_CLASS:
                  SCAN_LOC(uint64);
                  break;
              case mxINT32_CLASS:
                  SCAN_LOC(int32);
                  break;
              case mxUINT32_CLASS:
                  SCAN_LOC(uint32);
                  break;
              case mxCHAR_CLASS:
                  SCAN_LOC(uint16);
                  break;
              case mxINT16_CLASS:
                  SCAN_LOC(int16);
                  break;
              case mxUINT16_CLASS:
                  SCAN_LOC(uint16);
                  break;
              case mxLOGICAL_CLASS:
                  SCAN_LOC(uint08);
                  break;
              case mxINT8_CLASS:
                  SCAN_LOC(int08);
                  break;
              case mxUINT8_CLASS:
                  SCAN_LOC(uint08);
                  break;
              default:
                  mexErrMsgTxt("ismembers: Class not supported.");
          } /* switch */
      }
      return;
} /* ismembers */

8 Comments
Show 6 older comments Hide 6 older comments

Anat Kira on 11 Oct 2018

My first question was theoretical and the second was practical, but all the answers were very helpful for me, so thanks to everyone who answered.

Bruno Luong on 12 Oct 2018

Edited: Bruno Luong on 12 Oct 2018

Open in MATLAB Online

Here is a test results on arrays of 10000 elements for both A and B, results might vary for different input size and density.

ismember time = 146.440 [ms]
ismembc/ismembc2 time = 156.300 [ms]
ismembers time = 121.969 [ms]

NOTE: ISMEMBC2 returns that index of the last occurrence in B so it's not 100% compatible with ISMEMBER which returns the first occurrence (unless using 'legacy' argument).

Test script is

ntest = 100;
t = zeros(3,ntest);
for k=1:ntest
    A = sort(ceil(20000*rand(1,1e4)));
    B = sort(ceil(20000*rand(1,1e4)));
      tic
      [TF1,loc1] = ismember(A,B);
      t(1,k) = toc;
      tic
      TF2 = ismembc(A,B);
      loc2 = ismembc2(A,B);
      t(2,k) = toc;
      tic
      [TF3,loc3] = ismembers(A,B);
      t(3,k) = toc;
  end
t = mean(t,2)*1e6;
fprintf('ismember time = %1.3f [ms]\n', t(1));
fprintf('ismembc/ismembc2 time = %1.3f [ms]\n', t(2));
fprintf('ismembers time = %1.3f [ms]\n', t(3));

Sign in to comment.

Answer 4

Bruno Luong on 7 Oct 2018

Edited: Bruno Luong on 7 Oct 2018

Open in MATLAB Online

0 votes

I run this code (R2018b) by changing the length of B is ISMEMBER(A,B), length A fix. It shows clearly the timing is NOT proportional to log(b) but rather to b.

For ISMEMBERC and ISMEMBERC2 is not proportional to b but something lower, certainly log(b).

The complexity of large array A/B of ISMEMBER is (a + b)*log(b)

a = 10;
b = 2.^(1:22);
t = zeros(size(b));
for n=1:length(b)
    A = rand(1,10);
    B = rand(1,b(n));
    tic
    ismember(A,B);
    t(n) = toc;
end
close all
subplot(1,2,1);
semilogx(b,t);
xlabel('length(B)');
ylabel('time of ISMEMBER(...,B)')
subplot(1,2,2);
loglog(b,t);
xlabel('length(B)');
ylabel('time of ISMEMBER(...,B)')

2 Comments
Show None Hide None

Bruno Luong on 7 Oct 2018

For comparison, same curves for ISMEMBC and ISMEMBC2, the engine of ISMEMBER without argument checking:

Walter Roberson on 11 Oct 2018

Open in MATLAB Online

Original question stated "Assuming that my vectors A and B are from size a and b respectively. both of the vectors have integer numbers that arranged in ascending order"

The code

    A = rand(1,10);
    B = rand(1,b(n));
    ismember(A,B);

violates the "arranged in ascending order" pre-condition that is key to the question. The question is specifically about what happens given that pre-condition. As I stated in my Answer, ismember() first checks for B sorted, which is an O(b) process where b = length(B). When B is not sorted, then it needs to sort it; as quicksort is used, that is O(b*log(b)) . Once B is known sorted then ismemberc or (more recently) builtin('_ismemberhelper') is called to do a binary search. With A not being sorted, we can surmise that the overall search process after the sorting check would be a * log2(b) where a = length(A) . In the sorted case this leads to an overall complexity for ismember of b + a*log2(b) .

Except in the degenerate case of b = 1 or 2, this will be worse than a + b, indicating that given those pre-conditions, you can do better than calling ismember() itself -- that at worst you could use the a+b algorithm Bruno indicated earlier.

An algorithm such as the a+b algorithm Bruno indicated might perhaps be the most practical algorithm over a fair range of array sizes. More complicated algorithms can be designed that have a lower theoretical complexity cost, but the overheads of setting them up are not free, and probably have worse cache-line effects, so you might potentially have to go to pretty large matrix sizes before a theoretically better algorithm would do better: "Quality of Implementation" would be important for practical purposes.

On the other hand, consider if a is much smaller than b. Suppose, for example, that a = 100 and b = 2^30 . Then O(a+b) would be approximately 1073741924 comparisons. But using the internal assumed-sorted ismemberc or builtin('_ismemberhelper') would be O(a*log2(b)) = O(100 * log2(2^30)) = O(100 * 30) or about 3000. Thus, an O(a+b) algorithm might be considerably worse when the sizes are rather different from each other.

Are there theoretically better algorithms under the sorted precondition without assuming that the sizes are so unequal? It looks like there may be. https://stackoverflow.com/questions/12993655/find-common-elements-in-2-sorted-arrays describes a "blended" algorithm that it is claimed has better complexity, and references a paper on merging that proves a bound.

Sign in to comment.

What is the complexity of ismember function?

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

3 Comments
Show 1 older comment Hide 1 older comment

More Answers (3)

3 Comments
Show 1 older comment Hide 1 older comment

8 Comments
Show 6 older comments Hide 6 older comments

2 Comments
Show None Hide None

Categories

Tags

Community Treasure Hunt

What is the complexity of ismember function?

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

3 Comments Show 1 older comment Hide 1 older comment

More Answers (3)

3 Comments Show 1 older comment Hide 1 older comment

8 Comments Show 6 older comments Hide 6 older comments

2 Comments Show None Hide None

Categories

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

3 Comments
Show 1 older comment Hide 1 older comment

3 Comments
Show 1 older comment Hide 1 older comment

8 Comments
Show 6 older comments Hide 6 older comments

2 Comments
Show None Hide None