What is the complexity of ismember function?
Show older comments
I want to know what is the complexity of the ismember function. Assuming that my vectors A and B are from size a and b respectively. both of the vectors have integer numbers that arranged in ascending order (number can be missing) and no number can repeats more than once. I am using this syntax: [lia,locb]=ismember(A,B).
thank you!
Accepted Answer
More Answers (3)
Bruno Luong
on 5 Oct 2018
Edited: Bruno Luong
on 5 Oct 2018
0 votes
"Is there a better function in matlab (with lower complexity) that i can use that assuming the input is sorted"
You could do with complexity of (a+b) just by scan alternatively A(i) and B(j) such that B(j-1) <= A(i) < B(j) at all steps and derive the result from that.
Of course (a+b) is the best possible complexity one can achieve.
Now you can do in MATLAB for for loop but as we know MATLAB for loop is far from the speed for C for-loop. So the best is to program a MEX file. Such algorithm is closely related to merge sorted array problem (see this FEX). This problem is one step of the method of sorting by divide and conquer method.
3 Comments
Walter Roberson
on 5 Oct 2018
"Of course (a+b) is the best possible complexity one can achieve."
No. Let a = 1 (a single value to look for), then with B being sorted, you can search B in log2(b) steps, and log2(b) < 1+b for b > 1
Bruno Luong
on 5 Oct 2018
Edited: Bruno Luong
on 5 Oct 2018
Theoretically that is right for this degenerated case.
In practice such function - if programmed generically as a library as all MATLAB stock function - must assert for A and B to be sorted, and that verification step alone requires already (a+b).
Walter Roberson
on 5 Oct 2018
Stephen Cobeldick pointed to a discussion of the internal functions that do not check first if the values are sorted.
Sorted is being given as a pre-condition, and that potentially changes the algorithms.
Bruno Luong
on 5 Oct 2018
Edited: Bruno Luong
on 12 Oct 2018
"Is there a better function in matlab (with lower complexity) that i can use that assuming the input is sorted"
You could do with complexity of (a+b) just by scan alternatively A(i) and B(j) such that B(j-1) <= A(i) < B(j) at all steps and derive the result from that.
Of course (a+b) is the best possible complexity one can achieve.
MATLAB does not have such algo in a ready-to-be-use function. Now you can do in MATLAB for for loop but as we know MATLAB for loop is far from the speed for C for-loop. So the best is to program a MEX file. Such algorithm is closely related to merge sorted array problem (see this FEX). This problem is one step of the method of sorting by divide and conquer method.
Note: (a+b) is already the time spend to construct A and B, admittedly outside the ismembersorted algo itself, .
Just for you Walter, who seems to like to scratch the details, I'll modify the algorithm slightly by chopping the non irrelevant head and tail parts of B by two binary searches.
[~,i] = histc(A([1,end]),[B,Inf]); % O(log(b))
i1 = max(i(1),1);
i2 = i(2);
then followed by the (linear-time) alternating merge scan on A and B(i1:i2).
The complexity of both steps combined is O(log(b)) + O(a+i2-i1);
It's now has the lowest complexity not only in average sense but for all the cases.
You might say actually the construction of [B;Inf] is O(b), but I can tweak the chopping without explicitly constructing [B;Inf], thus for simplicity of the explanation I can assume the construction of [B;Inf] is actually free.
Implementation in MEX
/**************************************************************************
* MATLAB mex function ismembers.c
* Calling syntax:
* >> TF = ismembers(A,B)
*
* Like MATLAB stock function ISMEMBER but assuming A and B are numerical
* vectors and sorted (not checked)
*
* Complex array and string arrays not supported
*
* Author Bruno Luong <brunoluong@yahoo.com>
*************************************************************************/
#include "mex.h"
#include "matrix.h"
/* Define correct type depending on platform
* You might have to modify here depending on your compiler */
#if defined(_MSC_VER) || defined(__BORLANDC__)
typedef __int64 int64;
typedef __int32 int32;
typedef __int16 int16;
typedef __int8 int08;
typedef unsigned __int64 uint64;
typedef unsigned __int32 uint32;
typedef unsigned __int16 uint16;
typedef unsigned __int8 uint08;
#else /* LINUX + LCC, CAUTION: not tested by the author */
typedef long long int int64;
typedef long int int32;
typedef short int16;
typedef char int08;
typedef unsigned long long int uint64;
typedef unsigned long int uint32;
typedef unsigned short uint16;
typedef unsigned char uint08;
#endif
// SCAN engine Template
#define SCAN(type) { \
Ae = (void*)((type*)A + na); \
Be = (void*)((type*)B + nb); \
while (1) \
{ \
if ( *((type*)A) < *((type*)B) ) { \
TF++; \
A = (void*)((type*)A+1); \
if (A>=Ae) break; \
} \
else if ( *((type*)A) == *((type*)B) ) { \
*TF++ = 1; \
A = (void*)((type*)A+1); \
if (A>=Ae) break; \
} \
else { \
B = (void*)((type*)B+1); \
if (B>=Be) break; \
}; \
}; \
}
// SCAN_LOC engine Template
#define SCAN_LOC(type) { \
Ae = (void*)((type*)A + na); \
Be = (void*)((type*)B + nb); \
while (1) \
{ \
if ( *((type*)A) < *((type*)B) ) { \
TF++; \
A = (void*)((type*)A+1); \
loc++; \
if (A>=Ae) break; \
} \
else if ( *((type*)A) == *((type*)B) ) { \
*TF++ = 1; \
A = (void*)((type*)A+1); \
*loc++ = k; \
if (A>=Ae) break; \
} \
else { \
B = (void*)((type*)B+1); \
k++; \
if (B>=Be) break; \
}; \
}; \
}
/* Gateway routine ismembers */
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[]) {
const mxArray *A_IN, *B_IN;
mxClassID ClassID;
mwIndex m, n, na, nb, nc;
void *A, *B, *Ae, *Be;
double *loc;
uint32 k;
uint08 *TF;
if (nrhs!=2)
mexErrMsgTxt("ismembers: Two input arguments are required.");
/* Check data type of input argument */
B_IN = prhs[1];
m = (mwIndex)mxGetM(B_IN);
n = (mwIndex)mxGetN(B_IN);
if ((m > 1) && (n > 1))
mexErrMsgTxt("ismembers: Second input argument must be a vector.");
/* Get the total number elements of B */
nb = m*n;
/* Check data type of input argument */
A_IN = prhs[0];
m = (mwIndex)mxGetM(A_IN);
n = (mwIndex)mxGetN(A_IN);
if ((m > 1) && (n > 1))
mexErrMsgTxt("ismembers: First input argument must be a vector.");
/* Get the total number elements of A */
na = m*n;
plhs[0] = mxCreateLogicalMatrix(m, n);
TF = mxGetLogicals(plhs[0]);
ClassID = mxGetClassID(A_IN);
if (ClassID != mxGetClassID(B_IN))
mexErrMsgTxt("Two inputs must have the same classes.");
A = mxGetDoubles(A_IN);
B = mxGetDoubles(B_IN);
if (nlhs<=1) {
switch (ClassID) {
case mxDOUBLE_CLASS:
SCAN(double);
break;
case mxSINGLE_CLASS:
SCAN(float);
break;
case mxINT64_CLASS:
SCAN(int64);
break;
case mxUINT64_CLASS:
SCAN(uint64);
break;
case mxINT32_CLASS:
SCAN(int32);
break;
case mxUINT32_CLASS:
SCAN(uint32);
break;
case mxCHAR_CLASS:
SCAN(uint16);
break;
case mxINT16_CLASS:
SCAN(int16);
break;
case mxUINT16_CLASS:
SCAN(uint16);
break;
case mxLOGICAL_CLASS:
SCAN(uint08);
break;
case mxINT8_CLASS:
SCAN(int08);
break;
case mxUINT8_CLASS:
SCAN(uint08);
break;
default:
mexErrMsgTxt("ismembers: Class not supported.");
} /* switch */
}
else
{
plhs[1] = mxCreateNumericMatrix(m, n, mxDOUBLE_CLASS, mxREAL);
loc = mxGetDoubles(plhs[1]);
k = 1; /* MATLAB indexing starts with 1 */
switch (ClassID) {
case mxDOUBLE_CLASS:
SCAN_LOC(double);
break;
case mxSINGLE_CLASS:
SCAN_LOC(float);
break;
case mxINT64_CLASS:
SCAN_LOC(int64);
break;
case mxUINT64_CLASS:
SCAN_LOC(uint64);
break;
case mxINT32_CLASS:
SCAN_LOC(int32);
break;
case mxUINT32_CLASS:
SCAN_LOC(uint32);
break;
case mxCHAR_CLASS:
SCAN_LOC(uint16);
break;
case mxINT16_CLASS:
SCAN_LOC(int16);
break;
case mxUINT16_CLASS:
SCAN_LOC(uint16);
break;
case mxLOGICAL_CLASS:
SCAN_LOC(uint08);
break;
case mxINT8_CLASS:
SCAN_LOC(int08);
break;
case mxUINT8_CLASS:
SCAN_LOC(uint08);
break;
default:
mexErrMsgTxt("ismembers: Class not supported.");
} /* switch */
}
return;
} /* ismembers */
8 Comments
Walter Roberson
on 6 Oct 2018
"Just for you Walter, who seems to like to scratch the details"
The poster did not ask for real-world efficiency: the poster asked about lowest algorithmic complexity, and the details make a difference for that.
To phrase another way: the poster would be happy with an algorithm that did nothing for three years and then processed the complete ismember() in log2(a)*log2(b) time because that would be better algorithmic complexity than a*log2(b) and no-body cares about the constant terms for algorithmic complexity...
Bruno Luong
on 6 Oct 2018
Edited: Bruno Luong
on 6 Oct 2018
"The poster did not ask for real-world efficiency"
Sorry that's your interpretation, not mine
Original poster also asked if there is a better function that can deal for sorted arrays. I propose one for him/her to use that doesn't cost more than time of constructing A and B or check if A and B verifies the precondition. I put down the complexity argument because obviously he/she wants to judge the efficiency from that piece of information. Many people including myself use complexity to chose an algorithm even if there is other factors to count also.
Furthermore none of the method I propose has big constant time, they are even all small, why bring this topic in?
Walter Roberson
on 6 Oct 2018
The poster consistently asked about complexity. The poster qualified "better" with indication of lower complexity. It is a theory question, not a practical question. They did not, for example, give array size estimates and ask about the fastest under those conditions: they ask directly about complexity, repeatedly.
Bruno Luong
on 6 Oct 2018
Edited: Bruno Luong
on 6 Oct 2018
OP : "Is there a better function in matlab (with lower complexity) that i can use that assuming the input is sorted"
If this is not practical question what else is?
And I post both a practical algorithm (merge sorted arrays) as well as theoretical complexity of it. I don't know what's your problem? Must have a last word?
I let him/her read and decide. You also should.
Anything else??? Geez....
Walter Roberson
on 11 Oct 2018
The poster did not ask about "faster", or performance on a particular size of data, only about complexity. It is a theoretical question. In Computing, "complexity" has a specific meaning, one that ignores real-world performance measures. See for example https://en.wikipedia.org/wiki/Matrix_multiplication#Related_complexities
"Secondly, in practical implementations, one never uses the matrix multiplication algorithm that has the best asymptotical complexity, because the constant hidden behind the big O notation is too large for making the algorithm competitive for sizes of matrices that can be manipulated in a computer."
But complexity analysis does not care about trivial things such as cache line effects or whether there is enough memory in the Universe to make an algorithm useful.
Bruno Luong
on 11 Oct 2018
Edited: Bruno Luong
on 11 Oct 2018
So what is your point Walter? I have provided the complexity O(a+b)as well, and even correct the wrong complexity of ISMEMBER you stated earlier.
Whereas you agree or not, the ultimate goal of complexity analysis is to provide a faster code, even if there is other factors also play and there is an intrinsic theoretical interest of this branch.
Second why you keep making over and over the same statement?
I hear you but I don't agree with the fact that OP interests only on complexity. I think OP also wants a faster algorithm for his sorted arrays. Let him decide, and can we (you and me) done with it?
Usually unless in the Answer each of posters might have difference interpretation about the question and reply as he though.
I though the implicit minimal courtesy rule here is once we agree on not agreed, let each individual answer live on each own. Sorry but you are really heavy on this one.
Anat Kira
on 11 Oct 2018
Bruno Luong
on 12 Oct 2018
Edited: Bruno Luong
on 12 Oct 2018
Here is a test results on arrays of 10000 elements for both A and B, results might vary for different input size and density.
ismember time = 146.440 [ms]
ismembc/ismembc2 time = 156.300 [ms]
ismembers time = 121.969 [ms]
NOTE: ISMEMBC2 returns that index of the last occurrence in B so it's not 100% compatible with ISMEMBER which returns the first occurrence (unless using 'legacy' argument).
Test script is
ntest = 100;
t = zeros(3,ntest);
for k=1:ntest
A = sort(ceil(20000*rand(1,1e4)));
B = sort(ceil(20000*rand(1,1e4)));
tic
[TF1,loc1] = ismember(A,B);
t(1,k) = toc;
tic
TF2 = ismembc(A,B);
loc2 = ismembc2(A,B);
t(2,k) = toc;
tic
[TF3,loc3] = ismembers(A,B);
t(3,k) = toc;
end
t = mean(t,2)*1e6;
fprintf('ismember time = %1.3f [ms]\n', t(1));
fprintf('ismembc/ismembc2 time = %1.3f [ms]\n', t(2));
fprintf('ismembers time = %1.3f [ms]\n', t(3));
Bruno Luong
on 7 Oct 2018
Edited: Bruno Luong
on 7 Oct 2018
I run this code (R2018b) by changing the length of B is ISMEMBER(A,B), length A fix. It shows clearly the timing is NOT proportional to log(b) but rather to b.
For ISMEMBERC and ISMEMBERC2 is not proportional to b but something lower, certainly log(b).
The complexity of large array A/B of ISMEMBER is (a + b)*log(b)
a = 10;
b = 2.^(1:22);
t = zeros(size(b));
for n=1:length(b)
A = rand(1,10);
B = rand(1,b(n));
tic
ismember(A,B);
t(n) = toc;
end
close all
subplot(1,2,1);
semilogx(b,t);
xlabel('length(B)');
ylabel('time of ISMEMBER(...,B)')
subplot(1,2,2);
loglog(b,t);
xlabel('length(B)');
ylabel('time of ISMEMBER(...,B)')

2 Comments
Bruno Luong
on 7 Oct 2018
For comparison, same curves for ISMEMBC and ISMEMBC2, the engine of ISMEMBER without argument checking:


Walter Roberson
on 11 Oct 2018
Original question stated "Assuming that my vectors A and B are from size a and b respectively. both of the vectors have integer numbers that arranged in ascending order"
The code
A = rand(1,10);
B = rand(1,b(n));
ismember(A,B);
violates the "arranged in ascending order" pre-condition that is key to the question. The question is specifically about what happens given that pre-condition. As I stated in my Answer, ismember() first checks for B sorted, which is an O(b) process where b = length(B). When B is not sorted, then it needs to sort it; as quicksort is used, that is O(b*log(b)) . Once B is known sorted then ismemberc or (more recently) builtin('_ismemberhelper') is called to do a binary search. With A not being sorted, we can surmise that the overall search process after the sorting check would be a * log2(b) where a = length(A) . In the sorted case this leads to an overall complexity for ismember of b + a*log2(b) .
Except in the degenerate case of b = 1 or 2, this will be worse than a + b, indicating that given those pre-conditions, you can do better than calling ismember() itself -- that at worst you could use the a+b algorithm Bruno indicated earlier.
An algorithm such as the a+b algorithm Bruno indicated might perhaps be the most practical algorithm over a fair range of array sizes. More complicated algorithms can be designed that have a lower theoretical complexity cost, but the overheads of setting them up are not free, and probably have worse cache-line effects, so you might potentially have to go to pretty large matrix sizes before a theoretically better algorithm would do better: "Quality of Implementation" would be important for practical purposes.
On the other hand, consider if a is much smaller than b. Suppose, for example, that a = 100 and b = 2^30 . Then O(a+b) would be approximately 1073741924 comparisons. But using the internal assumed-sorted ismemberc or builtin('_ismemberhelper') would be O(a*log2(b)) = O(100 * log2(2^30)) = O(100 * 30) or about 3000. Thus, an O(a+b) algorithm might be considerably worse when the sizes are rather different from each other.
Are there theoretically better algorithms under the sorted precondition without assuming that the sizes are so unequal? It looks like there may be. https://stackoverflow.com/questions/12993655/find-common-elements-in-2-sorted-arrays describes a "blended" algorithm that it is claimed has better complexity, and references a paper on merging that proves a bound.
Categories
Find more on Matrix Indexing in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!