Fastest way to find number of times a number occurs in an array?

31 views (last 30 days)
As simple as that. I am currently using sum(array(:,2) == N) which I am told is faster than using the find function. Is there any faster way? This is bottlenecking my code as the array I am searching is extremely big (470 million by 2). It consists of indices in a 3D volume (array(:,1) and the numbers I am looking for corrospond to which cluster each index belongs to (array(:,2).
  2 Comments
Eric Chadwick
Eric Chadwick on 2 May 2019
Hi Madhan,
nnz() is faster than find() but not faster than linear indexing as I am doing. Its on average about the same speed as linear indexing. Thank you though.
There might just not be a quicker way unless I can do it during a previous step in my code.

Sign in to comment.

Accepted Answer

James Tursa
James Tursa on 2 May 2019
So, you've got some potential speed problems with your posted methods.
When you evaluate the expression X(:,2) repeatedly that expression creates a DEEP copy of the 2nd column each time, which can really slow things down. This can be circumvented by evaluating it once, storing the result in a variable, and then using that variable repeatedly downstream.
Another problem is the dynamic size increasing of V. Each time that happens in a loop the V data gets deep copied. This can be circumvented by pre-allocating V up front.
And finally, everytime you evaluate an element of X with the expression X(i,2) that creates a temporary MATLAB variable for that element ... about 120 bytes of overhead stuff. Do that several million times and it can add up.
The solution here is to use a function that has pre-compiled code in the background ... either a MATLAB function or a custom mex routine. A mex routine can ENTIRELY avoid any deep copies of any data and ENTIRELY avoid all of the overhead associated with temporary variables. A simple mex routine to do this (requires that you install a supported C compiler) is presented below. I then make a sample run to show the speed improvements.
The mex source code:
(You may want to alter the TOO_BIG value in the source code. It is there to prevent trying to allocate too big of an array, but I wasn't sure what the expected range of your data is)
/* counts occurrences of integer numbers in column C of 2D full double matrix
*
* Syntax: Y = count_clusters(X,C)
*
* X = full double MxN matrix
* column C contains only integers > 0 and not too big
* C = scalar integer column number to use
* where 1 <= C <= size(X,2)
* Y = max(X(:,C))x1 full double column vector containing the counts
* where Y(i) = sum(X(:,C)==i)
*
* Programmer: James Tursa
* Date: May 2019
*
*/
/* Includes ----------------------------------------------------------- */
#include "mex.h"
/* Macros */
#define TOO_BIG 1000000
/* Gateway ------------------------------------------------------------ */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
double *xpr, *Xpr, *Ypr;
double xmax, c;
mwSize i, m, x, C;
/* Argument checks */
if( nrhs == 0 ) { /* Assume only loading mex routine into memory */
return;
}
if( nrhs != 2 ) {
mexErrMsgTxt("Need exactly two inputs");
}
if( !mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]) ||
mxIsSparse(prhs[0]) || mxGetNumberOfDimensions(prhs[0]) > 2 ) {
mexErrMsgTxt("1st input must be a real full double 2D matrix");
}
if( !mxIsNumeric(prhs[1]) || mxIsComplex(prhs[1]) ||
mxGetNumberOfElements(prhs[1]) != 1 ) {
mexErrMsgTxt("2nd input must be a numeric scalar");
}
if( nlhs > 1 ) {
mexErrMsgTxt("Too many outputs.");
}
C = c = mxGetScalar(prhs[1]);
if( c < 0.0 || C != c || C > mxGetN(prhs[0]) ) {
mexErrMsgTxt("Invalid column number (2nd input)");
}
/* Check for empty input */
if( mxGetNumberOfElements(prhs[0]) == 0 ) {
plhs[0] = mxCreateDoubleMatrix( 0, 0, mxREAL );
return;
}
/* Get the max and check for invalid */
m = mxGetM(prhs[0]);
xpr = Xpr = (double *) mxGetData(prhs[0]) + m * (C-1); /* Point to column C */
xmax = *xpr;
for( i=0; i<m; i++ ) {
/* If you know the values are valid, you can comment this check out */
if( *xpr <= 0.0 || (mwSize)*xpr != *xpr ) { /* Make sure it is integer > 0 */
mexErrMsgTxt("Invalid value in column");
}
if( *xpr > xmax ) {
xmax = *xpr;
}
xpr++;
}
if( xmax > TOO_BIG ) {
mexErrMsgTxt("Max value in column is too large");
}
/* Create result */
plhs[0] = mxCreateDoubleMatrix( (mwSize) xmax, 1, mxREAL );
Ypr = mxGetPr(plhs[0]) - 1; /* Offset the pointer to 1-based (not quite kosher) */
/* Fill in the result */
for( i=0; i<m; i++ ) {
Ypr[ (mwSize) *Xpr++ ]++;
}
}
Note that if you KNOW that the column values are positive integers prior to calling the mex routine, you could comment the following lines of code out to increase the speed even more.
/* If you know the values are valid, you can comment this check out */
if( *xpr <= 0.0 || (mwSize)*xpr != *xpr ) { /* Make sure it is integer > 0 */
mexErrMsgTxt("Invalid value in column");
}
Here is some m-code for testing:
% Test code for the mex routine count_clusters
xmax = 5000; % range of column 2 data
m = 10000000; % number of rows
X = randi(xmax,m,2); % generate sample data
disp(' ');
disp('Counting Clusters test code');
fprintf('xmax = %d\n',xmax);
fprintf('m = %d\n',m);
disp(' ')
disp('posted method, LOTS of deep copies of X(:,2), no pre-allocation')
clear V
tic
for i = 1:max(X(:,2))
V(i) = sum(X(:,2) == i);
end
toc
disp(' ')
disp('posted method variation, only one deep copy of X(:,2), with pre-allocation')
clear X2 V1
tic
X2 = X(:,2);
m = max(X2);
V1 = zeros(m,1);
for i = 1:m
V1(i) = sum(X2 == i);
end
toc
isequal(V(:),V1(:))
disp(' ')
disp('accumarray, only one deep copy of X(:,2)')
clear count
tic
count = accumarray(X(:,2),1);
toc
isequal(V(:),count(:))
disp(' ')
disp('mex routine, no copies of anything')
clear Y
count_clusters; % Load mex routine into memory
tic
Y = count_clusters(X,2);
toc
isequal(V(:),Y(:))
disp(' ')
disp('mex routine, no copies of anything, no data check')
clear Y2
count_clusters2; % Load mex routine into memory
tic
Y2 = count_clusters2(X,2);
toc
isequal(V(:),Y2(:))
And here is a sample run:
>> count_clusters_test
Counting Clusters test code
xmax = 5000
m = 10000000
posted method, LOTS of deep copies of X(:,2), no pre-allocation
Elapsed time is 350.707066 seconds.
posted method variation, only one deep copy of X(:,2), with pre-allocation
Elapsed time is 30.428389 seconds.
ans =
logical
1
accumarray, only one deep copy of X(:,2)
Elapsed time is 0.194782 seconds.
ans =
logical
1
mex routine, no copies of anything
Elapsed time is 0.047998 seconds.
ans =
logical
1
mex routine, no copies of anything, no data check
Elapsed time is 0.017843 seconds.
ans =
logical
1
So the mex routine is indeed the fastest. Much faster than the looping methods, and a bit faster than accumarray.
  1 Comment
Eric Chadwick
Eric Chadwick on 3 May 2019
Wow. First of all, thank you so much for putting so much time and effort into my question! This is amazing.
I just implemented this program and it works beautifully! Makes me want to convert my entire code into mex files...maybe in the future.
Thanks again!
Cheers,
Eric

Sign in to comment.

More Answers (2)

Steven Lord
Steven Lord on 2 May 2019
Are you looking for the size of regions in an image? If so calling regionprops from Image Processing Toolbox on your label matrix will probably give you the information you want.
If you do just want counts regardless of whether the labels are contiguous, consider histcounts (or histcounts2) that are part of MATLAB proper.
Alternately some of the grouping functions in the "Grouping and Binning Data" section of this documentation page may be helpful.
If none of those functions do what you want, can you show us an example using a smaller data set? Make a 10-by-2 matrix of data and show us and explain to us exactly what you want the result of this processing step on that example data set to be.
  3 Comments
Steven Lord
Steven Lord on 3 May 2019
Based on that description, I believe it's as simple as calling histcounts on your second column of X.
V = histcounts(X(:, 2));
On my machine running release R2018b, generating a sample list of cluster values actually took longer than counting how many values fell into each bin.
tic;
X = randi(100, 470000000, 1);
toc
tic
[V, E] = histcounts(X, 'BinMethod', 'integers');
toc
Let's check.
nnz(X == 84) % As an example
V(84)
Eric Chadwick
Eric Chadwick on 3 May 2019
Hi Stephen,
Thank you, but I think James' answer is the fastest and is what I am going to use.
Cheers,
Eric

Sign in to comment.


Guillaume
Guillaume on 2 May 2019
Assuming that your clusters are integer numbers from 1 to N with no gaps:
count = accumarray(1, X(:, 2));
If not,
[cluster, ~, id] = unique(X(:, 2));
count = accumarray(1, X(:, 2));
table(cluster, count) %for pretty display.
As suggested by Steven, you can also use histcounts for that.
  2 Comments
Eric Chadwick
Eric Chadwick on 2 May 2019
Hi Guillaume,
I tried accumarray and it was a bit faster, so I will mark this answer as accepted. For the record though it should be count = accumarray(X(:,2), 1).
I actually found it was about twice as fast to create a for loop with respect to X instead of the number of clusters:
for i = 1:size(X,1)
V(X(i,2)) = V(X(i,2)) + 1;
end
Guillaume
Guillaume on 3 May 2019
Yep, got my subs and val swapped in accumarray. The order has never made sense to me.
I find it very suspicious that your loop is faster than accumarray. accumarray is just an pre-compiled implementation of that loop.

Sign in to comment.

Categories

Find more on Resizing and Reshaping Matrices in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!