# Fastest way to find number of times a number occurs in an array?

31 views (last 30 days)

Show older comments

Eric Chadwick
on 2 May 2019

Commented: Eric Chadwick
on 3 May 2019

##### 2 Comments

### Accepted Answer

James Tursa
on 2 May 2019

So, you've got some potential speed problems with your posted methods.

When you evaluate the expression X(:,2) repeatedly that expression creates a DEEP copy of the 2nd column each time, which can really slow things down. This can be circumvented by evaluating it once, storing the result in a variable, and then using that variable repeatedly downstream.

Another problem is the dynamic size increasing of V. Each time that happens in a loop the V data gets deep copied. This can be circumvented by pre-allocating V up front.

And finally, everytime you evaluate an element of X with the expression X(i,2) that creates a temporary MATLAB variable for that element ... about 120 bytes of overhead stuff. Do that several million times and it can add up.

The solution here is to use a function that has pre-compiled code in the background ... either a MATLAB function or a custom mex routine. A mex routine can ENTIRELY avoid any deep copies of any data and ENTIRELY avoid all of the overhead associated with temporary variables. A simple mex routine to do this (requires that you install a supported C compiler) is presented below. I then make a sample run to show the speed improvements.

The mex source code:

(You may want to alter the TOO_BIG value in the source code. It is there to prevent trying to allocate too big of an array, but I wasn't sure what the expected range of your data is)

/* counts occurrences of integer numbers in column C of 2D full double matrix

*

* Syntax: Y = count_clusters(X,C)

*

* X = full double MxN matrix

* column C contains only integers > 0 and not too big

* C = scalar integer column number to use

* where 1 <= C <= size(X,2)

* Y = max(X(:,C))x1 full double column vector containing the counts

* where Y(i) = sum(X(:,C)==i)

*

* Programmer: James Tursa

* Date: May 2019

*

*/

/* Includes ----------------------------------------------------------- */

#include "mex.h"

/* Macros */

#define TOO_BIG 1000000

/* Gateway ------------------------------------------------------------ */

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])

{

double *xpr, *Xpr, *Ypr;

double xmax, c;

mwSize i, m, x, C;

/* Argument checks */

if( nrhs == 0 ) { /* Assume only loading mex routine into memory */

return;

}

if( nrhs != 2 ) {

mexErrMsgTxt("Need exactly two inputs");

}

if( !mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]) ||

mxIsSparse(prhs[0]) || mxGetNumberOfDimensions(prhs[0]) > 2 ) {

mexErrMsgTxt("1st input must be a real full double 2D matrix");

}

if( !mxIsNumeric(prhs[1]) || mxIsComplex(prhs[1]) ||

mxGetNumberOfElements(prhs[1]) != 1 ) {

mexErrMsgTxt("2nd input must be a numeric scalar");

}

if( nlhs > 1 ) {

mexErrMsgTxt("Too many outputs.");

}

C = c = mxGetScalar(prhs[1]);

if( c < 0.0 || C != c || C > mxGetN(prhs[0]) ) {

mexErrMsgTxt("Invalid column number (2nd input)");

}

/* Check for empty input */

if( mxGetNumberOfElements(prhs[0]) == 0 ) {

plhs[0] = mxCreateDoubleMatrix( 0, 0, mxREAL );

return;

}

/* Get the max and check for invalid */

m = mxGetM(prhs[0]);

xpr = Xpr = (double *) mxGetData(prhs[0]) + m * (C-1); /* Point to column C */

xmax = *xpr;

for( i=0; i<m; i++ ) {

/* If you know the values are valid, you can comment this check out */

if( *xpr <= 0.0 || (mwSize)*xpr != *xpr ) { /* Make sure it is integer > 0 */

mexErrMsgTxt("Invalid value in column");

}

if( *xpr > xmax ) {

xmax = *xpr;

}

xpr++;

}

if( xmax > TOO_BIG ) {

mexErrMsgTxt("Max value in column is too large");

}

/* Create result */

plhs[0] = mxCreateDoubleMatrix( (mwSize) xmax, 1, mxREAL );

Ypr = mxGetPr(plhs[0]) - 1; /* Offset the pointer to 1-based (not quite kosher) */

/* Fill in the result */

for( i=0; i<m; i++ ) {

Ypr[ (mwSize) *Xpr++ ]++;

}

}

Note that if you KNOW that the column values are positive integers prior to calling the mex routine, you could comment the following lines of code out to increase the speed even more.

/* If you know the values are valid, you can comment this check out */

if( *xpr <= 0.0 || (mwSize)*xpr != *xpr ) { /* Make sure it is integer > 0 */

mexErrMsgTxt("Invalid value in column");

}

Here is some m-code for testing:

% Test code for the mex routine count_clusters

xmax = 5000; % range of column 2 data

m = 10000000; % number of rows

X = randi(xmax,m,2); % generate sample data

disp(' ');

disp('Counting Clusters test code');

fprintf('xmax = %d\n',xmax);

fprintf('m = %d\n',m);

disp(' ')

disp('posted method, LOTS of deep copies of X(:,2), no pre-allocation')

clear V

tic

for i = 1:max(X(:,2))

V(i) = sum(X(:,2) == i);

end

toc

disp(' ')

disp('posted method variation, only one deep copy of X(:,2), with pre-allocation')

clear X2 V1

tic

X2 = X(:,2);

m = max(X2);

V1 = zeros(m,1);

for i = 1:m

V1(i) = sum(X2 == i);

end

toc

isequal(V(:),V1(:))

disp(' ')

disp('accumarray, only one deep copy of X(:,2)')

clear count

tic

count = accumarray(X(:,2),1);

toc

isequal(V(:),count(:))

disp(' ')

disp('mex routine, no copies of anything')

clear Y

count_clusters; % Load mex routine into memory

tic

Y = count_clusters(X,2);

toc

isequal(V(:),Y(:))

disp(' ')

disp('mex routine, no copies of anything, no data check')

clear Y2

count_clusters2; % Load mex routine into memory

tic

Y2 = count_clusters2(X,2);

toc

isequal(V(:),Y2(:))

And here is a sample run:

>> count_clusters_test

Counting Clusters test code

xmax = 5000

m = 10000000

posted method, LOTS of deep copies of X(:,2), no pre-allocation

Elapsed time is 350.707066 seconds.

posted method variation, only one deep copy of X(:,2), with pre-allocation

Elapsed time is 30.428389 seconds.

ans =

logical

1

accumarray, only one deep copy of X(:,2)

Elapsed time is 0.194782 seconds.

ans =

logical

1

mex routine, no copies of anything

Elapsed time is 0.047998 seconds.

ans =

logical

1

mex routine, no copies of anything, no data check

Elapsed time is 0.017843 seconds.

ans =

logical

1

So the mex routine is indeed the fastest. Much faster than the looping methods, and a bit faster than accumarray.

### More Answers (2)

Steven Lord
on 2 May 2019

Are you looking for the size of regions in an image? If so calling regionprops from Image Processing Toolbox on your label matrix will probably give you the information you want.

If you do just want counts regardless of whether the labels are contiguous, consider histcounts (or histcounts2) that are part of MATLAB proper.

Alternately some of the grouping functions in the "Grouping and Binning Data" section of this documentation page may be helpful.

If none of those functions do what you want, can you show us an example using a smaller data set? Make a 10-by-2 matrix of data and show us and explain to us exactly what you want the result of this processing step on that example data set to be.

##### 3 Comments

Steven Lord
on 3 May 2019

Based on that description, I believe it's as simple as calling histcounts on your second column of X.

V = histcounts(X(:, 2));

On my machine running release R2018b, generating a sample list of cluster values actually took longer than counting how many values fell into each bin.

tic;

X = randi(100, 470000000, 1);

toc

tic

[V, E] = histcounts(X, 'BinMethod', 'integers');

toc

Let's check.

nnz(X == 84) % As an example

V(84)

Guillaume
on 2 May 2019

Assuming that your clusters are integer numbers from 1 to N with no gaps:

count = accumarray(1, X(:, 2));

If not,

[cluster, ~, id] = unique(X(:, 2));

count = accumarray(1, X(:, 2));

table(cluster, count) %for pretty display.

As suggested by Steven, you can also use histcounts for that.

##### 2 Comments

Guillaume
on 3 May 2019

Yep, got my subs and val swapped in accumarray. The order has never made sense to me.

I find it very suspicious that your loop is faster than accumarray. accumarray is just an pre-compiled implementation of that loop.

### See Also

### Categories

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!