Asked by Hamad
on 1 Apr 2013

Let **a**

a = 1 4 3 5 4 7 6 8

be integer intervals

[1 2 3 4] [3 4 5] [4 5 6 7] [6 7 8]

I'm looking for an output **b** to count the number of times each element appears. It is clear that **1** appears one time, **2** appears one time, **3** appears twice, and so on...

So the output **b** would list the element and how many times it appears:

b = 1 1 2 1 3 2 4 3 5 2 6 2 7 2 8 1

The challenge is, **a** is an extremely large **nx2** array, so using **find** even with parallel looping is inefficient. Is there a fast way to yield **b** given **a**? If it helps, we can be sure that the elements down the left-hand column of **a** are sorted, but not necessarily so for the right-hand column. Thanks in advance!

*No products are associated with this question.*

Answer by Teja Muppirala
on 2 Apr 2013

Accepted answer

Assuming that the second column of a is always >= the first column of a, then this is an efficient solution.

mA = max(a(:));

b = (1:mA)'; b = [b cumsum(accumarray(a(:,1),1,[mA 1]) + ... accumarray(min(a(:,2)+1,mA),-1,[mA 1]))]; b(end) = b(end) +1;

Answer by Brian B
on 2 Apr 2013

Edited by Brian B
on 2 Apr 2013

Like Cedric, I keep thinking of iterating over interval length. Another way to look at the vector `b` is as the sum of the outputs of a bunch of FIR filters with rectangular impulse responses to appropriate inputs. If you cut the filter length in half at each iteration, you can avoid iterating over all n interval lengths, and instead loop about log2(n) times.

Here I generate intervals of up to 1000, so the loop runs about 10 times.

N=1e8; M=1e7;

a=[randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];

idx = 1:max(a(:));

d = diff(a, 1, 2) + 1 ; % interval length s = a(:,1); % interval start point

L = ceil(max(d)/2); b = 0*idx;

tic while L>0 iL = (d>=L); u = hist(s(iL),idx); %b = b + conv(u, ones(1,L),'same'); b = b + filter(ones(1,L),1,u); s(iL) = s(iL)+L; d(iL) = d(iL)-L; L = ceil(max(d)/2); toc end

This ran in 243 seconds on my machine; Cedric's solution exhausted my memory after 577 seconds....

EDIT: Replaced the call to conv() with a call to filter(), since conv() was returning the *central* part of the convolution, not the *initial* part, as we need.

Answer by Jan Simon
on 2 Apr 2013

Edited by Jan Simon
on 2 Apr 2013

Are the values in the first and second column of `a` distinct? If so:

c = zeros(1, max(a(:, 2)) + 1); c(a(:, 1)) = 1; a2 = a(:, 2) + 1; % [EDITED: +1 inserted] c(a2) = c(a2) - 1; c = cumsum(c); c(end) = [];

If the values are not unique, instead of 1 and -1 `histc` determines the values:

q = 1:max(a(:, 2)); c = histc(a(:, 1), q); c = c - histc(a(:, 2) + 1, q); % [EDITED: +1 inserted] c = cumsum(c);

How efficient is simple FOR loop?

n = max(a(:, 2)); c = zeros(1, n + 1); for ia = 1:length(a) a1 = a(ia, 1); c(a1) = c(a1) + 1; a2 = a(ia, 2) + 1; % [EDITED: +1 inserted] c(a2) = c(a2) - 1; end c(end) = []; c = cumsum(c);

Converting this to C would be interesting also. Most of all the limitation of CUMSUM to the type double would not matter, such that UINT16 might reduce the data size, if sufficient.

Jan Simon
on 2 Apr 2013

Interesting timings:

N = 1e7; M = 1e6; a = [randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];

FOR-loop: 1.90 seconds HISTC: 9.64 seconds (see also Brian B's HISTC solution) CUMSUM: 1.71 seconds (columns of [a] must be unique!) CONV: 41.00 seconds (Brian B's loop with HIS and CONV) ACCUMARRAY: 1.83 seconds (Teja's CUMSUM/ACCUMARRAY)

The relation my change for the real data. But for the test data, the simple FOR loop can compete with the vectorized methods (R2009a/64, Win7).

Hamad
on 2 Apr 2013

Answer by Image Analyst
on 1 Apr 2013

Edited by Image Analyst
on 1 Apr 2013

Are they known to be integers? I'd use histc(). You could do this in about 4 lines of code -- 1 (or maybe a 3 line for loop) to construct a 1D list of all numbers you gave in the seconds array you showed, 1 to construct the hist bin edges, one to call histc() and one to construct "b" from the result of histc(). Give it a try and come back if you have trouble.

Answer by Cedric Wannaz
on 1 Apr 2013

Edited by Cedric Wannaz
on 1 Apr 2013

Hi Hamad,

Well, how large is "extremely large"? Image Analyst might have given you the optimal solution already. As an alternative, you could go for a solution involving ACCUMARRAY or SPARSE, where you would use these integers as indices and a vector of ones as values. This way you get counts per index/integer. For example:

>> buffer = arrayfun(@(r) a(r,1):a(r,2), 1:size(a,1), 'UniformOutput', false) ; >> buffer = [buffer{:}].' ; >> counts = accumarray(buffer, ones(numel(buffer),1)) counts = 1 1 2 3 2 2 2 1

There is still room for improvement though (in particular, I guess that a PARFOR loop would be more efficient than ARRAYFUN for building the vector of all integers, but I kept the latter because my point was to illustrate ACCUMARRAY). If not efficient enough after you improve it, it is probably something that you could optimize writing a few lines of C/MEX, in particular for building the vector of all integers.

Show 4 older comments

Cedric Wannaz
on 3 Apr 2013

Well, not sure about that, but you're welcome! The thread started with a homework tag and ended up being one of the most animated that I've seen a while .. I'd bet that your next question won't be mistaken for a HW ;-)

Cheers,

Cedric

Answer by Brian B
on 2 Apr 2013

Edited by Brian B
on 2 Apr 2013

Jan Simon's solution is interesting because cumsum can be though of as an IIR filter, and so this idea can further improve the FIR approach above, without looping:

N=1e8; M=1e7; a=[randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];

idx = 1:max(a(:)); b = hist(a(:,1),idx); b = b - hist(a(:,2)+1,idx); b = cumsum(b);

This works even if the values are not distinct.

Answer by Brian B
on 2 Apr 2013

Edited by Brian B
on 2 Apr 2013

You can avoid saving all of the intermediate intervals (e.g., [1 2 3 4]) if you just want the count:

a = [1 4;3 5;4 7;6 8]; b = zeros(max(max(a)),1);

parfor k=1:length(b) b(k) = sum( a(:,1)<=k & k<=a(:,2) ); end

Brian B
on 2 Apr 2013

Or, if the intervals tend to be small compared to `max(max(a))`, you could iterate over interval length:

a = [1 4;3 5;4 7;6 8]; idx = 1:max(max(a));

while ~isempty(a) b = b + hist(a(:,1),idx); % Then adjust the a matrix by increasing the lower bound % by one and removing any rows of a which are no longer valid intervals. end

I'll let you work out the details, since I just noticed this is tagged as homework.

Opportunities for recent engineering grads.

## 2 Comments

## Jan Simon (view profile)

Direct link to this comment:http://www.mathworks.com/matlabcentral/answers/69413#comment_140702

Some contributors spend a lot of time here to solve the problem. What about voting for this question to get the interst of others?

## Hamad (view profile)

Direct link to this comment:http://www.mathworks.com/matlabcentral/answers/69413#comment_140775

First thing to say is that I'm a little overwhelmed by the breadth of effort I have received with this coding challenge. For that reason, I wish to thank each contributor individually.

Firstly,

Image Analyst, thank you for a very speedy initial response. And referring to "why do you need this?"... there are~1e8milliseconds in a 24-hour period. Highly-traded assets such as index futures contracts yield price updates every few milliseconds during the day and every few minutes at night, averaging around 10,000 observations per day. The product of observations for a pair of these is therefore also~1e8.Brian B, you deserve thanks for offering creative contributions based on filters, and for following up with others' contributions.Jan Simon, you deserve thanks for offering a wide range of efficient solutions along with performance metrics.Last but by no means least,

Cedric Wannazdeserves thanks for offering very detailed intuitive explanations to his contributions. You make impenetrable code seem easy.THE RESULTSI have tested all solutions to the same standard, measured by efficiency (minor coding modifications excepted). The most efficient algorithm which solves the problem on my system setup and to my particular needs is provided by

Teja Muppirala, and this is followed closely by the FOR-loop provided byJan Simon. Thank you!!For this reason, I gratefully accept

Teja Muppirala's contribution. At the same time, I offer +1 votes to every contribution here. You have helped me greatly. My sincerest thanks to you all.Best wishes,

Hamad