Help with optimising my code?

1 view (last 30 days)
Rene Higuita
Rene Higuita on 11 Feb 2018
Edited: Rene Higuita on 12 Feb 2018
Hello everyone,
Apologies for the vague title, but I think I'm missing exactly the knowledge that would let me point at the topic...
I am trying to group the values in a large matrix yMat into a single vector yOut, sorting them by binning values from a secondary matrix xMat. Starting with
[N,bin] = histc(xMat,edges); *
Then
for k=1:numel(edges)-1
yOut(k)=mean(yMat(bin==k));
end
xMat/yMat vary in size, but their numel can go up to a couple tens of millions, and I need to go through hundreds of them. I picked a relatively large example: the histc line takes 0.3s, while the loop takes over 4s when numel(edges)=100. Ideally I would like much finer grain, too, with numel(edges) closer to 1000...
I am pretty confident I do this the very wrong way, for loops never being a good sign, but I can't think of a faster way to do this. Any help appreciated!
*I know histc is a legacy function, but I'm stuck with R2011b right now, and i don't think this impacts the issue

Accepted Answer

per isakson
per isakson on 11 Feb 2018
Edited: per isakson on 11 Feb 2018
xMat = randn( 12, 1);
edges = ( - 4 : 1 : 4 );
yMat = ( 1 : 12 )';
[ N, bin ] = histc( xMat, edges );
y1 = accumarray( bin, yMat, [], @mean );
for k=1:numel(edges)-1
y2(k)=mean(yMat(bin==k));
end
comparision
>> y1'
ans =
0 11.0000 8.5000 6.6667 4.0000 7.0000
>> y2
y2 =
NaN 11.0000 8.5000 6.6667 4.0000 7.0000 NaN NaN
btw: y2 should have been preallocated
In my example, elements of bin can get the value zero. Thus, bin+1 in accumarray
However, the increase in speed on R2016a is definitely not impressive.
  4 Comments
Rene Higuita
Rene Higuita on 11 Feb 2018
Edited: Rene Higuita on 12 Feb 2018
Thanks for your edits. Since I'm discovering these functions too, here's quick comments on them:
you avoid 0's in bin by adding -Inf and +Inf in edges (histc yields zero when values are out of bound).
Here's the more challenging code I used
tic;
xMat = randn( 1e8,1);
edges = [ -Inf ( - 4 : .01 : 4 ) Inf];
yMat = randn(size(xMat))';
allocTime=toc;
tic;
[ N, bin ] = histc( xMat, edges );
histcTime=toc;
tic;
y1 = accumarray( bin, yMat, [], @mean );
y1time=toc;
tic;
y3 = accumarray( bin, yMat);y3=y3./N(1:end-1);
y3time=toc;
tic;
y2=zeros(numel(edges)-1,1);
for k=1:numel(edges)-1
y2(k)=mean(yMat(bin==k));
end
y2time=toc;
%checking relative error levels
maxErr21=max(abs((y2-y1)./y1));
maxErr32=max(abs((y3-y2)./y2));
maxErr13=max(abs((y1-y3)./y3));
fprintf(['--- Numel(xMat)=%g, numel(edges)=%d ---\n'...
'Creating vectors took %0.1fs, histc %0.1fs\n'...
'Method y1: %0.1fs y2: %0.1fs y3: %0.1fs\n\n'...
'Largest relative differences between:\n y1 and y2 : %0.2g'...
' -- y2 and y3 : %0.2g -- y3 and y1 :%0.2g\n\n'],...
numel(xMat),numel(edges),allocTime,histcTime,...
y1time,y2time,y3time,maxErr21,maxErr32,maxErr13);
And the output by fprintf at the end there, on a 5year old laptop and R2011b:
--- Numel(xMat)=1e+008, numel(edges)=803 ---
Creating vectors took 5.1s, histc 9.5s
Method y1: 18.8s y2: 292.5s y3: 1.5s
Largest relative differences between:
y1 and y2 : 1.6e-012 -- y2 and y3 : 1e-012 -- y3 and y1 :2.4e-012
On a more recent desktop and R2017a:
i--- Numel(xMat)=1e+08, numel(edges)=803 ---
Creating vectors took 2.2s, histc 4.7s
Method y1: 10.4s y2: 117.6s y3: 0.7s
Largest relative differences between:
y1 and y2 : 2.1e-12 -- y2 and y3 : 1.8e-12 -- y3 and y1 :2.4e-12
So no dramatic changes in the ratios between 2011b and 2017a, and accumarray
Regarding the undocumented Matlab post, I can't really compare because accumarray only takes function handles as input for that argument, unlike cellfun which has a list of built-in functions it can deal with (much more efficiently than through a handle, which is the main point of that post). '@sum' is supposed to be the default handle, but I suspect it is actually built-in as opposed to actually calling 'sum'
Thanks for your help and interest! This definitely will take down my running times to acceptable levels.
per isakson
per isakson on 12 Feb 2018
Edited: per isakson on 12 Feb 2018
Back in 2009 Yair Altman described a similar case, in which function handles cause a significant performance penalty, cellfun – undocumented performance boost. This has not changed in recent Matlab releases.

Sign in to comment.

More Answers (1)

Walter Roberson
Walter Roberson on 11 Feb 2018
Try accumarray. Use the bin number as subs, use ymat in the values slot, use the empty array as the size, use @mean as the function
  1 Comment
Rene Higuita
Rene Higuita on 11 Feb 2018
Edited: Rene Higuita on 11 Feb 2018
You got the answer as well. I tagged Per's because he gives more details, and it'll probably help others more to have the examples he gives.
But thanks a lot!

Sign in to comment.

Categories

Find more on Matrices and Arrays in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!