Can i vectorize my loop
Show older comments
I have an array of numbers that i wish to reduce based on whether the previous kept value is within a defined tolerance. I don't believe that the diff() function does what i need and therefore I think i will need something more bespoke. Currently I perform this task using the following for loop but this is very slow when processing large amounts of data.
MinInterval=0.1
KeptData(1,1)=OriginalData(1,1);
n=1;
for p=1:length(OriginalData)-1
difference=OriginalData(p+1,1)-KeptData(n,1);
if difference<=MinInterval
%do nothing
else
n=n+1;
KeptData(n,1)=OriginalData(p+1,1);
end
end
For example,
OriginalData=[1; 2; 3; 3.1; 3.2; 3.3; 3.4; 3.5; 3.6; 4; 5];
assuming a tolerance of 0.1 would become
KeptData=[1; 2; 3; 3.2; 3.4; 3.6; 4; 5]
7 Comments
Image Analyst
on 27 Nov 2013
Is that what it becomes with that code, or is that what it would ideally become if everything worked perfectly. I don't understand why 3.1, 3.3, and 3.5 got thrown out when 3.2, 3.4, and 3.6 got kept when all 6 of those numbers are exactly the same 0.1 bigger than the prior number. Why aren't all the numbers from 3.1 through 3.6 getting thrown out?
Jeremy Wurbs
on 27 Nov 2013
I'll look at this more in a second, but don't forget that 0.1 can't be represented in binary (It's the binary equivalent of 1/3.). You can see this by checking 0.3==0.1*3 in the Matlab command window. There are likely precision errors at bay here.
Walter Roberson
on 27 Nov 2013
The tolerance is not relative to the immediately previous number: it is relative to the last number that was retained. So when 3 is retained, 3.1 is not enough bigger relative to 3 so 3.1 is thrown out, but then 3.2 is enough bigger relative to the 3 that is the last retained value.
Jeremy Wurbs
on 28 Nov 2013
Ahh, you are right, though I still contend it is dangerous to compare doubles directly in this fashion without some tolerance or assurance that they will be the same values for all cases they are expected to be.
Alfonso Nieto-Castanon
on 28 Nov 2013
Your code looks perfectly fine, the reason it may be slow for large amounts of data is probably just that KeptData is increasing its size within the loop. To speed the code up simply pre-allocate the variable "KeptData" (e.g. add the line "KeptData=OriginalData;" at the beginning of your code, and the line "KeptData=KeptData(1:n);" at the end of your code). That should give you orders of magnitude faster code...
Nigel
on 29 Nov 2013
Nigel
on 29 Nov 2013
Answers (1)
Jeremy Wurbs
on 27 Nov 2013
Very interesting question.
First off, running your solution as is I get:
[ 1.0000 2.0000 3.0000 3.1000 3.2000 3.4000 3.5000 3.6000 4.0000 5.0000]
It seems there are precision issues with MinInterval being 0.1 and your data being spaced "0.1" apart (which is impossible to represent in binary; you can see this by trying the statement (3*0.1==0.3) in the command window).
Anyway, adding a small wiggle factor to your MinInterval, I get the code:
% Method 1
tic
MinInterval=0.1+eps; % We will have issues with precision without eps
KeptData(1,1)=OriginalData(1,1);
n=1;
for p=1:length(OriginalData)-1
difference=OriginalData(p+1,1)-KeptData(n,1);
if difference<=MinInterval
%do nothing
else
n=n+1;
KeptData(n,1)=OriginalData(p+1,1);
end
end
t1 = toc;
A simple attempt to vectorize the above only using diff could look like:
% Method 2
tic
KeptData2 = OriginalData([true; diff(OriginalData)>MinInterval]);
t2 = toc;
But the above gives a different behavior around "stair-case" sections in your data (the [... 3.0; 3.1; 3.2; 3.3; ...] part). We can get more similar behavior to the original code with the following:
% Method 3
MinInterval = 0.1; % Shouldn't have any precision issues
tic
NewMinInterval = 2*MinInterval;
edges = min(OriginalData):NewMinInterval:max(OriginalData);
hc = histc(OriginalData,edges);
KeptData3 = edges(logical(hc));
t3 = toc;
Using the following:
KeptData'
KeptData2'
KeptData3
disp(['Using for loop: ' num2str(t1)])
disp(['Using diff alone: ' num2str(t2)])
disp(['Using histc + diff: ' num2str(t3)])
I get the results:
ans =
1.0000 2.0000 3.0000 3.2000 3.4000 3.6000 4.0000 5.0000
ans =
1 2 3 4 5
KeptData3 =
1.0000 2.0000 3.0000 3.2000 3.4000 3.6000 4.0000 5.0000
Using for loop: 0.017651
Using diff alone: 0.0047814
Using histc + diff: 0.0071429
Hopefully that helps. Let me know if that doesn't give the behavior you're looking for.
Cheers
Categories
Find more on Logical in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!