Can i vectorize my loop

I have an array of numbers that i wish to reduce based on whether the previous kept value is within a defined tolerance. I don't believe that the diff() function does what i need and therefore I think i will need something more bespoke. Currently I perform this task using the following for loop but this is very slow when processing large amounts of data.
MinInterval=0.1
KeptData(1,1)=OriginalData(1,1);
n=1;
for p=1:length(OriginalData)-1
difference=OriginalData(p+1,1)-KeptData(n,1);
if difference<=MinInterval
%do nothing
else
n=n+1;
KeptData(n,1)=OriginalData(p+1,1);
end
end
For example,
OriginalData=[1; 2; 3; 3.1; 3.2; 3.3; 3.4; 3.5; 3.6; 4; 5];
assuming a tolerance of 0.1 would become
KeptData=[1; 2; 3; 3.2; 3.4; 3.6; 4; 5]

7 Comments

Is that what it becomes with that code, or is that what it would ideally become if everything worked perfectly. I don't understand why 3.1, 3.3, and 3.5 got thrown out when 3.2, 3.4, and 3.6 got kept when all 6 of those numbers are exactly the same 0.1 bigger than the prior number. Why aren't all the numbers from 3.1 through 3.6 getting thrown out?
I'll look at this more in a second, but don't forget that 0.1 can't be represented in binary (It's the binary equivalent of 1/3.). You can see this by checking 0.3==0.1*3 in the Matlab command window. There are likely precision errors at bay here.
The tolerance is not relative to the immediately previous number: it is relative to the last number that was retained. So when 3 is retained, 3.1 is not enough bigger relative to 3 so 3.1 is thrown out, but then 3.2 is enough bigger relative to the 3 that is the last retained value.
Ahh, you are right, though I still contend it is dangerous to compare doubles directly in this fashion without some tolerance or assurance that they will be the same values for all cases they are expected to be.
Your code looks perfectly fine, the reason it may be slow for large amounts of data is probably just that KeptData is increasing its size within the loop. To speed the code up simply pre-allocate the variable "KeptData" (e.g. add the line "KeptData=OriginalData;" at the beginning of your code, and the line "KeptData=KeptData(1:n);" at the end of your code). That should give you orders of magnitude faster code...
Sorry for the delay responding, I've been out of the office. The behavior I require is exactly that shown by the example. 3.1 is close enough to 3 to be ignored but 3.2 is has a difference of 0.2 to 3 and therefore i want to keep it. 3.3 is within 0.1 of 3.2 and is ignored but 3.4 has a difference of 0.2 and this rule needs to be applied throughout the dataset
I have followed your suggestion Alfonso. Great result. Processing reduced from 850 secs to 1.5. Thank you all

Sign in to comment.

Answers (1)

Very interesting question.
First off, running your solution as is I get:
[ 1.0000 2.0000 3.0000 3.1000 3.2000 3.4000 3.5000 3.6000 4.0000 5.0000]
It seems there are precision issues with MinInterval being 0.1 and your data being spaced "0.1" apart (which is impossible to represent in binary; you can see this by trying the statement (3*0.1==0.3) in the command window).
Anyway, adding a small wiggle factor to your MinInterval, I get the code:
% Method 1
tic
MinInterval=0.1+eps; % We will have issues with precision without eps
KeptData(1,1)=OriginalData(1,1);
n=1;
for p=1:length(OriginalData)-1
difference=OriginalData(p+1,1)-KeptData(n,1);
if difference<=MinInterval
%do nothing
else
n=n+1;
KeptData(n,1)=OriginalData(p+1,1);
end
end
t1 = toc;
A simple attempt to vectorize the above only using diff could look like:
% Method 2
tic
KeptData2 = OriginalData([true; diff(OriginalData)>MinInterval]);
t2 = toc;
But the above gives a different behavior around "stair-case" sections in your data (the [... 3.0; 3.1; 3.2; 3.3; ...] part). We can get more similar behavior to the original code with the following:
% Method 3
MinInterval = 0.1; % Shouldn't have any precision issues
tic
NewMinInterval = 2*MinInterval;
edges = min(OriginalData):NewMinInterval:max(OriginalData);
hc = histc(OriginalData,edges);
KeptData3 = edges(logical(hc));
t3 = toc;
Using the following:
KeptData'
KeptData2'
KeptData3
disp(['Using for loop: ' num2str(t1)])
disp(['Using diff alone: ' num2str(t2)])
disp(['Using histc + diff: ' num2str(t3)])
I get the results:
ans =
1.0000 2.0000 3.0000 3.2000 3.4000 3.6000 4.0000 5.0000
ans =
1 2 3 4 5
KeptData3 =
1.0000 2.0000 3.0000 3.2000 3.4000 3.6000 4.0000 5.0000
Using for loop: 0.017651
Using diff alone: 0.0047814
Using histc + diff: 0.0071429
Hopefully that helps. Let me know if that doesn't give the behavior you're looking for.
Cheers

1 Comment

Thank you for your reply. Your code works very well for the example, but when I apply it to my dataset i get a significantly different set of data returned (my code returns 469966 datapoints and yours 34441). If i get chance i will have a look at the returned data to understand what is going on, however, i have preallocated the KeptData array as suggested by Alfonso and this has made an huge difference to the processing time. My loop originally ran for 850 secs and with preallocation it takes 1.5 secs so i will continue with this solution. Your code incidently was much faster at 0.03. Thank you for your help

Sign in to comment.

Products

Asked:

on 27 Nov 2013

Commented:

on 29 Nov 2013

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!