Why is replacing datetimes in a large array slow?

I've got a set of large data of which I know the time every 100 points. To get the other points, I'm trying to interpolate between the known times using linspace. While the linspace command seems to be quiet fast, replacing the datetime values in my initialized array seems to go slower the larger the array is. Doubling array size, also doubles writing time.
a 30.000.000 point array takes 0.24 seconds to overwrite only 100 points. This seems way too much to me.
Why is the writing time proportional to array size? And more importantly: how can I reduce this run time?
I have checked, but running the linspace and writing to ans, is sub milisecond fast, so it really has to do with writing to a large array.
short_length=300000;
random = rand(short_length,1)/1001; %random timeshift
DAQ_PC_datetime_short=datetime('now')+seconds((0:0.001:0.001*(short_length-1))'+random); %generate fictive times
DAQ_PC_datetime = NaT(length(DAQ_PC_datetime_short)*100,1); %init array
DAQ_PC_datetime(100:100:100*length(DAQ_PC_datetime_short)) = DAQ_PC_datetime_short; %set known values
% DAQ_PC_datetime = DAQ_PC_datetime'; %sizing
DAQ_PC_datetime(1:99) = DAQ_PC_datetime(100) + seconds(-.099:0.001:-.001); %extrapolate at start
for n=100:100:100*length(DAQ_PC_datetime_short)-1 %interpolate in the middle
tic
DAQ_PC_datetime(n:n+100) = linspace(DAQ_PC_datetime(n),DAQ_PC_datetime(n+100),101); %linear interpolation
toc
end

17 Comments

Experiment with the retime function. It may be faster. You will need to convert your array to a timetable first, howevber that is not difficult.
dpb
dpb on 28 Feb 2020
Edited: dpb on 28 Feb 2020
So the deal is you have a set of times that is 1:100 as compared to the actual data array and you want the time for every point?
I gather this time interval between these points isn't uniform (quite)?
What is the sample rate/dt between samples? I'm wondering if this isn't a place to use the venerable datenum for the interpolation process first, then convert that to datetime
I'm not seeing a way that retime solves your problem altho that was also one of my first thoughts along w/ StarS but on reading the doc again it doesn't seem set up for this problem....it wants a global uniform dt.
I think the problem is twofold -- the large array and that the array is a datetime object which is more than just a bigger version of a datenum number.
Matlab is making copies behind the scenes it appears and that is slow...I just tried to change the format string on the version in memory to see the difference and it sent MATLAB into comatose state-- had to kill the process that didn't complete in 15+ minutes and completely hogged the CPU. :(
I'll not be trying that again! :)
.I just tried to change the format string on the version in memory to see the difference and it sent MATLAB into comatose state
I'm surprised by that. The Format property of a datetime array is common to all the elements, so it's one char vector to change regardless of the size of the array.
Internally, a datetime is stored as a pair of double (strictly speaking as a complex double) so updating a datetime array is going to take as long as updating a double array twice the size.
Well, that's the memory footprint, yes (actually, interestingly enough it's (16*NUMELEM + 1) bytes, not sure what that one extra byte is for).
BUT the difference between a double or complex double is that to update a datetime has to go through the class functions/procedures and they seem to be expensive relative to a native double. I've noticed that with tables containing datetimes or the timetable with much smaller sizes than OP's here; I think his size just brings the issue into being large enough to be very observable.
Oh! I see what brought ML to its knees...I didn't get the trailing ; on the command to change format string so it tried to output all the results on the screen...still, one would think that would just start the screen scrolling and should be able to CTRL-C out of it. Instead, from the disk thrashing, sounded like it tried to create the whole array to display first or something else bizarre because it locked it up in a tight loop of some sort and undoubtedly was disk swapping trying to do whatever...
But, it wasn't just changing the format string itself that did it, but still not nice...
The memory footprint is actually 16*numel(datetimearray) + 2*numel(datetimearray.Format) + 2*numel(datetimearray.TimeZone)
I was just looking at the dynamic memory footprint of the elements themselves, not the object variable one-time overhead.
>> datetime(repmat(datestr(now),1,1));
>> whos ans
Name Size Bytes Class Attributes
ans 1x1 9 datetime
>> datetime(repmat(datestr(now),2,1));
>> whos ans
Name Size Bytes Class Attributes
ans 2x1 17 datetime
>> whos DAQ_PC_datetime
Name Size Bytes Class Attributes
DAQ_PC_datetime 300000x1 4800001 datetime
>>
That one odd byte is peculiar, methinks...must be a code of some sort for the class.
There certainly has to be the other hidden storage for the class object somewhere, yes. I don't know enough about the class mechanism in MATLAB to have any idea where that actually resides and certainly haven't tried to delve into the internals of datetime object itself. But, I'd presume they're metadata stored with the property when the class variable is created--which is why doesn't show up in the memory footprint from whos.
Thanks for the replies, sorry I'm a bit late with answering, weekend happened.
@dpb: I indeed have 1 data point that marks the end of 100 stored points. And I aim to get the actual date for the other points by linear interpolation. The rate should be 1000Hz, but isn't quite that. It is ever so slightly faster, but also varying somewhat over time.
So from my understanding you two are saying matlab is making copies of the entire array behind the scenes. (after al Linspace itself is quiet fast, adding the DAQ_PC_datetime(n:n+100) =, makes it slow. That would make scense for the timing scaling. That however does not help me with a sollution.
Would it be possible to directly edit the array element without using copies?
Alternatively, using your comment of making copies in the background I tried to make a workaround, although it's a bit ugly. Basically I'm splitting up the large array, into smaller ones, do the linspace on the smaller ones, and reconstruct the original size:
clear all
tic
short_length=300000;
random = rand(short_length,1)/1001; %random timeshift
DAQ_PC_datetime_short=datetime('now')+seconds((0:0.001:0.001*(short_length-1))'+random); %generate fictive times
DAQ_PC_datetime = NaT(length(DAQ_PC_datetime_short)*100,1); %init array
DAQ_PC_datetime(100:100:100*length(DAQ_PC_datetime_short)) = DAQ_PC_datetime_short; %set known values
Divider = 1000; %amount of arrays to make (take care that it fits well, as no rounding checks are implemented)
for m=1:Divider %make multiple smaller datetime arrays out of DAQ_PC_datetime subsets
Small_Array{m}=DAQ_PC_datetime((m-1)*length(DAQ_PC_datetime)/Divider+1:m*length(DAQ_PC_datetime)/Divider);
end
Small_Array{1}(1:99) = Small_Array{1}(100) + seconds(-.099:0.001:-.001); %extrapolate at start
for m=1:Divider %use linear interpolation but reducing internal copy size for every linspace use
for n=100:100:length(Small_Array{m})-1
Small_Array{m}(n:n+100) = linspace(Small_Array{m}(n),Small_Array{m}(n+100),101); %linear interpolation
end
end
for m=1:Divider %combine the smaller arrays back into original variable
DAQ_PC_datetime((m-1)*length(Small_Array{m})+1:m*length(Small_Array{m}))=Small_Array{m};
end
toc
This entire piece of code runs in about 5 minutes, as opposed to the (calculated) 19 hours in my original post.
Some trial and error show a divider of 1000 is fair. Will have to modify it a bit, as my actual data won't perfectly divide in 1000. It may be interesting to look at the ideal division ratio, that should be dependant on initial array size.
What version of MATLAB are you using ?
@Siddharth Bhutiya: perhaps this answers your question:
@Hans There have been performance improvements to subscripted assignment in the recent release (2019b and 2020a) that would specifically affect your use case. You can find more information about that in the release notes here.
Specifically if you use R2020a, and wrap this code inside a function (instead of a script), then you would be able to get significant improvement in performance
I'm facing the exact same problem (assigning datetime values to an allocated datetime array) and it does not matter if I use the code in a script or a function. Has there been a faster solution?
t(i)=mean(data.Timestamp(info.Max));
probably wouldn't need to be subscripted assignment on an element-by-element basis but could likely be vectorized. And, I'm not positive if the JIT compiler will replaced the function call to mean() by a computed result inside a loop or not; but unless info.Max is changing, then it's redundant do compute each pass.
The statement above appears to be a do-nothing statement in that it calculates the mean requested, but doesn't appear to do anything with the result.
We would need to see more of the code than these two lines in isolation, but there appear to be areas likely possible to be improved here.
In this instance I'm just doing a map and reduce without using the function overhead of mapreduce. Additionally I want to also store processed data each read (e.g mean of some data). I use the following code:
ds = datastore(); % It is a custome datastore that is acceptably fast already
sum = 0;
num = 0;
p = zeros(1000000,1); % allocate a lot of elemts that always more then read & processed
t(1000000) = NaT;
i = 0;
while hasdata(ds)
i = i + 1;
[data, info] = read(ds);
sum = sum + sum(data.T);
num = num + numel(data.T);
p(i) = mean(data.p);
t(i) = mean(data.Timestamp(info.Max)); % where info.Max changes every read and always just contains two indices
end
% remove unnecessary allocated elements
t(i+1:end) = [];
p(i+1:end) = [];
% build the average
average = sum/num;
% use the data p and t e.g. plot it
plot(t, p);
I also put this into a (local) function for the JIT compiler. I hope I did not miss anything. So in this case I don't see whre I can vectorize that computation or speed it up somwehre else.
What I also experienced is that when using the profiler it generates quite some overhead probably due to additional timing code and the lack of good JIT compilation. I would expect that but I was suppriesed to the extent.
Elapsed time is 112.928500 seconds. % <- without profiler
Elapsed time is 158.068796 seconds. % <- with profiler on
I now added some timing code (tic(), toc()) around the different parts above and received the follwoing times:
meanPart = 2.1963; % seconds of sum and numel part
meanTimePart = 7.7242s; % seconds averaging the time values
assignTime = 2.1287; % seconds for assigning the time value
So I guess I was fooled by the profiler as to where most time is spent. But from this experience datetime is far away from usable in speed relevant code.
Or am I missing something here?
ds = datastore(); % It is a custome datastore that is acceptably fast already
sum = 0;
num = 0;
p = zeros(1000000,1); % allocate a lot of elemts that always more then read & processed
t(1000000) = NaT;
i = 0;
while hasdata(ds)
i = i + 1;
[data, info] = read(ds);
sum = sum + sum(data.T);
num = num + numel(data.T);
p(i) = mean(data.p);
t(i) = mean(data.Timestamp(info.Max)); % where info.Max changes every read and always just contains two indices
end
...
Why create such huge arrays for p and t? It looks like you've created them as being even larger than the datastore content whereas they're really only the number of segments into the datastore that I would think would be several decades smaller.
What is i when the above loop is done?
I've got to run at the moment so can't actually do any experimenting just now, but I'd guess that has a lot to do with the times.
Plus, it would be good to show the code including where you placed the various tic/toc calls so folks can see unequivocally what they represent.
I have a data store that reads in chunks of data from hard disk. I then have a second data store on top that which analyses the buffered data and returns parts of it (e.g. one period/cycle of a really long sin wave). For each period/cycle I do an analysis as shown above. So I allocate memory the number of cycles and not the amount of data I expect (around 1000x amount of cycles). According to that I travels the number of total periods/cycles present in the data. I just saw that the code above does not fit to the timing I provided as I separated the mean part of the date time section into a separate call and stored it a temporary variable. Then the tic was after the read call and before the two date time calls. I’m sorry I’m on my phone so I could not write it in code again. I will post it tomorrow.
...
timing(1000000, 3)
while hasdata(ds)
i = i + 1;
[data, info] = read(ds);
tic();
sum = sum + sum(data.T);
num = num + numel(data.T);
p(i) = mean(data.p);
timing(i, 1) = toc();
tic();
val = mean(data.Timestamp(info.Max)); % where info.Max changes every read and always just contains two indices
timing(i, 2) = toc();
tic();
t(i) = val;
timing(i, 3) = toc();
end
...
After running the code i just summed up all the timings to get a total spent as posted before.

Sign in to comment.

Answers (0)

Categories

Products

Release

R2017a

Asked:

on 28 Feb 2020

Commented:

on 10 Aug 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!