Why is replacing datetimes in a large array slow?
Show older comments
I've got a set of large data of which I know the time every 100 points. To get the other points, I'm trying to interpolate between the known times using linspace. While the linspace command seems to be quiet fast, replacing the datetime values in my initialized array seems to go slower the larger the array is. Doubling array size, also doubles writing time.
a 30.000.000 point array takes 0.24 seconds to overwrite only 100 points. This seems way too much to me.
Why is the writing time proportional to array size? And more importantly: how can I reduce this run time?
I have checked, but running the linspace and writing to ans, is sub milisecond fast, so it really has to do with writing to a large array.
short_length=300000;
random = rand(short_length,1)/1001; %random timeshift
DAQ_PC_datetime_short=datetime('now')+seconds((0:0.001:0.001*(short_length-1))'+random); %generate fictive times
DAQ_PC_datetime = NaT(length(DAQ_PC_datetime_short)*100,1); %init array
DAQ_PC_datetime(100:100:100*length(DAQ_PC_datetime_short)) = DAQ_PC_datetime_short; %set known values
% DAQ_PC_datetime = DAQ_PC_datetime'; %sizing
DAQ_PC_datetime(1:99) = DAQ_PC_datetime(100) + seconds(-.099:0.001:-.001); %extrapolate at start
for n=100:100:100*length(DAQ_PC_datetime_short)-1 %interpolate in the middle
tic
DAQ_PC_datetime(n:n+100) = linspace(DAQ_PC_datetime(n),DAQ_PC_datetime(n+100),101); %linear interpolation
toc
end
17 Comments
Star Strider
on 28 Feb 2020
So the deal is you have a set of times that is 1:100 as compared to the actual data array and you want the time for every point?
I gather this time interval between these points isn't uniform (quite)?
What is the sample rate/dt between samples? I'm wondering if this isn't a place to use the venerable datenum for the interpolation process first, then convert that to datetime
I'm not seeing a way that retime solves your problem altho that was also one of my first thoughts along w/ StarS but on reading the doc again it doesn't seem set up for this problem....it wants a global uniform dt.
dpb
on 28 Feb 2020
I think the problem is twofold -- the large array and that the array is a datetime object which is more than just a bigger version of a datenum number.
Matlab is making copies behind the scenes it appears and that is slow...I just tried to change the format string on the version in memory to see the difference and it sent MATLAB into comatose state-- had to kill the process that didn't complete in 15+ minutes and completely hogged the CPU. :(
I'll not be trying that again! :)
Guillaume
on 28 Feb 2020
.I just tried to change the format string on the version in memory to see the difference and it sent MATLAB into comatose state
I'm surprised by that. The Format property of a datetime array is common to all the elements, so it's one char vector to change regardless of the size of the array.
Internally, a datetime is stored as a pair of double (strictly speaking as a complex double) so updating a datetime array is going to take as long as updating a double array twice the size.
dpb
on 28 Feb 2020
Well, that's the memory footprint, yes (actually, interestingly enough it's (16*NUMELEM + 1) bytes, not sure what that one extra byte is for).
BUT the difference between a double or complex double is that to update a datetime has to go through the class functions/procedures and they seem to be expensive relative to a native double. I've noticed that with tables containing datetimes or the timetable with much smaller sizes than OP's here; I think his size just brings the issue into being large enough to be very observable.
Oh! I see what brought ML to its knees...I didn't get the trailing ; on the command to change format string so it tried to output all the results on the screen...still, one would think that would just start the screen scrolling and should be able to CTRL-C out of it. Instead, from the disk thrashing, sounded like it tried to create the whole array to display first or something else bizarre because it locked it up in a tight loop of some sort and undoubtedly was disk swapping trying to do whatever...
But, it wasn't just changing the format string itself that did it, but still not nice...
Guillaume
on 28 Feb 2020
The memory footprint is actually 16*numel(datetimearray) + 2*numel(datetimearray.Format) + 2*numel(datetimearray.TimeZone)
I was just looking at the dynamic memory footprint of the elements themselves, not the object variable one-time overhead.
>> datetime(repmat(datestr(now),1,1));
>> whos ans
Name Size Bytes Class Attributes
ans 1x1 9 datetime
>> datetime(repmat(datestr(now),2,1));
>> whos ans
Name Size Bytes Class Attributes
ans 2x1 17 datetime
>> whos DAQ_PC_datetime
Name Size Bytes Class Attributes
DAQ_PC_datetime 300000x1 4800001 datetime
>>
That one odd byte is peculiar, methinks...must be a code of some sort for the class.
There certainly has to be the other hidden storage for the class object somewhere, yes. I don't know enough about the class mechanism in MATLAB to have any idea where that actually resides and certainly haven't tried to delve into the internals of datetime object itself. But, I'd presume they're metadata stored with the property when the class variable is created--which is why doesn't show up in the memory footprint from whos.
Hans Janssen
on 2 Mar 2020
Edited: Hans Janssen
on 2 Mar 2020
Siddharth Bhutiya
on 6 Mar 2020
What version of MATLAB are you using ?
Stephen23
on 6 Mar 2020
@Siddharth Bhutiya: perhaps this answers your question:

Siddharth Bhutiya
on 19 Mar 2020
@Hans There have been performance improvements to subscripted assignment in the recent release (2019b and 2020a) that would specifically affect your use case. You can find more information about that in the release notes here.
Specifically if you use R2020a, and wrap this code inside a function (instead of a script), then you would be able to get significant improvement in performance
Eike Blechschmidt
on 6 Aug 2021
I'm facing the exact same problem (assigning datetime values to an allocated datetime array) and it does not matter if I use the code in a script or a function. Has there been a faster solution?
dpb
on 8 Aug 2021
t(i)=mean(data.Timestamp(info.Max));
probably wouldn't need to be subscripted assignment on an element-by-element basis but could likely be vectorized. And, I'm not positive if the JIT compiler will replaced the function call to mean() by a computed result inside a loop or not; but unless info.Max is changing, then it's redundant do compute each pass.
The statement above appears to be a do-nothing statement in that it calculates the mean requested, but doesn't appear to do anything with the result.
We would need to see more of the code than these two lines in isolation, but there appear to be areas likely possible to be improved here.
Eike Blechschmidt
on 9 Aug 2021
In this instance I'm just doing a map and reduce without using the function overhead of mapreduce. Additionally I want to also store processed data each read (e.g mean of some data). I use the following code:
ds = datastore(); % It is a custome datastore that is acceptably fast already
sum = 0;
num = 0;
p = zeros(1000000,1); % allocate a lot of elemts that always more then read & processed
t(1000000) = NaT;
i = 0;
while hasdata(ds)
i = i + 1;
[data, info] = read(ds);
sum = sum + sum(data.T);
num = num + numel(data.T);
p(i) = mean(data.p);
t(i) = mean(data.Timestamp(info.Max)); % where info.Max changes every read and always just contains two indices
end
% remove unnecessary allocated elements
t(i+1:end) = [];
p(i+1:end) = [];
% build the average
average = sum/num;
% use the data p and t e.g. plot it
plot(t, p);
I also put this into a (local) function for the JIT compiler. I hope I did not miss anything. So in this case I don't see whre I can vectorize that computation or speed it up somwehre else.
What I also experienced is that when using the profiler it generates quite some overhead probably due to additional timing code and the lack of good JIT compilation. I would expect that but I was suppriesed to the extent.
Elapsed time is 112.928500 seconds. % <- without profiler
Elapsed time is 158.068796 seconds. % <- with profiler on
I now added some timing code (tic(), toc()) around the different parts above and received the follwoing times:
meanPart = 2.1963; % seconds of sum and numel part
meanTimePart = 7.7242s; % seconds averaging the time values
assignTime = 2.1287; % seconds for assigning the time value
So I guess I was fooled by the profiler as to where most time is spent. But from this experience datetime is far away from usable in speed relevant code.
Or am I missing something here?
dpb
on 9 Aug 2021
ds = datastore(); % It is a custome datastore that is acceptably fast already
sum = 0;
num = 0;
p = zeros(1000000,1); % allocate a lot of elemts that always more then read & processed
t(1000000) = NaT;
i = 0;
while hasdata(ds)
i = i + 1;
[data, info] = read(ds);
sum = sum + sum(data.T);
num = num + numel(data.T);
p(i) = mean(data.p);
t(i) = mean(data.Timestamp(info.Max)); % where info.Max changes every read and always just contains two indices
end
...
Why create such huge arrays for p and t? It looks like you've created them as being even larger than the datastore content whereas they're really only the number of segments into the datastore that I would think would be several decades smaller.
What is i when the above loop is done?
I've got to run at the moment so can't actually do any experimenting just now, but I'd guess that has a lot to do with the times.
Plus, it would be good to show the code including where you placed the various tic/toc calls so folks can see unequivocally what they represent.
Eike Blechschmidt
on 9 Aug 2021
I have a data store that reads in chunks of data from hard disk. I then have a second data store on top that which analyses the buffered data and returns parts of it (e.g. one period/cycle of a really long sin wave). For each period/cycle I do an analysis as shown above. So I allocate memory the number of cycles and not the amount of data I expect (around 1000x amount of cycles). According to that I travels the number of total periods/cycles present in the data. I just saw that the code above does not fit to the timing I provided as I separated the mean part of the date time section into a separate call and stored it a temporary variable. Then the tic was after the read call and before the two date time calls. I’m sorry I’m on my phone so I could not write it in code again. I will post it tomorrow.
Eike Blechschmidt
on 10 Aug 2021
...
timing(1000000, 3)
while hasdata(ds)
i = i + 1;
[data, info] = read(ds);
tic();
sum = sum + sum(data.T);
num = num + numel(data.T);
p(i) = mean(data.p);
timing(i, 1) = toc();
tic();
val = mean(data.Timestamp(info.Max)); % where info.Max changes every read and always just contains two indices
timing(i, 2) = toc();
tic();
t(i) = val;
timing(i, 3) = toc();
end
...
After running the code i just summed up all the timings to get a total spent as posted before.
Answers (0)
Categories
Find more on Data Type Identification in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!