Why is replacing datetimes in a large array slow?

Question

0 votes

I've got a set of large data of which I know the time every 100 points. To get the other points, I'm trying to interpolate between the known times using linspace. While the linspace command seems to be quiet fast, replacing the datetime values in my initialized array seems to go slower the larger the array is. Doubling array size, also doubles writing time.

a 30.000.000 point array takes 0.24 seconds to overwrite only 100 points. This seems way too much to me.

Why is the writing time proportional to array size? And more importantly: how can I reduce this run time?

I have checked, but running the linspace and writing to ans, is sub milisecond fast, so it really has to do with writing to a large array.

short_length=300000;
random = rand(short_length,1)/1001; %random timeshift
DAQ_PC_datetime_short=datetime('now')+seconds((0:0.001:0.001*(short_length-1))'+random); %generate fictive times
DAQ_PC_datetime = NaT(length(DAQ_PC_datetime_short)*100,1); %init array
DAQ_PC_datetime(100:100:100*length(DAQ_PC_datetime_short)) = DAQ_PC_datetime_short; %set known values
% DAQ_PC_datetime = DAQ_PC_datetime'; %sizing
DAQ_PC_datetime(1:99) = DAQ_PC_datetime(100) + seconds(-.099:0.001:-.001); %extrapolate at start
for n=100:100:100*length(DAQ_PC_datetime_short)-1 %interpolate in the middle
    tic
    DAQ_PC_datetime(n:n+100) = linspace(DAQ_PC_datetime(n),DAQ_PC_datetime(n+100),101); %linear interpolation
    toc
end

17 Comments
Show 15 older comments Hide 15 older comments

dpb on 28 Feb 2020

Well, that's the memory footprint, yes (actually, interestingly enough it's (16*NUMELEM + 1) bytes, not sure what that one extra byte is for).

BUT the difference between a double or complex double is that to update a datetime has to go through the class functions/procedures and they seem to be expensive relative to a native double. I've noticed that with tables containing datetimes or the timetable with much smaller sizes than OP's here; I think his size just brings the issue into being large enough to be very observable.

Oh! I see what brought ML to its knees...I didn't get the trailing ; on the command to change format string so it tried to output all the results on the screen...still, one would think that would just start the screen scrolling and should be able to CTRL-C out of it. Instead, from the disk thrashing, sounded like it tried to create the whole array to display first or something else bizarre because it locked it up in a tight loop of some sort and undoubtedly was disk swapping trying to do whatever...

But, it wasn't just changing the format string itself that did it, but still not nice...

dpb on 28 Feb 2020

Edited: dpb on 29 Feb 2020

Open in MATLAB Online

I was just looking at the dynamic memory footprint of the elements themselves, not the object variable one-time overhead.

>> datetime(repmat(datestr(now),1,1));
>> whos ans
  Name      Size            Bytes  Class       Attributes
  ans       1x1                 9  datetime              
>> datetime(repmat(datestr(now),2,1));
>> whos ans
  Name      Size            Bytes  Class       Attributes
  ans       2x1                17  datetime              
>> whos DAQ_PC_datetime
  Name                      Size              Bytes  Class       Attributes
  DAQ_PC_datetime      300000x1             4800001  datetime              
>> 

That one odd byte is peculiar, methinks...must be a code of some sort for the class.

There certainly has to be the other hidden storage for the class object somewhere, yes. I don't know enough about the class mechanism in MATLAB to have any idea where that actually resides and certainly haven't tried to delve into the internals of datetime object itself. But, I'd presume they're metadata stored with the property when the class variable is created--which is why doesn't show up in the memory footprint from whos.

Hans Janssen on 2 Mar 2020

Edited: Hans Janssen on 2 Mar 2020

Open in MATLAB Online

Thanks for the replies, sorry I'm a bit late with answering, weekend happened.

@dpb: I indeed have 1 data point that marks the end of 100 stored points. And I aim to get the actual date for the other points by linear interpolation. The rate should be 1000Hz, but isn't quite that. It is ever so slightly faster, but also varying somewhat over time.

So from my understanding you two are saying matlab is making copies of the entire array behind the scenes. (after al Linspace itself is quiet fast, adding the DAQ_PC_datetime(n:n+100) =, makes it slow. That would make scense for the timing scaling. That however does not help me with a sollution.

Would it be possible to directly edit the array element without using copies?

Alternatively, using your comment of making copies in the background I tried to make a workaround, although it's a bit ugly. Basically I'm splitting up the large array, into smaller ones, do the linspace on the smaller ones, and reconstruct the original size:

clear all
tic
short_length=300000;
random = rand(short_length,1)/1001; %random timeshift
DAQ_PC_datetime_short=datetime('now')+seconds((0:0.001:0.001*(short_length-1))'+random); %generate fictive times
DAQ_PC_datetime = NaT(length(DAQ_PC_datetime_short)*100,1); %init array
DAQ_PC_datetime(100:100:100*length(DAQ_PC_datetime_short)) = DAQ_PC_datetime_short; %set known values
Divider = 1000; %amount of arrays to make (take care that it fits well, as no rounding checks are implemented)
for m=1:Divider %make multiple smaller datetime arrays out of DAQ_PC_datetime subsets
    Small_Array{m}=DAQ_PC_datetime((m-1)*length(DAQ_PC_datetime)/Divider+1:m*length(DAQ_PC_datetime)/Divider);
end
Small_Array{1}(1:99) = Small_Array{1}(100) + seconds(-.099:0.001:-.001); %extrapolate at start
for m=1:Divider %use linear interpolation but reducing internal copy size for every linspace use
    for n=100:100:length(Small_Array{m})-1
        Small_Array{m}(n:n+100) = linspace(Small_Array{m}(n),Small_Array{m}(n+100),101); %linear interpolation
    end 
end
for m=1:Divider %combine the smaller arrays back into original variable
    DAQ_PC_datetime((m-1)*length(Small_Array{m})+1:m*length(Small_Array{m}))=Small_Array{m};
end
toc

This entire piece of code runs in about 5 minutes, as opposed to the (calculated) 19 hours in my original post.

Some trial and error show a divider of 1000 is fair. Will have to modify it a bit, as my actual data won't perfectly divide in 1000. It may be interesting to look at the ideal division ratio, that should be dependant on initial array size.

Eike Blechschmidt on 9 Aug 2021

Open in MATLAB Online

In this instance I'm just doing a map and reduce without using the function overhead of mapreduce. Additionally I want to also store processed data each read (e.g mean of some data). I use the following code:

ds = datastore(); % It is a custome datastore that is acceptably fast already
sum = 0;
num = 0;
p = zeros(1000000,1); % allocate a lot of elemts that always more then read & processed
t(1000000)  = NaT;
i = 0;
while hasdata(ds)
    i = i + 1;
    [data, info] = read(ds);
    sum  = sum + sum(data.T);
    num  = num + numel(data.T);
    p(i) = mean(data.p);
    t(i) = mean(data.Timestamp(info.Max));  % where info.Max changes every read and always just contains two indices
end
% remove unnecessary allocated elements
t(i+1:end) = [];
p(i+1:end) = [];
% build the average
average = sum/num;
% use the data p and t e.g. plot it
plot(t, p);

I also put this into a (local) function for the JIT compiler. I hope I did not miss anything. So in this case I don't see whre I can vectorize that computation or speed it up somwehre else.

What I also experienced is that when using the profiler it generates quite some overhead probably due to additional timing code and the lack of good JIT compilation. I would expect that but I was suppriesed to the extent.

Elapsed time is 112.928500 seconds. % <- without profiler
Elapsed time is 158.068796 seconds. % <- with profiler on

I now added some timing code (tic(), toc()) around the different parts above and received the follwoing times:

meanPart = 2.1963;      % seconds of sum and numel part
meanTimePart = 7.7242s; % seconds averaging the time values
assignTime = 2.1287;    % seconds for assigning the time value

So I guess I was fooled by the profiler as to where most time is spent. But from this experience datetime is far away from usable in speed relevant code.

Or am I missing something here?

Eike Blechschmidt on 9 Aug 2021

I have a data store that reads in chunks of data from hard disk. I then have a second data store on top that which analyses the buffered data and returns parts of it (e.g. one period/cycle of a really long sin wave). For each period/cycle I do an analysis as shown above. So I allocate memory the number of cycles and not the amount of data I expect (around 1000x amount of cycles). According to that I travels the number of total periods/cycles present in the data. I just saw that the code above does not fit to the timing I provided as I separated the mean part of the date time section into a separate call and stored it a temporary variable. Then the tic was after the read call and before the two date time calls. I’m sorry I’m on my phone so I could not write it in code again. I will post it tomorrow.

Eike Blechschmidt on 10 Aug 2021

Open in MATLAB Online

...
timing(1000000, 3)
while hasdata(ds)
    i = i + 1;
    [data, info] = read(ds);
    
    tic();
    sum  = sum + sum(data.T);
    num  = num + numel(data.T);
    p(i) = mean(data.p);
    timing(i, 1) = toc();
    
    tic();
    val = mean(data.Timestamp(info.Max));  % where info.Max changes every read and always just contains two indices
    timing(i, 2) = toc();
    
    tic();
    t(i) = val;
    timing(i, 3) = toc();
end
...

After running the code i just summed up all the timings to get a total spent as posted before.

Sign in to comment.

Sign in to answer this question.

Follow Question

Why is replacing datetimes in a large array slow?

17 Comments
Show 15 older comments Hide 15 older comments

Answers (0)

Categories

Products

Release

Tags

Community Treasure Hunt

Why is replacing datetimes in a large array slow?

17 Comments Show 15 older comments Hide 15 older comments

Answers (0)

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

17 Comments
Show 15 older comments Hide 15 older comments