MATLAB Answers

Poor R2019b performance with tables

19 views (last 30 days)
jg on 9 Nov 2019
Commented: Michelle Hirsch on 16 Apr 2020
I read that there were improvements in table object performance in 2019b (at least with subsref stuff). I put together a simple script and directly compared r2019b to r2018b, both fresh installations.
2019b was much slower than 2018b. If this is unexpected, how can I figure out what is going wrong?
clearvars -except t
t = readtable('patients.dat');
var1 = 'Location';
var2 = 'Age';
nt = 1e4;
for i=1:nt
t.Age(3) = 20;
fprintf('case 1: %g\n',toc);
for i=1:nt
t.Location = t.Age;
fprintf('case 2: %g\n',toc);
for i=1:nt
t.(var1) = t.(var2);
fprintf('case 3: %g\n',toc);
for i=1:nt
t.(var2)(3) = 20;
fprintf('case 4: %g\n',toc);
case 1: 0.987709
case 2: 0.817124
case 3: 0.741722
case 4: 0.922625
case 1: 1.64177
case 2: 1.20244
case 3: 1.19423
case 4: 1.5782
  1 Comment
Martin Lechner
Martin Lechner on 12 Nov 2019
According to the release notes of R2019b:
"However, the performance improvement occurs only when you make table subscripted assignments within a function. There is no improvement when subscripting into tables at the command line, or within try-catch blocks."
I tried to put your code into a function and got very similar results. I have only Matlab R2019a and R2019b installed. But your code runs still slower on R2019b.
I tried to modify your code, but I got always a very similar performance in R2109a and R2109b. I am not sure what this test shows because you are accessing a element of a small table multiple times. If I modify the test to access more or all elements multiple times the performance is still the same.

Sign in to comment.

Answers (3)

Campion Loong
Campion Loong on 27 Mar 2020
Edited: Campion Loong on 27 Mar 2020
Hi all,
Continuing our investment in performance, R2020a has been released with more indexing improvement in table, timetable, categorical, datetime, duration and calendarDuration. Most notably for this thread, performance gain is now across all array sizes. Please refer to our Performance Release Notes for details.
Going back to JG's original timing example using a 100-row table (attacheded). In R2020a, it produces the following results -- note the improvement over the same example measured in R2018b with an identical setup:
% R2020a
>> JG_timing_example;
case 1: 0.663
case 2: 0.528
case 3: 0.405
case 4: 0.621
% R2018b
>> JG_timing_example;
case 1: 0.874
case 2: 0.668
case 3: 0.683
case 4: 0.791
Martin's example using 1e6 rows table (attached) also reveals continual improvement across consecutive releases:
% R2020a
>> MARTIN_timing_example
indexing into table: 0.634
indexing into Matlab vector: 0.008
indexing into vector in struct: 0.007
% R2019b
>> MARTIN_timing_example
indexing into table: 1.271
indexing into Matlab vector: 0.007
indexing into vector in struct: 0.007
% R2019a
>> MARTIN_timing_example
indexing into table: 51.351
indexing into Matlab vector: 0.007
indexing into vector in struct: 0.007
While there remains more work to be done, these improvements do represent our continual commitment to move performance forward in these recent data types -- ones that carry rich ecosystems to represent your data naturally and support your increasingly complex workflows.
Michelle Hirsch
Michelle Hirsch on 16 Apr 2020
jg: I'm sorry about the impact that the 19b performance regression is having on your work. Could you say more about your license? If it's current on maintenance (or purchased within the last year), there is no charge to access new versions. Feel free to message me directly via my profile if you prefer.

Sign in to comment.

Campion Loong
Campion Loong on 20 Nov 2019
Thank you all for keeping track of table performance.
First of all, we regret that the performance improvement in assigning into large table/timetable variables comes at the expense of a performance regression on small variables. R2019b's improvement focused only on large variables. For example, as Martin noted, subscripted assignment into variables with 1e6 rows is substantially (~40x) faster in R2019b compared to R2019a. This is true for both table and timetable, as well as other newer data types such as datetime and categorical. Because our workflows research suggested performance is more important in large variables/arrays, we made a difficult choice to deliver this substantial improvement despite the regression with small variables/arrays. However, although the regression is less substantial than the improvement, we are aware of and will address the issue in upcoming releases.
I would also stress these R2019b improvements are not a one-time effort. We have active, ongoing projects prioritizing subscripting performance of our data types. Martin is correct to note that struct and numeric arrays received very aggressive optimization after long years of existence in the language. Data types introduced more recently, such as table and datetime, have rich ecosystem to support modern analytic workflows and all of them represent data more naturally in their original form. We are continually working on deeper optimization to improve their performance as well.
To that end, please reach out to us if your workflows are notably impacted by performance in recently introduced data types (i.e. table, datetime, categorical etc.), such as those 'jg' suggested that are affecting "95% of the time" of usage. Send us your workflows on Answers, direct messages or via support. Most importantly, DO include all parts of your workflow (i.e. computations, import/export etc.) that are important to your goal -- not just subscripting alone. Often there are small changes we can recommend to notably improve performance in existing codes. Last but not least, your specific examples will go a long way in informing our development - we would very much love to learn about them!

Martin Lechner
Martin Lechner on 12 Nov 2019
To verify the promoted performance improvements I tested the example from the release notes of R2019b and compared the indexing speed in case of an array and an struct (see the code below, in the comment section the performance results are compared). In the case of an random index the performance for older releases degraded according to the release notes.
I wasn't aware of, that the performance of tables is so bad. Compared with indexing into the same struct (struct with a field Var1) the performance of tables is still more than a factor 100 slower than structs. The performance of indixing into the Matlab array directly has nearly the same performance as in a struct.
function [t, tmp, tmpStruct] = timingTest_MatlabTable()
% performance test
% I added the return valus, to ensure that the JIT doesn't throw away all calculations. because it's never been used.
% As of R2019b I didn't encountered any difference (with or without return values). I think Java's JIT is doing a much more
% aggressive optimization.
% Example run with a typical result
[t, tmp, tmpStruct] = timingTest_MatlabTable();
% R2019b result:
indexing into table: 1.66243
indexing into Matlab vector: 0.0089993
indexing into vector in struct: 0.0091241
% R2019a result:
indexing into table: 56.3513
indexing into Matlab vector: 0.0097908
indexing into vector in struct: 0.0101078
tmp = zeros(1e6,1);
t = table(tmp);
tmpStruct = struct("Var1",tmp);
indices = randi(1e6,1,10000);
for i = indices
t.Var1(i) = rand;
fprintf(' indexing into table: %g\n',toc);
for i = indices
tmp(i) = rand;
fprintf(' indexing into Matlab vector: %g\n',toc);
for i = indices
tmpStruct.Var1(i) = rand;
fprintf('indexing into vector in struct: %g\n',toc);
jg on 14 Nov 2019
I did play around with using dynamic properties for table variables. There was enough of a speed penalty that I decided against it, though in my application my set of variables is consistent enough that I can live with hard-coding variable names (or inheriting a class with a new variable names when needed).

Sign in to comment.




Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!