MATLAB Answers

Michael
2

Performance of table data type

Asked by Michael
on 30 Oct 2014
Latest activity Commented on by LuisCardona on 28 Jun 2017
Hello!
Is it normal that writing into a table data structure is 1000 times slower than writing into a cell array of the same size? And that reading is 50 times slower?
Try the following code:
%Test:
tic;
A = cell(10000, 50);
'Time for initializing cell array:'
toc
tic;
B = cell2table(A);
'Time for initializing table:'
toc
i = 0; % create variable
tic;
for i = 1 : 2500
A{i, 7} = 'aaa';
end
'Time for writing into cell array:'
toc
tic;
for i = 1 : 2500
B{i, 7} = {'aaa'};
end
'Time for writing into table:'
toc
x = ''; % create variable
tic;
for i = 1 : 2500
x = A{i, 7};
end
'Time for reading from cell array:'
toc
tic;
for i = 1 : 2500
x = B{i, 7};
end
'Time for reading from table:'
toc

  2 Comments

While tables do have performance issues, this example is particularly pathological.
The initialization of a table with an array of empty cells is problematic. The following initialization is much faster:
tic;
A = repmat({''},1e4,50);
'Time for initializing cell array:'
toc
Also, named reference is preferred to curly brackets, i.e. B.A7(i) instead of B{i,7}.
Added similar issue to Stackoverflow, it may be helpful: Matlab Table / Dataset type optimization

Sign in to comment.

Products

6 Answers

Answer by Peter Perkins
on 30 Oct 2014

Michael, table is currently not as fast as datatypes like double and cell when you are reading or writing individual values in a long loop. However, it's often possible to vectorize your code and read or write entire variables, at which point you probably won't notice a speed difference. You may also find that
B.Var7{i} = 'aaa'
is faster than
B{i, 7} = {'aaa'}
Hope this helps.

  0 Comments

Sign in to comment.


Answer by Michael
on 31 Oct 2014

Thank you for the answer. In my case, I have to write single values. Therefore, the slow performance of the table data type is very disappointing. I will try to use B.Var7{i} = 'aaa', as you wrote. But such an (undocumented) difference in the behavior is also quite unsatisfying...

  1 Comment

Agreed. The table type appeared to be a perfect solution for what I needed to do. I found this question, registered my profile and wrote this while waiting for writetable to complete. The previous code using dmlwrite took a couple of seconds.

Sign in to comment.


Answer by Oleg Komarov on 30 Nov 2016
Edited by Oleg Komarov on 2 Dec 2016

As already replicated in table performance very slow , I repeat here my take.
I have been using table() way before they were introduced into the core package, since de facto they are the ported version of the dataset() class from the Statistics Toolbox. I also noticed long time ago many limitations in terms of performance and functionality, and have logged feature enhancements with TMW.
To address the limitations of the table(), while waiting for the ufficial implementation of my enhancement requests, I created the tableutils(). Among the problems, you would be astonished to know that the disp() of a big table can literally freeze your pc until the next ice age (and I am not talking about the movies...). This is somethig that I fixed with a buffered disp method.
While my tableutils() do not address directly the problems in subsref/subsasgn, anyone is welcome to contribute to this effort to make the table() class better by submitting an issue or a Pull Request on Github.
.
Addressing some points in the question
  • It is 50x faster to initialize with {''} rather than with []
N = 500;
A = cell(N);
sprintf('cell2table() on empty cells: %.3fs', timeit(@()cell2table(A)))
A = repmat({''},(N));
sprintf('cell2table() on {''} cells: %.3fs', timeit(@()cell2table(A)))
  • It is 5x faster to use dot-indexing, i.e. subsasgDot, than brace-indexing, i.e. subsasgBraces
S = 1000;
[row,col] = ind2sub(N,randsample(N^2,S,false));
% {} assignment
B = cell2table(A);
tic
for ii = 1:S
B{row(ii),col(ii)} = {'aaa'};
end
toc
% . assignment
C = cell2table(A);
vnames = B.Properties.VariableNames;
tic
for ii = 1:S
C.(vnames{col(ii)})(row(ii)) = {'aaa'};
end
toc

  0 Comments

Sign in to comment.


Answer by LuisCardona on 5 May 2016

Tables are the slowest thing I have ever had. I had to rewrite my code to use matrices coding the name of my columns with integers because their poor performance.
Stay away of the tables!

  3 Comments

Table is nonetheless a valuable programming paradigm (table-oriented-programming). Don't throw the concept away altogether! Here's a few hints to make it work reasonably fast:
1) Do not join() the parts you don't need to early. Break it into small tables first (to be linked later with keys).
2) Use nominal() instead of cellstr if you have a lot of repetitions! Under the hood it only keeps the unique() cellstr and map them by indices. You are only manipulating indices with nominal() so it's almost as fast as dealing with doubles.
3) If you need to repetitively access/update/manipulate a specific section of the table (such as a matrix or variable) in a loop, just 'copy' (remember MATLAB uses copy-on-write) the specific section to a temporary, native data type, work on it, and put it back into the table after you're done.
TMW already know this trick, but they have yet to consistently implement it everywhere in their code base. If you find performance issues in areas like subsasgn()/subsref(), you can correct it at user level (and 'cast a vote' by telling support about your experience so the developers can fix it).
Think of all the performance issues associated with a relational database as you work with table() objects. Almost all beginners I trained protested performance issues because they still had the traditional way of thinking (for-loops/imperative) while the table-based programming accommodates newer school of thoughts (functional/declarative programming) better.
Arrays/matrices are still strictly computationally faster, but the amount of work you spend managing them compared to the computational time you save (after following the suggestions above) is not worth it if you know how to play the cards right with table().
I think, the current Table datatype seems to be an attempt to support more sophisticated Excel-like functionality, with optimization trade-off.
The problem is, with matrices you can't always remember column name by index, and searching string for every call to a variable is not a good solution.
I have used two ways to keep variable/column names - structure of vectors of the same length and vector of structures (a.k.a. nonscalar struct array).
Both have drawbacks - you can't get simultaneous simple row-wise and colum-wise access without slow convertion to another data structure.
But I think that there can be some simpler and optimized version of Table data type, if we want just to combine row-number and column-variable indexing with original arrays and cell arrays. And if we have only numbers (with no cell/string/sparce functionality), it can be even more faster.
Hoi Wong. I wanted to clarify that I was talking about the tables in MATLAB, not the concept altogether. Thanks for the comment. But, I keep my position that they are terrible slow in MATLAB

Sign in to comment.


Answer by jbpritts
on 24 Nov 2016

I have Matlab 2016b. I can confirm that tables are terribly slow. Unless you really need it for heterogeneous data, then avoid them in any performance critical code. I will have to rewrite a fairly complicated section of code using legacy data structures. Matlab should address this extreme performance deficiency.

  0 Comments

Sign in to comment.


Answer by Peter Perkins
on 2 Dec 2016
Edited by Peter Perkins
on 2 Dec 2016

As posts on this thread have indicated, while tables are often the right data structure for the job, their performance in scalar indexing is not comparable to that of types such as double and struct. While there have been significant performance improvements since the initial release in R2014b (e.g. writetable), and those improvements will continue, tables are best when operations can be vectorized. That's often true even with plain old double matrices. It's also best to pre-allocate a table rather than growing it row by row, and again, that's true even for double matrices.
In situations where code cannot be vectorized, perhaps because the results of one iteration of a loop affect subsequent iterations, it's often possible to encapsulate the body of a loop into a function that you call by passing it a table's variables using dot subscripting, and assign back to a table's variables, rather than completely rewriting code to not use tables. It often looks something like this:
[t.X,t.Y,t.Z] = fun(t.A,T.B,t.C)
where fun is a loop that works on separate arrays. Even when it's not desirable to encapsulate the code in a function body, it's often possible to "hoist" a small number of variables out of a table and into the workspace before a loop, have the loop work on them, and then put them back in the table. In other words, if performance is an issue, consider replacing the bottlenecks with code that uses lower-level data types rather than completely avoiding tables.

  2 Comments

Hi Peter, thanks for the suggestion. Is there any particular reason why the table.subsasgnBraces() transforms the RHS into a table?
A lot of overhead is incurred in that operation and subsequent table methods applied to a table-like RHS.
See for e.g. line 121 @tabular\subsasgnBraces.m, and line 191 of @tabular\subsasgnParens.m which calls a matlab coded repmat since the input is the RHS rendered table, instead of the builtin repmat.
Your earlier observation that dot-then-parens indexing is faster than braces, for example, B.A7(i) vs B{i,7}, is true. That's one of the "significant performance improvements" I was referring to. It's an ongoing process. Table brace indexing is something we're planning to work on.

Sign in to comment.