Asked by Julian
on 10 Sep 2013

As an enthusiast for the dataset class, I notice with interest a new class table in the latest MATLAB release (in the promo video). This sounds very similar to the existing dataset class in the Statistics Toolbox which I have been using since release.

When I search the documentation/help for "table dataset" all I find is a converter function dataset2table and table2dataset, but the question I have is what is the difference in intention between these? When is it appropriate to use a dataset and when to use a table? What is the difference between the design of these two classes?

What about the "new" categorical class. Has this moved from stats toolbox into base MATLAB?

Should we expect dataset and categorical classess in the Statistics Toolbox to be deprecated in the future?

*No products are associated with this question.*

Answer by Peter Perkins
on 10 Sep 2013

Accepted answer

Julian, as you noticed, MATLAB R2013b includes two new array types known as tables and categorical arrays. These are very similar to the dataset, nominal, and ordinal array types that have been part of the Statistics Toolbox for about six years. Like a dataset array, a table is a container that holds mixed-type tabular data, the sort of column-oriented data you would often import from a CSV file or a spreadsheet. And like nominal and ordinal arrays, a categorical array represents discrete non-numeric data, the sort of data you might otherwise have used strings or "coded integers" to store.

Generally speaking, these new data types should look and feel very familiar to anyone who has used the ones in the Statistics Toolbox. One obvious difference is that they are included as part of core MATLAB, and you don't need to install the Statistics Toolbox to use them. In addition, their design and terminology makes them a bit more accessible for non-statistical uses, though they remain just as useful for statistics.

Tables and categorical arrays are ultimately intended as replacements for dataset, nominal, and ordinal arrays, and we recommend that MATLAB users adopt them for new work. We also recommend that, over time, users update any of their existing code that uses dataset/nominal/ordinal, but we don't expect that that changeover can happen immediately. Upcoming releases will provide more details and strategies for making the transition.

In R2013b, all of the Statistics Toolbox functionality that uses nominal and ordinal arrays also supports the new categorical arrays. In R2013b, you'll still need to use dataset arrays in the Statistics Toolbox for things like LinearModel and (new in R2013b) LinearMixedModel, but you might consider creating tables and converting to dataset only when needed, using table2dataset.

Show 2 older comments

Steve Eddins
on 11 Sep 2013

Julian, I was a bit surprised that you think we were disingenuous, which means lacking in frankness, candor, or sincerity; insincere or calculating. To our way of thinking, it is not necessary for a MATLAB user to understand datasets in the Statistics Toolbox or to know about their existence in order to learn about and successfully use the new table type in MATLAB. In fact, we think it would be mostly a distraction, introducing added complexity into documentation and demonstrations that would not be helpful to most people. That said, it is probably true that some people could have used more information about the connection between dataset and table than we provided. I might suggest "oversight" or "ran out of time" as alternative explanations instead of insincerity.

On your point about large datasets, MATLAB uses reference-counting heavily under the hood in order avoid actual memory copies whenever possible. Changing one variable, or changing the table's metadata, wouldn't normally result in a memory copy of the entire table.

Julian
on 12 Sep 2013

Steve, sure, I agree not mentioning the existing classes in the main MATLAB doc is clearer for a new (or a no Stats Toolbox) MATLAB user, and the doc is cleaner that way. But release notes (and videos) speak mainly to existing users rather than new ones.... and the new converter methods dataset2table and table2dataset should be mentioned in the release notes for the Statistics Toolbox. TMW modified the head doc page for Dataset Arrays http://www.mathworks.com/help/stats/dataset-arrays.html to reference table2dataset & dataset2table, but there is no remark at all about the relation between dataset and table, and the future implication for Statistics Toolbox users. The head page for Categorical Arrays http://www.mathworks.co.uk/help/stats/categorical-arrays.html fails even to mention its new non-abstract namesake.

I am sure the design of table and categorical leaned heavily on experience with dataset and categorical. TMW didn't forget about datasets or categoricals when you launched their replacements with a big fanfare, but you did forget about their users when you updated the Statistics Toolbox documentation and release notes.

Thank you for your other remark regarding efficiency. BTW It's great to see "datasets" and "categoricals" get a wider audience, I really like them. It will be quite a while before I get to try the new ones (my company just upgraded to R2013a from R2011a). I hope a migration guide will be published by then?

Opportunities for recent engineering grads.

## 0 Comments