dataset

(Not Recommended) Arrays for statistical data

The dataset data type is not recommended. To work with heterogeneous data, use the MATLAB^® table data type instead. See MATLAB table documentation for more information.

Description

Dataset arrays are used to collect heterogeneous data and metadata including variable and observation names into a single container variable. Dataset arrays are suitable for storing column-oriented or tabular data that are often stored as columns in a text file or in a spreadsheet, and can accommodate variables of different types, sizes, units, and so on.

Dataset arrays can contain different kinds of variables, including numeric, logical, character, string, categorical, and cell. However, a dataset array is a different class than the variables that it contains. For example, even a dataset array that contains only variables that are double arrays cannot be operated on as if it were itself a double array. However, using dot subscripting, you can operate on variable in a dataset array as if it were a workspace variable.

You can subscript dataset arrays using parentheses much like ordinary numeric arrays, but in addition to numeric and logical indices, you can use variable and observation names as indices.

Creation

Syntax

A = dataset(varspec,Name=Value)

A = dataset(File=filename,Name=Value)

A = dataset(XLSFile=filename,Name=Value)

A = dataset(XPTFile=filename,Name=Value)

Description

A = dataset(varspec,Name=Value) creates dataset array A using the workspace variable input method varspec and one or more name-value arguments.

example

A = dataset(File=filename,Name=Value) creates dataset array A from column-oriented data in the text file specified by filename. Variables in A are of type double if data in the corresponding column of the file, following the column header, are entirely numeric; otherwise the variables in A are cell arrays of character vectors. dataset converts empty fields to either NaN (for a numeric variable) or the empty character vector (for a character-valued variable). dataset ignores insignificant white space in the file. You cannot specify both a file and workspace variables as input.

example

A = dataset(XLSFile=filename,Name=Value) creates dataset array A from column-oriented data in the Excel^® spreadsheet specified by Name=Value. Variables in A are of type double if data in the corresponding column of the spreadsheet, following the column header, are entirely numeric; otherwise the variables in A are cell arrays of character vectors.

example

A = dataset(XPTFile=filename,Name=Value) creates a dataset array from a SAS^® XPORT format file. Variable names from the XPORT format file are preserved. Numeric data types in the XPORT format file are preserved but all other data types are converted to cell arrays of character vectors. The XPORT format allows for 28 missing data types. dataset represents these in the file by an upper case letter, '.' or '_'. dataset converts all missing data to NaN values in A.

Input Arguments

expand all

`varspec` — Workspace variable input method
variable | cell array

Workspace variable input method, specified as one or more of the following values:

Workspace variable var. The dataset function uses the workspace name for the variable name in A. To include multiple variables, specify var_1,var_1,...,var_N. Variables can be arrays of any size, but all variables must have the same number of rows. var can also be an expression. In this case, dataset creates a default name automatically.
Cell array containing a workspace variable, var and a variable name, name, such as {var,name}. dataset uses name as the variable name. To include multiple variables and names, specify {var_1,name_1},{var_2,name_2},...,{var_N,name_N}.
m-columned workspace variable var, such as {var,name_1,...name_m}. dataset uses the names name_1,...name_m as variable names. Include a name for every column in var. Each column becomes a separate variable in A.

You can combine these input methods to include as many variables and names as needed. Names must be valid, unique MATLAB identifiers.

`filename` — Name of text file, Excel spreadsheet, or SAS XPORT format file
string | character vector

Name of text file, Excel spreadsheet, or SAS XPORT format file, specified as a string or a character vector.

Data Types: string | char

Name-Value Arguments

expand all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: patients = dataset(File="hospital.dat",Delimiter=',',ReadObsNames=true)

`VarNames` — Names of `m` variables in the resulting dataset
string array | cell array

Names of m variables in the resulting dataset, specified as a string array or cell array. Names must be valid, unique MATLAB identifiers. The number of names must equal the number of variables in A. Do not use VarNames if you provide names for individual variables using {var,name} pairs. To specify VarNames when using a file as an input, set ReadVarNames to false.

Example: {name_1,...,name_m}

Data Types: string | cell

`ObsNames` — Names of `n` observations in the resulting dataset
string array | cell array

Names of n observations in the resulting dataset, specified as a string array or cell array. The names do not need to be valid MATLAB identifiers, but they must be unique. The number of names must equal the number of observations (rows) in A. To specify ObsNames when using a file as input, set ReadObsNames to false.

Example: {name_1,...,name_n}

Data Types: string | cell

Text Files Only

expand all

`Delimiter` — Character separating columns in the file
string scalar | character vector

Character separating columns in the file, specified as a string scalar or character vector. Available characters are:

'\t' (tab, the default when no format is specified)
' ' (space, the default when a format is specified)
',' (comma)
';' (semicolon)
'|' (bar)

Data Types: string | char

`Format` — Format parameter for `textscan`
string scalar | character vector

Format parameter for textscan, specified as a string scalar or character vector. dataset reads the file using textscan, and creates variables in A according to the conversion specifiers in the format parameter. You can also provide any name-value arguments accepted by textscan. Specifying Format is much faster for large files. If ReadObsNames is true, then format must include a format specifier for the first column of the file.

Data Types: string | char

`HeaderLines` — Number of lines to skip at the beginning of a file
0 (default) | nonnegative integer

Number of lines to skip at the beginning of a file, specified as a nonnegative integer.

Data Types: double

`TreatAsEmpty` — Characters to treat as the empty character vector in a numeric column
string array | character array | cell array of character vectors

Characters to treat as the empty character vector in a numeric column, specified as a string array, character array, or cell array of character vectors. The parameter applies only to numeric columns in the file; dataset does not accept numeric literals, such as '-99'.

Data Types: string | char | cell

Text Files or Excel Spreadsheets

expand all

`ReadVarNames` — Indicator for reading variable names from the first row of the file
`true` (default) | `false`

Indicator for reading variable names from the first row of the file, specified as true or false. If the ReadVarNames value is set to true, dataset reads from the first row of the file, otherwise it does not. If ReadVarNames is true, variable names in the column headers of the file or range (if using an Excel spreadsheet) must not be empty.

Data Types: logical

`ReadObsNames` — Indicator for reading observation names from the first row of the file
`false` (default) | `true`

Indicator for reading observation names from the first row of the file, specified as false or true.

If ReadObsNames and ReadVarNames are both true, dataset saves the header of the first column in the file or range as the name of the first dimension in A.Properties.DimNames.

When reading from an XPT format file, ReadObsNames determines whether or not to try to use the first variable in the file as observation names. Specify as a logical value (default false). If the contents of the first variable are not valid observation names then dataset reads the variable into a variable of the dataset array and does not set the observation names.

Data Types: logical

Excel Spreadsheets Only

expand all

`Sheet` — Sheet number or a quoted sheet name
positive number | character vector | string scalar

Sheet number or a quoted sheet name, specified as a positive number, character vector, or string scalar.

Data Types: double | char

`Range` — Range of cells to read
character vector | string scalar

Range of cells to read, specified as a character vector or string scalar of the form 'C1:C2' where C1 and C2 are the names of cells at opposing corners of a rectangular region to be read, as for xlsread. By default, the rectangular region extends to the right-most column containing data. If the spreadsheet contains empty columns between columns of data, or if the spreadsheet contains figures or other non-tabular information, specify a range that contains only data.

Data Types: double

Properties

expand all

A dataset array D has properties that store metadata (information about your data). Access or assign to a property using P = D.Properties.PropName or D.Properties.PropName = P, where PropName is one of the following:

`Description` — Description of the dataset array
`0×0 empty char array` (default) | character vector

Description of the dataset array, stored as a character vector.

Data Types: char

`DimNames` — Names of the two dimensions of the dataset array
`{'Observations' 'Variables'}` (default) | two-element cell array of character vectors

Names of the two dimensions of the dataset array, stored as a two-element cell array of character vectors.

Data Types: cell

`ObsNames` — Names of the observations in the dataset array
cell array of nonempty, distinct character vectors

Names of the observations in the dataset array, stored as a cell array of nonempty, distinct character vectors. This property can be empty. If it is not empty, then the number of character vectors must equal the number of observations.

Data Types: cell

`Units` — Units of the variables in the dataset array
`0×0 empty cell array` (default) | cell array of character vectors

Units of the variables in the dataset array, stored as a cell array of character vectors. This property can be empty. If it is not empty, then the number of character vectors must equal the number of variables. Any individual character vector can be empty for a variable that does not have units defined.

Data Types: cell

`UserData` — Any variable containing additional information to be associated with the dataset array
`[]` (default) | array

Any variable containing additional information to be associated with the dataset array, stored as an array.

Data Types: double

`VarDescription` — Descriptions of the variables in the dataset array
`0×0 empty cell array` (default) | cell array of character vectors

Descriptions of the variables in the dataset array, stored as a cell array of character vectors. This property can be empty. If it is not empty, then the number of character vectors must equal the number of variables. Any individual character vector can be empty for a variable that does not have a description defined.

Data Types: cell

`VarNames` — Names of the variables in the dataset array
cell array of names for the variables used to create the data set (default) | cell array of nonempty, distinct character vectors

Names of the variables in the dataset array, stored as a cell array of nonempty, distinct character vectors. The number of character vectors must equal the number of variables. The default is the cell array of names for the variables used to create the data set.

Data Types: cell

Object Functions

`cat`	(Not Recommended) Concatenate dataset arrays
`cellstr`	(Not Recommended) Create cell array of character vectors from dataset array
`dataset2cell`	(Not Recommended) Convert dataset array to cell array
`dataset2struct`	(Not Recommended) Convert dataset array to structure
`disp`	(Not Recommended) Display dataset array
`double`	(Not Recommended) Convert dataset variables to double array
`end`	(Not Recommended) Last index in indexing expression for dataset array
`export`	(Not Recommended) Write dataset array to file
`get`	(Not Recommended) Access dataset array properties
`horzcat`	(Not Recommended) Horizontal concatenation for dataset arrays
`intersect`	(Not Recommended) Set intersection for dataset array observations
`isempty`	(Not Recommended) True for empty dataset array
`ismember`	(Not Recommended) Dataset array elements that are members of set
`ismissing`	(Not Recommended) Find dataset array elements with missing values
`join`	(Not Recommended) Merge dataset array observations
`length`	(Not Recommended) Length of dataset array
`ndims`	(Not Recommended) Number of dimensions of dataset array
`numel`	(Not Recommended) Number of elements in dataset array
`replaceWithMissing`	(Not Recommended) Insert missing data indicators into a dataset array
`replacedata`	(Not Recommended) Replace dataset variables
`set`	(Not Recommended) Set and display dataset array properties
`setdiff`	(Not Recommended) Set difference for dataset array observations
`setxor`	(Not Recommended) Set exclusive or for dataset array observations
`single`	(Not Recommended) Convert dataset variables to single array
`size`	(Not Recommended) Size of dataset array
`sortrows`	(Not Recommended) Sort rows of dataset array
`stack`	(Not Recommended) Stack dataset array from multiple variables into single variable
`subsasgn`	(Not Recommended) Subscripted assignment to dataset array
`subsref`	(Not Recommended) Subscripted reference for dataset array
`summary`	(Not Recommended) Print summary of dataset array
`union`	(Not Recommended) Set union for dataset array observations
`unique`	(Not Recommended) Unique observations in dataset array
`unstack`	(Not Recommended) Unstack dataset array from single variable into multiple variables
`vertcat`	(Not Recommended) Vertical concatenation for dataset arrays

Examples

collapse all

Create `dataset` Arrays from Workspace Variables

Open Live Script

Create a dataset array from workspace variables, including observation names.

load cereal
cereal = dataset(Calories,Protein,Fat,Sodium, ...
                 Fiber,Carbo,Sugars,ObsNames=Name);
cereal.Properties.VarDescription = Variables(4:10,2);

Create a dataset array from a single multi-columned workspace variable. Designate variable names for each column.

load cities
categories = cellstr(categories);
cities = dataset({ratings,categories{:}},...
                ObsNames=cellstr(names));

Create `dataset` Arrays from Text Files

Open Live Script

Load patient data from the CSV file hospital.dat and store the information in a dataset array with observation names given by the first column in the data (patient identification).

patients = dataset(File="hospital.dat", ...
                   Format="%s%s%s%f%f%f%f%f%f%f%f%f", ...
                   Delimiter=',', ...
                   ReadObsNames=true);

You can also load the data without specifying a format. dataset automatically creates dataset variables that are either double arrays or cell arrays of character vectors, depending on the contents of the file.

patients = dataset(File="hospital.dat", ...
                   Delimiter=',', ...
                   ReadObsNames=true);

Make the {0,1}-valued variable smoke nominal, and change the labels to 'No' and 'Yes'.

patients.smoke = nominal(patients.smoke,{'No','Yes'});

Add new levels to smoke as placeholders for more detailed histories of smokers.

patients.smoke = addlevels(patients.smoke,...
                 {'0-5 Years','5-10 Years','LongTerm'});

Assuming the nonsmokers have never smoked, relabel the 'No' level.

patients.smoke = setlabels(patients.smoke,'Never','No');

Drop the undifferentiated 'Yes' level from smoke. Note that smokers now have an undefined level.

patients.smoke = droplevels(patients.smoke,'Yes');

Set each smoker to one of the new levels, by observation name.

patients.smoke('YPL-320') = '5-10 Years';

Create `dataset` Arrays from Excel Spreadsheets

Open Live Script

Load patient data from a spreadsheet file.

patients = dataset(XLSFile="hospital.xls",ReadObsNames=true);

Create Simple Subsets from `dataset` Array

Open Live Script

Load a dataset array from a .mat file.

load hospital
h1 = hospital(1:10,:);
h2 = hospital(:,{'LastName' 'Age' 'Sex' 'Smoker'});

Access and modify metadata.

hospital.Properties.Description;
hospital.Properties.VarNames{4} = 'Wgt';

Create a new dataset variable from an existing one.

hospital.AtRisk = hospital.Smoker | (hospital.Age > 40);

Use individual variables to explore the data.

boxplot(hospital.Age,hospital.Sex)

Figure contains an axes object. The axes object contains 14 objects of type line. One or more of the lines displays its values using only markers

h3 = hospital(hospital.Age<30,...
   {'LastName' 'Age' 'Sex' 'Smoker'});

Sort the observations based on two variables.

h4 = sortrows(hospital,{'Sex','Age'});

Tips

To convert numeric arrays, cell arrays, structure arrays, or tables to dataset arrays, you can also use these functions, perspectively:
Dataset arrays can contain built-in types or array objects as variables. Array objects must implement each of these:
- Standard MATLAB parenthesis indexing of the form var(i,...), where i is a numeric or logical vector corresponding to rows of the variable
- A size method with a dim argument
- A vertcat method

Version History

Introduced in R2006b

dataset

Description

Creation

Syntax

Description

Input Arguments

varspec — Workspace variable input method variable | cell array

filename — Name of text file, Excel spreadsheet, or SAS XPORT format file string | character vector

Name-Value Arguments

VarNames — Names of m variables in the resulting dataset string array | cell array

ObsNames — Names of n observations in the resulting dataset string array | cell array

Text Files Only

Delimiter — Character separating columns in the file string scalar | character vector

Format — Format parameter for textscan string scalar | character vector

HeaderLines — Number of lines to skip at the beginning of a file 0 (default) | nonnegative integer

TreatAsEmpty — Characters to treat as the empty character vector in a numeric column string array | character array | cell array of character vectors

Text Files or Excel Spreadsheets

ReadVarNames — Indicator for reading variable names from the first row of the file true (default) | false

ReadObsNames — Indicator for reading observation names from the first row of the file false (default) | true

Excel Spreadsheets Only

Sheet — Sheet number or a quoted sheet name positive number | character vector | string scalar

Range — Range of cells to read character vector | string scalar

Properties

Description — Description of the dataset array 0×0 empty char array (default) | character vector

DimNames — Names of the two dimensions of the dataset array {'Observations' 'Variables'} (default) | two-element cell array of character vectors

ObsNames — Names of the observations in the dataset array cell array of nonempty, distinct character vectors

Units — Units of the variables in the dataset array 0×0 empty cell array (default) | cell array of character vectors

UserData — Any variable containing additional information to be associated with the dataset array [] (default) | array

VarDescription — Descriptions of the variables in the dataset array 0×0 empty cell array (default) | cell array of character vectors

VarNames — Names of the variables in the dataset array cell array of names for the variables used to create the data set (default) | cell array of nonempty, distinct character vectors

Object Functions

Examples

Create dataset Arrays from Workspace Variables

Create dataset Arrays from Text Files

Create dataset Arrays from Excel Spreadsheets

Create Simple Subsets from dataset Array

Tips

Version History

See Also

`varspec` — Workspace variable input method
variable | cell array

`filename` — Name of text file, Excel spreadsheet, or SAS XPORT format file
string | character vector

`VarNames` — Names of `m` variables in the resulting dataset
string array | cell array

`ObsNames` — Names of `n` observations in the resulting dataset
string array | cell array

`Delimiter` — Character separating columns in the file
string scalar | character vector

`Format` — Format parameter for `textscan`
string scalar | character vector

`HeaderLines` — Number of lines to skip at the beginning of a file
0 (default) | nonnegative integer

`TreatAsEmpty` — Characters to treat as the empty character vector in a numeric column
string array | character array | cell array of character vectors

`ReadVarNames` — Indicator for reading variable names from the first row of the file
`true` (default) | `false`

`ReadObsNames` — Indicator for reading observation names from the first row of the file
`false` (default) | `true`

`Sheet` — Sheet number or a quoted sheet name
positive number | character vector | string scalar

`Range` — Range of cells to read
character vector | string scalar

`Description` — Description of the dataset array
`0×0 empty char array` (default) | character vector

`DimNames` — Names of the two dimensions of the dataset array
`{'Observations' 'Variables'}` (default) | two-element cell array of character vectors

`ObsNames` — Names of the observations in the dataset array
cell array of nonempty, distinct character vectors

`Units` — Units of the variables in the dataset array
`0×0 empty cell array` (default) | cell array of character vectors

`UserData` — Any variable containing additional information to be associated with the dataset array
`[]` (default) | array

`VarDescription` — Descriptions of the variables in the dataset array
`0×0 empty cell array` (default) | cell array of character vectors

`VarNames` — Names of the variables in the dataset array
cell array of names for the variables used to create the data set (default) | cell array of nonempty, distinct character vectors

Create `dataset` Arrays from Workspace Variables

Create `dataset` Arrays from Text Files

Create `dataset` Arrays from Excel Spreadsheets

Create Simple Subsets from `dataset` Array