Quantcast

Documentation Center

  • Trial Software
  • Product Updates

dataset

Class: dataset

Construct dataset array

The dataset data type might be removed in a future release. To work with heterogeneous data, use the MATLAB® table data type instead. See MATLAB table documentation for more information.

Syntax

A = dataset(varspec,'ParamName',Value)
A = dataset('File',filename,'ParamName',Value)
A = dataset('XLSFile',filename,'ParamName',Value)
A = dataset('XPTFile',xptfilename,'ParamName',Value)

Description

A = dataset(varspec,'ParamName',Value) creates dataset array A using the workspace variable input method varspec and one or more optional name/value pairs (see Parameter Name/Value Pairs).

The input method varspec can be one or more of the following:

  • VAR — a workspace variable. dataset uses the workspace name for the variable name in A. To include multiple variables, specify VAR_1,VAR_2,...,VAR_N. Variables can be arrays of any size, but all variables must have the same number of rows. VAR can also be an expression. In this case, dataset creates a default name automatically.

  • {VAR,name} — a workspace variable, VAR and a variable name, name . dataset uses name as the variable name. To include multiple variables and names, specify {VAR_1,name_1}, {VAR_2,name_2},..., {VAR_N,name_N}.

  • {VAR,name_1,...,name_m} — an m-columned workspace variable, VAR. dataset uses the names name_1, ..., name_m as variable names. You must include a name for every column in VAR. Each column becomes a separate variable in A.

You can combine these input methods to include as many variables and names as needed. Names must be valid, unique MATLAB identifier strings. For example input combinations, see Examples. For optional name/value pairs see Inputs.

To convert numeric arrays, cell arrays, structure arrays, or tables to dataset arrays, you can also use (respectively):

    Note:   Dataset arrays may contain built-in types or array objects as variables. Array objects must implement each of the following:

    • Standard MATLAB parenthesis indexing of the form var(i,...), where i is a numeric or logical vector corresponding to rows of the variable

    • A size method with a dim argument

    • A vertcat method

A = dataset('File',filename,'ParamName',Value) creates dataset array A from column-oriented data in the text file specified by the string filename. Variables in A are of type double if data in the corresponding column of the file, following the column header, are entirely numeric; otherwise the variables in A are cell arrays of strings. dataset converts empty fields to either NaN (for a numeric variable) or the empty string (for a string-valued variable). dataset ignores insignificant white space in the file. You cannot specify both a file and workspace variables as input. See Name/Value Pairs for more information.

A = dataset('XLSFile',filename,'ParamName',Value) creates dataset array A from column-oriented data in the Excel® spreadsheet specified by the string filename. Variables in A are of type double if data in the corresponding column of the spreadsheet, following the column header, are entirely numeric; otherwise the variables in A are cell arrays of strings. See Name/Value Pairs for more information.

A = dataset('XPTFile',xptfilename,'ParamName',Value) creates a dataset array from a SAS® XPORT format file. Variable names from the XPORT format file are preserved. Numeric data types in the XPORT format file are preserved but all other data types are converted to cell arrays of strings. The XPORT format allows for 28 missing data types. dataset represents these in the file by an upper case letter, '.' or '_'. dataset converts all missing data to NaN values in A. See Name/Value Pairs for more information.

Parameter Name/Value Pairs

Specify one or more of the following name/value pairs when constructing a dataset:

'VarNames'

A cell array {name_1,...,name_m} naming the m variables in A with the specified variable names. Names must be valid, unique MATLAB identifier strings. The number of names must equal the number of variables in A. You cannot use the VarNames parameter if you provide names for individual variables using {VAR,name} pairs. To specify VarNames when using a file as input, set ReadVarNames to false.

'ObsNames'

A cell array {name_1,...,name_n} naming the n observations in A with the specified observation names. The names need not be valid MATLAB identifier strings, but must be unique. The number of names must equal the number of observations (rows) in A. To specify ObsNames when using a file as input, set ReadObsNames to false.

Name/value pairs available when using text files as inputs:

'Delimiter'

A string indicating the character separating columns in the file. Values are

  • '\t' (tab, the default when no format is specified)

  • ' ' (space, the default when a format is specified)

  • ',' (comma)

  • ';' (semicolon)

  • '|' (bar)

'Format'

A format string, as accepted by textscan. dataset reads the file using textscan, and creates variables in A according to the conversion specifiers in the format string. You may also provide any name/value pairs accepted by textscan. Using the Format parameter is much faster for large files. If ReadObsNames is true, the format string should include a format specifier for the first column of the file.

'HeaderLines'

Numeric value indicating the number of lines to skip at the beginning of a file.

Default: 0

'TreatAsEmpty'

Specifies strings to treat as the empty string in a numeric column. Values may be a character string or a cell array of strings. The parameter applies only to numeric columns in the file; dataset does not accept numeric literals such as '-99'.

Name/value pairs available when using text files or Excel spreadsheets as inputs:

'ReadVarNames'

A logical value indicating whether (true) or not (false) to read variable names from the first row of the file. The default is true. If ReadVarNames is true, variable names in the column headers of the file or range (if using an Excel spreadsheet) cannot be empty.

'ReadObsNames'

A logical value indicating whether (true) or not (false) to read observation names from the first column of the file or range (if using an Excel spreadsheet). The default is false. If ReadObsNames and ReadVarNames are both true, dataset saves the header of the first column in the file or range as the name of the first dimension in A.Properties.DimNames.

When reading from an XPT format file, the ReadObsNames parameter name/value pair determines whether or not to try to use the first variable in the file as observation names. Specify as a logical value (default false). If the contents of the first variable are not valid observation names then dataset reads the variable into a variable of the dataset array and does not set the observation names.

Name/value pairs available when using Excel spreadsheets as input:

'Sheet'

A positive scalar value of type double indicating the sheet number, or a quoted string indicating the sheet name.

'Range'

A string of the form 'C1:C2' where C1 and C2 are the names of cells at opposing corners of a rectangular region to be read, as for xlsread. By default, the rectangular region extends to the right-most column containing data. If the spreadsheet contains empty columns between columns of data, or if the spreadsheet contains figures or other non-tabular information, specify a range that contains only data.

Examples

Create a dataset array from workspace variables, including observation names:

load cereal
cereal = dataset(Calories,Protein,Fat,Sodium,Fiber,Carbo,...
   Sugars,'ObsNames',Name)
cereal.Properties.VarDescription = Variables(4:10,2);
 

Create a dataset array from a single, multi-columned workspace variable, designating variable names for each column:

load cities
categories = cellstr(categories);
cities = dataset({ratings,categories{:}},...
   'ObsNames',cellstr(names))
 

Load data from a text or spreadsheet file

patients = dataset('File','hospital.dat',...
   'Delimiter',',','ReadObsNames',true)
patients2 = dataset('XLSFile','hospital.xls',...
   'ReadObsNames',true)
 
  1. Load patient data from the CSV file hospital.dat and store the information in a dataset array with observation names given by the first column in the data (patient identification):

    patients = dataset('file','hospital.dat', ...
                 'format','%s%s%s%f%f%f%f%f%f%f%f%f', ...
                 'Delimiter',',','ReadObsNames',true); 
    

    You can also load the data without specifying a format string. dataset will automatically create dataset variables that are either double arrays or cell arrays of strings, depending on the contents of the file:

    patients = dataset('file','hospital.dat',...
                       'delimiter',',',...
                       'ReadObsNames',true);
  2. Make the {0,1}-valued variable smoke nominal, and change the labels to 'No' and 'Yes':

    patients.smoke = nominal(patients.smoke,{'No','Yes'});
    
  3. Add new levels to smoke as placeholders for more detailed histories of smokers:

    patients.smoke = addlevels(patients.smoke,...
                     {'0-5 Years','5-10 Years','LongTerm'});
    
  4. Assuming the nonsmokers have never smoked, relabel the 'No' level:

    patients.smoke = setlabels(patients.smoke,'Never','No');
    
  5. Drop the undifferentiated 'Yes' level from smoke:

    patients.smoke = droplevels(patients.smoke,'Yes');
    
    Warning: OLDLEVELS contains categorical levels that 
    were present in A, caused some array elements to have 
    undefined levels.

    Note that smokers now have an undefined level.

  6. Set each smoker to one of the new levels, by observation name:

    patients.smoke('YPL-320') = '5-10 Years';

See Also

| | | | |

Related Examples

More About

Was this topic helpful?