Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

tabularTextDatastore

Datastore for tabular text files

Description

Use a TabularTextDatastore object to manage large collections of text files containing column-oriented or tabular data where the collection does not necessarily fit in memory. Tabular data is data that is arranged in a rectangular fashion with each row having the same number of entries. You can create a TabularTextDatastore object using the tabularTextDatastore function, specify its properties, and then import and process the data using object functions.

Creation

Syntax

ttds = tabularTextDatastore(location)
ttds = tabularTextDatastore(location,Name,Value)

Description

example

ttds = tabularTextDatastore(location) creates a datastore from the collection of data specified by location.

ttds = tabularTextDatastore(location,Name,Value) specifies additional parameters and properties for ttds using one or more name-value pair arguments. For example, tabularTextDatastore(location,'FileExtensions',{'.txt','.csv'}) creates a datastore from only the files in location with extensions .txt and .csv.

Input Arguments

expand all

Files or folders included in the datastore, specified as a character vector, cell array of character vectors, string scalar, or string array. If the files are not in the current folder, then location must be full or relative paths. Files within subfolders of the specified folder are not automatically included in the datastore.

You can use the wildcard character (*) when specifying location. This character indicates that all matching files or all files in the matching folders are included in the datastore.

If the files are not available locally, then the full path of the files or folders must be an internationalized resource identifier (IRI) of the form
hdfs:///path_to_file.

For information on using datastore with Amazon S3™, Windows Azure® Blob Storage, and HDFS™, see Read Remote Data.

When location represents a folder, the datastore includes only supported text file formats and ignores any other format. Supported file formats have the extension .txt, .csv, .dat, .dlm, .asc, .text, or no extension.

Example: 'file1.csv'

Example: '../dir/data/file1'

Example: {'C:\dir\data\file1.csv','C:\dir\data\file2.dat'}

Example: 'C:\dir\data\*.text'

Data Types: char | cell | string

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: ttds = tabularTextDatastore('C:\dir\textdata','FileExtensions',{'.csv','.txt'})

expand all

Subfolder inclusion flag, specified as the comma-separated pair consisting of 'IncludeSubfolders' and true, false, 0, or 1. Specify true to include all files and subfolders within each folder or false to include only the files within each folder.

When you do not specify 'IncludeSubfolders', then the default value is false.

Example: 'IncludeSubfolders',true

Data Types: logical | double

Text file extensions, specified as the comma-separated pair consisting of 'FileExtensions' and a character vector, cell array of character vectors, string scalar, or string array. The specified extensions do not require a supported format. If you want to include unsupported extensions, then specify all extensions. Use empty quotes '' to represent files without extensions.

Example: 'FileExtensions','.txt'

Example: 'FileExtensions',{'.text','.csv'}

Data Types: char | cell | string

Alternate file system root paths, specified as the comma-separated pair consisting of 'AlternateFileSystemRoots' and a string vector or a cell array. Use 'AlternateFileSystemRoots' when you create a datastore on a local machine, but need to access and process the data on another machine (possibly of a different operating system). Also, when processing data using PCTParallel Computing Toolbox™ and MATLAB® Distributed Computing Server™, and the data is stored on your local machines with a copy of the data available on different platform cloud or cluster machines, you must use 'AlternateFileSystemRoots' to associate the root paths.

  • To associate a set of root paths that are equivalent to one another, specify 'AlternateFileSystemRoots' as a string vector. For example,

    ["Z:\datasets","/mynetwork/datasets"]

  • To associate multiple sets of root paths that are equivalent for the datastore, specify 'AlternateFileSystemRoots' as a cell array containing multiple rows where each row represents a set of equivalent root paths. Specify each row in the cell array as either a string vector or a cell array of character vectors. For example:

    • Specify 'AlternateFileSystemRoots' as a cell array of string vectors.

      {["Z:\datasets", "/mynetwork/datasets"];...
       ["Y:\datasets", "/mynetwork2/datasets","S:\datasets"]}

    • Alternatively, specify 'AlternateFileSystemRoots' as a cell array of cell array of character vectors.

      {{'Z:\datasets','/mynetwork/datasets'};...
       {'Y:\datasets', '/mynetwork2/datasets','S:\datasets'}}

The value of 'AlternateFileSystemRoots' must satisfy these conditions:

  • Contains one or more rows, where each row specifies a set of equivalent root paths.

  • Each row specifies multiple root paths and each root path must contain at least two characters.

  • Root paths are unique and are not subfolders of one another.

  • Contains at least one root path entry that points to the location of the files.

For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

Example: ["Z:\datasets","/mynetwork/datasets"]

Data Types: string | cell

Output data type of text variables, specified as the comma-separated pair consisting of 'TextType' and either 'char' or 'string'. If the output table from the read, readall, or preview functions contains text variables, then 'TextType' specifies the data type of those variables for TabularTextDatastore. If 'TextType' is 'char', then the output is a cell array of character vectors. If 'TextType' is 'string', then the output has type string.

Data Types: char | string

Type for imported date and time data, specified as the comma-separated pair consisting of 'DatetimeType' and one of these values: 'datetime' or 'text'.

ValueType for Imported Date and Time Data
'datetime'

MATLAB datetime data type

For more information, see datetime.

'text'

If 'DatetimeType' is specified as 'text', then the type for imported date and time data depends on the value specified in the 'TextType' property:

  • If 'TextType' is 'char', then the tabularTextdatastore imports dates as a cell array of character vectors.

  • If 'TextType' is 'string', then the tabularTextdatastore imports dates as an array of strings.

If the specified TextscanFormats property contains a %D, then the tabularTextdatastore ignores the value specified in DatetimeType.

Example: 'DatetimeType','datetime'

Data Types: char | string

Output data type of duration data, specified as the comma-separated pair consisting of 'DurationType' and either 'duration' or 'text'.

ValueType for Imported Duration Data
'duration'

MATLAB duration data type

For more information, see duration.

'text'

If 'DurationType' is specified as 'text', then the type for imported duration data depends on the value specified in the 'TextType' parameter:

  • If 'TextType' is 'char', then the importing function returns duration data as a cell array of character vectors.

  • If 'TextType' is 'string', then the importing function returns duration data as an array of strings.

Data Types: char | string

In addition to these name-value pairs, you also can specify the properties on this page as name-value pairs, with the exception of the Files property.

Properties

expand all

TabularTextDatastore properties describe the files associated with a TabularTextDatastore object. Specifically, the properties describe the format of the data in the files and control how the data should be read from the datastore. When you create a TabularTextDatastore object, the datastore function uses the first file in the Files property to determine the values of the properties. With the exception of the Files property, you can specify the value of TabularTextDatastore properties using name-value pair arguments when you create the datastore object. To view or modify a property after creating the object, use the dot notation:

ds = datastore('airlinesmall.csv');
ds.TreatAsMissing = 'NA';
ds.MissingValue = 0;

File Properties

Files included in the datastore, resolved as a cell array of character vectors or a string array, where each character vector or string is a full path to a file. The location argument in the tabularTextDatastore and datastore functions define these files.

The first file specified by the Files property determines the variable names and format information for all files in the datastore.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Example: {'C:\dir\data\mydata1.csv';'C:\dir\data\mydata2.csv'}

Data Types: cell | string

File encoding, specified as a character vector or a string scalar like one of these values.

'IBM866'

'ISO-8859-1'

'windows-847'

'KOI8-R'

'ISO-8859-2'

'windows-1250'

'KOI8-U'

'ISO-8859-3'

'windows-1251'

'Macintosh'

'ISO-8859-4'

'windows-1252'

'US-ASCII'

'ISO-8859-5'

'windows-1253'

'UTF-8'

'ISO-8859-6'

'windows-1254'

 

'ISO-8859-7'

'windows-1255'

 

'ISO-8859-8'

'windows-1256'

 

'ISO-8859-9'

'windows-1257'

 

'ISO-8859-11'

'windows-1258'

 

'ISO-8859-13'

 
 

'ISO-8859-15'

 

If each file in the datastore fits into memory, then FileEncoding also can be one of these values.

'Big5'

'EUC-KR'

'GB18030'

'Shift_JIS'

'Big5-HKSCS'

'EUC-JP'

'GB2312'

'windows-949'

'CP949'

'EUC-TW'

'GBK'

 

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Data Types: char | string

Read variable names, specified as a logical true or false.

  • If unspecified, the tabularTextDatastore function detects the presence of variable names automatically.

  • If true, then the first nonheader row of the first file determines the variable names for the data.

  • If false, then the first nonheader row of the first file contains the first row of data. The data is assigned default variable names, Var1, Var2, and so on.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Data Types: logical

Names of variables in the datastore, specified as a cell array of character vectors or a string array. Specify the variable names in the order in which they appear in the files. If you do not specify the variable names, they are detected from the first nonheader line in the first file of the datastore. When modifying the VariableNames property, the number of new variable names must match the number of original variable names.

If ReadVariableNames is false, then VariableNames defaults to {'Var1','Var2', ...}.

Example: {'Time','Name','Quantity'}

Data Types: cell | string

Text Format Properties

Number of lines to skip at the beginning of the file, specified as a nonnegative integer. If unspecified, the tabularTextDatastore function detects the number of lines to skip automatically.

The tabularTextDatastore function ignores the specified number of header lines before reading the variable names or data.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Data Types: double

Field delimiter characters, specified as a character vector, cell array of character vectors, string scalar, or string array. Specify multiple delimiters in a cell array of character vectors or a string array. If unspecified, the tabularTextDatastore function detects the delimiter automatically.

Example: '|'

Example: {';','*'}

Repeated delimiter characters in a file are interpreted as separate delimiters with empty fields between them. If unspecified, the read function detects the delimiter automatically by default.

When you specify one of the following escape sequences as a delimiter, it is converted to the corresponding control character.

\bBackspace
\nNewline
\rCarriage return
\tTab
\\Backslash (\)

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Data Types: char | cell | string

Row delimiter character, specified as a character vector or string scalar that must be either a single character or one of '\r', '\n', or '\r\n'.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Example: ':'

Data Types: char | string

Numeric values to treat as missing values, specified as a single character vector, cell array of character vectors, string scalar, or string array. Values specified as TreatAsMissing are substituted with the value defined in the MissingValue property. For instance, if MissingValue is defined to be a NaN, and the TreatAsMissing is specified as 'NA'. Then, in the imported data, all occurrences of 'NA' are replaced by NaN.

This option only applies to numeric fields. Also, this property is equivalent to the TreatAsEmpty name-value pair argument for the textscan function.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Example: 'NA'

Example: '-99'

Example: {'-',''}

Data Types: char | cell | string

Value for missing numeric fields in delimited text files, specified as a scalar. This property is equivalent to the EmptyValue name-value pair argument for the textscan function.

Data Types: double

Advanced Text Format Properties

Data field format, specified as a cell array of character vectors or a string array, where each character vector or string contains one conversion specifier.

When you specify or modify the TextscanFormats property, you can use the same conversion specifiers that the textscan function accepts for the formatSpec argument. Valid values for TextscanFormats include conversion specifiers that skip fields using an asterisk (*) character and ones that skip literal text. The number of conversion specifiers must match the number of variables in the VariableNames property.

  • If the value of TextscanFormats includes conversion specifiers that skip fields using asterisk characters (*), then the value of the SelectedVariableNames property automatically updates. MATLAB uses the %*q conversion specifier to skip fields omitted by the SelectedVariableNames property and treats the field contents as literal character vectors. For fixed-width files, indicate a skipped field using the appropriate conversion specifier along with the field width. For example, %*52c skips a field that contains 52 characters.

  • If you do not specify a value for TextscanFormats, then datastore determines the format of the data fields by scanning text from the first nonheader line in the first file of the datastore.

Example: {'%s','%s','%f'}

Data Types: cell | string

Exponent characters, specified as a character vector or string scalar. The default exponent characters are e, E, d, and D.

Data Types: char | string

Style of comments in the file, specified as a character vector, cell array of character vectors, string scalar, or string array.

For example, specify '%' to ignore characters following the text on the same line. Specify {'/*','*/'} to ignore characters between the text.

When reading from a TabularTextDatastore, the read function checks for comments only at the start of each field, not within a field.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Example: 'CommentStyle',{'/*', '*/'}

Data Types: char | cell | string

White-space characters, specified as a character vector or a string scalar of one or more characters.

When you specify one of the following escape sequences as any white-space character, the datastore function converts that sequence to the corresponding control character.

\bBackspace
\nNewline
\rCarriage return
\tTab
\\Backslash (\)

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Example: ' \b\t'

Data Types: char | string

Multiple delimiter handling, specified as either true or false. If true, then datastore treats consecutive delimiters as a single delimiter. Repeated delimiters separated by white-space are also treated as a single delimiter.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Properties That Control Table Returned by preview, read, readall

Variables to read from the file, specified as a cell array of character vectors or a string array, where each character vector or string contains the name of one variable. You can specify the variable names in any order.

Example: {'Var3','Var7','Var4'}

Data Types: cell | string

Formats of the selected variables to read, specified as a cell array of character vectors or a string array, where each character vector or string contains one conversion specifier. The variables to read are indicated by the SelectedVariableNames property. The number of character vectors or strings in SelectedFormats must match the number of variables to read.

You can use the same conversion specifiers that the textscan function accepts, including specifiers that skip literal text. However, you cannot use a conversion specifier that skips a field. That is, the conversion specifier cannot include an asterisk character (*).

Example: {'%d','%d'}

Data Types: cell | string

Amount of data to read in a call to the read function, specified as a positive scalar or 'file'.

  • If ReadSize is a positive integer, then each call to read reads at most ReadSize rows.

  • If ReadSize is 'file', then each call to read reads all of the data in one file.

When you change ReadSize from a numeric scalar to 'file' or vice versa, MATLAB resets the datastore to the state where no data has been read from it.

Data Types: double | char | string

Output data type of text variables, specified as 'char' or 'string'. TextType specifies the data type of text variables formatted with %s, %q, or [...].

  • If TextType is 'char', then the output is a cell array of character vectors.

  • If TextType is 'string', then the output has type string.

Data Types: char | string

Object Functions

hasdataDetermine if data is available to read
numpartitionsNumber of datastore partitions
partitionPartition a datastore
previewSubset of data in datastore
readRead data in datastore
readallRead all data in datastore
resetReset datastore to initial state

Examples

collapse all

Create a TabularTextDatastore object containing the text file airlinesmall.csv.

ttds = tabularTextDatastore('airlinesmall.csv')
ttds = 

  TabularTextDatastore with properties:

                      Files: {
                             ' ...\matlab\toolbox\matlab\demos\airlinesmall.csv'
                             }
               FileEncoding: 'UTF-8'
          ReadVariableNames: true
              VariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more}

  Text Format Properties:
             NumHeaderLines: 0
                  Delimiter: ','
               RowDelimiter: '\r\n'
             TreatAsMissing: ''
               MissingValue: NaN

  Advanced Text Format Properties:
            TextscanFormats: {'%f', '%f', '%f' ... and 26 more}
         ExponentCharacters: 'eEdD'
               CommentStyle: ''
                 Whitespace: ' \b\t'
    MultipleDelimitersAsOne: false

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more}
            SelectedFormats: {'%f', '%f', '%f' ... and 26 more}
                   ReadSize: 20000 rows

Create a datastore from the sample file airlinesmall.csv, which contains tabular data.

ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA');

View the variables in the datastore.

ds.VariableNames
ans = 1x29 cell array
  Columns 1 through 5

    {'Year'}    {'Month'}    {'DayofMonth'}    {'DayOfWeek'}    {'DepTime'}

  Columns 6 through 9

    {'CRSDepTime'}    {'ArrTime'}    {'CRSArrTime'}    {'UniqueCarrier'}

  Columns 10 through 13

    {'FlightNum'}    {'TailNum'}    {'ActualElapsedTime'}    {'CRSElapsedTime'}

  Columns 14 through 18

    {'AirTime'}    {'ArrDelay'}    {'DepDelay'}    {'Origin'}    {'Dest'}

  Columns 19 through 22

    {'Distance'}    {'TaxiIn'}    {'TaxiOut'}    {'Cancelled'}

  Columns 23 through 25

    {'CancellationCode'}    {'Diverted'}    {'CarrierDelay'}

  Columns 26 through 28

    {'WeatherDelay'}    {'NASDelay'}    {'SecurityDelay'}

  Column 29

    {'LateAircraftDelay'}

Modify the SelectedVariableNames property to specify the variables of interest.

ds.SelectedVariableNames = {'Year','Month','Cancelled'};

Alternatively, you can specify the variables of interest when you create the datastore.

ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA','SelectedVariableNames',{'Year','Month','Cancelled'});

Create a datastore from the sample file airlinesmall.csv, which contains tabular data.

ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA');

Specify the variables of interest.

ds.SelectedVariableNames = {'Year','Month','UniqueCarrier'};

View the SelectedFormats property.

ds.SelectedFormats
ans = 1x3 cell array
    {'%f'}    {'%f'}    {'%q'}

The SelectedFormats property indicates that the Year and Month variables will be interpreted as columns of floating-point values, and the UniqueCarrier variable will be interpreted as a column of text.

Specify that the first two variables should be read as signed integers, and the third variable should be read as a categorical value by modifying the SelectedFormats property.

ds.SelectedFormats = {'%d','%d','%C'};

Preview the data.

T = preview(ds)
T=8×3 table
    Year    Month    UniqueCarrier
    ____    _____    _____________

    1987     10           PS      
    1987     10           PS      
    1987     10           PS      
    1987     10           PS      
    1987     10           PS      
    1987     10           PS      
    1987     10           PS      
    1987     10           PS      

Alternatives

You also can create a TabularTextDatastore object using the datastore function. For example, ds = datastore(location,'Type','tabulartext') creates a datastore from a collection of files specified by location.

Introduced in R2014b

Was this topic helpful?