Main Content

parquetDatastore

Datastore for collection of Parquet files

Description

Use a ParquetDatastore object to manage a collection of Parquet files, where each individual Parquet file fits in memory, but the entire collection of files does not necessarily fit. You can create a ParquetDatastore object using the parquetDatastore function, specify its properties, and then import and process the data using object functions.

Creation

Description

example

pds = parquetDatastore(location) creates a datastore pds from the collection of Parquet files specified by location.

example

pds = parquetDatastore(location,Name,Value) specifies additional parameters and properties for pds using one or more name-value pair arguments.

Input Arguments

expand all

Files or folders included in the datastore, specified as a FileSet object, as file paths, or as a DsFileSet object.

  • FileSet object — You can specify location as a FileSet object. Specifying the location as a FileSet object leads to a faster construction time for datastores compared to specifying a path or DsFileSet object. For more information, see matlab.io.datastore.FileSet.

  • File path — You can specify a single file path as a character vector or string scalar. You can specify multiple file paths as a cell array of character vectors or a string array.

  • DsFileSet object — You can specify a DsFileSet object. For more information, see matlab.io.datastore.DsFileSet.

Files or folders may be local or remote:

  • Local files or folders — Specify local paths to files or folders. If the files are not in the current folder, then specify full or relative paths. Files within subfolders of the specified folder are not automatically included in the datastore. You can use the wildcard character (*) when specifying the local path. This character specifies that the datastore include all matching files or all files in the matching folders.

  • Remote files or folders — Specify full paths to remote files or folders as a uniform resource locator (URL) of the form hdfs:///path_to_file. For more information, see Work with Remote Data.

When you specify a folder, the datastore includes only files with supported file formats and ignores files with any other format. To specify a custom list of file extensions to include in your datastore, see the FileExtensions property.

The parquetDatastore function supports the .parquet file format.

Example: "myfile.parquet"

Example: "../dir/data/myfile.parquet"

Example: ["C:\dir\data\myfile01.parquet","C:\dir\data\myfile02.parquet"]

Example: "s3://bucketname/path_to_files/*.parquet"

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: "IncludeSubfolders",true

Extensions to include in datastore, specified as the name-value argument consisting of "FileExtensions" and a character vector, cell array of character vectors, string scalar, or string array.

  • If you do not specify "FileExtensions", then parquetDatastore automatically includes all files with .parquet and .parq extensions in the specified path.

  • If you want to include parquet files with non-standard file extensions in the parquetDatastore, then specify those extensions explicitly.

  • If you want to create a parquetDatastore for files without any extensions, then specify "FileExtensions" as an empty character vector, ''.

Example: "FileExtensions",[".parquet",".parq"]

Example: "FileExtensions",".myformat"

Example: "FileExtensions",''

Data Types: char | cell | string

Subfolder inclusion flag, specified as the name-value argument consisting of "IncludeSubfolders" and true or false. Specify true to include all files and subfolders within each folder or false to include only the files within each folder.

If you do not specify "IncludeSubfolders", then the default value is false.

Example: "IncludeSubfolders",true

Data Types: logical | double

Output datatype, specified as the name-value argument consisting of "OutputType" and one of these values:

  • "auto" — Detects if the output from the datastore should be a table or a timetable based on whether you specify the "RowTimes" name-value argument. If you specify "RowTimes" then the output is a timetable; otherwise, the output is a table.

  • "table" — Return a table.

  • "timetable" — Return a timetable.

The value of OutputType determines the data type returned by the preview, read, and readall functions. Use this option in conjunction with the "RowTimes" name-value pair to return timetables from ParquetDatastore.

Example: "OutputType","timetable"

Data Types: char | string

Flag to preserve variable names, specified as either "modify" or "preserve".

  • "modify" — Convert invalid variable names (as determined by the isvarname function) to valid MATLAB® identifiers.

  • "preserve" — Preserve variable names that are not valid MATLAB identifiers such as variable names that include spaces and non-ASCII characters.

Starting in R2019b, variable names and row names can include any characters, including spaces and non-ASCII characters. Also, they can start with any characters, not just letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the isvarname function). To preserve these variable names and row names, set the value of VariableNamingRule to "preserve". Variable names are not refreshed when the value of VariableNamingRule is changed from "modify" to "preserve".

Data Types: char | string

Alternate file system root paths, specified as the name-value argument consisting of "AlternateFileSystemRoots" and a string vector or a cell array. Use "AlternateFileSystemRoots" when you create a datastore on a local machine, but need to access and process the data on another machine (possibly of a different operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB Parallel Server™, and the data is stored on your local machines with a copy of the data available on different platform cloud or cluster machines, you must use "AlternateFileSystemRoots" to associate the root paths.

  • To associate a set of root paths that are equivalent to one another, specify "AlternateFileSystemRoots" as a string vector. For example,

    ["Z:\datasets","/mynetwork/datasets"]

  • To associate multiple sets of root paths that are equivalent for the datastore, specify "AlternateFileSystemRoots" as a cell array containing multiple rows where each row represents a set of equivalent root paths. Specify each row in the cell array as either a string vector or a cell array of character vectors. For example:

    • Specify "AlternateFileSystemRoots" as a cell array of string vectors.

      {["Z:\datasets", "/mynetwork/datasets"];...
       ["Y:\datasets", "/mynetwork2/datasets","S:\datasets"]}

    • Alternatively, specify "AlternateFileSystemRoots" as a cell array of cell array of character vectors.

      {{'Z:\datasets','/mynetwork/datasets'};...
       {'Y:\datasets', '/mynetwork2/datasets','S:\datasets'}}

The value of "AlternateFileSystemRoots" must satisfy these conditions:

  • Contains one or more rows, where each row specifies a set of equivalent root paths.

  • Each row specifies multiple root paths and each root path must contain at least two characters.

  • Root paths are unique and are not subfolders of one another.

  • Contains at least one root path entry that points to the location of the files.

For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

Example: ["Z:\datasets","/mynetwork/datasets"]

Data Types: string | cell

Properties

expand all

ParquetDatastore properties describe the format of the files in a datastore object, and control how the data is read from the datastore. With the exception of the Files property, you can specify the value of ParquetDatastore properties using name-value pair arguments when you create the datastore object. To view or modify a property after creating the object, use the dot notation.

Files included in the datastore, resolved as a cell array of character vectors or a string array, where each character vector or string is a full path to a file. The location argument defines these files.

The first file specified in the cell array determines the variable names and format information for all files in the datastore.

Example: {"C:\dir\data\file1.ext";"C:\dir\data\file2.ext"}

Data Types: cell | string

This property is read-only.

Folders used to construct datastore, returned as a cell array of character vectors. The cell array is oriented as a column vector. Each character vector is a path to a folder that contains data files. The location argument in the parquetDatastore and datastore functions defines Folders when the datastore is created.

The Folders property is reset when you modify the Files property of a ParquetDatastore object.

Data Types: cell

Filter to select rows to import, specified as a matlab.io.RowFilter object. The matlab.io.RowFilter object designates conditions each row must satisfy to be included in your output table or timetable. If you do not specify RowFilter, then parquetDatastore imports all rows from the input Parquet file.

Amount of data to read per read step, specified as one of these values:

  • "rowgroup" — Each read step reads the number of rows in the row groups of the Parquet data. To get the number of rows in each row group, see the RowGroupHeights property of the ParquetInfo object.

  • "file" — Each read step reads all of the data in one file.

  • positive integer — Each read step reads the specified number of rows.

When you change ReadSize from a positive integer to "file" or "rowgroup", or from "file" or "rowgroup" to a positive integer, MATLAB resets the datastore to an unread state, where no data has been read from it.

In a parallel processing workflow (Parallel Computing Toolbox), the data is read in steps from each parallel worker. In a serial workflow, the data is read in steps from the input location.

Data Types: string | char | double

Since R2023b

Partition unit for parallel processing, specified as one of the values in the following table.

In a parallel processing workflow (Parallel Computing Toolbox), PartitionMethod determines the amount of data to send to each parallel worker. The amount of data to send to each worker is approximately calculated by the total number of partition units divided by the number of parallel workers. In a serial workflow, the PartitionMethod name-value argument is ignored.

Value

Description

"auto"

parquetDatastore selects a partition unit based on the ReadSize name-value argument to balance the workload between parallel workers.

"file"

Partitions are based on the total number of files.
"bytes"Partitions are based on the number of bytes specified by the BlockSize property.
"rowgroup"Partitions are based on the total number of row groups.

Granularity and speed of processing depend on the combination of PartitionMethod and ReadSize values. While PartitionMethod determines how much data to send to each parallel worker, ReadSize determines how much data to read per read step. This table shows supported PartitionMethod and ReadSize combinations and their relative granularities and partitioning times.

Granularity, Partitioning TimePartitionMethodReadSize
High granularity, long partitioning timerowgrouprowgroup
rowgrouppositive integer
Moderate granularity, moderate partitioning timebytesrowgroup
Low granularity, short partitioning timefilefile

Since R2023b

Number of bytes per partition, specified as a positive integer. Specify this argument if PartitionMethod is "bytes". By default, the value of BlockSize is 128000000 bytes (128 MB).

Example: BlockSize=1000000

Names of variables in the datastore, specified as a character vector, cell array of character vectors, string scalar, or string array. Specify the variable names in the order in which they appear in the files. If you do not specify the variable names, the datastore detects them from the first nonheader line in the first file. You can specify VariableNames with a character vector or string scalar, however the datastore converts and stores the property value to a cell array of character vectors. When modifying the VariableNames property, the number of new variable names must match the number of original variable names.

To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value of the VariableNamingRule parameter to "preserve".

If ReadVariableNames is false, then VariableNames defaults to ["Var1","Var2", ...].

Example: ["Time","Date","Quantity"]

Data Types: char | cell | string

Variables to read from the file, specified as a cell array of character vectors or a string array, where each character vector or string contains the name of one variable. You can specify the variable names in any order.

To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value of the VariableNamingRule parameter to "preserve".

Example: ["Var3","Var7","Var4"]

Data Types: cell | string

Name of row times variable, specified as the name-value argument consisting of "RowTimes" and a variable name (such as "Date") or a variable index (such as 3).

RowTimes is a timetable-related parameter. Each row of a timetable is associated with a time, which is captured in a time vector for the timetable. The variable specified in RowTimes must contain a datetime or a duration vector.

If the value of "OutputType" is "timetable", but you do not specify "RowTimes", then ParquetDatastore uses the first datetime or duration variable as the row times for the timetable.

This property is read-only.

Formats supported for writing, returned as a row vector of strings. This property specifies the possible output formats when using writeall to write output files from the datastore.

This property is read-only.

Default output format, returned as a string scalar. This property specifies the default format when using writeall to write output files from the datastore.

Data Types: string

Object Functions

hasdataDetermine if data is available to read
numpartitionsNumber of datastore partitions
partitionPartition a datastore
previewPreview subset of data in datastore
readRead data in datastore
readallRead all data in datastore
writeallWrite datastore to files
resetReset datastore to initial state
transformTransform datastore
combineCombine data from multiple datastores
isPartitionableDetermine whether datastore is partitionable
isSubsettableDetermine whether datastore is subsettable
isShuffleableDetermine whether datastore is shuffleable

Examples

collapse all

Create a parquetDatastore object using either a FileSet object or a file path.

Create a FileSet object containing the file outages.parquet. Create a parquetDatastore object.

fs = matlab.io.datastore.FileSet("outages.parquet");
pds = parquetDatastore(fs)
pds = 
  ParquetDatastore with properties:

                       Files: {
                              '...\matlab\toolbox\matlab\demos\outages.parquet'
                              }
                     Folders: {
                              '...\matlab\toolbox\matlab\demos'
                              }
               VariableNames: {1x6 cell}
       SelectedVariableNames: {1x6 cell}
                    ReadSize: 'rowgroup'
                  OutputType: 'table'
                    RowTimes: []
    AlternateFileSystemRoots: {}
      SupportedOutputFormats: ["txt"    "csv"    "xlsx"    "xls"    ...    ]
         DefaultOutputFormat: "parquet"
          VariableNamingRule: 'modify'

Alternatively, you can use a file path to create your parquetDatastore object.

pds = parquetDatastore("outages.parquet");

Create a datastore for a sample Parquet file, and then read data from the file with different ReadSize values.

Create a datastore for outages.parquet, set ReadSize to 10 rows, and then read from the datastore. The value of ReadSize determines how many rows of data are read from the datastore with each call to the read function.

pds = parquetDatastore("outages.parquet","ReadSize",10);
read(pds)
ans=10×6 table
      Region            OutageTime          Loss     Customers       RestorationTime             Cause      
    ___________    ____________________    ______    __________    ____________________    _________________

    "SouthWest"    01-Feb-2002 12:18:00    458.98    1.8202e+06    07-Feb-2002 16:50:00    "winter storm"   
    "SouthEast"    23-Jan-2003 00:49:00    530.14    2.1204e+05                     NaT    "winter storm"   
    "SouthEast"    07-Feb-2003 21:15:00     289.4    1.4294e+05    17-Feb-2003 08:14:00    "winter storm"   
    "West"         06-Apr-2004 05:44:00    434.81    3.4037e+05    06-Apr-2004 06:10:00    "equipment fault"
    "MidWest"      16-Mar-2002 06:18:00    186.44    2.1275e+05    18-Mar-2002 23:23:00    "severe storm"   
    "West"         18-Jun-2003 02:49:00         0             0    18-Jun-2003 10:54:00    "attack"         
    "West"         20-Jun-2004 14:39:00    231.29           NaN    20-Jun-2004 19:16:00    "equipment fault"
    "West"         06-Jun-2002 19:28:00    311.86           NaN    07-Jun-2002 00:51:00    "equipment fault"
    "NorthEast"    16-Jul-2003 16:23:00    239.93         49434    17-Jul-2003 01:12:00    "fire"           
    "MidWest"      27-Sep-2004 11:09:00    286.72         66104    27-Sep-2004 16:37:00    "equipment fault"

Set the ReadSize property value to "file" and read from the datastore. Every call to the read function reads all the data from the datastore.

pds.ReadSize ="file"; 
data = read(pds)
data=1468×6 table
      Region            OutageTime          Loss     Customers       RestorationTime             Cause      
    ___________    ____________________    ______    __________    ____________________    _________________

    "SouthWest"    01-Feb-2002 12:18:00    458.98    1.8202e+06    07-Feb-2002 16:50:00    "winter storm"   
    "SouthEast"    23-Jan-2003 00:49:00    530.14    2.1204e+05                     NaT    "winter storm"   
    "SouthEast"    07-Feb-2003 21:15:00     289.4    1.4294e+05    17-Feb-2003 08:14:00    "winter storm"   
    "West"         06-Apr-2004 05:44:00    434.81    3.4037e+05    06-Apr-2004 06:10:00    "equipment fault"
    "MidWest"      16-Mar-2002 06:18:00    186.44    2.1275e+05    18-Mar-2002 23:23:00    "severe storm"   
    "West"         18-Jun-2003 02:49:00         0             0    18-Jun-2003 10:54:00    "attack"         
    "West"         20-Jun-2004 14:39:00    231.29           NaN    20-Jun-2004 19:16:00    "equipment fault"
    "West"         06-Jun-2002 19:28:00    311.86           NaN    07-Jun-2002 00:51:00    "equipment fault"
    "NorthEast"    16-Jul-2003 16:23:00    239.93         49434    17-Jul-2003 01:12:00    "fire"           
    "MidWest"      27-Sep-2004 11:09:00    286.72         66104    27-Sep-2004 16:37:00    "equipment fault"
    "SouthEast"    05-Sep-2004 17:48:00    73.387         36073    05-Sep-2004 20:46:00    "equipment fault"
    "West"         21-May-2004 21:45:00    159.99           NaN    22-May-2004 04:23:00    "equipment fault"
    "SouthEast"    01-Sep-2002 18:22:00    95.917         36759    01-Sep-2002 19:12:00    "severe storm"   
    "SouthEast"    27-Sep-2003 07:32:00       NaN    3.5517e+05    04-Oct-2003 07:02:00    "severe storm"   
    "West"         12-Nov-2003 06:12:00    254.09    9.2429e+05    17-Nov-2003 02:04:00    "winter storm"   
    "NorthEast"    18-Sep-2004 05:54:00         0             0                     NaT    "equipment fault"
      ⋮

You also can set the value of ReadSize property to "rowgroup". For more information, see the ReadSize property of the ParquetDatastore object reference page.

Use the OutputType and RowTimes name-value pairs to make ParquetDatastore return timetables instead of tables.

Create a datastore for airlinesmall.parquet. Specify the "OutputType" name-value argument as "timetable".

pds = parquetDatastore("airlinesmall.parquet","OutputType","timetable");
preview(pds)
ans=12500×26 timetable
       Date        DayOfWeek          DepTime                CRSDepTime               ArrTime                CRSArrTime         UniqueCarrier    FlightNum    TailNum    ActualElapsedTime    CRSElapsedTime    AirTime    ArrDelay    DepDelay    Origin    Dest     Distance    TaxiIn     TaxiOut    Cancelled    CancellationCode    Diverted    CarrierDelay    WeatherDelay    NASDelay    SecurityDelay    LateAircraftDelay
    ___________    _________    ____________________    ____________________    ____________________    ____________________    _____________    _________    _______    _________________    ______________    _______    ________    ________    ______    _____    ________    _______    _______    _________    ________________    ________    ____________    ____________    ________    _____________    _________________

    21-Oct-1987        3        21-Oct-1987 06:42:00    21-Oct-1987 06:30:00    21-Oct-1987 07:35:00    21-Oct-1987 07:27:00        "PS"           1503        "NA"           3180 sec           3420 sec       NaN sec     480 sec     720 sec    "LAX"     "SJC"       308      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    26-Oct-1987        1        26-Oct-1987 10:21:00    26-Oct-1987 10:20:00    26-Oct-1987 11:24:00    26-Oct-1987 11:16:00        "PS"           1550        "NA"           3780 sec           3360 sec       NaN sec     480 sec      60 sec    "SJC"     "BUR"       296      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    23-Oct-1987        5        23-Oct-1987 20:55:00    23-Oct-1987 20:35:00    23-Oct-1987 22:18:00    23-Oct-1987 21:57:00        "PS"           1589        "NA"           4980 sec           4920 sec       NaN sec    1260 sec    1200 sec    "SAN"     "SMF"       480      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    23-Oct-1987        5        23-Oct-1987 13:32:00    23-Oct-1987 13:20:00    23-Oct-1987 14:31:00    23-Oct-1987 14:18:00        "PS"           1655        "NA"           3540 sec           3480 sec       NaN sec     780 sec     720 sec    "BUR"     "SJC"       296      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    22-Oct-1987        4        22-Oct-1987 06:29:00    22-Oct-1987 06:30:00    22-Oct-1987 07:46:00    22-Oct-1987 07:42:00        "PS"           1702        "NA"           4620 sec           4320 sec       NaN sec     240 sec     -60 sec    "SMF"     "LAX"       373      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    28-Oct-1987        3        28-Oct-1987 14:46:00    28-Oct-1987 13:43:00    28-Oct-1987 15:47:00    28-Oct-1987 14:48:00        "PS"           1729        "NA"           3660 sec           3900 sec       NaN sec    3540 sec    3780 sec    "LAX"     "SJC"       308      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    08-Oct-1987        4        08-Oct-1987 09:28:00    08-Oct-1987 09:30:00    08-Oct-1987 10:52:00    08-Oct-1987 10:49:00        "PS"           1763        "NA"           5040 sec           4740 sec       NaN sec     180 sec    -120 sec    "SAN"     "SFO"       447      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    10-Oct-1987        6        10-Oct-1987 08:59:00    10-Oct-1987 09:00:00    10-Oct-1987 11:34:00    10-Oct-1987 11:23:00        "PS"           1800        "NA"           9300 sec           8580 sec       NaN sec     660 sec     -60 sec    "SEA"     "LAX"       954      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    20-Oct-1987        2        20-Oct-1987 18:33:00    20-Oct-1987 18:30:00    20-Oct-1987 19:29:00    20-Oct-1987 19:26:00        "PS"           1831        "NA"           3360 sec           3360 sec       NaN sec     180 sec     180 sec    "LAX"     "SJC"       308      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    15-Oct-1987        4        15-Oct-1987 10:41:00    15-Oct-1987 10:40:00    15-Oct-1987 11:57:00    15-Oct-1987 11:55:00        "PS"           1864        "NA"           4560 sec           4500 sec       NaN sec     120 sec      60 sec    "SFO"     "LAS"       414      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    15-Oct-1987        4        15-Oct-1987 16:08:00    15-Oct-1987 15:53:00    15-Oct-1987 16:56:00    15-Oct-1987 16:40:00        "PS"           1907        "NA"           2880 sec           2820 sec       NaN sec     960 sec     900 sec    "LAX"     "FAT"       209      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    21-Oct-1987        3        21-Oct-1987 09:49:00    21-Oct-1987 09:40:00    21-Oct-1987 10:55:00    21-Oct-1987 10:52:00        "PS"           1939        "NA"           3960 sec           4320 sec       NaN sec     180 sec     540 sec    "LGB"     "SFO"       354      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    22-Oct-1987        4        22-Oct-1987 19:02:00    22-Oct-1987 18:47:00    22-Oct-1987 20:30:00    22-Oct-1987 19:51:00        "PS"           1973        "NA"           5280 sec           3840 sec       NaN sec    2340 sec     900 sec    "LAX"     "OAK"       337      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    16-Oct-1987        5        16-Oct-1987 19:10:00    16-Oct-1987 18:38:00    16-Oct-1987 20:52:00    16-Oct-1987 19:55:00        "TW"             19        "NA"           9720 sec           8220 sec       NaN sec    3420 sec    1920 sec    "STL"     "DEN"       770      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    02-Oct-1987        5        02-Oct-1987 11:30:00    02-Oct-1987 11:33:00    02-Oct-1987 12:37:00    02-Oct-1987 12:37:00        "TW"             59        "NA"          11220 sec          11040 sec       NaN sec       0 sec    -180 sec    "STL"     "PHX"      1262      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    30-Oct-1987        5        30-Oct-1987 14:00:00    30-Oct-1987 14:00:00    30-Oct-1987 19:20:00    30-Oct-1987 19:34:00        "TW"            102        "NA"          12000 sec          12840 sec       NaN sec    -840 sec       0 sec    "SNA"     "STL"      1570      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
      ⋮

When you do not also specify "RowTimes", parquetDatastore uses the first datetime or duration variable as the row times. In this case, the Date variable is used for the row times.

Specify the "RowTimes" option to use the arrival times (ArrTime) as the row times, instead of the flight dates.

pds = parquetDatastore("airlinesmall.parquet","OutputType","timetable","RowTimes","ArrTime");
preview(pds)
ans=12500×26 timetable
          ArrTime              Date        DayOfWeek          DepTime                CRSDepTime              CRSArrTime         UniqueCarrier    FlightNum    TailNum    ActualElapsedTime    CRSElapsedTime    AirTime    ArrDelay    DepDelay    Origin    Dest     Distance    TaxiIn     TaxiOut    Cancelled    CancellationCode    Diverted    CarrierDelay    WeatherDelay    NASDelay    SecurityDelay    LateAircraftDelay
    ____________________    ___________    _________    ____________________    ____________________    ____________________    _____________    _________    _______    _________________    ______________    _______    ________    ________    ______    _____    ________    _______    _______    _________    ________________    ________    ____________    ____________    ________    _____________    _________________

    21-Oct-1987 07:35:00    21-Oct-1987        3        21-Oct-1987 06:42:00    21-Oct-1987 06:30:00    21-Oct-1987 07:27:00        "PS"           1503        "NA"           3180 sec           3420 sec       NaN sec     480 sec     720 sec    "LAX"     "SJC"       308      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    26-Oct-1987 11:24:00    26-Oct-1987        1        26-Oct-1987 10:21:00    26-Oct-1987 10:20:00    26-Oct-1987 11:16:00        "PS"           1550        "NA"           3780 sec           3360 sec       NaN sec     480 sec      60 sec    "SJC"     "BUR"       296      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    23-Oct-1987 22:18:00    23-Oct-1987        5        23-Oct-1987 20:55:00    23-Oct-1987 20:35:00    23-Oct-1987 21:57:00        "PS"           1589        "NA"           4980 sec           4920 sec       NaN sec    1260 sec    1200 sec    "SAN"     "SMF"       480      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    23-Oct-1987 14:31:00    23-Oct-1987        5        23-Oct-1987 13:32:00    23-Oct-1987 13:20:00    23-Oct-1987 14:18:00        "PS"           1655        "NA"           3540 sec           3480 sec       NaN sec     780 sec     720 sec    "BUR"     "SJC"       296      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    22-Oct-1987 07:46:00    22-Oct-1987        4        22-Oct-1987 06:29:00    22-Oct-1987 06:30:00    22-Oct-1987 07:42:00        "PS"           1702        "NA"           4620 sec           4320 sec       NaN sec     240 sec     -60 sec    "SMF"     "LAX"       373      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    28-Oct-1987 15:47:00    28-Oct-1987        3        28-Oct-1987 14:46:00    28-Oct-1987 13:43:00    28-Oct-1987 14:48:00        "PS"           1729        "NA"           3660 sec           3900 sec       NaN sec    3540 sec    3780 sec    "LAX"     "SJC"       308      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    08-Oct-1987 10:52:00    08-Oct-1987        4        08-Oct-1987 09:28:00    08-Oct-1987 09:30:00    08-Oct-1987 10:49:00        "PS"           1763        "NA"           5040 sec           4740 sec       NaN sec     180 sec    -120 sec    "SAN"     "SFO"       447      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    10-Oct-1987 11:34:00    10-Oct-1987        6        10-Oct-1987 08:59:00    10-Oct-1987 09:00:00    10-Oct-1987 11:23:00        "PS"           1800        "NA"           9300 sec           8580 sec       NaN sec     660 sec     -60 sec    "SEA"     "LAX"       954      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    20-Oct-1987 19:29:00    20-Oct-1987        2        20-Oct-1987 18:33:00    20-Oct-1987 18:30:00    20-Oct-1987 19:26:00        "PS"           1831        "NA"           3360 sec           3360 sec       NaN sec     180 sec     180 sec    "LAX"     "SJC"       308      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    15-Oct-1987 11:57:00    15-Oct-1987        4        15-Oct-1987 10:41:00    15-Oct-1987 10:40:00    15-Oct-1987 11:55:00        "PS"           1864        "NA"           4560 sec           4500 sec       NaN sec     120 sec      60 sec    "SFO"     "LAS"       414      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    15-Oct-1987 16:56:00    15-Oct-1987        4        15-Oct-1987 16:08:00    15-Oct-1987 15:53:00    15-Oct-1987 16:40:00        "PS"           1907        "NA"           2880 sec           2820 sec       NaN sec     960 sec     900 sec    "LAX"     "FAT"       209      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    21-Oct-1987 10:55:00    21-Oct-1987        3        21-Oct-1987 09:49:00    21-Oct-1987 09:40:00    21-Oct-1987 10:52:00        "PS"           1939        "NA"           3960 sec           4320 sec       NaN sec     180 sec     540 sec    "LGB"     "SFO"       354      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    22-Oct-1987 20:30:00    22-Oct-1987        4        22-Oct-1987 19:02:00    22-Oct-1987 18:47:00    22-Oct-1987 19:51:00        "PS"           1973        "NA"           5280 sec           3840 sec       NaN sec    2340 sec     900 sec    "LAX"     "OAK"       337      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    16-Oct-1987 20:52:00    16-Oct-1987        5        16-Oct-1987 19:10:00    16-Oct-1987 18:38:00    16-Oct-1987 19:55:00        "TW"             19        "NA"           9720 sec           8220 sec       NaN sec    3420 sec    1920 sec    "STL"     "DEN"       770      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    02-Oct-1987 12:37:00    02-Oct-1987        5        02-Oct-1987 11:30:00    02-Oct-1987 11:33:00    02-Oct-1987 12:37:00        "TW"             59        "NA"          11220 sec          11040 sec       NaN sec       0 sec    -180 sec    "STL"     "PHX"      1262      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
    30-Oct-1987 19:20:00    30-Oct-1987        5        30-Oct-1987 14:00:00    30-Oct-1987 14:00:00    30-Oct-1987 19:34:00        "TW"            102        "NA"          12000 sec          12840 sec       NaN sec    -840 sec       0 sec    "SNA"     "STL"      1570      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
      ⋮

Conditionally select rows from a data set using the RowFilter property.

Create a Parquet datastore using the outages.parquet file. View the first 8 rows of the datastore.

pds = parquetDatastore("outages.parquet");
preview(pds)
ans=8×6 table
      Region            OutageTime          Loss     Customers       RestorationTime             Cause      
    ___________    ____________________    ______    __________    ____________________    _________________

    "SouthWest"    01-Feb-2002 12:18:00    458.98    1.8202e+06    07-Feb-2002 16:50:00    "winter storm"   
    "SouthEast"    23-Jan-2003 00:49:00    530.14    2.1204e+05                     NaT    "winter storm"   
    "SouthEast"    07-Feb-2003 21:15:00     289.4    1.4294e+05    17-Feb-2003 08:14:00    "winter storm"   
    "West"         06-Apr-2004 05:44:00    434.81    3.4037e+05    06-Apr-2004 06:10:00    "equipment fault"
    "MidWest"      16-Mar-2002 06:18:00    186.44    2.1275e+05    18-Mar-2002 23:23:00    "severe storm"   
    "West"         18-Jun-2003 02:49:00         0             0    18-Jun-2003 10:54:00    "attack"         
    "West"         20-Jun-2004 14:39:00    231.29           NaN    20-Jun-2004 19:16:00    "equipment fault"
    "West"         06-Jun-2002 19:28:00    311.86           NaN    07-Jun-2002 00:51:00    "equipment fault"

Create a row filter that identifies rows with a Region of "NorthEast" and a Cause of "winter storm". Then, set the RowFilter property of the datastore to the filter. Preview the datastore, note that the datastore contains only rows that meet the filter conditions.

rf = rowfilter(pds);
filter = rf.Region == "NorthEast" & rf.Cause == "winter storm";
pds.RowFilter = filter;
preview(pds)
ans=8×6 table
      Region            OutageTime          Loss     Customers       RestorationTime           Cause     
    ___________    ____________________    ______    __________    ____________________    ______________

    "NorthEast"    13-Nov-2004 10:42:00       NaN    1.4227e+05    19-Nov-2004 02:31:00    "winter storm"
    "NorthEast"    26-Dec-2004 22:18:00    255.45    1.0444e+05    27-Dec-2004 14:11:00    "winter storm"
    "NorthEast"    17-Dec-2003 15:11:00       NaN         66692    19-Dec-2003 07:22:00    "winter storm"
    "NorthEast"    28-Jan-2005 18:20:00    401.39         89683    29-Jan-2005 02:36:00    "winter storm"
    "NorthEast"    04-Feb-2005 00:53:00    32.061         46182    09-Feb-2005 02:42:00    "winter storm"
    "NorthEast"    16-Nov-2006 10:04:00    147.25    1.2571e+05    17-Nov-2006 10:55:00    "winter storm"
    "NorthEast"    03-Feb-2007 02:19:00    293.83    1.1628e+05    04-Feb-2007 21:24:00    "winter storm"
    "NorthEast"    18-Feb-2008 05:24:00    353.29         64687    20-Feb-2008 08:56:00    "winter storm"

Limitations

  • If you use parquetread or parquetDatastore to read the files, then the result might not have the same format or contents as the original table. For more information, see Apache Parquet Data Type Mappings.

  • Unlike parquetread, which replaces NULL values with doubles, parquetDatastore replaces NULL integer values with 0 and NULL boolean values with false. This replacement results in a lossy transformation.

Extended Capabilities

Version History

Introduced in R2019a

expand all