How can I acces data from an hdfs in parquet format

11 views (last 30 days)
We have a large dataset stored in parquet files on an hadoop file system and would like to use a matlab datastore to analyse them. Unfortunately I couldn't find any reports, that anybody has done this yet.
Does mathworks provide a native way to access parquet data? Perhaps one can use the fileDatastore or a matlab custom datastore? Is there a template for that?

Accepted Answer

Hitesh Kumar Dasika
Hitesh Kumar Dasika on 20 Dec 2018
Mathworks has added support for Parquet files. it is available in the following link.
  5 Comments
Hatem Helal
Hatem Helal on 10 Apr 2019
R2019a adds support for working with parquet files, see this answer and let us know if you have any further feedback.

Sign in to comment.

More Answers (2)

Hatem Helal
Hatem Helal on 10 Apr 2019
MATLAB R2019a adds support for reading and writing Apache Parquet files (doc). Here are the relevant release notes:
1. Import and export column-oriented data from Parquet files in MATLAB. Parquet is a columnar storage format that supports efficient compression and encoding schemes. To work with the Parquet file format, use these functions.
2. The write function now supports writing tall arrays to Parquet files. To write a tall array, set the FileType parameter to 'parquet', for example:
write('C:\myData',tX,'FileType','parquet')
3. Read a collection of Parquet files into MATLAB workspace using parquetDatastore.
For more information on the Parquet file format, see https://parquet.apache.org/.

Hitesh Kumar Dasika
Hitesh Kumar Dasika on 24 Sep 2018
Currently, there is no support to Apache Arrow and Parquet files in MATLAB.
  3 Comments
Hatem Helal
Hatem Helal on 10 Apr 2019
R2019a adds support for working with parquet files, see this answer and let us know if you have any further feedback.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!