MATLAB Interface for WebHDFS
MATLAB® interface for WebHDFS
Table of Contents
Introduction
What is WebHDFS?
WebHDFS is a protocol that defines a public HTTP REST API which permits clients to access Hadoop® Distributed File System (HDFS) over the Web. It retains the security the native Hadoop® protocol offers and uses parallelism, for better throughput. To use this toolbox, the webhdfs functionality needs to be enabled in the Hadoop® Server.
The MATLAB® interface for WebHDFS
This toolbox provides a set of functions that enable the user to directly work with files and folders stored in Hadoop® via a REST API and perform common operations such as read, write, upload, and download files.
When should you use WebHDFS?
When working with Hadoop® files, the WebHDFS is not the only alternative and you might want to consider other alternatives depending on the task at hand.
-
For Big Data applications, you can prototype an algorithm in MATLAB® either using tall arrays or our Spark API and deploy them direclty on Spark enabled Hadoop® cluster
-
You can access your files using Hive and Impala, and run any SQL or HQL command. This tool might be more suitable to run queries on large pieces of data.
These tools might be more suitable to run analytics on large sets of data, while the webhdfs interface might be a better tool to do small operations, since the data needs to travel back and forth over the internet.
Installing or Updating the toolbox
Requirements
-
Only base MATLAB® is required to run all the toolbox functionality. For users accessing Hadoop® via Kerberos authentication R2019b or newer is recommended.
-
[optional] Database Toolbox™ is needed to run Hive and Impala
-
[optional] MATLAB® Compiler™ is needed to deploy Spark or mapreduce jobs
-
[optional] MATLAB® Parallel Server™ is needed to run interactive Spark jobs
Installation
The toolbox can be installed directly from the Add-On explorer, or by double-clicking the mltbx
file. All the functionality will be then accessible under the namepsace webhdfs.*
Future updates
The toolbox will be updated regularly. To get the newest version, you can simply uninstall and re-install the toolbox direclty from the Add-On explorer in MATLAB.
Licence
This toolbox is licensed under an XLSA license. Please see LICENSE.txt.
Getting Started
A complete documentation for the toolbox can be found in the getting started guide. However, the most common tasks are also outlined below.
For more information, please look at the documentation of the toolbox:
doc WebHdfsClient
Setting up the connection
The connection to a Hadoop® cluster is always done via the class WebHdfsClient
. This class supports several optional arguments:
- root [optional]: Root folder to parent all requests. If unspecified, all requests are assumed to be relative to "".
- protocol [optional]: whether the connection is done via http or https (default).
- host [optional]: hostname of the server running Hadoop®
- port [optional]: port number where Hadoop® is running
- name [optional]: for unauthenticated servers only. Specify the name of the user.
client = WebHdfsClient(root = 'data', protocol = 'http', host = "sandbox-hdp.hortonworks.com", port = 50070);
To avoid having to set the connection details every time, you can save the connection details so future connections only require the root folder of your requests (if any). These preferences persist between MATLAB® sessions
client.saveConnectionDetails();
client = WebHdfsClient("root", 'data');
These saved preferences can be removed at any poiny by running:
client.clearConnectionDetails();
When you work with files and folders you can specify a relative or absolute path. If the specified path starts with "/", it will be interpreted as an absolute path. Otherwise, the code will interpret the path as relative to the "root" folder in the server. For example, the following line lists the status of the folder "/data/myData/testMW"
client = WebHdfsClient("root", 'data');
status = client.hdfs_status("myData/testMW")
status =
accessTime: 0
blockSize: 0
childrenNum: 0
fileId: 2495973
group: 'hdfs'
length: 0
modificationTime: 1.6262e+12
owner: 'maria_dev'
pathSuffix: ''
permission: '755'
replication: 0
storagePolicy: 0
type: 'DIRECTORY'
Authentication mechanisms
There are two authentication mechanisms supported by the toolbox:
- For unauthenticated servers, you will need to specify your Username during the hdfs connection if you want to access any user specific operations.
client = WebHdfsClient(root = 'data', name = "maria_dev");
- For Kerberos authentication, please use R2019b or newer. The authentication will be done automatically.
Exploring the Hadoop® file system
The following section outlines the most common commands to work with HDFS files. It shows how one can navigate the directories, open, download, and upload new files direclty in HDFS.
Listing files and folders
The methods hdfs_list
and hdfs_content
will give you information about the files inside a specific folder. For example, the following command returns the names and status of all the files within the folder /data/myData
client = WebHdfsClient("root", 'data');
elements = client.hdfs_list("myData", status = true)
Fields | accessTime | blockSize | childrenNum | fileId | group | length | modificationTime | owner | pathSuffix | permission | replication | storagePolicy | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1.6262e+12 | 134217728 | 0 | 2499596 | 'hdfs' | 184 | 1.6262e+12 | 'maria_dev' | 'petdata.csv_... | '777' | 1 | 0 | 'FILE' |
2 | 1.6238e+12 | 134217728 | 0 | 2472035 | 'hdfs' | 184 | 1.6238e+12 | 'maria_dev' | 'petdata2.csv... | '777' | 1 | 0 | 'FILE' |
3 | 0 | 0 | 0 | 2495973 | 'hdfs' | 0 | 1.6262e+12 | 'maria_dev' | 'testMW' | '755' | 0 | 0 | 'DIRECTORY' |
Alternatively, if status
is set to false
, only the names of the files are returned.
Similarly, the method hdfs_recent_files
, allows you to find most recently modified files in a directory. By default function returns one file but you can specify the maximum number of files return with the nfiles argument. If you you set the nfiles argument to None, then you will get back list of all files. This function returns only the file names. For example, to view the latest file added to the folder /data/myData
, we can run:
files = client.hdfs_recent_files("myData", 1)
files =
accessTime: 1.6262e+12
blockSize: 134217728
childrenNum: 0
fileId: 2499552
group: 'hdfs'
length: 184
modificationTime: 1.6262e+12
owner: 'maria_dev'
pathSuffix: 'petdata.csv'
permission: '777'
replication: 1
storagePolicy: 0
type: 'FILE'
Creating, moving, and deleting directories
You can use the method hdfs_makedirs
to create Hadoop® directories. It will recursively create intermediate directories if they are missing. For example, the following call:
client.hdfs_makedirs("one/two/three");
would also create directory one/two
if they were missing. Additionally, this method accepts an optional overwrite
parameter (true/false) to specify if the folder needs to be overwritten. Pleaes note that all contents will be discarded if overwrite is set to true.
You can delete files and directories from Hadoop® with the hdfs_delete
method. Files/directories are not moved to the HDFS Trash so they will be permanently deleted.
hdfs_delete
will return True if the file/directory was deleted and False if the file/directory did not exist.
client.hdfs_delete("one/two/three")
By default non-empty directories will not be deleted. However if you set the optional recursive argument to True then files/directories will be deleted recursively.
client.hdfs_delete("one", recursive=true)
Finally, you can move files/directories in Hadoop® with the hdfs_rename
method:
- If the destination is an existing directory, then the source file/directory will be moved into it.
- If the destination is an existing file, then this method will return false.
- If the parent destination is missing, then this method will return false.
client.hdfs_rename("share/one/two", "share/one/four")
Downloading and Uploading files and folders
With hdfs_download
you can download files from a Hadoop® directory. For exmaple, to download something into a temporary directory you can run:
client.hdfs_download('<hdfs_path>', tempdir());
If hdfs_path
is a file then that file will be downloaded. If the argument is a directory then all the files and subfolders (together with their files) in that directory will be downloaded.
Note that wildcards are not supported so you can either download complete contents of a directory or individual files. If the local file or directory already exists then it will not be overwritten and an error will be raised. However, you can set an overwrite flag to force the download of the files:
client.hdfs_download(testFileName, tempdir(), overwrite=true);
The same process is equivalent to uploading files in HDFS. You can upload local files and folders with the hdfs_upload
method:
- If the target HDFS path exists and is a directory, then the files will be uploaded into it.
- If the target HDFS path exists and is a file, then it will be overwritten if the optional overwrite argument is set to True.
For example, to upload a single file, with the chosen permissions you can run:
lab.hdfs_upload("/data/one", "myfolder", overwrite = true, permission = 777)
Reading and Writing files directly on Hadoop®
With hdfs_open
, hdfs_read
, and hdfs_write
you can directly read data from files in Hadoop® folders. The files will be read in memory, so you will not create a local copy of the file. Use the standard MATLAB® modes "r" for text files and "rb" for binary files like parquet. For example:
testFileName = 'myData/matlab_WebHdfsPetdata.csv';
reader = client.hdfs_open(testFileName,'r')
reader =
WebHdfsFile with properties:
encoding: "utf-8"
hdfs_path: "/data/myData/Petdata.csv"
mode: 'r'
data = reader.read();
disp(data)
Row,Age,Weight,Height
Sanchez,38,176,71
Johnson,43,163,69
Lee,38,131,64
Diaz,40,133,67
Brown,49,119,64
Sanchez,38,176,71
Johnson,43,163,69
Lee,38,131,64
Diaz,40,133,67
Brown,49,119,64
If the data can be read using a standard MATLAB® command such as readtable
, imread
, or parquetread
, you can pass this command (with its standard inputs or parameters) as:
data = reader.read(@(x) readtable(x,'ReadVariableNames',false))
Var1 | Var2 | Var3 | Var4 | |
---|---|---|---|---|
1 | 'Sanchez' | 38 | 176 | 71 |
2 | 'Johnson' | 43 | 163 | 69 |
3 | 'Lee' | 38 | 131 | 64 |
4 | 'Diaz' | 40 | 133 | 67 |
5 | 'Brown' | 49 | 119 | 64 |
6 | 'Sanchez' | 38 | 176 | 71 |
7 | 'Johnson' | 43 | 163 | 69 |
8 | 'Lee' | 38 | 131 | 64 |
9 | 'Diaz' | 40 | 133 | 67 |
10 | 'Brown' | 49 | 119 | 64 |
With hdfs_open
function you can also write data to files in HDFS. Note that this is different from uploading files, as the file will not exist in a local path, it will be created from the data in memory.
You can overwrite existing files by setting the mode to "wb" (binary) or "wt" (text), and you can append to an existing files by setting the mode to "at" (text) or "ab" (binary). Note that appending is supported with text files and some binary formats like Avro. Appending is not supported with Parquet files.
The toolbox has four helper methods to help you write data into the file depending on the format that you choose to write.
- "write": to add any type of text data to the file.
- "writeTable": to add/append tables to a file.
- "writeCell": to add/append cells to a file.
- "writeMatrix": to add/append matrices to a file.
testFileName = '/data/myData/petdata.csv';
file = client.hdfs_open(testFileName, 'w+');
writeTable(file, T)
file.read(@readtable)
Age | Weight | Height | |
---|---|---|---|
1 | 38 | 176 | 71 |
2 | 43 | 163 | 69 |
3 | 38 | 131 | 64 |
4 | 40 | 133 | 67 |
5 | 49 | 119 | 64 |
By default, the funciton writeTable in "write" mode, will only add the variable names to the table in UTF-8 format. This features can also be set as follows:
writeTable(file, T, 'WriteVariableNames', false, 'WriteRowNames', true, 'encoding', 'UTF-8')
file.read(@(x) readtable(x,'ReadRowNames', true))
Var1 | Var2 | Var3 | |
---|---|---|---|
1 Sanchez | 38 | 176 | 71 |
2 Johnson | 43 | 163 | 69 |
3 Lee | 38 | 131 | 64 |
4 Diaz | 40 | 133 | 67 |
5 Brown | 49 | 119 | 64 |
close(file);
Finally, if the file is opened in "append" mode, the data will be added at the end of the file. If the file did not exist, an empty file would be created upon start.
testFileName = 'myData/petdata.csv';
file = client.hdfs_open(testFileName, 'a+');
Unlike write mode, the default settings in append mode do not add variable headers, or row names to the file. You can, however, specify these same options in append mode.
file.writeTable(T,'WriteVariableNames', true, 'WriteRowNames', true)
file.writeTable(T, 'WriteRowNames', true)
file.read(@(x) readtable(x,'ReadRowNames', true))
Age | Weight | Height | |
---|---|---|---|
1 Sanchez | 38 | 176 | 71 |
2 Johnson | 43 | 163 | 69 |
3 Lee | 38 | 131 | 64 |
4 Diaz | 40 | 133 | 67 |
5 Brown | 49 | 119 | 64 |
6 Sanchez_1 | 38 | 176 | 71 |
7 Johnson_1 | 43 | 163 | 69 |
8 Lee_1 | 38 | 131 | 64 |
9 Diaz_1 | 40 | 133 | 67 |
10 Brown_1 | 49 | 119 | 64 |
Changing file permissions
It is possible to set the read/write/execute file permissions programmatically. You need to pass the file permissions as a 3-number octal as show here Chmod Calculator (chmod-calculator.com)
client.hdfs_set_permission( 'myData/petdata.csv', 777);
Enhancement requests
To report any bug, or enchancement request, pelase submit a GitHub issue.
Cite As
Edu Benet Cerda (2024). MATLAB Interface for WebHDFS (https://github.com/mathworks/MATLAB-Interface-for-WebHDFS/releases/tag/v1.0.0), GitHub. Retrieved .
MATLAB Release Compatibility
Platform Compatibility
Windows macOS LinuxTags
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!Discover Live Editor
Create scripts with code, output, and formatted text in a single executable document.
tbx
tbx/+webhdfs
tbx/+webhdfs/+utils
tests
utils
tbx/doc
Version | Published | Release Notes | |
---|---|---|---|
1.0.0 |
|