MATLAB Answers

How do I create a deployed MATLAB® applications to run against Cloudera Spark™?

12 views (last 30 days)
How do I create a deployed MATLAB® applications to run against Cloudera Spark™?

Accepted Answer

MathWorks Support Team
MathWorks Support Team on 14 Jun 2018
Edited: MathWorks Support Team on 14 Jun 2018
In R2016b, MATLAB Compiler supports running MATLAB applications as standalone executables against a Spark enabled cluster. The ability to deploy MATLAB applications against a Cloudera Spark distribution requires an alternate workflow that is undocumented in the release documentation.
To deploy MATLAB applications against a Cloudera distribution of Spark requires a new wrapper type that can be generated using the mcc command. Using this new wrapper type generates a jar file as well as a shell script which calls spark_submit. The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It supports both yarn-client mode and yarn-cluster mode.
MATLAB applications that use tall arrays or the MATLAB API for Spark can be deployed using this workflow.
Example 1: Deploy Tall Arrays to a Cloudera Spark Enabled Hadoop Cluster
This example shows you how to deploy a MATLAB application that uses tall arrays to a Cloudera Spark enabled Hadoop cluster. The application meanArrivalDemo.m computes the mean arrival delay from airline data. The inputs to the application are:
master—URL to the Spark cluster.
inputFile—the file containing the input data.
outputFile—the file containing the results of the computation.
Prerequisites:
  1. Install the MATLAB Runtime in the default location on the desktop. This example uses as the default location for the MATLAB Runtime.
  2. Install the MATLAB Runtime on every worker node.
  3. Copy the airlinesmall.csv from folder of your MATLAB install area into Hadoop Distributed File System (HDFS™) folder /datasets/airlinemod.
If you don't have the MATLAB Runtime, you can download it from the website at: https://www.mathworks.com/products/compiler/matlab-runtime.html.
Procedure:
1. At the MATLAB command prompt, use the mcc command to generate a jar file and shell script for the MATLAB application meanArrivalDemo.m
>> mcc -vCW 'Spark:meanArrivalDemoApp' meanArrivalDemo.m
or, if using Spark version 2:
>> mcc -vCW 'Spark:meanArrivalDemoApp, 2' meanArrivalDemo.m
This creates a jar file named meanArrivalDempApp.jar and a shell script named run_meanArrivalDemoApp.sh.
Note: In order to use the shell script, you need the environment variables HADOOP_PREIX, HADOOP_CONF_DIR and SPARK_HOME to be set up correctly.
2. You can execute the shell script in yarn-client mode or yarn-cluster mode. In yarn-client mode, the driver runs on the desktop. In yarn-cluster mode, the driver runs in the Application Master process in the cluster.
The general syntax to execute the shell script is:
./run_meanArrivalDemoApp.sh <runtime install root> [Spark arguments] [Application arguments]
a. yarn-client mode
Run the following command from a Linux terminal:
$ ./run_meanArrivalDemoApp.sh \
/usr/local/MATLAB/MATLAB_Runtime/v91 \
yarn-client \
hdfs://hadoop01glnxa64:54310/datasets/airlinemod/airlinesmall.csv \
hdfs://hadoop01glnxa64:54310/user/someuser/meanArrivalResult
To examine the result, enter the following from the MATLAB command prompt:
>> ds = datastore('hdfs://hadoop01glnxa64:54310/user/someuser/meanArrivalResult/*');
>> readall(ds)
b. yarn-cluster mode
Run the following command from a Linux terminal:
$ ./run_meanArrivalDemoApp.sh \
/usr/local/MATLAB/MATLAB_Runtime/v91 \
--deploy-mode cluster --master yarn yarn-cluster \
hdfs://hadoop01glnxa64:54310/datasets/airlinemod/airlinesmall.csv \
hdfs://hadoop01glnxa64:54310/user/someuser/meanArrivalResult
In yarn-cluster mode, since the driver is running on some worker node in the cluster, any standard output from the MATLAB function will not be displayed on your desktop. In addition, files can end up being saved anywhere. In order to prevent such behavior, this example uses the write function to explicitly save the results to a particular location in HDFS.
Example 2: Deploy Applications Using the MATLAB API for Spark
This example shows you how to deploy a MATLAB application developed using the MATLAB API for Spark against a Cloudera Spark enabled Hadoop cluster. The application flightsByCarrierDemo.m computes the number of airline carrier types from airline data. The inputs to the application are:
master—URL to the Spark cluster.
inputFile—the file containing the input data.
Prerequisites:
  1. Install the MATLAB Runtime in the default location on the desktop. This example uses as the default location for the MATLAB Runtime.
  2. Install the MATLAB Runtime on every worker node.
  3. Copy the airlinesmall.csv from folder of your MATLAB install area into Hadoop Distributed File System (HDFS™) folder /datasets/airlinemod.
If you don't have the MATLAB Runtime, you can download it from the website at: http://www.mathworks.com/products/compiler/mcr.
Procedure:
1. At the MATLAB command prompt, use the mcc command to generate a jar file and shell script for the MATLAB application flightsByCarrierDemo.m
>> mcc -C -W 'Spark:flightsByCarrierDemoApp' flightsByCarrierDemo.m
This creates a jar file named flightsByCarrierDemoApp.jar and a shell script named run_flightsByCarrierDemoApp.sh.
2. You can execute the shell script in yarn-client mode or yarn-cluster mode. In yarn-client mode, the driver runs on the desktop. In yarn-cluster mode, the driver runs in the Application Master process in the cluster. The results of the computation in both cases are saved to a text file on HDFS by calling the saveAsTextFile method on the RDD.
a. yarn-client mode
Run the following command from a Linux terminal:
$ ./run_flightsByCarrierDemoApp.sh \
/usr/local/MATLAB/MATLAB_Runtime/v91 \
yarn-client \
hdfs://hadoop01glnxa64:54310/datasets/airlinemod/airlinesmall.csv
To examine the results, enter the following from a Linux terminal.
$ hadoop fs -cat flightsByCarrierResults/*
b. yarn-cluster mode
Run the following command from a Linux terminal:
$ ./run_flightsByCarrierDemoApp.sh \
/usr/local/MATLAB/MATLAB_Runtime/v91 \
--deploy-mode cluster --master yarn yarn-cluster \
hdfs://hadoop01glnxa64:54310/datasets/airlinemod/airlinesmall.csv

More Answers (2)

Sambit Tripathy
Sambit Tripathy on 11 May 2018
Edited: Sambit Tripathy on 11 May 2018
I am getting this error
com.mathworks.toolbox.javabuilder.MWException: Cannot determine the Hadoop version. Verify that the HADOOP_PREFIX, HADOOP_HOME, or the MATLAB_HADOOP_INSTALL environment variable is set to the root of your Hadoop installation folder.
at com.mathworks.toolbox.javabuilder.internal.MWMCR.mclFeval(Native Method)
at com.mathworks.toolbox.javabuilder.internal.MWMCR.access$600(MWMCR.java:31)
at com.mathworks.toolbox.javabuilder.internal.MWMCR$6.mclFeval(MWMCR.java:882)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.mathworks.toolbox.javabuilder.internal.MWMCR$5.invoke(MWMCR.java:769)
at com.sun.proxy.$Proxy0.mclFeval(Unknown Source)
at com.mathworks.toolbox.javabuilder.internal.MWMCR.invoke(MWMCR.java:443)
at com.mathworks.mlspark.mlsubmit.MatlabSubmit$.main(MatlabSubmit.scala:104)
at com.mathworks.mlspark.mlsubmit.MatlabSubmit.main(MatlabSubmit.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I have set HADOOP_PREFIX but still getting this error.
I am running this script on a edge node(driver) where the Hadoop binaries are already present. I am compiling and running on the same machine. In this link https://www.mathworks.com/help/matlab/import_export/read-remote-data.html it says " HDFS data on Hortonworks or Cloudera
If your current machine has access to HDFS data on Hortonworks or Cloudera®, then you do not have to set the HADOOP_HOME or HADOOP_PREFIX environment variables. MATLAB automatically assigns these environment variables when using Hortonworks or Cloudera application edge nodes. " And this is the case, I don't need to set the env variables as I am able run the hadoop commands on this machine in the command line.

wei li
wei li on 18 Dec 2019
I'm trying run this demo on Matlab2019 and CDH5.13, i could submit task to spark, but it will hangs about 2 hour then silently failed.
I try use ps to figure out what happen, i found this cmd /usr/local/MATLAB/MATLAB_Runtime/v97/bin/glnxa64/ctfserver get sucks, it last log is "createMatlabWorker", need help.Thank you!
19/12/18 07:24:29 INFO spark.PackageLogger: InProcessDeployedWorkerFactory:createMatlabWorker
  1 Comment
wei li
wei li on 18 Dec 2019
spark version is 1.6.0, Hadoop version is 3.0, also i try this demo on this spark and hadoop version without cdh, it work fine.

Sign in to comment.

Tags

No tags entered yet.

Products


Release

R2016b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!