Hadoop cluster for mapreducer, mapreduce and tall arrays
A parallel.cluster.Hadoop object provides access to a cluster for configuring mapreducer, mapreduce, and tall arrays.
A parallel.cluster.Hadoop object has the following properties.
|Folders to add to MATLAB search path of workers, specified as a character vector, string or string array, or cell array of character vectors|
|Files and folders that are sent to workers during a |
|Specifies whether automatically attach files|
|Specifies path to MATLAB for workers to use|
|Application configuration file to be given to Hadoop|
|Installation location of Hadoop on the local machine|
|Map of name-value property pairs to be given to Hadoop|
|License number to use with online licensing|
|Specify whether cluster uses online licensing|
|Installation location of Spark on the local machine|
|Map of name-value property pairs to be given to Spark|
When you offload computations to workers, any files that are required for computations on
the client must also be available on workers. By default, the client attempts to
automatically detect and attach such files. To turn off automatic detection, set the
AutoAttachFiles property to false. If automatic detection cannot
find all the files, or if sending files from client to worker is slow, use the following
If the files are in a folder that is not accessible on the workers, set the
AttachedFiles property. The cluster copies each file you
specify from the client to workers.
If the files are in a folder that is accessible on the workers, you can set the
AdditionalPaths property instead. Use the
AdditionalPaths property to add paths to each worker's
MATLAB® search path and avoid copying files unnecessarily from the client to
HadoopProperties allows you to override
configuration properties for Hadoop. See the list of properties in
the Hadoop® documentation.
SparkInstallFolder is by default set
SPARK_HOME environment variable. This
is required for tall array evaluation on Hadoop (but not for mapreduce).
For a correctly configured cluster, you only need to set the installation
SparkProperties allows you to override
configuration properties for Spark. See the list of properties in
the Spark® documentation.
For further help, type:
Spark enabled Hadoop clusters place limits on how much memory is available. You must adjust these limits to support your workflow.
The amount of data gathered to the client is limited by the Spark properties:
The amount of data to gather from a single Spark task must fit in these properties. A single Spark task processes one block of data from HDFS, which is 128 MB of data by default. If you gather a tall array containing most of the original data, you must ensure these properties are set to fit.
If these properties are set too small, you see an error like the following.
Error using tall/gather (line 50) Out of memory; unable to gather a partition of size 300m from Spark. Adjust the values of the Spark properties spark.driver.memory and spark.executor.memory to fit this partition.
Adjust the properties either in the default settings of the cluster or
directly in MATLAB. To adjust the properties in MATLAB, add name-value pairs to
SparkProperties property of the cluster. For
cluster = parallel.cluster.Hadoop; cluster.SparkProperties('spark.driver.memory') = '2048m'; cluster.SparkProperties('spark.executor.memory') = '2048m'; mapreducer(cluster);
The amount of working memory for a MATLAB Worker is limited by the Spark property:
By default, this is set to 2.5 GB. You typically need to increase this if you
cellfun, or custom
datastores to generate large amounts of data in one go. It is advisable to
increase this if you come across lost or crashed Spark Executor
You can adjust these properties either in the default settings of the cluster or directly in MATLAB. To adjust the properties in MATLAB, add name-value pairs to the SparkProperties property of the cluster. For example:
cluster = parallel.cluster.Hadoop; cluster.SparkProperties('spark.yarn.executor.memoryOverhead') = '4096m'; mapreducer(cluster);