Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

Use Tall Arrays on a Spark Enabled Hadoop Cluster

Creating and Using Tall Tables

This example shows how to modify a MATLAB® example of creating a tall table to run on a Spark® enabled Hadoop® cluster. You can use this tall table to create tall arrays and calculate statistical properties. You can develop code locally and then scale up, to take advantage of the capabilities offered by Parallel Computing Toolbox™ and MATLAB Distributed Computing Server™ without having to rewrite your algorithm. See also Big Data Workflow Using Tall Arrays and Datastores and Configure a Hadoop Cluster (MATLAB Distributed Computing Server)

First, you must set environment variables and cluster properties as appropriate for your specific Spark enabled Hadoop cluster configuration. See your system administrator for the values for these and other properties necessary for submitting jobs to your cluster.

setenv('HADOOP_HOME', '/path/to/hadoop/install')
setenv('SPARK_HOME', '/path/to/spark/install');
cluster = parallel.cluster.Hadoop;

% Optionally, if you want to control the exact number of workers:
cluster.SparkProperties('spark.executor.instances') = '16';

mapreducer(cluster);

Note

In the setup step, you use mapreducer to set the cluster execution environment. In the next step, you create a tall array. If you modify or delete the cluster execution environment after creating a tall array, then the tall array is invalid and you must recreate it.

Note

If you want to develop in serial and not use local workers, enter the following command.

mapreducer(0);

After setting your environment variables and cluster properties, you can run the MATLAB tall table example (MATLAB) on the Spark enabled Hadoop cluster instead of on your local machine. Create a datastore and convert it into a tall table. MATLAB automatically starts a Spark job to run subsequent calculations on the tall table.

ds = datastore('airlinesmall.csv');
varnames = {'ArrDelay', 'DepDelay'};
ds.SelectedVariableNames = varnames;
ds.TreatAsMissing = 'NA';

Create a tall table tt from the datastore.

tt = tall(ds)
Starting a Spark Job on the Hadoop cluster. This could take a few minutes ...done.

tt =

  M×2 tall table 

    ArrDelay    DepDelay
    ________    ________

     8          12      
     8           1      
    21          20      
    13          12      
     4          -1      
    59          63      
     3          -2      
    11          -1      
    :           :
    :           :

The display indicates that the number of rows, M, is not yet known. M is a placeholder until the calculation completes.

Extract the arrival delay ArrDelay from the tall table. This action creates a new tall array variable to use in subsequent calculations.

a = tt.ArrDelay;

You can specify a series of operations on your tall array, which are not executed until you call gather. Doing so allows you to batch up commands that might take a long time. As an example, calculate the mean and standard deviation of the arrival delay. Use these values to construct the upper and lower thresholds for delays that are within 1 standard deviation of the mean.

m = mean(a,'omitnan');
s = std(a,'omitnan');
one_sigma_bounds = [m-s m m+s];

Use gather to calculate one_sigma_bounds, and bring the answer into memory.

sig1 = gather(one_sigma_bounds)
Evaluating tall expression using the Spark Cluster:
Evaluation completed in 0 sec

sig1 =

  -23.4572    7.1201   37.6975

You can specify multiple inputs and outputs to gather if you want to evaluate several things at once. Doing so is faster than calling gather separately on each tall array. For example, calculate the minimum and maximum arrival delay.

[max_delay, min_delay] = gather(max(a),min(a))
Evaluating tall expression using the Spark Cluster:
- Pass 1 of 1: Completed in 1 sec
Evaluation completed in 1 sec

max_delay =

        1014

min_delay =

   -64

Note

These examples take more time to complete the first time if MATLAB is starting on the cluster workers.

When using tall arrays on a Spark enabled Hadoop cluster, compute resources from the Hadoop cluster will be reserved for the lifetime of the mapreducer execution environment. To clear these resources, you must delete the mapreducer:

delete(gcmr);
Alternatively, you can change to a different execution environment, for example:
mapreducer(0);

See Also

| | | | |

Related Examples

More About

Was this topic helpful?