Main Content

`spmd`

, `parfor`

,
and `parfeval`

To run computations in parallel, you can use `parfor`

,
`parfeval`

, `parfevalOnAll`

, or
`spmd`

. Each construct relies on different parallel programming
concepts. If you require workers to communicate throughout a computation, use
`parfeval`

, `parfevalOnAll`

, or
`spmd`

.

Use

`parfeval`

or`parfevalOnAll`

if your code can be split into a set of tasks, where each task can depend on the output of other tasks.Use

`spmd`

if you require communication between workers during a computation.

Computations with `parfeval`

are best represented as a graph,
similar to a Kanban board with blocking. Generally, results are collected from
workers after a computation is complete. You can collect results from execution of a
`parfeval`

operation by using `afterEach`

or `afterAll`

. You typically use the results in further
calculations.

Computations with `spmd`

are best represented by a flowchart,
similar to a waterfall workflow. A pool worker executing `spmd`

statements is called a lab. Results can be collected from labs during a computation.
Sometimes, labs must communicate with other labs before they can finish their
computation.

If you are unsure, ask yourself the following: **within my
communicating parallel code, can each computation be completed without any
communication between workers?** If yes, use
`parfeval`

. Otherwise, use `spmd`

.

When choosing between `parfor`

,
`parfeval`

, and `spmd`

, consider whether
your calculation requires synchronization with the client.

`parfor`

and `spmd`

require synchronization,
and therefore block you from running any new computations on the MATLAB^{®} client. `parfeval`

does not require
synchronization, so the client is free to pursue other work.

`ProcessPool`

In this example, you compare how fast functions run on the client and on a `ProcessPool`

. Some MATLAB functions make use of multithreading. Tasks that use these functions perform better on multiple threads than a single thread. Therefore, if you use these functions on a machine with many cores, a local cluster can perform worse than multithreading on the client.

The supporting function `clientFasterThanPool`

, listed at the end of this example, returns `true`

if multiple executions are performed faster on the client than a `parfor`

-loop. The syntax is the same as `parfeval`

: use a function handle as the first argument, the number of outputs as the second argument, and then give all required arguments for the function.

First, create a local `ProcessPool`

.

`p = parpool('local');`

Starting parallel pool (parpool) using the 'local' profile ... Connected to the parallel pool (number of workers: 6).

Check how fast the `eig`

function runs by using the `clientFasterThanPool`

supporting function. Create an anonymous function with `eig`

to represent your function call.

[~, t_client, t_pool] = clientFasterThanPool(@(N) eig(randn(N)), 0, 500)

t_client = 22.6243

t_pool = 4.9334

The parallel pool computes the answer faster than the client. Divide `t_client`

by `maxNumCompThreads`

to find the time taken per thread on the client.

t_client/maxNumCompThreads

ans = 3.7707

Workers are single threaded by default. The result indicates that the time taken per thread is similar on both the client and the pool, as the value of `t_pool`

is roughly 1.5 times the value of `t_client/maxNumCompThreads`

. The `eig`

function does not benefit from multithreading.

Next, check how fast the `lu`

function runs by using the `clientFasterThanPool`

supporting function.

[~, t_client, t_pool] = clientFasterThanPool(@(N) lu(randn(N)), 0, 500)

t_client = 1.0225

t_pool = 0.4693

The parallel pool typically computes the answer faster than the client if your local machine has four or more cores. Divide `t_client`

by `maxNumCompThreads`

to find the time taken per thread.

t_client/maxNumCompThreads

ans = 0.1704

This result indicates that the time taken per thread is much less on the client than the pool, as the value of `t_pool`

is roughly 3 times the value of `t_client/maxNumCompThreads`

. Each thread is used for less computational time, indicating that `lu`

uses multithreading.

**Define Helper Function**

The supporting function `clientFasterThanPool`

checks whether a computation is faster on the client than on a parallel pool. It takes as input a function handle `fcn`

and a variable number of input arguments (`in1, in2, ...`

). `clientFasterThanPool`

executes `fcn(in1, in2, ...)`

on both the client and the active parallel pool. As an example, if you wish to test `rand(500)`

, your function handle must be in the following form:

fcn = @(N) rand(N);

Then, use `clientFasterThanPool(fcn,500)`

.

function [result, t_multi, t_single] = clientFasterThanPool(fcn,numout,varargin) % Preallocate cell array for outputs outputs = cell(numout); % Client tic for i = 1:200 if numout == 0 fcn(varargin{:}); else [outputs{1:numout}] = fcn(varargin{:}); end end t_multi = toc; % Parallel pool vararginC = parallel.pool.Constant(varargin); tic parfor i = 1:200 % Preallocate cell array for outputs outputs = cell(numout); if numout == 0 fcn(vararginC.Value{:}); else [outputs{1:numout}] = fcn(vararginC.Value{:}); end end t_single = toc; % If multhreading is quicker, return true result = t_single > t_multi; end

`parfor`

, `parfeval`

, and `spmd`

Using `spmd`

can be slower or faster than using `parfor`

-loops or `parfeval`

, depending on the type of computation. Overhead affects the relative performance of `parfor`

-loops, `parfeval`

, and `spmd`

.

For a set of tasks, `parfor`

and `parfeval`

typically perform better than `spmd`

under these conditions.

The computational time taken per task is not deterministic.

The computational time taken per task is not uniform.

The data returned from each task is small.

Use `parfeval`

when:

You want to run computations in the background.

Each task is dependent on other tasks.

In this example, you examine the speed at which matrix operations can be performed when using a `parfor`

-loop, `parfeval`

, and `spmd`

.

First, create a local parallel pool `p`

.

`p = parpool('local');`

Starting parallel pool (parpool) using the 'local' profile ... Connected to the parallel pool (number of workers: 6).

**Compute Random Matrices**

Examine the speed at which random matrices can be generated by using a `parfor`

-loop, `parfeval`

, and `spmd`

. Set the number of trials (`n`

) and the matrix size (for an `m`

-by-`m`

matrix). Increasing the number of trials improves the statistics used in later analysis, but does not affect the calculation itself.

m = 1000; n = 20;

Then, use a `parfor`

-loop to execute `rand(m)`

once for each worker. Time each of the `n`

trials.

parforTime = zeros(n,1); for i = 1:n tic; mats = cell(1,p.NumWorkers); parfor N = 1:p.NumWorkers mats{N} = rand(m); end parforTime(i) = toc; end

Next, use `parfeval`

to execute `rand(m)`

once for each worker. Time each of the `n`

trials.

parfevalTime = zeros(n,1); for i = 1:n tic; f(1:p.NumWorkers) = parallel.FevalFuture; for N = 1:p.NumWorkers f(N) = parfeval(@rand,1,m); end mats = fetchOutputs(f, "UniformOutput", false)'; parfevalTime(i) = toc; clear f end

Finally, use `spmd`

to execute `rand(m)`

once for each lab. For details on labs and how to execute commands on them with `spmd`

, see Run Single Programs on Multiple Data Sets. Time each of the `n`

trials.

spmdTime = zeros(n,1); for i = 1:n tic; spmd e = rand(m); end eigenvals = {e{:}}; spmdTime(i) = toc; end

Use `rmoutliers`

to remove the outliers from each of the trials. Then, use `boxplot`

to compare the times.

% Hide outliers boxData = rmoutliers([parforTime parfevalTime spmdTime]); % Plot data boxplot(boxData, 'labels',{'parfor','parfeval','spmd'}, 'Symbol','') ylabel('Time (seconds)') title('Make n random matrices (m by m)')

Typically, `spmd`

requires more overhead per evaluation than `parfor`

or `parfeval`

. Therefore, in this case, using a `parfor`

-loop or `parfeval`

is more efficient.

**Compute Sum of Random Matrices**

Next, compute the sum of random matrices. You can do this by using a reduction variable with a `parfor`

-loop, a sum after computations with `parfeval`

, or `gplus`

with `spmd`

. Again, set the number of trials (`n`

) and the matrix size (for an `m`

-by-`m`

matrix).

m = 1000; n = 20;

Then, use a `parfor`

-loop to execute `rand(m)`

once for each worker. Compute the sum with a reduction variable. Time each of the `n`

trials.

parforTime = zeros(n,1); for i = 1:n tic; result = 0; parfor N = 1:p.NumWorkers result = result + rand(m); end parforTime(i) = toc; end

Next, use `parfeval`

to execute `rand(m)`

once for each worker. Use `fetchOutputs`

on all of the matrices, then use `sum`

. Time each of the `n`

trials.

parfevalTime = zeros(n,1); for i = 1:n tic; f(1:p.NumWorkers) = parallel.FevalFuture; for N = 1:p.NumWorkers f(N) = parfeval(@rand,1,m); end result = sum(fetchOutputs(f)); parfevalTime(i) = toc; clear f end

Finally, use `spmd`

to execute `rand(m)`

once for each lab. Use `gplus`

to sum all of the matrices. To send the result only to the first lab, set the optional `targetlab`

argument to `1`

. Time each of the `n`

trials.

spmdTime = zeros(n,1); for i = 1:n tic; spmd r = gplus(rand(m), 1); end result = r{1}; spmdTime(i) = toc; end

Use `rmoutliers`

to remove the outliers from each of the trials. Then, use `boxplot`

to compare the times.

% Hide outliers boxData = rmoutliers([parforTime parfevalTime spmdTime]); % Plot data boxplot(boxData, 'labels',{'parfor','parfeval','spmd'}, 'Symbol','') ylabel('Time (seconds)') title('Sum of n random matrices (m by m)')

For this calculation, `spmd`

is significantly faster than a `parfor`

-loop or `parfeval`

. When you use reduction variables in a `parfor`

-loop, you broadcast the result of each iteration of the `parfor`

-loop to all of the workers. By contrast, `spmd`

calls `gplus`

only once to do a global reduction operation, requiring less overhead. As such, the overhead for the reduction part of the calculation is $$O({n}^{2})$$ for `spmd`

, and $$O(m{n}^{2})$$ for `parfor`

.