You can improve the performance of
parfor-loops in various ways.
This includes parallel creation of arrays inside the loop; profiling
parfor-loops; slicing arrays; and optimizing your code on local
workers before running on a cluster.
When you create a large array in the client before your
parfor-loop, and access it within the loop, you might observe
slow execution of your code. To improve performance, tell each MATLAB® worker to create its own arrays, or portions of them, in parallel. You
can save the time of transferring data from client to workers by asking each worker
to create its own copy of these arrays, in parallel, inside the loop. Consider
changing your usual practice of initializing variables before a
for-loop, avoiding needless repetition inside the loop. You
might find that parallel creation of arrays inside the loop improves
Performance improvement depends on different factors, including
size of the arrays
time needed to create arrays
worker access to all or part of the arrays
number of loop iterations that each worker performs
Consider all factors in this list when you are considering to convert
parfor-loops. For more
details, see Convert for-Loops Into parfor-Loops.
As an alternative, consider the
parallel.pool.Constant function to
establish variables on the pool workers before the loop. These variables remain on
the workers after the loop finishes, and remain available for multiple
parfor-loops. You might improve performance using
parallel.pool.Constant, because the data is transferred
only once to the workers.
In this example, you first create a big data set
D to build a
parallel.pool.Constant object, which allows you to reuse
the data by copying
D to each worker. Measure the elapsed time
toc for each case and note
function constantDemo D = rand(1e7, 1); tic for i = 1:20 a = 0; parfor j = 1:60 a = a + sum(D); end end toc tic D = parallel.pool.Constant(D); for i = 1:20 b = 0; parfor j = 1:60 b = b + sum(D.Value); end end toc end
>> constantDemo Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers. Elapsed time is 63.839702 seconds. Elapsed time is 10.194815 seconds.
parfor-loop by using the
You can profile a
parfor-loop by measuring the time elapsed
toc. You can also measure how
much data is transferred to and from the workers in the parallel pool by using
tocBytes. Note that this
is different from profiling MATLAB code in the usual sense using the MATLAB profiler, see Profile Your Code to Improve Performance.
This example calculates the spectral radius of a matrix and converts a
for-loop into a
the resulting speedup and the amount of transferred data.
In the MATLAB Editor, enter the following
toc to measure the time
elapsed. Save the file as
function a = MyForLoop(A) tic for i = 1:200 a(i) = max(abs(eig(rand(A)))); end toc end
Run the code, and note the elapsed time.
a = MyForLoop(500);
Elapsed time is 31.935373 seconds.
MyForLoop.m, replace the
for-loop with a
measure how much data is transferred to and from the workers in the parallel
pool. Save the file as
ticBytes(gcp); parfor i = 1:200 a(i) = max(abs(eig(rand(A)))); end tocBytes(gcp)
Run the new code, and run it again. Note that the first run is slower than the second run, because the parallel pool has to be started and you have to make the code available to the workers. Note the elapsed time for the second run.
By default, MATLAB automatically opens a parallel pool of workers on your local machine.
a = MyParforLoop(500);
Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers. ... BytesSentToWorkers BytesReceivedFromWorkers __________________ ________________________ 1 15340 7024 2 13328 5712 3 13328 5704 4 13328 5728 Total 55324 24168 Elapsed time is 10.760068 seconds.
If a variable is initialized before a
parfor-loop, then used
parfor-loop, it has to be passed to each MATLAB worker evaluating the loop iterations. Only those variables used
inside the loop are passed from the client workspace. However, if all occurrences of
the variable are indexed by the loop variable, each worker receives only the part of
the array it needs.
As an example, you first run a
parfor-loop using a sliced
variable and measure the elapsed time.
% Sliced version M = 100; N = 1e6; data = rand(M, N); tic parfor idx = 1:M out2(idx) = sum(data(idx, :)) ./ N; end toc
Elapsed time is 2.261504 seconds.
Now suppose that you accidentally use a reference to the variable
data instead of
N inside the
parfor-loop. The problem here is that the call to
size(data, 2) converts the sliced variable into a broadcast
% Accidentally non-sliced version clear M = 100; N = 1e6; data = rand(M, N); tic parfor idx = 1:M out2(idx) = sum(data(idx, :)) ./ size(data, 2); end toc
Elapsed time is 8.369071 seconds.
In this case, you can easily avoid the non-sliced usage of
data, because the result is a constant, and can be computed
outside the loop. In general, you can perform computations that depend only on
broadcast data before the loop starts, since the broadcast data cannot be modified
inside the loop. In this case, the computation is trivial, and results in a scalar
result, so you benefit from taking the computation out of the loop.
Running your code on local workers might offer the convenience of testing your application without requiring the use of cluster resources. However, there are certain drawbacks or limitations with using local workers. Because the transfer of data does not occur over the network, transfer behavior on local workers might not be indicative of how it will typically occur over a network.
With local workers, because all the MATLAB worker sessions are running on the same machine, you might not see any
performance improvement from a
parfor-loop regarding execution
time. This can depend on many factors, including how many processors and cores your
machine has. The key point here is that a cluster might have more cores available
than your local machine. If your code can be multithreaded by MATLAB, then the only
way to go faster is to use more cores to work on the problem, using a
You might experiment to see if it is faster to create the arrays before the loop (as shown on the left below), rather than have each worker create its own arrays inside the loop (as shown on the right).
Try the following examples running a parallel pool locally, and notice the difference in time execution for each loop. First open a local parallel pool:
Run the following examples, and execute again. Note that the first run for each case is slower than the second run, because the parallel pool has to be started and you have to make the code available to the workers. Note the elapsed time, for each case, for the second run.
tic; n = 200; M = magic(n); R = rand(n); parfor i = 1:n A(i) = sum(M(i,:).*R(n+1-i,:)); end toc
tic; n = 200; parfor i = 1:n M = magic(n); R = rand(n); A(i) = sum(M(i,:).*R(n+1-i,:)); end toc
Running on a remote cluster, you might find different behavior, as workers can simultaneously create their arrays, saving transfer time. Therefore, code that is optimized for local workers might not be optimized for cluster workers, and vice versa.