# Distributing individual tasks over my pool of workers, scaling should be perfect

4 views (last 30 days)
Alexis Bottero on 16 Nov 2020
Commented: Edric Ellis on 19 Nov 2020
Here is a parallel computing problem that should be trivial: distributing individual tasks (of equivalent duration) over a pool of workers.
parpool(4) % Open a pool of 4 workers
Ntasks = 4; % Number of tasks to compute (chosen equal to the number of workers for simplicity)
N = 2000; % Big number (not too big, of course, to avoid memory issues)
%% First strategy: serial computing
big_matrix_or_whatever = zeros(N,N);
tic
% norm(pinv(...)) is a dummy example of a costly task
complicated_thing_to_compute(i) = norm(pinv(big_matrix_or_whatever));
end
toc % ~ 2.6s
%% Second strategy: parfor
big_matrix_or_whatever = zeros(N,N);
tic
% norm(pinv(...)) is a dummy example of a costly task
complicated_thing_to_compute(i) = norm(pinv(big_matrix_or_whatever));
end
toc % ~ 8s
%% Third stategy: drange
big_matrix_or_whatever = zeros(N,N);
tic
% norm(pinv(...)) is a dummy example of a costly task
complicated_thing_to_compute(i) = norm(pinv(big_matrix_or_whatever));
end
res = gather(complicated_thing_to_compute, 1);
toc % ~ 2.7s
%% Correct strategy that runs in ~ 0.7 s? Maybe with spmd?
Note that this is a trivial problem to tackle with MPI: each proc receive a list of indices 'i' (just one per proc in this example) and do its thing independently.
Coded in MPI (I've done it in Python, C and Fortran) the scaling is perfect: computational time is exactly inversely proportional to the number of workers. There is absolutely no reason why it should be any different. Of course, in the end (if needed) everybody have to send their results to everybody and this can take a bit of additional time. This must be negligible in this specific example though (procs just exchange a few floats)

Walter Roberson on 16 Nov 2020
Note that this is a trivial problem do tackle with MPI, each proc receive a list of indices 'i' (just one per proc in this example) and do its thing independently.
Coded in MPI (I've done it in Python, C and Fortran) the scaling is perfect: computational time is exactly inversely proportional to the number of workers.
You are assuming that each thread is taking the same amount of time (to within statistical noise) so that you can pre-determine the indices to send for processing. parfor() does not assume that: it allocates roughly 2/3 of the tasks the first time around, then it allocates most of the rest in fixed-sized (but smaller) chunks, and then it runs down the remaining tasks individually. This way, if some threads take different lengths of time, CPUs do not go idle until there is no remaining work for them.
Anyhow, what you are probably looking for is threads-based parfor, which is available as of R2020b, the release after what you have.
Walter Roberson on 17 Nov 2020
I indicated that spmd is MPI based. It is, but remember that that is passing messages between processes without any certainty that you can use shared memory even for the message passing. Shared memory for message passing is an optimization as far as MPI is concerned. And you have to get the big array to the processes, which again is not certain to use shared memory.
How does MATLAB transfer initial data to the workers for the parallel technologies that are not spmd? That is not specified, and might be different compared to transfer for spmd. It is specified that the process is equivalent to save() and transfer the byte stream and load(), which rules out shared memory of original objects but the data transfer itself could be different ways.
I did trace through the data transfer for parallel queues at one point. It involved using Java queues. I do not recall at the moment how the Java queues transferred data.
Anyhow, data transfer costs add up.

Edric Ellis on 17 Nov 2020
There are several confounding factors here that go some way at least to explaining your timings.
In your first example (the for loop running in the client), the call to MATLAB's pinv is intrinsically multithreaded - i.e. MATLAB automatically runs it on multiple threads. That means that all your computational resources are already being fully utilised in running the loop. There's nothing more to give from your computer, and the only way you're going to get a faster time is by getting more hardware.
This (somewhat) explains why your parfor loop is never going to be faster. The fact that it is slower is a little confusing. There are several possibilities here. The first thing to consider is: how many hardware threads your CPU supports. MATLAB's maxNumCompThreads can tell you that. If the answer is not 4, then you are probably not seeing the best performance. This is because (by default) parallel pool workers operate in "single computational thread" mode - so they will not intrinsically multi-thread the calls to pinv. There's overhead to getting your large-ish matrix to the workers, that will slow things down. Walter's comment about the parfor scheduling is not quite right in this specific case (the comment is correct in general) - when the loop bounds happens to be the same as the number of workers in the pool, then exactly one iteration is sent to each worker. So that aspect isn't slowing you down here. All that said, I am surprised at how poorly the parfor loop is performing in this case. I would have expected it to be probably slightly slower than the for loop.
Finally, your for-drange example isn't right - you're running on the client there, so that's why you're seeing the same time as the client. You need to put the loop inside an spmd block, i.e.
spmd
% do stuff
end
end
##### 2 CommentsShowHide 1 older comment
Edric Ellis on 19 Nov 2020
I'd like to focus on one piece of your comment there. You said:
if, of course, it is not possible to disable parallel implementation of pinv
The workers running your parfor loop do have the parallel implementation disabled (by default) - as I mentioned, with default settings, workers run in "single computational thread" mode.
So, I do think parfor is a viable option when you have e.g. 1000 pinv-type tasks. Here's a simple experiment I ran on the largest machine I could find around here - a 32-core machine.
% Launch a pool with one worker per "real" core.
% Input data
A = rand(2000);
% Run 10 times locally.
t = tic();
for idx = 1:10
norm(pinv(A));
end
serialTime1 = toc(t) / 10
t = tic();
norm(pinv(A));
end
On the machine I tried, I got 1.2 seconds for serialTime1, and 0.6 seconds for parallelTime1. This shows that on this machine, I can get twice as much work done per unit wall-clock time using parfor.

R2020a

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!