Main Content

This example looks at why it is so hard to give a concrete answer to the question "How will my (parallel) application perform on my multi-core machine or on my cluster?" The answer most commonly given is "It depends on your application as well as your hardware," and we will try to explain why this is all one can say without more information.

This example shows the contention for memory access that we can run into on a regular multi-core computer. This example illustrates the case of all the workers running on the same multi-core computer with only one CPU.

Related examples:

To simplify the problem at hand, we will benchmark the computer's ability to execute task parallel problems that do not involve disk IO. This allows us to ignore several factors that might affect parallel applications, such as:

Amount of inter-process communication

Network bandwidth and latency for inter-process communication

Process startup and shutdown times

Time to dispatch requests to the processes

Disk IO performance

This leaves us with only:

Time spent executing task parallel code

To understand why it is worthwhile to perform such a simple benchmark, consider the following example: If one person can fill one bucket of water, carry it some distance, empty it, and take it back to refill it in one minute, how long will it take two people to go the same round trip with one bucket each? This simple analogy closely reflects the task parallel benchmarks in this example. At first glance, it seems absurd that there should be any decrease in efficiency when two people are simultaneously doing the same thing as compared to one person.

If all things are perfect in our previous example, two people complete one loop with a bucket of water each in one minute. Each person quickly fills one bucket, carries the bucket over to the destination, empties it and walks back, and they make sure never to interfere or interrupt one another.

However, imagine that they have to fill the buckets from a single, small water hose. If they arrive at the hose at the same time, one would have to wait. This is one example of a **contention for a shared resource**. Maybe the two people don't need to simultaneously use the hose, and the hose therefore serves their needs; but if you have 10 people transporting a bucket each, some might always have to wait.

In our case, the water hose corresponds to the computer hardware that we use, in particular the access to the computer memory. If multiple programs are running simultaneously on one CPU core each, and they all need access to data that is stored in the computer's memory, some of the programs may have to wait because of limited memory bandwidth.

To take our analogy a bit further, imagine that we have contention at the hose when two people are carrying one bucket each, but then we change the task and ask them to carry the water quite a bit further away from the hose. When performing this modified task, the two people spend a larger proportion of their time doing work, i.e., walking with the buckets, and a smaller proportion of their time contending over the shared resource, the hose. They are therefore less likely to need the hose at the same time, so this modified task has a higher **parallel efficiency** than the original one.

In the case of the benchmarks in this example, this corresponds on the one hand to running programs that require lots of access to the computer's memory, but they perform very little work with the data once fetched. If, on the other hand, the programs perform lots of computations with the data, it becomes irrelevant how long it took to fetch the data, the computation time will overshadow the time spent waiting for access to the memory.

The predictability of the memory access of an algorithm also effects how contended the memory access will be. If the memory is accessed in a regular, predictable manner, we will not experience the same amount of contention as when memory is accessed in an irregular manner. This can be seen further below, where, for example, singular value decomposition calculations result in more contention than matrix multiplication.

The code shown in this example can be found in this function:

```
function paralleldemo_resource_bench
```

We will use the parallel pool to call our task parallel test functions, so we start by checking whether the pool is open. Note that if the parallel pool is using workers running on multiple computers, you may not experience any of the resource contention that we attempt to illustrate.

p = gcp; if isempty(p) error('pctexample:backslashbench:poolClosed', ... ['This example requires a parallel pool. ' ... 'Manually start a pool using the parpool command or set ' ... 'your parallel preferences to automatically start a pool.']); end poolSize = p.NumWorkers;

For our calculations, we create an input matrix large enough that it needs to be brought from the computer's memory onto the CPU each time it is processed. That is, we make it so large that we deliberately cause resource contention.

sz = 2048; m = rand(sz*sz, 1);

These computations are very simple: Repeatedly sum a single array. Since the computations are very lightweight, we expect to see resource contention when running multiple copies of this function simultaneously with a large input array.

function sumOp(m) s = 0; for itr = 1:100 % Repeat multiple times to get accurate timing s = s + sum(m); end end

The core of the timing function consists of a simple `spmd`

statement. Notice that we retain the minimum execution time observed for a given level of concurrency, `n`

. As stated at the beginning, we are benchmarking task parallel problems, measuring only the actual runtime. This means that we are not benchmarking the performance of MATLAB, the Parallel Computing Toolbox™, or the `spmd`

language construct. Rather, we are benchmarking the ability of our OS and hardware to simultaneously run multiple copies of a program.

function time = timingFcn(fcn, numConcurrent) time = zeros(1, length(numConcurrent)); for ind = 1:length(numConcurrent) % Invoke the function handle fcn concurrently on n different labs. % Store the minimum time it takes all to complete. n = numConcurrent(ind); spmd(n) tconcurrent = inf; for itr = 1:5 labBarrier; tic; % Measure only task parallel runtime. fcn(); tAllDone = gop(@max, toc); % Time for all to complete. tconcurrent = min(tconcurrent, tAllDone); end end time(ind) = tconcurrent{1}; clear tconcurrent itr tAllDone; if ind == 1 fprintf('Execution times: %f', time(ind)); else fprintf(', %f', time(ind)); end end fprintf('\n'); end

We measure how long it takes to simultaneously evaluate `n`

copies of a summation function for different values of `n`

. Since the summation calculations are so simple, we expect to encounter a resource contention when running multiples of these computations simultaneously on a multi-core CPU. Consequently, we expect it to take longer to evaluate the summation function when we are performing multiple such evaluations concurrently than it takes to execute a single such evaluation on an otherwise idle CPU.

tsum = timingFcn(@() sumOp(m), 1:poolSize);

Execution times: 0.367254, 0.381174, 0.395128, 0.421978

We now look at a more computationally intensive problem, that of calculating the FFT of our vector. Since FFT is more computationally intensive than summation, we expect that we will not see the same performance degradations when concurrently evaluating multiple calls to the FFT function as we saw with calls to the summation function.

function fftOp(m) for itr = 1:10 % Repeat a few times for accurate timing fft(m); end end tfft = timingFcn(@() fftOp(m), 1:poolSize);

Execution times: 1.078532, 1.196302, 1.358666, 1.570749

Finally, we look at an even more computationally intensive problem than either FFT or summation. The memory access in matrix multiplication is also very regular, so this problem therefore has the potential to be executed quite efficiently in parallel on a multi-core machine.

function multOp(m) m*m; %#ok<VUNUS> % No need to repeat for accurate timing. end m = reshape(m, sz, sz); tmtimes = timingFcn(@() multOp(m), 1:poolSize); clear m;

Execution times: 0.667003, 0.768181, 0.846016, 0.989242

We create a simple bar chart showing the speedup achieved by running multiple function invocations concurrently. This kind of graph shows the speedup with what is known as **weak scaling**. Weak scaling is where the number of processes/processors varies, and the problem size on each process/processor is fixed. This has the effect of increasing the total problem size as we increase the number of processes/processors. On the other hand, **strong scaling** is where the problem size is fixed and the number of processes/processors varies. The effect of this is that as we increase the number of processes/processors, the work done by each process/processor decreases.

allTimes = [tsum(:), tfft(:), tmtimes(:)]; % Normalize the execution times. efficiency = bsxfun(@rdivide, allTimes(1, :), allTimes); speedup = bsxfun(@times, efficiency, (1:length(tsum))'); fig = figure; ax = axes('parent', fig); bar(ax, speedup); legend(ax, 'vector sum', 'vector fft', 'matrix mult', ... 'Location', 'NorthWest') xlabel(ax, 'Number of concurrent processes'); ylabel(ax, 'Speedup') title(ax, ['Effect of number of concurrent processes on ', ... 'resource contention and speedup']);

Looking at the graph above, we can see that these simple problems scale very differently on the same computer. When we also look at the fact that other problems and different computers may show very different behavior, it should become clear why it is impossible to give a general answer to the question "How will my (parallel) application perform on my multi-core machine or on my cluster?" The answer to that question truly depends on the application and the hardware in question.

The resource contention does not depend only on the function that we are executing, but also on the size of the data that we process. To illustrate this, we measure the execution times of various functions with various sizes of input data. As before, we are benchmarking the ability of our hardware to perform these computations concurrently, and we are not benchmarking MATLAB or its algorithms. We use more functions than before so that we can investigate the effects of different memory access patterns as well as the effects of different data sizes.

szs = [128, 256, 512, 1024, 2048]; description = {'vector sum', 'vector fft', 'matrix mult', 'matrix LU', ... 'matrix SVD', 'matrix eig'};

We loop through the different data sizes and the functions we want to benchmark, and measure the speedup observed. We compare the sequential execution time to the time it takes to execute concurrently as many invocations as we have labs in the pool.

speedup = zeros(length(szs), length(description)); for i = 1:length(szs) sz = szs(i); fprintf('Using matrices of size %d-by-%d.\n', sz, sz); j = 1; for f = [{@sumOp; sz^2; 1}, {@fftOp; sz^2; 1}, {@multOp; sz; sz}, ... {@lu; sz; sz}, {@svd; sz; sz}, {@eig; sz; sz}] op = f{1}; nrows = f{2}; ncols = f{3}; m = rand(nrows, ncols); % Compare sequential execution to execution on all labs. tcurr = timingFcn(@() op(m), [1, poolSize]); speedup(i, j) = tcurr(1)/tcurr(2)*poolSize; j = j + 1; end end

Using matrices of size 128-by-128. Execution times: 0.001472, 0.001721 Execution times: 0.002756, 0.004069 Execution times: 0.000221, 0.000367 Execution times: 0.000200, 0.000369 Execution times: 0.001314, 0.002186 Execution times: 0.006565, 0.009958 Using matrices of size 256-by-256. Execution times: 0.005472, 0.005629 Execution times: 0.010400, 0.013956 Execution times: 0.002175, 0.002994 Execution times: 0.000801, 0.001370 Execution times: 0.008052, 0.009112 Execution times: 0.042912, 0.057383 Using matrices of size 512-by-512. Execution times: 0.021890, 0.022754 Execution times: 0.055563, 0.083730 Execution times: 0.011029, 0.017932 Execution times: 0.005655, 0.009090 Execution times: 0.052226, 0.067276 Execution times: 0.262720, 0.353336 Using matrices of size 1024-by-1024. Execution times: 0.090294, 0.110154 Execution times: 0.317712, 0.445605 Execution times: 0.097819, 0.131056 Execution times: 0.037662, 0.057474 Execution times: 0.392037, 0.937005 Execution times: 1.063107, 1.579232 Using matrices of size 2048-by-2048. Execution times: 0.365510, 0.422942 Execution times: 1.067788, 1.560758 Execution times: 0.667548, 0.980306 Execution times: 0.271354, 0.391217 Execution times: 4.111523, 7.820437 Execution times: 6.101292, 8.984251

When we look at the results, we have to keep in mind how a function interacts with the cache on a CPU. For small data sizes, we are always working out of the CPU cache for all these functions. In that case, we expect to see perfect speedup. When the input data is too large to fit into the CPU cache, we start seeing the performance degradation caused by contention for memory access. This happens in several ways, but for this example, the following are the most important:

The function performs a relatively small amount of computation with the data. The sum of a vector is a good example of that.

The function accesses data in small chunks or has irregular data access patterns. The following graph shows that with eigenvalue calculations (eig) and singular value decomposition (svd).

fig = figure; ax = axes('parent', fig); plot(ax, speedup); lines = ax.Children; set(lines, {'Marker'}, {'+', 'o', '*', 'x', 's', 'd'}'); hold(ax, 'on'); plot(ax, repmat(poolSize, 1, length(szs)), '--'); hold(ax, 'off'); legend(ax, [description, {'ideal speedup'}], 'Location', 'SouthWest'); ax.XTick = 1:length(szs); ax.XTickLabel = arrayfun(@(x) {sprintf('%d^2', x)}, szs); yl = ylim; ylim([0, max([poolSize + 0.1, yl(2)])]) xlabel(ax, 'Number of elements per process'); ylabel(ax, 'Speedup') title(ax, 'Effect of data size on resource contention and speedup');

```
end
```