# [GPU] Why do GFLOPS of element-wise matrix operations (addition, multiplication) seem to scale poorly as compared to e.g. mtimes?

10 views (last 30 days)
D. Plotnick on 20 May 2019
Commented: Edric Ellis on 22 May 2019
Hello all,
BLUF: Why do element-wise operations on the GPU seem to scale more slowly than operations like mtimes; even if A+B runs in less absolute time than A*B, the relative speed does not seem to scale as I would expect from computational complexity. Thus, A*B is "faster" than A+B. Why?
I am trying to understand some Matlab speed bottlenecks on the GPU using the techniques I found here (Measuring GPU Performance) as well as GPUbench.
What I do not understand is why element-wise array operations appear slow as compared to mtimes and other benchmark measures using matrices of size (N x N). For example, for now I am going to assume that the number of floating point operations for element operations scales linearly with the total number of points (N^2), while for Mtimes I will use the standard (2*N.^3 - N.^ 2).
To benchmark these, I run the following:
sizes = power(2, 12:2:24);
N = sqrt(sizes);
mmTimesHost = inf(size(sizes));
mmTimesRtimes = inf(size(sizes));
mmTimesMtimes = inf(size(sizes));
for ii=1:numel(sizes)
ii
A = single(rand( N(ii), N(ii) ));
B = single(rand( N(ii), N(ii) ));
A = gpuArray(A);
B = gpuArray(B);
mmTimesRtimes(ii) = gputimeit(@() A.*B);
mmTimesMtimes(ii) = gputimeit(@() A*B);
end
semilogx(sizes,sizes./mmTimesRtimes/1E9, 'b.-',sizes,(2*N.^3 - N.^ 2)./mmTimesMtimes/1E9, 'r.-',...
grid on
title('Single precision matrix-matrix rtimes')
xlabel('Matrix size (numel)')
ylabel('Calculation Rate (GFLOPS)')
legend('Rtimes', 'Mtimes', '+','Location', 'NorthWest')
In the resulting plot, I get a maximum ~= 7000 GFLOPS for mtimes (A*B), which jives with GPUbench ...the element-wise operations are at the very bottom of the scale.
I can see them if I set
ylim([0 20])
and see that (assuming my scaling is right) I peg out at 20 GFLOPS which seems insanely slow. I want to know why; I can only think of four (good) options:
1. I have misunderstood how the number of FLOPs scales with the number of elements involved. I don't see how this could be true for element-wise operations like (+). Although I could be missing a scale factor, I do not believe it is 300 as the above speeds suggest, and I don't see why the number of operations should scale more than linearly.
2. There is something inherently "faster" for mtimes than for (+).
3. There is something different in the Matlab implementation of these functions on the GPU, such that for memory or other reasons mtimes has been optimized, but (+) and A.*B have not.
4. There is some implementation issue on my end that I am missing (I did try converting A and B to colums rather than matrices, no dice).
There are probably other things I am not thinking about too, but at the end of the day I want to know:
• why element-wise operations seem so much slower than it seems they should be?
• Is there a way to improve the speed/scaling of these operations? (maybe using the new GPU coder, custom CUDA code, changing the properties of my gpuDevice, or batching my data in a certain way).
Cheers,
-DP

Edric Ellis on 21 May 2019
Edited: Edric Ellis on 22 May 2019
The main factor here is that MTIMES (i.e. matrix-matrix multiplication) is compute bound, where as PLUS and TIMES (element-wise operations) are memory bound. This means that for both PLUS and TIMES, the memory system on the GPU simply cannot feed the CUDA cores quickly enough for them to be limited by the amount of floating-point operations they need to perform. You can read more about this here: https://en.wikipedia.org/wiki/Roofline_model#Arithmetic_intensity . For a GPU, the memory transfer rate from the main memory to the computational cores is slow enough that the computational cores need to be performing ~50-100 FLOPs per element (roughly... it changes a lot based on the GPU generation - and it's getting worse - modern GPUs need to perform way more FLOPs per element to be compute-bound).
I modified your example a bit to show the limiting factor for PLUS and TIMES, which is the device memory bandwidth (this was run on my somewhat old Tesla K20c, using R2019a)
sizes = power(2, 12:2:24);
N = sqrt(sizes);
mmTimesHost = inf(size(sizes));
mmTimesRtimes = inf(size(sizes));
mmTimesMtimes = inf(size(sizes));
for ii=1:numel(sizes)
A = rand(N(ii), 'single', 'gpuArray');
B = rand(N(ii), 'single', 'gpuArray');
mmTimesRtimes(ii) = gputimeit(@() A.*B);
mmTimesMtimes(ii) = gputimeit(@() A*B);
end
%% Compute-bound - MTIMES
subplot(2, 1, 1)
mtimesGflops = (2*N.^3 - N.^ 2)./mmTimesMtimes/1E9;
semilogx(sizes, mtimesGflops, 'r.-');
grid on
title('Single precision MTIMES')
xlabel('Matrix size (numel)')
ylabel('Calculation Rate (GFLOPS)')
legend('MTIMES','Location', 'NorthWest')
%% Memory bound - PLUS, TIMES
subplot(2, 1, 2)
arrayGB = 4 * sizes / 1e9;
semilogx(sizes, 3 .* arrayGB ./ mmTimesRtimes, 'b.-', ...
sizes, 3 .* arrayGB ./ mmTimesAddtimes, 'g.-');
grid on
title('Single precision TIMES and PLUS')
xlabel('Matrix size (numel)')
ylabel('Memory Bandwidth (GB/sec)')
legend('TIMES', 'PLUS','Location', 'NorthWest')

D. Plotnick on 21 May 2019
Thank you for the excellent answer. I had a sneaking suspicion that something like this was the case, but didn't know how to test it. Several of the speed tests make more sense now (esp. the read/write speed test).
This brings up one final question that understandably goes into the weeds:
To me, this bottleneck appears to be characteristic of the GPU itself, not of any implementation on my end or Matlab's end. Does this represent a fundamental bottleneck for element-wise operations, or can I still achieve significant performance gains by really digging into the CUDA code and memory allocation itself? For example, can I force Matlab to use the tensor cores from the Turing microarchitecture, which (afaik) are optimized for large matrix operations?
Edric Ellis on 22 May 2019
Yes, really this comes down to a fundamental feature of the GPU. For element-wise operations, the key is to try and perform as many as you can at once. The MATLAB gpuArray functionality already tries to do that to some extent, but you can force matters by using arrayfun. When you use arrayfun, the body of your function is converted into a form that can run natively on the GPU, and then the GPU has the opportunity to perform multiple computations on each element that it loads from the GPU's main memory.
(Note: I noticed I missed a factor of 3 in my memory bandwidth computations, now corrected - of course, for C = A + B, the total memory transfer is 3 times the size of A - 2 inputs and 1 output).