Measure GPU Memory Bandwidth and Processing Power

This example shows how to measure some of the key performance characteristics of your GPU hardware.

GPUs can be used to speed up certain types of computations. However, GPU performance varies widely between different GPU devices. These three tests quantify the performance of a GPU:

How quickly can data be sent to the GPU or read back from it?
How fast can the GPU kernel read and write data?
How fast can the GPU perform computations in double and single precision?

After assessing these metrics, you can compare the performance of the GPU and the host CPU. This comparison indicated how much data or computation is required for the GPU to provide an advantage over the CPU.

Check GPU Setup

Check whether a GPU is available.

gpu = gpuDevice;
disp(gpu.Name + " GPU detected and available.")

NVIDIA RTX A5000 GPU detected and available.

Measure Host/GPU Bandwidth

The first test estimates how quickly data can be sent to and read from the GPU. Because the GPU is plugged into the PCI bus, the bandwidth largely depends on how fast the PCI bus is and how many other devices are using it. However, there are some overheads that are included in the measurements, particularly the time taken to call the send and read functions and allocate the arrays. Because these are present in any "real world" use of the GPU, it is reasonable to include these overheads.

To define the test parameters:

Create a variable representing the number of bytes required to store a double-precision number.
Create an vector of sizes, where the maximum size is 1/4 of the available GPU memory. The tests loop over this vector and create arrays of increasing size. As an array on the GPU in MATLAB® cannot have more than $2^{31} - 1$ elements, remove any sizes that will create arrays larger than this.

sizeOfDouble = 8;

maxSize = 0.25*gpu.AvailableMemory;
maxNumTests = 15;
sizes = logspace(4,log10(maxSize),maxNumTests);
sizes(sizes/sizeOfDouble > intmax) = [];

To measure the host/GPU bandwidth, for each array size in sizes:

Create random double-precision data on the host and on the GPU.
Use the gputimeit function to time allocating memory and sending data from the host to the GPU using the gpuArray function.
Use the timeit function to time allocating memory and sending data from the GPU to the host using the gather function.
Divide the amount of data sent by the measured time to determine the bandwidth.

numTests = numel(sizes);
numElements = floor(sizes/sizeOfDouble);
sendTimes = inf(1,numTests);
gatherTimes = inf(1,numTests);

for idx=1:numTests
    disp("Test " + idx + " of " + numTests + ". Timing send and gather for array with " + numElements(idx) + " elements.")

    % Generate random data on GPU and host.
    gpuData = randi([0 9],numElements(idx),1,"gpuArray");
    hostData = gather(gpuData);

    % Time sending data to GPU.
    sendFcn = @() gpuArray(hostData);
    sendTimes(idx) = gputimeit(sendFcn);

    % Time gathering data back from GPU.
    gatherFcn = @() gather(gpuData);
    gatherTimes(idx) = gputimeit(gatherFcn);
end

Test 1 of 15. Timing send and gather for array with 1250 elements.
Test 2 of 15. Timing send and gather for array with 3240 elements.
Test 3 of 15. Timing send and gather for array with 8398 elements.
Test 4 of 15. Timing send and gather for array with 21768 elements.
Test 5 of 15. Timing send and gather for array with 56425 elements.
Test 6 of 15. Timing send and gather for array with 146258 elements.
Test 7 of 15. Timing send and gather for array with 379107 elements.
Test 8 of 15. Timing send and gather for array with 982663 elements.
Test 9 of 15. Timing send and gather for array with 2547104 elements.
Test 10 of 15. Timing send and gather for array with 6602203 elements.
Test 11 of 15. Timing send and gather for array with 17113191 elements.
Test 12 of 15. Timing send and gather for array with 44358118 elements.
Test 13 of 15. Timing send and gather for array with 114978124 elements.
Test 14 of 15. Timing send and gather for array with 298028173 elements.
Test 15 of 15. Timing send and gather for array with 772501660 elements.

sendBandwidth = (sizes./sendTimes)/1e9;
gatherBandwidth = (sizes./gatherTimes)/1e9;

Determine the peak send and gather speeds. Note that the GPU used in this test supports PCI Express® version 4.0, which has a theoretical bandwidth of 1.97 GB/s per lane. For the 16-lane slots used by NVIDIA® compute cards this gives a theoretical bandwidth of 31.52 GB/s.

[maxSendBandwidth,maxSendIdx] = max(sendBandwidth);
[maxGatherBandwidth,maxGatherIdx] = max(gatherBandwidth);
fprintf("Achieved peak send speed of %.2f GB/s",maxSendBandwidth)

Achieved peak send speed of 10.13 GB/s

fprintf("Achieved peak gather speed of %.2f GB/s",maxGatherBandwidth)

Achieved peak gather speed of 4.28 GB/s

Plot the data transfer speeds against array size, and circle the peak for each case. With small data set sizes, overheads dominate. With larger amounts of data the PCI bus is the limiting factor.

figure
semilogx(sizes,sendBandwidth,MarkerIndices=maxSendIdx,Marker="o")
hold on
semilogx(sizes,gatherBandwidth,MarkerIndices=maxGatherIdx,Marker="o")
grid on
title("Data Transfer Bandwidth")
xlabel("Array size (bytes)")
ylabel("Transfer speed (GB/s)")
legend(["Send to GPU" "Gather from GPU"],Location="SouthEast")
hold off

Measure Read and Write Speed During Memory Intensive Operations

Many operations do very little computation with each element of an array, so these operations are dominated by the time taken to fetch the data from memory or to write it back. Functions such as ones, zeros, nan, and true only write their output, whereas functions like transpose and tril both read and write but do no computation. Even simple operators like plus and minus do so little computation per element that they are bound only by the memory access speed.

The function plus performs one memory read and one memory write for each floating-point operation. As such, the function is limited by memory access speed and provides a good indicator of the speed of a read plus write operation.

Reset your GPU to clear its memory of GPU arrays allocated in the previous section.

reset(gpu)

Create a vector of sizes.

sizes = logspace(4.5,log10(maxSize),maxNumTests);
sizes(sizes/sizeOfDouble > intmax) = [];

Measure the speed at which the GPU can read and write to GPU memory. Also measure the speed at which the host can read and write to host memory. For each array size in sizes:

Create random double-precision data on the GPU and on the host.
Use the gputimeit function to time the execution of the plus function on the GPU.
Use the timeit function to time the execution of the plus function on the host.
Divide the amount of data fetched and written by the measured time to determine the read and write bandwidth.

numTests = numel(sizes);
numElements = floor(sizes/sizeOfDouble);
memoryTimesGPU = inf(1,numTests);
memoryTimesHost = inf(1,numTests);

for idx=1:numTests
    disp("Test " + idx + " of " + numTests + ". Timing plus operation on GPU and CPU for arrays with " + numElements(idx) + " elements.")

    % Generate random data on GPU and host.
    gpuData = randi([0 9],numElements(idx),1,"gpuArray");
    hostData = gather(gpuData);

    % Time the plus function on GPU.
    plusFcn = @() plus(gpuData,1.0);
    memoryTimesGPU(idx) = gputimeit(plusFcn);

    % Time the plus function on host.
    plusFcn = @() plus(hostData,1.0);
    memoryTimesHost(idx) = timeit(plusFcn);
end

Test 1 of 15. Timing plus operation on GPU and CPU for arrays with 3952 elements.
Test 2 of 15. Timing plus operation on GPU and CPU for arrays with 9437 elements.
Test 3 of 15. Timing plus operation on GPU and CPU for arrays with 22530 elements.
Test 4 of 15. Timing plus operation on GPU and CPU for arrays with 53788 elements.
Test 5 of 15. Timing plus operation on GPU and CPU for arrays with 128416 elements.
Test 6 of 15. Timing plus operation on GPU and CPU for arrays with 306583 elements.
Test 7 of 15. Timing plus operation on GPU and CPU for arrays with 731942 elements.
Test 8 of 15. Timing plus operation on GPU and CPU for arrays with 1747449 elements.
Test 9 of 15. Timing plus operation on GPU and CPU for arrays with 4171886 elements.
Test 10 of 15. Timing plus operation on GPU and CPU for arrays with 9960023 elements.
Test 11 of 15. Timing plus operation on GPU and CPU for arrays with 23778702 elements.
Test 12 of 15. Timing plus operation on GPU and CPU for arrays with 56769618 elements.
Test 13 of 15. Timing plus operation on GPU and CPU for arrays with 135532606 elements.
Test 14 of 15. Timing plus operation on GPU and CPU for arrays with 323572501 elements.
Test 15 of 15. Timing plus operation on GPU and CPU for arrays with 772501660 elements.

memoryBandwidthGPU = 2*(sizes./memoryTimesGPU)/1e9;
memoryBandwidthHost = 2*(sizes./memoryTimesHost)/1e9;

Determine the peak read and write speeds.

[maxBWGPU,maxBWIdxGPU] = max(memoryBandwidthGPU);
[maxBWHost,maxBWIdxHost] = max(memoryBandwidthHost);
fprintf("Achieved peak read+write speed on the GPU: %.2f GB/s",maxBWGPU)

Achieved peak read+write speed on the GPU: 678.83 GB/s

fprintf("Achieved peak read+write speed on the host: %.2f GB/s",maxBWHost)

Achieved peak read+write speed on the host: 59.22 GB/s

Plot the read and write speeds against array size, and circle the peak for each case. Comparing this plot with the data-transfer plot above, it is clear that GPUs can typically read from and write to their memory much faster than they can get data from the host. So, it is important to minimize the number of host-GPU or GPU-host memory transfers. Ideally, programs should initially create data on the GPU. Otherwise, programs should transfer data to the GPU, then do as much as possible with the data while it is on the GPU before bringing the data back to the host only when complete.

figure
semilogx(sizes,memoryBandwidthGPU,MarkerIndices=maxBWIdxGPU,Marker="o")
hold on
semilogx(sizes,memoryBandwidthHost,MarkerIndices=maxBWIdxHost,Marker="o")

grid on
title("Read+Write Bandwidth")
xlabel("Array size (bytes)")
ylabel("Speed (GB/s)")
legend(["GPU" "Host"],Location="NorthWest")
hold off

Measure Processing Power During Computationally Intensive Operations

Memory speed is less important for operations where there are a high number of floating-point computations per element read from or written to memory. These operations are said to have high computational density. In this case, the number and speed of the floating-point units is the limiting factor.

A good test of computational performance is a matrix-matrix multiply. For multiplying two $N \times N$ matrices, the total number of floating-point calculations is

$F L O P (N) = 2 N^{3} - N^{2}$ .

Two input matrices are read and one resulting matrix is written, for a total of $3 N^{2}$ elements read or written. This gives a computational density of (2N - 1)/3 FLOP/element. Contrast this with the plus function as used above, which has a computational density of 1/2 FLOP/element.

Reset your GPU to clear its memory of GPU arrays allocated in the previous section and create a vector of sizes.

reset(gpu)
sizes = logspace(4,log10(maxSize)-1,maxNumTests);
sizes(sizes/sizeOfDouble > intmax) = [];

Double Precision

MATLAB can perform calculations in double precision or single precision. Calculating in single precision instead of double precision can improve the performance of code running on your GPU, as most GPU cards are designed for graphics display, which demands a high single-precision performance. In contrast, CPUs are designed for general purpose computing and so do not provide this improvement when switching from double to single precision. For more information on converting data to single precision and performing arithmetic operations on single-precision data, see Floating-Point Numbers. Typical examples of workflows suitable for single-precision computation on the GPU include image processing and machine learning. However, other types of calculation, such as linear algebra problems, typically require double-precision processing.

For an approximate measure of the relative performance of your GPU in single precision compared to double precision, query the SingleDoubleRatio property of your device. This property describes the ratio of single- to double-precision floating-point units (FPUs) on the device. Most desktop GPUs have 24, 32, or even 64 times as many single-precision floating-point units as double-precision. Some GPUs also contain specialized cores that accelerate common deep learning operations. For example, Ampere architecture and later NVIDIA data center GPUs (A100 and H100) include Tensor Cores that can accelerate double-precision matrix multiplication. For these data center GPUs, the SingleDoubleRatio property might not accurately represent the relative performance in single precision compared to double precision for matrix multiplication.

gpu.SingleDoubleRatio

ans = 
32

To measure the double-precision processing power, for each array size in sizes:

Create random double-precision data on the GPU and on the host.
Use the gputimeit function to time the execution of the plus function on the GPU in single and double precision.
Use the timeit function to time the execution of the plus function on the host in single and double precision.
Divide the amount of data fetched and written by the measured time to determine the read and write bandwidth.

numTests = numel(sizes);
NDouble = floor(sqrt(sizes/sizeOfDouble));
mmTimesHostDouble = inf(1,numTests);
mmTimesGPUDouble = inf(1,numTests);

for idx=1:numTests
    disp("Test " + idx + " of " + numTests + ". Timing double-precision matrix-matrix multiplication with " + NDouble(idx)^2 + " elements.")
    
    % Generate random data on GPU.
    A = rand(NDouble(idx),"gpuArray");
    B = rand(NDouble(idx),"gpuArray");
    
    % Time the matrix multiplication on GPU.
    mmTimesGPUDouble(idx) = gputimeit(@() A*B);

    % Gather the data and time matrix multiplication on the host.
    A = gather(A);
    B = gather(B);
    mmTimesHostDouble(idx) = timeit(@() A*B);
end

Test 1 of 15. Timing double-precision matrix-matrix multiplication with 1225 elements.
Test 2 of 15. Timing double-precision matrix-matrix multiplication with 2601 elements.
Test 3 of 15. Timing double-precision matrix-matrix multiplication with 5776 elements.
Test 4 of 15. Timing double-precision matrix-matrix multiplication with 12321 elements.
Test 5 of 15. Timing double-precision matrix-matrix multiplication with 26569 elements.
Test 6 of 15. Timing double-precision matrix-matrix multiplication with 57121 elements.
Test 7 of 15. Timing double-precision matrix-matrix multiplication with 123201 elements.
Test 8 of 15. Timing double-precision matrix-matrix multiplication with 265225 elements.
Test 9 of 15. Timing double-precision matrix-matrix multiplication with 570025 elements.
Test 10 of 15. Timing double-precision matrix-matrix multiplication with 1227664 elements.
Test 11 of 15. Timing double-precision matrix-matrix multiplication with 2637376 elements.
Test 12 of 15. Timing double-precision matrix-matrix multiplication with 5673924 elements.
Test 13 of 15. Timing double-precision matrix-matrix multiplication with 12201049 elements.
Test 14 of 15. Timing double-precision matrix-matrix multiplication with 26234884 elements.
Test 15 of 15. Timing double-precision matrix-matrix multiplication with 56415121 elements.

Determine the peak double-precision processing power.

mmFlopsHostDouble = (2*NDouble.^3 - NDouble.^2)./mmTimesHostDouble;
[maxFlopsHostDouble,maxFlopsHostDoubleIdx] = max(mmFlopsHostDouble);
mmFlopsGPUDouble = (2*NDouble.^3 - NDouble.^2)./mmTimesGPUDouble;
[maxFlopsGPUDouble,maxFlopsGPUDoubleIdx] = max(mmFlopsGPUDouble);
fprintf("Achieved peak double-precision processing power on the GPU: %.2f TFLOPS",maxFlopsGPUDouble/1e12)

Achieved peak double-precision calculation rate on the GPU: 0.41 TFLOPS

fprintf("Achieved peak double-precision processing power on the host: %.2f TFLOPS",maxFlopsHostDouble/1e12)

Achieved peak double-precision calculation rate on the host: 0.39 TFLOPS

Single Precision

You can convert data to single precision by using the single function, or by specifying single as the data type in a creation function such as zeros, ones, rand, and eye.

Measure the single-precision processing power. Generate single-precision random data using the rand function and specifying single as the data type.

NSingle = floor(sqrt(sizes/(sizeOfDouble/2)));
mmTimesHostSingle = inf(1,numTests);
mmTimesGPUSingle = inf(1,numTests);

for idx=1:numTests
    disp("Test " + idx + " of " + numTests + ". Timing single-precision matrix-matrix multiplication with " + NSingle(idx)^2 + " elements.")

    % Generate random, single-precision data on GPU.
    A = rand(NSingle(idx),"single","gpuArray");
    B = rand(NSingle(idx),"single","gpuArray");

    % Time matrix multiplication on GPU.
    mmTimesGPUSingle(idx) = gputimeit(@() A*B);

    % Gather the data and time matrix multiplication on the host.
    A = gather(A);
    B = gather(B);
    mmTimesHostSingle(idx) = timeit(@() A*B);
end

Test 1 of 15. Timing single-precision matrix-matrix multiplication with 2500 elements.
Test 2 of 15. Timing single-precision matrix-matrix multiplication with 5329 elements.
Test 3 of 15. Timing single-precision matrix-matrix multiplication with 11449 elements.
Test 4 of 15. Timing single-precision matrix-matrix multiplication with 24649 elements.
Test 5 of 15. Timing single-precision matrix-matrix multiplication with 53361 elements.
Test 6 of 15. Timing single-precision matrix-matrix multiplication with 114244 elements.
Test 7 of 15. Timing single-precision matrix-matrix multiplication with 247009 elements.
Test 8 of 15. Timing single-precision matrix-matrix multiplication with 529984 elements.
Test 9 of 15. Timing single-precision matrix-matrix multiplication with 1140624 elements.
Test 10 of 15. Timing single-precision matrix-matrix multiplication with 2455489 elements.
Test 11 of 15. Timing single-precision matrix-matrix multiplication with 5276209 elements.
Test 12 of 15. Timing single-precision matrix-matrix multiplication with 11350161 elements.
Test 13 of 15. Timing single-precision matrix-matrix multiplication with 24403600 elements.
Test 14 of 15. Timing single-precision matrix-matrix multiplication with 52475536 elements.
Test 15 of 15. Timing single-precision matrix-matrix multiplication with 112848129 elements.

Determine the peak single-precision processing power.

mmFlopsHostSingle = (2*NSingle.^3 - NSingle.^2)./mmTimesHostSingle;
[maxFlopsHostSingle,maxFlopsHostSingleIdx] = max(mmFlopsHostSingle);
mmFlopsGPUSingle = (2*NSingle.^3 - NSingle.^2)./mmTimesGPUSingle;
[maxFlopsGPUSingle,maxFlopsGPUSingleIdx] = max(mmFlopsGPUSingle);
fprintf("Achieved peak single-precision processing power on the GPU: %.2f TFLOPS",maxFlopsGPUSingle/1e12)

Achieved peak single-precision calculation rate on the GPU: 16.50 TFLOPS

fprintf("Achieved peak single-precision processing power on the host: %.2f TFLOPS",maxFlopsHostSingle/1e12)

Achieved peak single-precision calculation rate on the host: 0.83 TFLOPS

Plot the processing power against array size for double precision and single precision, and circle the peak for each case.

figure
loglog(NDouble.^2,mmFlopsGPUDouble,MarkerIndices=maxFlopsGPUDoubleIdx,Marker="o")
hold on
loglog(NSingle.^2,mmFlopsGPUSingle,MarkerIndices=maxFlopsGPUSingleIdx,Marker="o")
loglog(NDouble.^2,mmFlopsHostDouble,MarkerIndices=maxFlopsHostDoubleIdx,Marker="o")
loglog(NSingle.^2,mmFlopsHostSingle,MarkerIndices=maxFlopsHostSingleIdx,Marker="o")

grid on
title("Matrix-Matrix Multiply")
xlabel("Matrix size (numel)")
ylabel("processing power (FLOPS)")
legend(["GPU double" "GPU single" "Host double" "Host single"],Location="SouthEast")
hold off

Conclusions

These tests reveal some important characteristics of GPU performance:

Transfers from host memory to GPU memory and back are relatively slow.
The GPU can read and write its memory much faster than the host CPU can read and write its memory.
Given large enough data, a GPU can perform calculations faster than the host CPU.
GPUs perform calculations faster in single precision than double precision, and often much faster.

It is notable that, in each test, large arrays were required for the GPU to outperform the host CPU. GPUs provide the greatest advantage when working with millions of elements at once.

For more detailed GPU benchmarks, including comparisons between different GPUs, see GPUBench on the MATLAB Central File Exchange.