Why does my GPU not outperform my CPU/another GPU?

12 views (last 30 days)

Accepted Answer

MathWorks Support Team
MathWorks Support Team on 13 Dec 2024
Edited: MathWorks Support Team on 23 Dec 2024
There are multiple factors which determine a GPU's performance. The headline number of cores a GPU has is not enough to accurately gauge performance.
To isolate if the code or the GPU is the primary issue, try the following (details for each step below):
  1. Run a standard benchmark test.
  2. If the benchmark shows different behavior between the precision types: calculate the expected ratio between Single and Double Precision for their GPU card.
  3. If the card is a laptop (mobile) card, adjust expectation for performance. 
  4. If performance in the benchmarks is good and clear then it is likely a problem originating in code. 
 
Standard GPU Benchmark:
gpuBench is a benchmarking test written by MathWorks Parallel Computing Team and available on the File Exchange
This test will do a variety of tests involving both memory and compute intensive tasks in both single and double precision. It will also offer comparison between a relatively normal display card and a reasonable compute card. The performances are matched with the version of MATLAB being used.
>> gpuBench
Comparing GPU Devices:
To answer this question you will need the GPU device specifications and for completeness the CPU specs can help as well. There are then three key topics to consider when making the comparison.
 
1. Double vs Single Precision
Double precision and single precision performance can be wildly different between graphics cards with the same total number of cores (the variation is due to whether the cores are mostly FP32 (single) or FP64 (double)). Most GPUs are designed for mostly single precision performance since this is what graphics display demands. In comparison CPUs will not have a drop in performance for double precision. A full list of NVIDIA's GPUs with hardware statistics is a useful reference guide for information about any specific graphics card.
If Nvidia has declared their double precision performance it will be listed. If double precision performance is not listed, then although the compute capability may be above 1.3 (needed for double precision) then the performance is significantly lower (in the order of 32-64x slower at double precision than single precision).
To calculate the ratio between Single Precision and Double Precision:
  1. Find the GPU on the wiki page above.
  2. Get the stated single precision and double precision performance values from the table. (if there is no double precision GFLOPS value assume ratio is 32x to 64x slower for double precision)
  3. Divide the stated single precision GFLOPS by the double precision GFLOPS to get a ratio of how slower double is to single.
At the time of writing a high end compute card can get this ratio as low as 2x.
2. Mobile vs Desktop
Is the graphics card inside a laptop? If yes, then it is highly likely the card is a mobile graphics card. In many cases this card is suffixed with an M (not all M's mean mobile again cross reference with the wiki page above to definitively check).
Mobile graphics cards are smaller and less powerful due to the heat and power restrictions their environment imposes on them. If using a mobile GPU for computation, speed expectations should be adjusted down.
3. Display vs Compute
Is the graphics card acting also as the display card? If there is only 1 graphics card in the machine and no on-board graphics, then this is likely the case. In this situation the operating system will commonly impose a Kernel Timeout. This is shown on the "gpuDevice" output as:
KernelExecutionTimeout: 1
A Kernel Execution Timeout's purpose is to make sure the OS is always able to print updates to the screen. If a computation on the GPU takes too much time then the operation will be killed. This tends to disrupt the CUDA environment for MATLAB and further use of the GPU by MATLAB (for either OpenGL or GPU computation) will require a restart of MATLAB. The following article has instructions on how to extend or disable this timeout period.
However, note that performance may be lowered even without hitting this timeout due to the need to share the resource with other programs. Where possible, do not use a GPU for both computation and display. NVIDIA's "nvidia-smi" utility can be used to monitor all the processes that are using each GPU card on your machine. 
For Windows machines only:
It is possible to set a card in WDDM (for both display and compute) or TCC (compute only) mode using NVIDIA's "nvidia-smi" utility. Changing between these two modes requires the machine to be restarted. Note that TCC mode is not available to all the GPU cards (for example, the GeForce family of cards do not have TCC mode).
4. Obtain GPU execution time measurements
For specific parts of a larger workflow that performs GPU computations, you can measure the execution time of particular functions using gputimeit, which runs a function multiple times to average out variation and compensate for overhead. The gputimeit function also ensures that all operations on the GPU are complete before recording the time. Execution time measurements obtained using gputimeit among different GPU cards are comparable only if obtained from the same machine and, ideally, from the same MATLAB process.
Refer to the "Measure and Improve GPU Performance" documentation for more information:

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!