The purpose of GPU computing in MATLAB^{®} is to speed up your
applications. This topic discusses fundamental concepts and practices
that can help you achieve better performance on the GPU, such as the
configuration of the GPU hardware and best practices within your code.
It discusses the trade-off between implementation difficulty and performance,
and describes the criteria you might use to choose between using gpuArray
functions, `arrayfun`

, MEX-files, or CUDA kernels.
Finally, it describes how to accurately measure performance on the
GPU.

When converting MATLAB code to run on the GPU, it is best to start with MATLAB code that already performs well. While the GPU and CPU have different performance characteristics, the general guidelines for writing good MATLAB code also help you write good MATLAB code for the GPU. The first step is almost always to profile your CPU code. The lines of code that the profiler shows taking the most time on the CPU will likely be ones that you must concentrate on when you code for the GPU.

It is easiest to start converting your code using MATLAB built-in
functions that support gpuArray data. These functions take gpuArray
inputs, perform calculations on the GPU, and return gpuArray outputs.
A list of the MATLAB functions that support gpuArray data is found
in Run Built-In Functions on a GPU.
In general these functions support the same arguments and data types
as standard MATLAB functions that are calculated in the CPU. Any limitations
in these overloaded functions for gpuArrays are described in their
command-line help (e.g., `help gpuArray/qr`

).

If all the functions that you want to use are supported on the
GPU, running code on the GPU may be as simple as calling `gpuArray`

to
transfer input data to the GPU, and calling `gather`

to
retrieve the output data from the GPU when finished. In many cases,
you might need to vectorize your code, replacing looped scalar operations
with MATLAB matrix and vector operations. While vectorizing is generally
a good practice on the CPU, it is usually critical for achieving high
performance on the GPU. For more information, see Vectorize for Improved GPU Performance.

It is possible that even after converting inputs to gpuArrays
and vectorizing your code, there are operations in your algorithm
that are either not built-in functions, or that are not fast enough
to meet your application's requirements. In such situations
you have three main options: use `arrayfun`

to precompile
element-wise parts of your application, make use of GPU library functions,
or write a custom CUDA kernel.

If you have a purely element-wise function, you can improve
its performance by calling it with `arrayfun`

. The `arrayfun`

function
on the GPU turns an element-wise MATLAB function into a custom CUDA
kernel, thus reducing the overhead of performing the operation. Often,
there is a subset of your application that can be used with `arrayfun`

even
if the entire application cannot be. The example Improve
Performance of Element-wise MATLAB Functions on the GPU using ARRAYFUN shows
the basic concepts of this approach; and the example Using
ARRAYFUN for Monte-Carlo Simulations shows how this can be
done in simulations for a finance application.

MATLAB provides an extensive library of GPU-enabled functions in Parallel Computing Toolbox™, Image Processing Toolbox™, Signal Processing Toolbox™, and other products. However, there are many libraries of additional functions that do not have direct built-in analogs in MATLAB's GPU support. Examples include the NVIDIA Performance Primitives library and the CURAND library, which are included in the CUDA toolkit that ships with MATLAB. If you need to call a function in one of these libraries, you can do so using the GPU MEX interface. This interface allows you to extract the pointers to the device data from MATLAB gpuArrays so that you can pass these pointers to GPU functions. You can convert the returned values into gpuArrays for return to MATLAB. For more information see Run MEX-Functions Containing CUDA Code.

Finally, you have the option of writing a custom CUDA kernel for the operation that you need. Such kernels can be directly integrated into MATLAB using the CUDAKernel object.

The example Illustrating
Three Approaches to GPU Computing: The Mandelbrot Set shows
how to implement a simple calculation using three of the approaches
mentioned in this section. This example begins with MATLAB code that
is easily converted to run on the GPU, rewrites the code to use `arrayfun`

for
element-wise operations, and finally shows how to integrate a custom
CUDA kernel for the same operation.

Alternately, you can write a CUDA kernel as part of a MEX-file and call it using the CUDA Runtime API inside the MEX-file. Either of these approaches might let you work with low-level features of the GPU, such as shared memory and texture memory, that are not directly available in MATLAB code. For more details, see the example Accessing Advanced CUDA Features Using MEX.

In general you can achieve the best performance when your GPU is dedicated to computing. It is usually not practical to use the same GPU device for both computations and graphics, because of the amount of memory taken up for problems of reasonable size and the constant use of the device by the system for graphics. If possible, obtain a separate device for graphics. Details of configuring your device for compute or graphics depend on the operating system and driver version.

On Windows^{®} systems, a GPU device can be in one of two modes:
Windows Display Driver Model (WDDM) or Tesla Compute Cluster (TCC)
mode. For best performance, any devices used for computing should
be in TCC mode. Consult NVIDIA^{®} documentation for more details.

NVIDIA's highest-performance compute devices, the Tesla line, support error correcting codes (ECC) when reading and writing GPU memory. The purpose of ECC is to correct for occasional bit-errors that occur normally when reading or writing dynamic memory. One technique to improve performance is to turn off ECC to increase the achievable memory bandwidth. While the hardware can be configured this way, MathWorks does not recommend this practice. The potential loss of accuracy due to silent errors can be more harmful than the performance benefit.

This topic describes general techniques that help you achieve better performance on the GPU. Some of these tips apply when writing MATLAB code for the CPU as well.

Data in MATLAB arrays is stored in column-major order. Therefore, it is beneficial to operate along the first or column dimension of your array. If one dimension of your data is significantly longer than others, you might achieve better performance if you make that the first dimension. Similarly, if you frequently operate along a particular dimension, it is usually best to have it as the first dimension. In some cases, if consecutive operations target different dimensions of an array, it might be beneficial to transpose or permute the array between these operations.

GPUs achieve high performance by calculating many results in parallel. Thus, matrix and higher-dimensional array operations typically perform much better than operations on vectors or scalars. You can achieve better performance by rewriting your loops to make use of higher-dimensional operations. The process of revising loop-based, scalar-oriented code to use MATLAB matrix and vector operations is called vectorization. For more details, see Using Vectorization.

By default, all operations in MATLAB are performed in double-precision floating-point arithmetic. However, most operations support a variety of data types, including integer and single-precision floating-point. Today's GPUs and CPUs typically have much higher throughput when performing single-precision operations, and single-precision floating-point data occupies less memory. If your application's accuracy requirements allow the use of single-precision floating-point, it can greatly improve the performance of your MATLAB code.

The GPU sits at the end of a data transfer mechanism known as
the PCI bus. While this bus is an efficient, high-bandwidth way to
transfer data from the PC host memory to various extension cards,
it is still much slower than the overall bandwidth to the global memory
of the GPU device or of the CPU (for more details, see the example Measuring
GPU Performance). In addition, transfers from the GPU device
to MATLAB host memory cause MATLAB to wait for all pending operations
on the device to complete before executing any other statements. This
can significantly hurt the performance of your application. In general,
you should limit the number of times you transfer data between the
MATLAB workspace and the GPU. If you can transfer data to the GPU
once at the start of your application, perform all the calculations
you can on the GPU, and then transfer the results back into MATLAB
at the end, that generally results in the best performance. Similarly,
when possible it helps to create arrays directly on the GPU, using
either the `'gpuArray'`

or the `'like'`

option
for functions such as `zeros`

(e.g., ```
Z
= zeros(___,'gpuArray')
```

or `Z = zeros(N,'like',g)`

for
existing gpuArray `g`

).

The best way to measure performance on the GPU is to use `gputimeit`

. This function takes as input
a function handle with no input arguments, and returns the measured
execution time of that function. It takes care of such benchmarking
considerations as repeating the timed operation to get better resolution,
executing the function before measurement to avoid initialization
overhead, and subtracting out the overhead of the timing function.
Also, `gputimeit`

ensures that all operations on
the GPU have completed before the final timing.

For example, consider measuring the time taken to compute the `lu`

factorization
of a random matrix `A`

of size `N`

-by-`N`

.
You can do this by defining a function that does the `lu`

factorization
and passing the function handle to `gputimeit`

:

A = rand(N,'gpuArray'); fh = @() lu(A); gputimeit(fh,2); % 2nd arg indicates number of outputs

You can also measure performance with `tic`

and `toc`

. However, to get accurate timing
on the GPU, you must wait for operations to complete before calling `toc`

.
There are two ways to do this. You can call `gather`

on
the final GPU output before calling `toc`

: this
forces all computations to complete before the time measurement is
taken. Alternately, you can use the `wait`

function with a GPUDevice object as its input.
For example, if you wanted to measure the time taken to compute the `lu`

factorization
of matrix `A`

using `tic`

, `toc`

,
and `wait`

, you can do it as follows:

gd = gpuDevice(); tic(); [l,u] = lu(A); wait(gd); tLU = toc();

You can also use the MATLAB profiler to show how computation
time is distributed in your GPU code. Note, that to accomplish timing
measurements, the profiler runs each line of code independently, so
it cannot account for overlapping (asynchronous) execution such as
might occur during normal operation. For timing whole algorithms,
you should use `tic`

and `toc`

,
or `gputimeit`

, as described above. Also, the profile
might not yield correct results for user-defined MEX functions if
they run asynchronously.

This example shows you how to improve performance by running a function on the GPU instead of the CPU, and by vectorizing the calculations.

Consider a function that performs fast convolution on the columns of a matrix. Fast convolution, which is a common operation in signal processing applications, transforms each column of data from the time domain to the frequency domain, multiplies it by the transform of a filter vector, transforms back to the time domain, and stores the result in an output matrix.

function y = fastConvolution(data,filter) [m,n] = size(data); % Zero-pad filter to the column length of data, and transform filter_f = fft(filter,m); % Create an array of zeros of the same size and class as data y = zeros(m,n,'like',data); % Transform each column of data for ix = 1:n af = fft(data(:,ix)); y(:,ix) = ifft(af .* filter_f); end end

Execute this function in the CPU on data of a particular size,
and measure the execution time using the MATLAB `timeit`

function. The `timeit`

function
takes care of common benchmarking considerations, such as accounting
for startup and overhead.

a = complex(randn(4096,100),randn(4096,100)); % Data input b = randn(16,1); % Filter input c = fastConvolution(a,b); % Calculate output ctime = timeit(@()fastConvolution(a,b)); % Measure CPU time disp(['Execution time on CPU = ',num2str(ctime)]);

On a sample machine, this code displays the output:

Execution time on CPU = 0.019335

Now execute this function on the GPU. You can do this easily
by changing the input data to be gpuArrays rather than normal MATLAB
arrays. The `'like'`

syntax used when creating the
output inside the function ensures that `y`

will
be a gpuArray if `data`

is a gpuArray.

ga = gpuArray(a); % Move array to GPU gb = gpuArray(b); % Move filter to GPU gc = fastConvolution(ga,gb); % Calculate on GPU gtime = gputimeit(@()fastConvolution(ga,gb)); % Measure GPU time gerr = max(max(abs(gather(gc)-c))); % Calculate error disp(['Execution time on GPU = ',num2str(gtime)]); disp(['Maximum absolute error = ',num2str(gerr)]);

On the same machine, this code displays the output:

Execution time on CPU = 0.019335 Execution time on GPU = 0.027235 Maximum absolute error = 1.1374e-14

Unfortunately, the GPU is slower than the CPU for this problem.
The reason is that the `for`

-loop is executing the
FFT, multiplication, and inverse FFT operations on individual columns
of length 4096. The best way to increase the performance is to vectorize
the code, so that a single MATLAB function call performs more calculation.
The FFT and IFFT operations are easy to vectorize: `fft(A)`

computes
the FFT of each column of a matrix `A`

. You can perform
a multiply of the filter with every column in a matrix at once using
the MATLAB binary scalar expansion function `bsxfun`

.
The vectorized function looks like this:

function y = fastConvolution_v2(data,filter) m = size(data,1); % Zero-pad filter to the length of data, and transform filter_f = fft(filter,m); % Transform each column of the input af = fft(data); % Multiply each column by filter and compute inverse transform y = ifft(bsxfun(@times,af,filter_f)); end

Perform the same experiment using the vectorized function:

a = complex(randn(4096,100),randn(4096,100)); % Data input b = randn(16,1); % Filter input c = fastConvolution_v2(a,b); % Calculate output ctime = timeit(@()fastConvolution_v2(a,b)); % Measure CPU time disp(['Execution time on CPU = ',num2str(ctime)]); ga = gpuArray(a); % Move data to GPU gb = gpuArray(b); % Move filter to GPU gc = fastConvolution_v2(ga, gb); % Calculate on GPU gtime = gputimeit(@()fastConvolution_v2(ga,gb));% Measure GPU time gerr = max(max(abs(gather(gc)-c))); % Calculate error disp(['Execution time on GPU = ',num2str(gtime)]); disp(['Maximum absolute error = ',num2str(gerr)]);

Execution time on CPU = 0.010393 Execution time on GPU = 0.0020537 Maximum absolute error = 1.1374e-14

In conclusion, vectorizing the code helps both the CPU and GPU versions to run faster. However, vectorization helps the GPU version much more than the CPU. The improved CPU version is nearly twice as fast as the original; the improved GPU version is 13 times faster than the original. The GPU code went from being 40% slower than the CPU in the original version, to about five times faster in the revised version.

Was this topic helpful?