The purpose of GPU computing in MATLAB® is to speed up your
applications. This topic discusses fundamental concepts and practices
that can help you achieve better performance on the GPU, such as the
configuration of the GPU hardware and best practices within your code.
It discusses the trade-off between implementation difficulty and performance,
and describes the criteria you might use to choose between using gpuArray
arrayfun, MEX-files, or CUDA kernels.
Finally, it describes how to accurately measure performance on the
When converting MATLAB code to run on the GPU, it is best to start with MATLAB code that already performs well. While the GPU and CPU have different performance characteristics, the general guidelines for writing good MATLAB code also help you write good MATLAB code for the GPU. The first step is almost always to profile your CPU code. The lines of code that the profiler shows taking the most time on the CPU will likely be ones that you must concentrate on when you code for the GPU.
It is easiest to start converting your code using MATLAB built-in
functions that support gpuArray data. These functions take gpuArray
inputs, perform calculations on the GPU, and return gpuArray outputs.
A list of the MATLAB functions that support gpuArray data is found
in Run Built-In Functions on a GPU. In general these functions support the same arguments and data
types as standard MATLAB functions that are calculated in the CPU.
Any limitations in these overloaded functions for gpuArrays are described
in their command-line help (e.g.,
If all the functions that you want to use are supported on the
GPU, running code on the GPU may be as simple as calling
gpuArray to transfer input data to the GPU, and calling
gather to retrieve the output data from the GPU when finished.
In many cases, you might need to vectorize your code, replacing looped
scalar operations with MATLAB matrix and vector operations. While
vectorizing is generally a good practice on the CPU, it is usually
critical for achieving high performance on the GPU. For more information,
see Vectorize for Improved GPU Performance.
It is possible that even after converting inputs to gpuArrays
and vectorizing your code, there are operations in your algorithm
that are either not built-in functions, or that are not fast enough
to meet your application's requirements. In such situations
you have three main options: use
arrayfun to precompile
element-wise parts of your application, make use of GPU library functions,
or write a custom CUDA kernel.
If you have a purely element-wise function, you can improve
its performance by calling it with
arrayfun function on the GPU turns an element-wise MATLAB
function into a custom CUDA kernel, thus reducing the overhead of
performing the operation. Often, there is a subset of your application
that can be used with
arrayfun even if the entire
application cannot be. The example Improve Performance of Element-wise MATLAB Functions on the GPU using
ARRAYFUN shows the basic concepts of this approach; and the
example Using ARRAYFUN for Monte-Carlo Simulations shows how this
can be done in simulations for a finance application.
MATLAB provides an extensive library of GPU-enabled functions in Parallel Computing Toolbox™, Image Processing Toolbox™, Signal Processing Toolbox™, and other products. However, there are many libraries of additional functions that do not have direct built-in analogs in MATLAB's GPU support. Examples include the NVIDIA Performance Primitives library and the CURAND library, which are included in the CUDA toolkit that ships with MATLAB. If you need to call a function in one of these libraries, you can do so using the GPU MEX interface. This interface allows you to extract the pointers to the device data from MATLAB gpuArrays so that you can pass these pointers to GPU functions. You can convert the returned values into gpuArrays for return to MATLAB. For more information see Run MEX-Functions Containing CUDA Code.
Finally, you have the option of writing a custom CUDA kernel for the operation that you need. Such kernels can be directly integrated into MATLAB using the CUDAKernel object.
The example Illustrating Three Approaches to GPU Computing: The Mandelbrot Set shows how to implement a simple calculation using three of the approaches
mentioned in this section. This example begins with MATLAB code that
is easily converted to run on the GPU, rewrites the code to use
arrayfun for element-wise operations, and finally shows
how to integrate a custom CUDA kernel for the same operation.
Alternately, you can write a CUDA kernel as part of a MEX-file and call it using the CUDA Runtime API inside the MEX-file. Either of these approaches might let you work with low-level features of the GPU, such as shared memory and texture memory, that are not directly available in MATLAB code. For more details, see the example Accessing Advanced CUDA Features Using MEX.
In general you can achieve the best performance when your GPU is dedicated to computing. It is usually not practical to use the same GPU device for both computations and graphics, because of the amount of memory taken up for problems of reasonable size and the constant use of the device by the system for graphics. If possible, obtain a separate device for graphics. Details of configuring your device for compute or graphics depend on the operating system and driver version.
On Windows® systems, a GPU device can be in one of two modes: Windows Display Driver Model (WDDM) or Tesla Compute Cluster (TCC) mode. For best performance, any devices used for computing should be in TCC mode. Consult NVIDIA® documentation for more details.
NVIDIA's highest-performance compute devices, the Tesla line, support error correcting codes (ECC) when reading and writing GPU memory. The purpose of ECC is to correct for occasional bit-errors that occur normally when reading or writing dynamic memory. One technique to improve performance is to turn off ECC to increase the achievable memory bandwidth. While the hardware can be configured this way, MathWorks does not recommend this practice. The potential loss of accuracy due to silent errors can be more harmful than the performance benefit.
This topic describes general techniques that help you achieve better performance on the GPU. Some of these tips apply when writing MATLAB code for the CPU as well.
Data in MATLAB arrays is stored in column-major order. Therefore, it is beneficial to operate along the first or column dimension of your array. If one dimension of your data is significantly longer than others, you might achieve better performance if you make that the first dimension. Similarly, if you frequently operate along a particular dimension, it is usually best to have it as the first dimension. In some cases, if consecutive operations target different dimensions of an array, it might be beneficial to transpose or permute the array between these operations.
GPUs achieve high performance by calculating many results in parallel. Thus, matrix and higher-dimensional array operations typically perform much better than operations on vectors or scalars. You can achieve better performance by rewriting your loops to make use of higher-dimensional operations. The process of revising loop-based, scalar-oriented code to use MATLAB matrix and vector operations is called vectorization. For more details, see Using Vectorization.
By default, all operations in MATLAB are performed in double-precision floating-point arithmetic. However, most operations support a variety of data types, including integer and single-precision floating-point. Today's GPUs and CPUs typically have much higher throughput when performing single-precision operations, and single-precision floating-point data occupies less memory. If your application's accuracy requirements allow the use of single-precision floating-point, it can greatly improve the performance of your MATLAB code.
The GPU sits at the end of a data transfer mechanism known as
the PCI bus. While this bus is an efficient, high-bandwidth way to
transfer data from the PC host memory to various extension cards,
it is still much slower than the overall bandwidth to the global memory
of the GPU device or of the CPU (for more details, see the example Measuring
GPU Performance). In addition, transfers from the GPU device
to MATLAB host memory cause MATLAB to wait for all pending operations
on the device to complete before executing any other statements. This
can significantly hurt the performance of your application. In general,
you should limit the number of times you transfer data between the
MATLAB workspace and the GPU. If you can transfer data to the GPU
once at the start of your application, perform all the calculations
you can on the GPU, and then transfer the results back into MATLAB
at the end, that generally results in the best performance. Similarly,
when possible it helps to create arrays directly on the GPU, using
'gpuArray' or the
'like' option for functions such as
Z = zeros(___,'gpuArray') or
Z = zeros(N,'like',g) for existing gpuArray
The best way to measure performance on the GPU is to use
gputimeit. This function takes as input
a function handle with no input arguments, and returns the measured
execution time of that function. It takes care of such benchmarking
considerations as repeating the timed operation to get better resolution,
executing the function before measurement to avoid initialization
overhead, and subtracting out the overhead of the timing function.
gputimeit ensures that all operations on
the GPU have completed before the final timing.
For example, consider measuring the time taken to compute the
lu factorization of a random matrix
A of size
N. You can do
this by defining a function that does the
and passing the function handle to
A = rand(N,'gpuArray'); fh = @() lu(A); gputimeit(fh,2); % 2nd arg indicates number of outputs
You can also measure performance with
to get accurate timing on the GPU, you must wait for operations to
complete before calling
toc. There are two ways
to do this. You can call
gather on the final GPU output before calling
this forces all computations to complete before the time measurement
is taken. Alternately, you can use the
wait function with a GPUDevice object as its input.
For example, if you wanted to measure the time taken to compute the
lu factorization of matrix
wait, you can do it as follows:
gd = gpuDevice(); tic(); [l,u] = lu(A); wait(gd); tLU = toc();
You can also use the MATLAB profiler to show how computation
time is distributed in your GPU code. Note, that to accomplish timing
measurements, the profiler runs each line of code independently, so
it cannot account for overlapping (asynchronous) execution such as
might occur during normal operation. For timing whole algorithms,
you should use
gputimeit, as described above. Also, the
profile might not yield correct results for user-defined MEX functions
if they run asynchronously.
This example shows you how to improve performance by running a function on the GPU instead of the CPU, and by vectorizing the calculations.
Consider a function that performs fast convolution on the columns of a matrix. Fast convolution, which is a common operation in signal processing applications, transforms each column of data from the time domain to the frequency domain, multiplies it by the transform of a filter vector, transforms back to the time domain, and stores the result in an output matrix.
function y = fastConvolution(data,filter) [m,n] = size(data); % Zero-pad filter to the column length of data, and transform filter_f = fft(filter,m); % Create an array of zeros of the same size and class as data y = zeros(m,n,'like',data); % Transform each column of data for ix = 1:n af = fft(data(:,ix)); y(:,ix) = ifft(af .* filter_f); end end
Execute this function in the CPU on data of a particular size,
and measure the execution time using the MATLAB
timeit function. The
timeit function takes care of common benchmarking considerations, such
as accounting for startup and overhead.
a = complex(randn(4096,100),randn(4096,100)); % Data input b = randn(16,1); % Filter input c = fastConvolution(a,b); % Calculate output ctime = timeit(@()fastConvolution(a,b)); % Measure CPU time disp(['Execution time on CPU = ',num2str(ctime)]);
On a sample machine, this code displays the output:
Execution time on CPU = 0.019335
Now execute this function on the GPU. You can do this easily
by changing the input data to be gpuArrays rather than normal MATLAB
'like' syntax used when creating the
output inside the function ensures that
be a gpuArray if
data is a gpuArray.
ga = gpuArray(a); % Move array to GPU gb = gpuArray(b); % Move filter to GPU gc = fastConvolution(ga,gb); % Calculate on GPU gtime = gputimeit(@()fastConvolution(ga,gb)); % Measure GPU time gerr = max(max(abs(gather(gc)-c))); % Calculate error disp(['Execution time on GPU = ',num2str(gtime)]); disp(['Maximum absolute error = ',num2str(gerr)]);
On the same machine, this code displays the output:
Execution time on CPU = 0.019335 Execution time on GPU = 0.027235 Maximum absolute error = 1.1374e-14
Unfortunately, the GPU is slower than the CPU for this problem.
The reason is that the
for-loop is executing the
FFT, multiplication, and inverse FFT operations on individual columns
of length 4096. The best way to increase the performance is to vectorize
the code, so that a single MATLAB function call performs more calculation.
The FFT and IFFT operations are easy to vectorize:
fft(A) computes the FFT of each column of a matrix
A. You can perform a multiply of the filter with every column in a
matrix at once using the MATLAB binary scalar expansion function
bsxfun. The vectorized function looks
function y = fastConvolution_v2(data,filter) m = size(data,1); % Zero-pad filter to the length of data, and transform filter_f = fft(filter,m); % Transform each column of the input af = fft(data); % Multiply each column by filter and compute inverse transform y = ifft(bsxfun(@times,af,filter_f)); end
Perform the same experiment using the vectorized function:
a = complex(randn(4096,100),randn(4096,100)); % Data input b = randn(16,1); % Filter input c = fastConvolution_v2(a,b); % Calculate output ctime = timeit(@()fastConvolution_v2(a,b)); % Measure CPU time disp(['Execution time on CPU = ',num2str(ctime)]); ga = gpuArray(a); % Move data to GPU gb = gpuArray(b); % Move filter to GPU gc = fastConvolution_v2(ga, gb); % Calculate on GPU gtime = gputimeit(@()fastConvolution_v2(ga,gb));% Measure GPU time gerr = max(max(abs(gather(gc)-c))); % Calculate error disp(['Execution time on GPU = ',num2str(gtime)]); disp(['Maximum absolute error = ',num2str(gerr)]);
Execution time on CPU = 0.010393 Execution time on GPU = 0.0020537 Maximum absolute error = 1.1374e-14
In conclusion, vectorizing the code helps both the CPU and GPU versions to run faster. However, vectorization helps the GPU version much more than the CPU. The improved CPU version is nearly twice as fast as the original; the improved GPU version is 13 times faster than the original. The GPU code went from being 40% slower than the CPU in the original version, to about five times faster in the revised version.