Improve Performance of Element-wise MATLAB® Functions on the GPU using ARRAYFUN
This example shows how
arrayfun can be used to run a MATLAB® function natively on the GPU. When the MATLAB function contains many element-wise operations,
arrayfun can provide improved performance when compared to simply executing the MATLAB function directly on the GPU with gpuArray input data. The MATLAB function can be in its own file or can be a nested or anonymous function. It must contain only scalar operations and arithmetic.
We put the example into a function to allow nested functions:
Using Horner's Rule to Calculate Exponentials
Horner's rule allows the efficient evaluation of power series expansions. We will use it to calculate the first 10 terms of the power series expansion for the exponential function
exp. We can implement this as a MATLAB function.
function y = horner(x) %HORNER - series expansion for exp(x) using Horner's rule y = 1 + x.*(1 + x.*((1 + x.*((1 + ... x.*((1 + x.*((1 + x.*((1 + x.*((1 + ... x.*((1 + x./9)./8))./7))./6))./5))./4))./3))./2)); end
horner for the GPU
To run this function on the GPU with minimal code changes, we could pass a
gpuArray object as input to the
horner function. Since
horner contains only individual element-wise operations, we might not realize very good performance on the GPU when performing each operation one at a time. However, we can improve the performance by executing all of the element-wise operations in the
horner function at one time using
To run this function on the GPU using
arrayfun, we use a handle to the
horner automatically adapts to different size and type inputs. We can compare the results computed on the GPU using both
gpuArray objects and
arrayfun with standard MATLAB CPU execution simply by evaluating the function directly.
hornerFcn = @horner;
Create the Input Data
We create some inputs of different types and sizes, and use
gpuArray to send them to the GPU.
data1 = rand( 2000, 'single' ); data2 = rand( 1000, 'double' ); gdata1 = gpuArray( data1 ); gdata2 = gpuArray( data2 );
horner on the GPU
To evaluate the
horner function on the GPU, we have two choices. With minimal code changes we can evaluate the original function on the GPU by providing a
gpuArray object as input. However, to improve the performance on the GPU call
arrayfun, using the same calling convention as the original MATLAB function.
We can compare the accuracy of the results by evaluating the original function directly in MATLAB on the CPU. We expect some slight numerical differences because the floating-point arithmetic on the GPU does not precisely match the arithmetic performed on the CPU.
gresult1 = arrayfun( hornerFcn, gdata1 ); gresult2 = arrayfun( hornerFcn, gdata2 ); comparesingle = max( max( abs( gresult1 - horner( data1 ) ) ) ); comparedouble = max( max( abs( gresult2 - horner( data2 ) ) ) );
fprintf( 'Maximum discrepancy for single precision: %g\n', comparesingle ); fprintf( 'Maximum discrepancy for double precision: %g\n', comparedouble );
Maximum discrepancy for single precision: 2.38419e-07 Maximum discrepancy for double precision: 0
Comparing Performance between GPU and CPU
We can compare the performance of the GPU versions to the native MATLAB CPU version. Current generation GPUs have much better performance in single precision, so we compare that.
% CPU execution tic hornerFcn( data1 ); tcpu = toc; % GPU execution using only gpuArray objects tgpuObject = gputimeit(@() hornerFcn(gdata1)); % GPU execution using gpuArray objects with arrayfun tgpuArrayfun = gputimeit(@() arrayfun(hornerFcn, gdata1)); fprintf( 'Speed-up achieved using gpuArray objects only: %g\n',... tcpu / tgpuObject ); fprintf( 'Speed-up achieved using gpuArray objects with arrayfun: %g\n',... tcpu / tgpuArrayfun );
Speed-up achieved using gpuArray objects only: 24.6764 Speed-up achieved using gpuArray objects with arrayfun: 98.3555