For GPU code generation, the primary mechanism for creating CUDA® kernels is by using
for-loops. The way you write loops in
your MATLAB® code has a significant impact on the number of kernels created as well as the
performance of the generated code. When you generate GPU code, check the diagnostic report to
see if your loop segment has
Loop not parallelized notices. Calls to
MATLAB functions in your code may also have
for-loops that contain
these notices. To get maximum performance, you want to ensure that compute intensive loop
segments in your code are mapped to kernels and executed in parallel. The following
recommendations help you in achieving this goal and generating efficient CUDA kernels.
Mapping Nested Loops to Kernels
Consider a function that has nested
function y = foo(x) ... for i1 = 1:N1 for i2 = 1:N2 for i3 = 1:N3 for i4 = 1:N4 ... end end end end
Assume that one of the intermediate loop
i3 is not parallelizable.
When performs loop analysis to create kernels, GPU Coder™ it considers only the outermost parallel loops
creates a kernel with the outer loop dimensions
N1,N2. The loops
i3,i4 are within the kernel body and are executed sequentially.
However if the innermost
i4 is large (iteration), then better
performance may be achieved by creating kernels for the innermost loop.
There are three ways in which you can parallelize the innermost loop:
Rewrite the code so that the innermost code segment is not within a nested loop.
If the iteration size of the outer loop is small, then attach the loop to a
coder.unrollfunction. This function unrolls the
for-loop by making a copy of the loop body for each loop iteration. For more information, see
function y = foo(x) ... for i1 = coder.unroll(1:N1) ... end
Make the outer loop dimension as dynamic bound. This way parallel loop analysis fails on the outer loop, whereas it succeeds on the inner loops.
function y = foo(x,N1) ... for i1 = 1:N1 ... end
For-Loops with Break
Loops with break are not supported.
while (i < N) ... ... if (cond2) ... ... break; end end
Remove breaks by creating a guard variable and conditional.
cond = true; while (i< N) if(cond) ... ... if(cond2) cond = false; end end end
Dependence Analysis Parallel Loop Check Fails
Kernel extraction use parallel loop dependence analysis. There are cases where loop
dependence analysis cannot detect a parallel for loop. The
coder.gpu.kernel allows GPU Coder to override dependence analysis and force kernel creation. The caveat is for
user to be sure that the loop is “for-all” loop with no inter-iteration
coder.gpu.kernel pragma explicitly on each of your for-loops.
Logical Indexing of Arrays
GPU Coder may not create kernels when logical indexing is used for accessing array elements.
i = (mag ~= 0); vx(i) = vx(i)./mag(i); vy(i) = vy(i)./mag(i);
Rewrite the code by using a loop body and guarding with an appropriate conditional.
for i = 1:numel(mag) if (mag(i) ~= 0) vx(i) = vx(i)./mag(i); vy(i) = vy(i)./mag(i); end end
Use of unsupported functions, coder pragmas, toolbox functions etc. inside a loop prevents them from becoming a kernel.
Try rewriting unsupported functions using pure MATLAB.
If smaller loops in a loop nest are the outer most loops, then a kernel could be created with just a subset of the loops in the nesting. If algorithm allows it, always put the largest loops in the outermost nesting.
Rewrite loop nesting with larger loops as outer loops.
- Code Generation Using the Command Line Interface
- Code Generation by Using the GPU Coder App
- Code Generation Reports
- Trace Between Generated CUDA Code and MATLAB Source Code
- Generating a GPU Code Metrics Report for Code Generated from MATLAB Code
- Memory Bottleneck Analysis
- Analyze Execution Profiles of the Generated Code