Technical Articles and Newsletters |

By Daniel Armyr and Dan Doherty, MathWorks

NVIDIA GPUs are becoming increasingly popular for large-scale computations in image processing, financial modeling, signal processing, and other applications—largely due to their highly parallel architecture and high computational throughput. The CUDA programming model lets programmers exploit the full power of this architecture by providing fine-grained control over how computations are divided among parallel threads and executed on the device. The resulting algorithms often run significantly faster than traditional code written for the CPU.

While algorithms written for the GPU are often much faster, the process of building a framework for developing and testing them can be time-consuming. Many programmers write CUDA kernels with the expectation that they will be integrated into C or Fortran programs for production. For this reason, they often use these languages to iterate on and test their kernels, which requires writing significant amounts of “glue code” for tasks such as transferring data to the GPU, managing GPU memory, initializing and launching CUDA kernels, and visualizing kernel outputs. This glue code is time-consuming to write and difficult to modify if, for example, you want to evaluate your kernel for different input data or visualize kernel outputs using a different type of plot.

Using an image white balancing example, this article describes how MATLAB^{®} supports CUDA kernel development by providing a language and development environment for quickly evaluating kernels, analyzing and visualizing kernel results, and writing test harnesses to validate kernel results.

White balancing is a technique that is used to adjust the colors in an image so that the image does not have a reddish or bluish tint.

Suppose you want to write a white balance routine in CUDA C for integration into a larger C program. Before writing any C code, it’s useful to explore the algorithm, investigate different algorithmic approaches, and develop a working prototype.

We do this in MATLAB using the following code:

This code computes the average amount of each color present in the input image and then applies scaling factors to ensure that the output image has an equal amount of each color. Notice that, with MATLAB, developing the algorithm takes just five lines of code—far fewer than it would take in C or Fortran. One reason is that MATLAB is a high-level, interpreted language, and therefore there is no need to perform administrative tasks such as declaring variables and allocating memory. Another is that MATLAB includes thousands of built-in math, engineering, and plotting functions and can be extended with domain-specific algorithms in signal and image processing, computational finance, communications, and other areas.

We call the MATLAB white balance algorithm using an input image that includes a Gretag Macbeth color chart, which is commonly used to calibrate cameras. We then visualize the output using the `imshow` command in Image Processing Toolbox™:

adjustedImage = whitebalance(imagedata); imshow(adjustedImage);

The algorithm removes the reddish tint from the original image (Figure 1).

This working MATLAB implementation will serve as a reference as we develop and test CUDA kernels for the white balance algorithm.

We ultimately want to implement the white balance algorithm in C, with each computational step written as a CUDA kernel. Before getting started with CUDA, we use the MATLAB white balance code to explore the algorithm and decide how to break it into kernels. We begin by using the MATLAB Profiler to see how long each section of code takes to execute. The profiler will indicate the bottleneck areas where we will need to spend extra effort to develop efficient CUDA kernels.

We launch the MATLAB Profiler using the Run and Time button on the MATLAB desktop (Figure 2).

We see that the final three lines of code take 0.15 seconds to run, making this the most time-consuming section of the algorithm. This code multiplies every element in the image data with an appropriate scale factor. It is clearly an operation that can be parallelized massively, and one that could be accelerated significantly on the GPU.

We reimplement this code in CUDA C/C++, as follows:

Before writing kernels for the other computational steps in the white balance algorithm, we will transition back to MATLAB to evaluate and test this kernel to make sure that it runs properly and gives correct results.

To load the kernel into MATLAB, we provide paths to the compiled PTX file and source code:

kernel = parallel.gpu.CUDAKernel( 'applyScaleFactorsKernel.ptx', ... 'applyScaleFactorsKernel.cu' );

Once the kernel is loaded we must complete a few setup tasks before we can launch it, such as initializing return data and setting the sizes of the thread blocks and grid. The kernel can then be used just like any other MATLAB function, except that we launch the kernel using the `feval` command, with the following syntax:

`[outArguments] = feval(kernelName, inArguments)`

We replace the final three lines of code in our MATLAB white balance algorithm with code that loads and launches the kernel. The updated white balance routine `(whitebalance_gpu.m)` is as follows:

Notice the relative ease of calling CUDA kernels from MATLAB. Each task, such as transferring data to the GPU, initializing return data, and launching the kernel, is performed using a single line of MATLAB code. Furthermore, the code is robust in that we can evaluate the kernel for different sized images without updating the code. In lower-level languages like C or Fortran, the process of moving the data to the GPU, managing memory, and launching CUDA kernels requires significantly more coding. The code is not only more difficult to write but also more difficult for other developers and project collaborators to understand and modify for their own purposes.

Now that we have integrated our kernel into MATLAB, we test whether the results are correct by comparing the original MATLAB implementation of the white balance algorithm `(whitebalance.m)` with the new version that incorporates the kernel `(whitebalance_gpu.m)`

We could easily use this test harness to test the kernel for additional input images with different characteristics. We could also develop more sophisticated test harnesses to perform more detailed postprocessing or to automate testing.

So far, we have reimplemented one portion of the white balance algorithm in CUDA. We ultimately want the entire algorithm written as a collection of CUDA kernels, since the major computational steps appear to be parallelizable and good candidates for the GPU. Performing the entire computation on the GPU would also reduce the overhead associated with transferring data between the CPU and GPU multiple times.

We will not implement the remaining steps in CUDA C in this example; however, you could use the process we just used for the image scaling operation, writing kernels and then testing them against the original MATLAB code. If the computation is available in a CUDA library such as NPP, you can use the GPU MEX API in Parallel Computing Toolbox™ to call the host-side C functions, and pass them pointers to the underlying GPU data.

This process of incrementally developing kernels and testing them as you go makes it easier to isolate bugs in your code, and ensures a more organized development process.

This article has focused on how MATLAB can help you incrementally develop and validate CUDA kernels that will be integrated into a larger C application. For applications that do not have to be delivered in C, you can often save significant development time by staying in MATLAB and leveraging its built-in GPU capabilities. Most core math functions in MATLAB, as well as a growing number of toolbox functions, are overloaded to run on the GPU when given input data of the `gpuArray` data type. This means that you can get the speed advantages of the GPU without the need to write any CUDA kernels, and with minimal changes to your MATLAB code.

Recall our original MATLAB prototype of the white balance routine. Rather than writing a CUDA kernel for the image scaling operation, we could have done it on the GPU simply by transferring the `imageData` variable to the GPU using the `gpuArray` command and then performing the image scaling without any additional changes to the code:

This approach reduces the total time for the image scaling operation from 150 ms on the CPU to 9 ms on the GPU^{1}, of which 2.3 ms is execution time and 6.7 ms is for transferring the data to the GPU and back. In a larger algorithm the data transfer time often becomes negligible since data transfer needs to be completed only once, and we can compare execution times only. In our example, that equates to a 65x speedup on the GPU.

As this example has shown, with MATLAB you can develop your algorithms much faster than in C or Fortran, and still take advantage of GPU computing for computationally intensive parts of your code.

^{1} Performance measurements made using an Intel Xeon 3690 CPU and an NVIDIA Tesla K20 GPU

Published 2013 - 92100v00