Run MEX-Functions Containing CUDA Code

Write a MEX-File Containing CUDA Code

As with any MEX-files, those containing CUDA® code have a single entry point, known as mexFunction. The MEX-function contains the host-side code that interacts with gpuArray objects from MATLAB® and launches the CUDA code. The CUDA code in the MEX-file must conform to the CUDA runtime API.

You should call the function mxInitGPU at the entry to your MEX-file. This ensures that the GPU device is properly initialized and known to MATLAB.

The interface you use to write a MEX-file for gpuArray objects is different from the MEX interface for standard MATLAB arrays.

You can see an example of a MEX-file containing CUDA code at:

This file contains the following CUDA device function:

void __global__ TimesTwo(double const * const A,
                         double * const B,
                         int const N)
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        B[i] = 2.0 * A[i];

It contains the following lines to determine the array size and launch a grid of the proper size:

N = (int)(mxGPUGetNumberOfElements(A));
blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
TimesTwo<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, N);

Compile a GPU MEX-File

To compile CUDA code you must have installed the CUDA toolkit version consistent with the ToolkitVersion property of the GPUDevice object.

Use the mexcuda command in MATLAB to compile a MEX-file containing the CUDA code. You can compile the example file using the command:


If mexcuda has trouble locating the NVIDIA compiler (nvcc), it might be installed in a non-default location. You can specify the location of nvcc on your system by storing it in the environment variable MW_NVCC_PATH. You can set this variable using the MATLAB setenv command. For example,


Run the Resulting MEX-Functions

The MEX-function in this example multiplies every element in the input array by 2 to get the values in the output array. To test it, start with a gpuArray in which every element is 1:

x = ones(4,4,'gpuArray');
y = mexGPUExample(x)
y = 

    2    2    2    2
    2    2    2    2
    2    2    2    2
    2    2    2    2

Both the input and output arrays are gpuArray objects:

disp(['class(x) = ',class(x),', class(y) = ',class(y)])
class(x) = gpuArray, class(y) = gpuArray

Comparison to a CUDA Kernel

Parallel Computing Toolbox also supports CUDAKernel objects that can be used to integrate CUDA code with MATLAB. Consider the following when choosing the MEX-file approach versus the CUDAKernel approach:

  • MEX-files can interact with host-side libraries, such as the NVIDIA Performance Primitives (NPP) or CUFFT libraries, and can also contain calls from the host to functions in the CUDA runtime library.

  • MEX-files can analyze the size of the input and allocate memory of a different size, or launch grids of a different size, from C or C++ code. In comparison, MATLAB code that calls CUDAKernel objects must preallocate output memory and determine the grid size.

Access Complex Data

Complex data on a GPU device is stored in interleaved complex format. That is, for a complex gpuArray A, the real and imaginary parts of element i are stored in consecutive addresses. MATLAB uses CUDA built-in vector types to store complex data on the device (see the NVIDIA CUDA C Programming Guide).

Depending on the needs of your kernel, you can cast the pointer to complex data either as the real type or as the built-in vector type. For example, in MATLAB, suppose you create a matrix:

a = complex(ones(4,'gpuArray'),ones(4,'gpuArray'));

If you pass a gpuArray to a MEX-function as the first argument (prhs[0]), then you can get a pointer to the complex data by using the calls:

mxGPUArray const * A = mxGPUCreateFromMxArray(prhs[0]);
mwSize numel_complex = mxGPUGetNumberOfElements(A);
double2 * d_A = (double2 const *)(mxGPUGetDataReadOnly(A));

To treat the array as a real double-precision array of twice the length, you could do it this way:

mxGPUArray const * A = mxGPUCreateFromMxArray(prhs[0]);
mwSize numel_real =2*mxGPUGetNumberOfElements(A);
double * d_A = (double const *)(mxGPUGetDataReadOnly(A));

Various functions exist to convert data between complex and real formats on the GPU. These operations require a copy to interleave the data. The function mxGPUCreateComplexGPUArray takes two real mxGPUArrays and interleaves their elements to produce a single complex mxGPUArray of the same length. The functions mxGPUCopyReal and mxGPUCopyImag each copy either the real or the imaginary elements into a new real mxGPUArray. (There is no equivalent of the mxGetImagData function for mxGPUArray objects.)

Was this topic helpful?