Skip to Main Content Skip to Search
Product Documentation

Execute CUDA or PTX Code on the GPU

Create Kernels from CU Files

This section explains how to make a kernel from CU and PTX (parallel thread execution) files.

Compile a PTX File

If you have a CU file you want to execute on the GPU, you must first compile it to create a PTX file. One way to do this is with the nvcc compiler in the NVIDIA CUDA Toolkit. For example, if your CU file is called myfun.cu, you can create a compiled PTX file with the shell command:

nvcc -ptx myfun.cu

This generates the file named myfun.ptx.

Construct the Kernel Object

With a .cu file and a .ptx file you can create a kernel object in MATLAB that you can then use to evaluate the kernel:

k = parallel.gpu.CUDAKernel('myfun.ptx', 'myfun.cu');

Run the Kernel

Use the feval function to evaluate the kernel on the GPU. The following examples show how to execute a kernel using GPUArray objects and MATLAB workspace data.

Use Workspace Data

Assume that you have already written some kernels in a native language and want to use them in MATLAB to execute on the GPU. You have a kernel that does a convolution on two vectors; load and run it with two random input vectors:

k = parallel.gpu.CUDAKernel('conv.ptx', 'conv.cu');

o = feval(k, rand(100, 1), rand(100, 1));

Even if the inputs are constants or variables for MATLAB workspace data, the output is GPUArray.

Use GPU Data

It might be more efficient to use GPUArray objects as input when running a kernel:

k = parallel.gpu.CUDAKernel('conv.ptx', 'conv.cu');

i1 = gpuArray(rand(100, 1, 'single'));
i2 = gpuArray(rand(100, 1, 'single'));

o1 = feval(k, i1, i2);

Because the output is a GPUArray, you can now perform other operations using this input or output data without further transfers between the MATLAB workspace and the GPU. When all your GPU computations are complete, gather your final result data into the MATLAB workspace:

o2 = feval(k, o1, i2);

r1 = gather(o1);
r2 = gather(o2);

Determine Input and Output Correspondence

When calling [out1, out2] = feval(kernel, in1, in2, in3), the inputs in1, in2, and in3 correspond to each of the input argument to the C function within your CU file. The outputs out1 and out2 store the values of the first and second non-const pointer input arguments to the C function after the C kernel has been executed.

For example, if the C kernel within a CU file has the following signature:

void reallySimple( float * pInOut, float c )

the corresponding kernel object (k) in MATLAB has the following properties:

MaxNumLHSArguments: 1
   NumRHSArguments: 2
     ArgumentTypes: {'inout single vector'  'in single scalar'}

Therefore, to use the kernel object from this code with feval, you need to provide feval two input arguments (in addition to the kernel object), and you can use one output argument:

y = feval(k, x1, x2)

The input values x1 and x2 correspond to pInOut and c in the C function prototype. The output argument y corresponds to the value of pInOut in the C function prototype after the C kernel has executed.

The following is a slightly more complicated example that shows a combination of const and non-const pointers:

void moreComplicated( const float * pIn, float * pInOut1, float * pInOut2 )

The corresponding kernel object in MATLAB then has the properties:

MaxNumLHSArguments: 2
   NumRHSArguments: 3
     ArgumentTypes: {'in single vector'  'inout single vector'  'inout single vector'}

You can use feval on this code's kernel (k) with the syntax:

[y1, y2] = feval(k, x1, x2, x3)

The three input arguments x1, x2, and x3, correspond to the three arguments that are passed into the C function. The output arguments y1 and y2, correspond to the values of pInOut1 and pInOut2 after the C kernel has executed.

Kernel Object Properties

When you create a kernel object without a terminating semicolon, or when you type the object variable at the command line, MATLAB displays the kernel object properties. For example:

k = parallel.gpu.CUDAKernel('conv.ptx', 'conv.cu')
k = 
  parallel.gpu.CUDAKernel handle
  Package: parallel.gpu

  Properties:
     ThreadBlockSize: [1 1 1]
  MaxThreadsPerBlock: 512
            GridSize: [1 1]
    SharedMemorySize: 0
          EntryPoint: '_Z8theEntryPf'
  MaxNumLHSArguments: 1
     NumRHSArguments: 2
       ArgumentTypes: {'in single vector'  'inout single vector'}

The properties of a kernel object control some of its execution behavior. Use dot notation to alter those properties that can be changed.

For a descriptions of the object properties, see the CUDAKernel object reference page.

Specify Entry Points

If your PTX file contains multiple entry points, you can identify the particular kernel in myfun.ptx that you want the kernel object k to refer to:

k = parallel.gpu.CUDAKernel('myfun.ptx', 'myfun.cu', 'myKernel1');

A single PTX file can contain multiple entry points to different kernels. Each of these entry points has a unique name. These names are generally mangled (as in C++ mangling). However, when generated by nvcc the PTX name always contains the original function name from the CU. For example, if the CU file defines the kernel function as

__global__ void simplestKernelEver( float * x, float val )

then the PTX code contains an entry that might be called _Z18simplestKernelEverPff.

When you have multiple entry points, specify the entry name for the particular kernel when calling CUDAKernel to generate your kernel.

Provide C Prototype Input

If you do not have the CU file corresponding to your PTX file, you can specify the C prototype for your C kernel instead of the CU file:

k = parallel.gpu.CUDAKernel('myfun.ptx', 'float *, const float *, float');

In parsing C prototype, the supported C data types are listed in the following table.

Float TypesInteger TypesBoolean and Character Types

double, double2

float, float2

short, unsigned short, short2, ushort2

int, unsigned int, int2, uint2

long, unsigned long, long2, ulong2

long long, unsigned long long, longlong2, ulonglong2

bool

char, unsigned char, char2, uchar2

All inputs can be scalars or pointers, and can be labeled const.

The C declaration of a kernel is always of the form:

__global__ void aKernel(inputs ...)

When translating the definition of a kernel in C into MATLAB:

These rules have some implications. The most notable is that every output from a kernel must necessarily also be an input to the kernel, since the input allows the user to define the size of the output (which follows from being unable to allocate memory on the GPU).

Complete Kernel Workflow

Add Two Numbers

This example adds two doubles together in the GPU. You should have the NVIDIA CUDA Toolkit installed, and have CUDA-capable drivers for your card.

  1. The CU code to do this is as follows.

    __global__ void add1( double * pi, double c ) 
    {
        *pi += c;
    }

    The directive __global__ indicates that this is an entry point to a kernel. The code uses a pointer to send out the result in pi, which is both an input and an output. Put this code in a file called test.cu in the current directory.

  2. Compile the CU code at the shell command line to generate a PTX file called test.ptx.

    nvcc -ptx test.cu
  3. Create the kernel in MATLAB. Currently this PTX file only has one entry so you do not need to specify it. If you were to put more kernels in, you would specify add1 as the entry.

    k = parallel.gpu.CUDAKernel('test.ptx', 'test.cu');
  4. Run the kernel with two inputs of 1. By default, a kernel runs on one thread.

    >> o = feval(k, 1, 1);
    o = 
        2
    

Add Two Vectors

This example extends the previous one to add two vectors together. For simplicity, assume that there are exactly the same number of threads as elements in the vectors and that there is only one thread block.

  1. The CU code is slightly different from the last example. Both inputs are pointers, and one is constant because you are not changing it. Each thread will simply add the elements at its thread index. The thread index must work out which element this thread should add. (Getting these thread- and block-specific values is a very common pattern in CUDA programming.)

    __global__ void add2( double * v1, const double * v2 ) 
    {
        int idx = threadIdx.x;
        v1[idx] += v2[idx];
    }

    Save this code in the file test.cu.

  2. Compile as before using nvcc.

    nvcc -ptx test.cu
  3. If this code was put in the same CU file as the first example, you need to specify the entry point name this time to distinguish it.

    k = parallel.gpu.CUDAKernel('test.ptx', 'add2', 'test.cu');
    
  4. When you run the kernel, you need to set the number of threads correctly for the vectors you want to add.

    >> o = feval(k, 1, 1);
    o = 
        2
    >> N = 128;
    >> k.ThreadBlockSize = N;
    >> o = feval(k, ones(N, 1), ones(N, 1));
    

  


Free Parallel Computing Interactive Kit

See how to solve large problems with minimal effort and reduce simulation time.

Get free kit

Trials Available

Try the latest versions of parallel computing products.

Get trial software
 © 1984-2012- The MathWorks, Inc.    -   Site Help   -   Patents   -   Trademarks   -   Privacy Policy   -   Preventing Piracy   -   RSS