parallel.gpu.CUDAKernel

Kernel executable on GPU

Description

A CUDAKernel object represents a CUDA kernel that can execute on a GPU. You create the kernel from CU and PTX files. For an example of how to create and use a CUDAKernel object, see Run CUDA or PTX Code on GPU.

Note

You cannot save or load CUDAKernel objects.

Creation

Syntax

kern = parallel.gpu.CUDAKernel(ptxFile,cuFile)

kern = parallel.gpu.CUDAKernel(ptxFile,cuFile,func)

kern = parallel.gpu.CUDAKernel(ptxFile,cProto)

kern = parallel.gpu.CUDAKernel(ptxFile,cProto,func)

Description

example

kern = parallel.gpu.CUDAKernel(ptxFile,cuFile) creates a CUDAKernel object using the PTX code ptxFile and the CUDA^® source file cuFile. The PTX file must contain only a single entry point.

Use feval with kern as an input to execute the CUDA kernel on the GPU. For information on executing your kernel object, see Run a CUDAKernel.

example

kern = parallel.gpu.CUDAKernel(ptxFile,cuFile,func) creates a CUDAKernel for the function entry point defined by func. func must unambiguously define the appropriate kernel entry point in the PTX file.

example

kern = parallel.gpu.CUDAKernel(ptxFile,cProto) creates a CUDAKernel object using the PTX file ptxFile and the C prototype cProto. cProto is the C function prototype for the kernel call that kern represents. The PTX file must contain only a single entry point.

kern = parallel.gpu.CUDAKernel(ptxFile,cProto,func) creates a CUDAKernel object from a PTX file and C prototype for the function entry point defined by func. func must unambiguously define the appropriate kernel entry point in the PTX file.

Input Arguments

expand all

`ptxFile` — PTX file or code
string scalar | character vector

Name of a PTX file or PTX code.

You can provide the name of a PTX file, or pass its contents as a string.

Example: "simpleEx.ptx"

Data Types: char | string

`cuFile` — Name of CUDA source file
string scalar | character vector

Name of a CUDA source file, specified as a character vector.

The function examines the CUDA source file to find the function prototype for the CUDA kernel that is defined in the PTX code. The CUDA source file must contain a kernel definition starting with '__global__'.

Example: "simpleEx.cu"

Data Types: char | string

`func` — Function entry point
string scalar | character vector

Function entry point, specified as a character vector. func must unambiguously define the appropriate entry point in the PTX file.

Note

The parallel.gpu.CUDAKernel function searches for the specified entry point in the PTX file, and matches on any substring occurrences. Therefore, you should not name any of your entry points as substrings of any others.

Example: "add1"

Data Types: char | string

`cProto` — C prototype
string scalar | character vector

C prototype for the kernel call, specified as a character vector. Specify multiple input arguments separated by commas.

Example: "float *,float,int"

Data Types: char | string

Properties

expand all

`ThreadBlockSize` — Size of block of threads on kernel
positive integer vector

Size of a block of threads on the kernel, specified as a vector of positive integers of length 1, 2, or 3 (since thread blocks can be up to 3-dimensional). The product of the elements of ThreadBlockSize must not exceed the MaxThreadsPerBlock for this kernel, and no element of ThreadBlockSize can exceed the corresponding element of the GPUDevice property MaxThreadBlockSize.

Example: [8 8 8]

`MaxThreadsPerBlock` — Maximum number of threads permissible in single block
positive integer

This property is read-only.

Maximum number of threads permissible in a single block for this CUDA kernel. The product of the elements of ThreadBlockSize must not exceed this value.

Example: 1024

`GridSize` — Size of grid
positive integer vector

Size of grid of thread blocks, specified as an integer vector of length 3. This is effectively the number of thread blocks launched independently by the GPU. None of the elements of this vector can exceed the corresponding element in the vector of the MaxGridSize property of the GPUDevice object.

Example: [977 1 1]

`SharedMemorySize` — Size of shared memory
positive integer

The amount of dynamic shared memory (in bytes) that each thread block can use. Each thread block has an available shared memory region. This memory is shared with registers on the multiprocessors. SharedMemorySize must not exceed the MaxShmemPerBlock property of the GPUDevice object.

As with all memory, this needs to be allocated before the kernel is launched. It is common for the size of this shared memory region to be tied to the size of the thread block. Setting this value on the kernel ensures that each thread in a block can access this available shared memory region.

Example: 16000

`EntryPoint` — Entry point name
character vector | string scalar

This property is read-only.

The entry point name in the PTX code called by the kernel.

Example: "_Z13returnPointerPKfPy"

Data Types: char | string

`MaxNumLHSArguments` — Maximum number of left hand side arguments
positive integer

This property is read-only.

The maximum number of left hand side arguments that the kernel supports. It cannot be greater than the number of right hand side arguments, and if any inputs are constant or scalar it will be less.

Example: 1

`NumRHSArguments` — Required number of right hand side arguments
positive integer

This property is read-only.

The required number of right hand side arguments needed to call this kernel. All inputs need to define either the scalar value of an input, the elements for a vector input/output, or the size of an output argument.

Example: 5

`ArgumentTypes` — Expected data type of right hand side arguments
cell array

This property is read-only.

Cell array of character vectors of length NumRHSArguments. Each of the character vectors indicates what the expected MATLAB^® data type for that input is by specifying a numeric type such as uint8, single, or double followed by the word scalar or vector to indicate if we are passing by reference or value. In addition, if that argument is only an input to the kernel, it is prefixed by in; and if it is an input/output, it is prefixed by inout. This allows you to decide how to efficiently call the kernel with both MATLAB arrays and gpuArray objects, and to see which of the kernel inputs are being treated as outputs.

Example: {'inout double vector'} {'in double vector'} {'in double vector'} {'in uint32 scalar'} {'in uint32 scalar'}

Data Types: cell

Object Functions

`feval`	Evaluate kernel on GPU
`setConstantMemory`	Set some constant memory on GPU
`existsOnGPU`	Determine if gpuArray or CUDAKernel is available on GPU

Examples

collapse all

Create a `CUDAKernel` Object

This example shows how to create a CUDAKernel object using a PTX file and a CU file, or using a PTX file and the function prototype.

The CUDA source file simpleEx.cu contains the following code:

/*
* Add a constant to a vector.
*/
__global__ void addToVector(float * pi, float c, int vecLen)  {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if (idx < vecLen) {
       pi[idx] += c;
   }
}

Compile the CU file into a PTX file using mexcuda with the -ptx option.

mexcuda -ptx simpleEx.cu

Building with 'NVIDIA CUDA Compiler'.
MEX completed successfully.

Create a CUDA kernel using the PTX file and the CU file.

kern = parallel.gpu.CUDAKernel("simpleEx.ptx","simpleEx.cu");

Create a CUDA kernel using the PTX file and the function prototype of the addToVector function.

kern = parallel.gpu.CUDAKernel("simpleEx.ptx","float *,float,int");

Both of the preceding statements return a kernel object that you can use to call the addToVector CUDA kernel.

Specify Entry Point of a `CUDAKernel` Object

This example shows how to create a CUDAKernel object from a PTX file with more than one entry point.

Suppose your CU file, myfun.cu, contains a function add1 for adding two doubles together and a function add2 for adding two vectors together.

__global__ void add1( double * a, double b ) 
{
    *a += b;
}

__global__ void add2( double * v1, const double * v2 ) 
{
    int idx = threadIdx.x;
    v1[idx] += v2[idx];
}

Compile the CU file into a PTX file using mexcuda with the -ptx option.

mexcuda -ptx myfun.cu

Building with 'NVIDIA CUDA Compiler'.
MEX completed successfully.

The PTX file contains two entry points corresponding to the add1 and add2 functions. When your PTX code contains multiple entry points, you must specify an entry when creating your kernel.

Create a kernel for adding two doubles together and specify the entry point add1.

k = parallel.gpu.CUDAKernel("myfun.ptx","myfun.cu","add1");

Version History

Introduced in R2010b

parallel.gpu.CUDAKernel

Description

Creation

Syntax

Description

Input Arguments

`ptxFile` — PTX file or code
string scalar | character vector

`cuFile` — Name of CUDA source file
string scalar | character vector

`func` — Function entry point
string scalar | character vector

`cProto` — C prototype
string scalar | character vector

Properties

`ThreadBlockSize` — Size of block of threads on kernel
positive integer vector

`MaxThreadsPerBlock` — Maximum number of threads permissible in single block
positive integer

`GridSize` — Size of grid
positive integer vector

`SharedMemorySize` — Size of shared memory
positive integer

`EntryPoint` — Entry point name
character vector | string scalar

`MaxNumLHSArguments` — Maximum number of left hand side arguments
positive integer

`NumRHSArguments` — Required number of right hand side arguments
positive integer

`ArgumentTypes` — Expected data type of right hand side arguments
cell array

Object Functions

Examples

Create a `CUDAKernel` Object

Specify Entry Point of a `CUDAKernel` Object

Version History

See Also

Topics

parallel.gpu.CUDAKernel

Description

Creation

Syntax

Description

Input Arguments

ptxFile — PTX file or code string scalar | character vector

cuFile — Name of CUDA source file string scalar | character vector

func — Function entry point string scalar | character vector

cProto — C prototype string scalar | character vector

Properties

ThreadBlockSize — Size of block of threads on kernel positive integer vector

MaxThreadsPerBlock — Maximum number of threads permissible in single block positive integer

GridSize — Size of grid positive integer vector

SharedMemorySize — Size of shared memory positive integer

EntryPoint — Entry point name character vector | string scalar

MaxNumLHSArguments — Maximum number of left hand side arguments positive integer

NumRHSArguments — Required number of right hand side arguments positive integer

ArgumentTypes — Expected data type of right hand side arguments cell array

Object Functions

Examples

Create a CUDAKernel Object

Specify Entry Point of a CUDAKernel Object

Version History

See Also

Topics

`ptxFile` — PTX file or code
string scalar | character vector

`cuFile` — Name of CUDA source file
string scalar | character vector

`func` — Function entry point
string scalar | character vector

`cProto` — C prototype
string scalar | character vector

`ThreadBlockSize` — Size of block of threads on kernel
positive integer vector

`MaxThreadsPerBlock` — Maximum number of threads permissible in single block
positive integer

`GridSize` — Size of grid
positive integer vector

`SharedMemorySize` — Size of shared memory
positive integer

`EntryPoint` — Entry point name
character vector | string scalar

`MaxNumLHSArguments` — Maximum number of left hand side arguments
positive integer

`NumRHSArguments` — Required number of right hand side arguments
positive integer

`ArgumentTypes` — Expected data type of right hand side arguments
cell array

Create a `CUDAKernel` Object

Specify Entry Point of a `CUDAKernel` Object