## cuSOLVER Example

This example solves the systems of linear equations` Ax = B` for `x` by using the cuSOLVER library. The matrices `A` and `B` must have the same number of rows. If `A` is a scalar, then `A\B` is equivalent to `A.\B`. If `A` is a square n-by-n matrix and `B` is a matrix with n rows, then ```x = A\B``` is a solution to the equation `A*x = B`, if it exists. The MATLAB® implementation of `backslash` is:

```function [x] = backslash(A,b) if (isscalar(A)) x = coder.nullcopy(zeros(size(b))); else x = coder.nullcopy(zeros(size(A,2),size(b,2))); end x = A\b; end```

### Prepare `backslash` for Kernel Creation

GPU Coder™ requires no special pragma to generate calls to libraries. Just as before, there are two ways to generate CUDA® kernels — `coder.gpu.kernelfun` and `coder.gpu.kernel`. In this example, we utilize the `coder.gpu.kernelfun` pragma to generate CUDA kernels. The modified `backslash` function is:

```function [x] = backslash(A,b) %#codegen if (isscalar(A)) x = coder.nullcopy(zeros(size(b))); else x = coder.nullcopy(zeros(size(A,2),size(b,2))); end coder.gpu.kernelfun() x = A\b; end```

Note

A minimum size is required on the input data for replacing math operators and functions with cuSOLVER library implementations. The minimum threshold is 128 elements.

### Generated CUDA Code

When you generate CUDA code, GPU Coder creates function calls to initialize the cuSOLVER library, perform `mldivide` operations, and release hardware resources that the cuSOLVER library uses. A snippet of the generated CUDA code is:

``` cusolverEnsureInitialization(); /* Copyright 2017 The MathWorks, Inc. */ cudaMemcpy(b_gpu_A, A, 1152UL, cudaMemcpyHostToDevice); blackslash_kernel1<<<dim3(1U, 1U, 1U), dim3(160U, 1U, 1U)>>>(b_gpu_A,gpu_A); cudaMemcpy(b_A, gpu_A, 1152UL, cudaMemcpyDeviceToHost); cusolverDnDgetrf_bufferSize(cusolverGlobalHandle, 12, 12, &gpu_A, 12, &cusolverWorkspaceReq); cusolverWorkspaceTypeSize = 8; cusolverInitWorkspace(); cudaMemcpy(gpu_A, b_A, 1152UL, cudaMemcpyHostToDevice); cusolverDnDgetrf(cusolverGlobalHandle, 12, 12, &gpu_A, 12, (real_T *) cusolverWorkspaceBuff, &gpu_ipiv_t, gpu_info_t); A_dirtyOnGpu = true; cudaMemcpy(&info_t, gpu_info_t, 4UL, cudaMemcpyDeviceToHost);```

To initialize the cuSOLVER library and create a handle to the cuSOLVER library context, the function `cusolversEnsureInitialization()` calls `cusolverDnCreate()` cuSOLVER API. It allocates hardware resources on the host and device.

```static void cusolverEnsureInitialization(void) { if (cusolverGlobalHandle == NULL) { cusolverDnCreate(&cuSolverGlobalHandle); } }```

`backslash_kernel1` zero pads the matrix `A`. This kernel is launched with a single block of 512 threads.

```static __global__ __launch_bounds__(160, 1) void backslash_kernel1(const real_T * A, real_T *b_A) { int32_T threadId; ; ; threadId = (int32_T)(((gridDim.x * gridDim.y * blockIdx.z + gridDim.x * blockIdx.y) + blockIdx.x) * (blockDim.x * blockDim.y * blockDim.z) + (int32_T)((threadIdx.z * blockDim.x * blockDim.y + threadIdx.y * blockDim.x) + threadIdx.x)); if (!(threadId >= 144)) { /* Copyright 2017 The MathWorks, Inc. */ b_A[threadId] = A[threadId]; } }```

Calls to `cudaMemcpy` transfer the matrix `A` from the host to the device. The function `cusolverDnDgetrf` computes the LU factorization of an m×n matrix:

`P*A = L*U`

where A is an m×n matrix, P is a permutation matrix, L is a lower triangular matrix with unit diagonal, and U is an upper triangular matrix.

### cuSOLVER Standalone Code

For functions like `qr` that only have partial support in cuSOLVER, GPU Coder uses LAPACK library where necessary. For MEX functions, the code generator uses the LAPACK library that is included with MATLAB. For standalone code, the code generator uses the LAPACK library that you specify. To specify the LAPACK library:

• At the command line, define your own `coder.LAPACKCallback` class containing the LAPACK library information and assign it to the `CustomLAPACKCallback` property of the code configuration object.

• In the GPU Coder app, set Custom LAPACK library callback to your LAPACK library.

For example, to generate a standalone executable, you can use the following code generation script. Here `myLAPACK` is the name of the custom `coder.LAPACKCallback` class containing the LAPACK library information.

```cfg = coder.gpuConfig('exe'); cfg.CustomLAPACKCallback = 'myLAPACK'; cfg.GenerateExampleMain = 'GenerateCodeAndCompile'; classdef myLAPACK < coder.LAPACKCallback methods (Static) function hn = getHeaderFilename() hn = 'lapacke.h'; end function updateBuildInfo(buildInfo, buildctx) [~,linkLibExt] = buildctx.getStdLibInfo(); cudaPath = getenv('CUDA_PATH'); libPath = 'lib\x64'; buildInfo.addIncludePaths(fullfile(cudaPath,'include')); libName = 'cusolver'; libPath = fullfile(cudaPath,libPath); buildInfo.addLinkObjects([libName linkLibExt], libPath, ... '', true, true); lapackLocation = 'C:\LAPACK\win64'; % specify path to LAPACK libraries includePath = fullfile(lapackLocation,'include'); buildInfo.addIncludePaths(includePath); libPath = fullfile(lapackLocation,'lib'); libName = 'mllapack'; buildInfo.addLinkObjects([libName linkLibExt], libPath, ... '', true, true); buildInfo.addDefines('HAVE_LAPACK_CONFIG_H'); buildInfo.addDefines('LAPACK_COMPLEX_STRUCTURE'); end end end```
For more information, see Speed Up Linear Algebra in Generated Standalone Code by Using LAPACK Calls.