MATLAB Answers

Unable to compile cuda code containing dynamic parallelism: Error: “ptxas fatal : Unresolved extern function 'cublasCreate_v2'”

48 views (last 30 days)
I’m trying to create a simple mex function that calls cublas functions such as cublasDgemm from inside a kernel so I can utilize nested, or dynamic, parallelism in my calculations which is supposed to be supported on newer GPUs such as the GTX1080 I’m using.
However, when I try to compile my cuda code from Matlab like this:
mexcuda -lcublas
I get the error:
Building with 'NVIDIA CUDA Compiler'.
Error using mex
ptxas fatal : Unresolved extern function 'cublasCreate_v2'
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are
deprecated, and may be removed in a future release (Use
-Wno-deprecated-gpu-targets to suppress warning).
And as soon as I comment out everything inside my kernel which is related to cublas it works fine again...Could someone please advise me on what I need to do to get this to compile and work? I would really appreciate it.
The sample cuda code I’ve written to test this looks like this:
#include "mex.h"
#include "cublas_v2.h"
#include <cuda_runtime.h>
/* Kernel code with dgemm */
__global__ void dgemmkernel(const double* deviceX, double* XX, const int n, const int m) {
/* Cublas handle */
cublasHandle_t handle;
/* Scalar constants */
double alpha = 1.0, beta = 0.0;
/* Calculate XX = X'*X using cublasDgemv. */
cublasDgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, n, n, m, &alpha, deviceX, m, deviceX, m, &beta, XX, n);
/* The Matlab gateway function */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) {
/* Host-side variables */
const double *X; // Host-side input X.
double *Output1; // Matlab output.
size_t m, n; // size variables.
/* Device-side variables. */
double *deviceX; // Device-side version X.
double *XX; // GPU version XX.
/* Get pointers to input host-side array X from Matlab */
X = mxGetPr(prhs[0]);
/* Get the dimensions of the input variables */
m = mxGetM(prhs[0]); // Number of rows in X.
n = mxGetN(prhs[0]); // Number of columns in X.
/* Allocate memory on the device for the variables involved in the calculations. */
cudaMalloc(&deviceX, m * n * sizeof(double)); // [m-by-n]
cudaMalloc(&XX, n * n * sizeof(double)); // [n-by-n]
/* Use cudaMemcpy to copy X from host to device */
cudaMemcpy(deviceX, X, (m*n) * sizeof(double), cudaMemcpyHostToDevice);
/* Call dgemm kernel */
dgemmkernel<<<1, 1>>>(deviceX, XX, n, m);
/* Deliver results back to matlab as host-side variables */
plhs[0] = mxCreateDoubleMatrix(n, n, mxREAL);
Output1 = mxGetPr(plhs[0]);
cudaMemcpy(Output1, XX, (n*n) * sizeof(double), cudaMemcpyDeviceToHost);
/* Free the cudaMalloc'ed arrays from the device before exit */


Petter Stefansson
Petter Stefansson on 7 Jun 2017
No I hadn’t tried that, but now I have. It still doesn’t work but it changed the error message somewhat into:
Error using mex
nvlink error : Undefined reference to 'cublasCreate_v2' in
(target: sm_35)
nvlink error : Undefined reference to 'cublasDgemm_v2' in
(target: sm_35)

Sign in to comment.

Accepted Answer

Joss Knight
Joss Knight on 12 Jun 2017
You need to link against the cublas device library in the device linking stage and unfortunately there isn't a proper formal API to do this. You can use the variable NVCC_FLAGS to add it there, and then the standard -L and -l options to add it to the host linking stage. In my example command below the cublas device library is located at /usr/local/cuda/lib64 - you should substitute this for the lib64 directory wherever you've installed the CUDA Toolkit.
mexcuda -v -dynamic NVCC_FLAGS=-lcublas_device -L/usr/local/cuda/lib64 -lcublas_device
In the long term I'll take this as a request to have a more convenient way of linking in other device libraries.

More Answers (1)

Joss Knight
Joss Knight on 7 Jun 2017
I don't think you can call cublasDgemm inside a kernel - that's not the way dynamic parallelism works. You can't just call any old host code that happens to contain kernel launches. You can, as always, call device code, and what dynamic parallelism lets you do is launch other kernels.


Petter Stefansson
Petter Stefansson on 7 Jun 2017
Are you sure that’s not how dynamic parallelism works? Because if you look at nvidia’s “Dynamic parallelism in CUDA” technical notes it specifically states for example:
'A kernel can also call GPU libraries such as CUBLAS directly without needing to return to the CPU..'
As I understood it, starting from toolkit 5.0 any device with at least compute capability 3.5 should be able to do this. This feature is the main reason I upgraded from a lower CC version GPU, so it would be extremely disappointing if I have misunderstood the situation.
Joss Knight
Joss Knight on 12 Jun 2017
I did not realise that! Thank you! The MEXCUDA dynamic MEX options file is not properly set up to link against the cublas device library so you'll need to co-opt some existing MEX variables to get the right command line options. See my other Answer for how to do this.

Sign in to comment.

Sign in to answer this question.