Linux CUDA-based Shared Library Crashes MATLAB with Segfault on Kernel Call

Hey all,
I'm running into an issue with a CUDA-based shared library I've written to solve a system of PDEs that I load through loadlibrary. I've written this using pretty generic CUDA / no outside libraries, etc. When compiled to the shared library, executing it from MATLAB results in a crash (due to a segfault) once the execution arrives at the kernel calls (there are 5 distinct kernels, each called about twice). I've commented out the kernels one-by-one and they all lead to a segfault, leading me to believe there is some issue with the kernel calling mechanism, maybe?
I believe the kernels to be well functioning - I've written a c++ caller of the .so, and it works fine (passing cuda-memcheck as well). This code also works in Windows just fine, exactly as is (called from MATLAB). Therefore, I believe this to be a specific MATLAB issue or possibly compile flags issue. The odd thing is that writing a quick, trivial kernel appears to work within MATLAB (same flags as below) - the kernel does execute correctly.
So, I understand that you don't have my code...so I'm not asking for code debugging. My questions are moreso related to shared library compiling requirements by MATLAB. I use the following flags to compile through nvcc:
-std=c++14 -shared -x cu -cudart static -O2 -gencode=arch=compute_50,code=\"sm_50,compute_50\" -m64 -Xcompiler "-fPIC -Wno-narrowing" -w -Wno-deprecated-gpu-targets
and the following to link:
-shared -w
Do you see any issues with that? I mark the functions as extern "C" when compiling with g++ and have a clause for when MATLAB compiles the thunk library to simply use extern (due to using gcc).
#ifdef __linux__
#ifdef __cplusplus
#define EXTC extern "C"
#else
#define EXTC extern
#endif
...
Any issues there? I don't think there is - as mentioned, other funciton calls work fine.
I'm at a bit of a loss as to how to move forward here. Does anyone have any insight?
Thanks.

7 Comments

There could be a million issues. Are you passing GPU data to your kernel that you allocated in MATLAB? Is your library compiled with the same CUDA version as MATLAB?
It's going to be pretty hard to advise without seeing at least some basic reproduction code. Have you thought about writing a minimal bit of code that reproduces the issue that you can post here?
I don't mean to be overly vague...and I hate to be that guy that asks an ill-defined question. The issue is that I've attempted to produce a compact version of code that reproduces this error...and I can't seem to reproduce it apart from my code base, which isn't huge, but too big for here. A simple CUDA kernel (vector addition that I wrote quickly) and C++ caller compiled with the above flags works just fine. That's why I'm moreso asking for insights related to compilation of a shared library, CUDA considerations, etc rather than debugging of my code. It seems like I've declared the function to be exported correctly (extern "C" when using a c++ compiler, extern when using a c compiler - which I understand MATLAB uses).
To answer your questions:
The function does take in matrices from MATLAB, copy them to the GPU, and operate on them. The cudaMalloc and cudaMemcpy operations appear to happen correctly. It is once a CUDA kernel (any kernel) is called that the entire program segfaults. This is odd behavior as normally a segfault in a CUDA kernel due to a bug will cause the kernel to fail and return an error to the C++ program, not cause the entire program to segfault.
I wasn't aware MATLAB had an associated CUDA version, although maybe this could cause the issue. I'm using CUDA 11.1, and on Linux I don't have the GPU toolkit installed (I do on Windows - where it works). Is there still a CUDA version associated with MATLAB? If so, is there a way to use the system CUDA library instead of the one that comes with MATLAB? Would installing the GPU toolkit potentially help?
I'll keep working on a reproducible example and post if I can get it working (or, rather, not working).
If you don't have Parallel Computing Toolbox and/or you are not creating any gpuArray data in MATLAB then no, it doesn't matter what toolkit MATLAB is using.
It's so hard to say what might be wrong, since you seem to be claiming there are no bugs in your CUDA code. One guess is that there are bugs, but it's only when your library is running in the MATLAB process that reading or writing off the end of an array is causing a crash. Sometimes cuda-memcheck doesn't notice these things, especially an illegal read, unless you compile with device debugging.
Try running cuda-memcheck with MATLAB. It's simple enough to launch MATLAB with the -r flag to run some code and then exit; with any luck the segfault will be triggered and cuda-memcheck will tell you where the problem is.
You could have an alignment issue, or you could be getting your datatype wrong. Try taking the data you copied to device and copying it back to a new host array. Display the array contents and check they're the same as before. Try copying the data to a newly allocated host array and then copy that data to device. That could fix an alignment issue.
Finally, use the NVIDIA NSight debugger to step through your CUDA code.
I appreciate your continued help.
That's a great trick to run the MATLAB script from the command line through cuda-memcheck. I didn't realize you could do that. When I do, (now, the library compiled with debug flags, through cuda-memcheck), it completes with no errors (after a lot longer). Spitting out some data, it seems reasonable...although I stopped it a bit early. It's odd that there's a segfault/crash when running in the IDE (debug or release), but from the command line it completes successfully... It's also as fast as I expect and yields the same results when running without cuda-memcheck.
What would cause a crash the IDE, but not the command line? Seems odd to me. Next step I'll try is to run it until it converges to a tighter tolerance (what I run on Windows), and see if the output data is the same - I don't know, maybe kick the relevant data to a csv, the load up full MATLAB and graph as I would normally.
The odd part is that it works just fine in Windows...so I would think it wouldn't be any different in linux - especially given the lack of cuda-memcheck errors, etc. I've copied the data back after copying to GPU and checked it - it is the same, so no alignment issues.
NSight might be my last resort... I'll also keep trying to reproduce with a small example. May be a few days.
So, running to a tigher tolerance, the exact same graph is produced (and I would imagine values are within FPE).
So, to recap the facts (for my sake), the issue is this:
I have a MATLAB script that uses a shared library function consisting of several CUDA kernel calls looped for ~10000 iterations until convergence. The shared library contains code that passes cuda-memcheck, and is (as far as I know) absent any segfault-like bugs. This code runs correctly from the command line, and produces correct results, at least in the single variable I've looked at (although they are coupled, so the others would have to be correct as well or that variable couldn't be correct). When running from the IDE, MATLAB crashes with a segfault when arriving at a line containing any CUDA kernel call. cudaMemcpy's happen correctly in both directions with no segfault.
So, what would cause MATLAB (IDE) to segfault on a kernel launch within a c++ shared library that wouldn't happen from command line? Is a version of CUDA loaded with the IDE that's not from command line? I do not have any toolkits (parallel computing, GPU coder, etc) although I do have simulink installed.
This really does sound like a CUDA library issue - although I don't know why the IDE would cause a different CUDA library to load vs the command line...
Don't mean to be repetitive, it just helps me to collect the facts in one place.
Try launching MATLAB with -softwareopengl and see whether there's some sort of graphics issue here.
Thanks. I gave that a shot, and it's still not working. At this point, I think I'll just deal with running it from the command line. It just works form there. I can save the workspace at the end of the sim, and load it from the IDE. I can even have the IDE running in a separate process at the same time. So, I think that will be a good enough solution for me on Linux. Thanks for all your help. I appreciate your diligence in continuing to suggest things to try.

Sign in to comment.

Answers (0)

Categories

Find more on Loops and Conditional Statements in Help Center and File Exchange

Products

Release

R2020b

Asked:

on 11 Dec 2020

Edited:

on 17 Dec 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!