I have an application where I call my own CUDA fucntions from a mex. However, the memory transferred can be very big (both input and output) and that means that pinned memory can help me speed up the process quite a lot. I have seen several posts in teh internet and in hre mentioning that you can not use pinned memory (cudaMallocHost) with MTALAB variables, however all these are from 2017 or older. Now that we are in 2019, and the parallel computing toolbox, CUDA and MATLAB have changed a lot, is this still true? Can pinned memory not be used still? For applications where memory is critical this is a big drawback.

Allocating pinned memory in matlab mex with CUDA

Alexis Detroyat on 7 Aug 2019

Have you found a way to go around this? I am currently trying to use TIGRE (good job and thank you!) with 2048x2048 projections and Atb_mex obviously is out of memory.

Ander Biguri on 13 Aug 2019

Hi Alexis.

No I haven't, but this has nothing to do with upper limits of memory, just transfer speeds. The answer really is that no, MATLAB can not allocate pinned memory.

If you have questions about TIGRE I suggest Github or my email (you shoudl not be getting that error).

Ander

Matt J on 13 Aug 2019

Edited: Matt J on 13 Aug 2019

Can TIGRE mex functions return a result to GPU memory instead, i.e., as a gpuArray object? That should diminsh the need for frequent GPU-CPU transfers in most applications of TIGRE that I would envision. Also, I have noticed improvements in interoperability between gpuArray and user-supplied CUDA since R2017b.

Ander Biguri on 13 Aug 2019

Hi Matt,

Indeed that would be an option. This was not implemented for design reasons, and perhaps it is now a bit too late to restructure the entire toolbox, but definetly that would be a solid option to minimize the trasnfer times.

In any case, for most algorithms and uses of TIGRE, specially when the data is big, the transfer times are just a small fraction of the computational time, so its just a small theoretical maximum improvement that can be achieved if TIGRE would return gpuArray objects.

Perhaps I will test this more scientifically at some point to give numbers of how much, but I think it does not exceed 10%, and for industrial sized datasets, not even 1%.

Matt J on 13 Aug 2019

Edited: Matt J on 13 Aug 2019

In any case, for most algorithms and uses of TIGRE, specially when the data is big, the transfer times are just a small fraction of the computational time

I'm not sure which algorithms you had in mind here, but performance will definitely suffer for ordered subset algorithms if you have to do a transfer after every forward/back projection of a subset. The total data set may be large, but the size of a subset can be small in comparison, and the more subsets you have, the more transfers you will have to do. If I were to undertake the task of creating dedicated gpuArray versions of the forward/back projection modules only, are you saying it would be highly challenging task?

Ander Biguri on 13 Aug 2019

Hi Matt,

You are absolutely right. In fact, a small test that I did not long ago showed that particularly for SART (which updates images projection by ptojection), an acceleration of x10 is expected if the memory trasnfer is removed and all the data is kept in the GPU.

For industrial/scientific sizes of images, SART would still be very slow and not recoomended. For medical images, this improvement may be very welcomed.

Now, about modifying TIGRE: it may be a challenging task.

Recenltly I updated TIGRE (https://arxiv.org/pdf/1905.03748.pdf) to work with multi-GPUs where the trasnfer to CPU may be required, as TIGRE now will break up the problem in chuncks if it does not fit the GPU, thus allowing for recosntruction bigger than before. Modifying this version is quite a huge workload as it would require quite big changes in the CUDA side, as there is a lot of memory management involved.

However, modifying the older single-GPU version will likely be considerably easier. Some changes in the CUDA code will be required (as its who passes memory in and out of the GPU), but there are just few lines to do the job. If you were to modify it to have dedicated gpuArrays and succeed, we could find a way to add it to the TIGRE code, and I could add some logic for when to use each of the versions (depending on problem size, number of GPUs, etc). If you are up for the task, please feel free to email me and we can discuss it further.

Ander

Allocating pinned memory in matlab mex with CUDA

6 Comments
Show 4 older comments Hide 4 older comments

Answers (0)

Categories

Products

Tags

Community Treasure Hunt

Allocating pinned memory in matlab mex with CUDA

6 Comments Show 4 older comments Hide 4 older comments

Answers (0)

Categories

Products

Tags

See Also

Community Treasure Hunt

6 Comments
Show 4 older comments Hide 4 older comments