How to use lsqr with GPU?

10 views (last 30 days)
Su Py
Su Py on 28 Jul 2021
Commented: Joss Knight on 9 Aug 2021
I'm using lsqr in order to solve a least squares problem of the form . Since A is a huge matrix, I implemented the function and I'm using it in lsqr instead of holding A in the memory. In order to speed up the calculation, the function f uses parfor statements to calulate .
How can I use my GPU cores in order to speed up these elements?
The general structure of my code:
function Ax = afun(x,flag)
if strcmp(flag,'notransp') % Compute A*x
parfor i = 1:K
Ax_mat(:,i) = ...
end
Ax = Ax_mat(:);
elseif strcmp(flag,'transp') % Compute A'*x
parfor i = 1:K
Ax_mat(:,i) = ...
end
Ax = Ax_mat(:);
end
end
solution = lsqr(@afun,b);

Answers (1)

Joss Knight
Joss Knight on 4 Aug 2021
Edited: Joss Knight on 4 Aug 2021
solution = lsqr(@afun,gpuArray(b));
Or alternatively, move the data to the GPU inside your afun operation. The problem is, using a parallel pool in conjunction with GPU execution is generally counter-productive. You may have many CPU cores to perform your matrix-vector multiplication one chunk at a time, but you only have one GPU. If you do the same on GPU, the parallelism will be lost as each worker waits to access the GPU. You could attempt to load-balance between CPU and GPU by having only one worker use the GPU but then you're going to encounter your memory issue - to balance properly the GPU worker may need to be working on a chunk of data 10x larger than the CPU workers.
  2 Comments
Joss Knight
Joss Knight on 9 Aug 2021
Yes, you can open a pool with 8 workers...however, you are still probably not going to get the most efficient utilisation with your parfor loop. The GPU works best when you vectorize, which means that ideally you will not process column by column but instead do multiple columns at a time, i.e. 1/8th of the columns on each worker.
The point is that even in parfor you are running many operations serially, just spread between multiple workers. On the CPU this doesn't matter because the whole operation was serial anyway, but on GPU it's critical that you maintain the density of array elements being processed in each function call.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!