## How to distribute computation on GPU vector-wise?

Asked by Hans-Martin Schwab

### Hans-Martin Schwab (view profile)

on 18 Apr 2017
Latest activity Edited by Hans-Martin Schwab

### Hans-Martin Schwab (view profile)

on 20 Apr 2017
Hi,
I am trying to accelerate a specific funtion by assigning each row of a matrix to one GPU core and have that core processing that row and returning a new matrix. Lets say my input matrix is n by m, I want the computation to be distributed on n cores, while each of the n cores returns a matrix of the size k by m. The computation applied to each row is quite complicated, but only functions supported by the GPU are required.
As I understand this, arrayfun can only be used for single element operations, not arrays. The individual elements in one row of the input matrix, however, cannot be computed individually. I think pagefun and bsxfun also won't work, because they do not support self written functions. Is there any way to proceed like this in Matlab without the need to implement the entire code in cuda?
Thanks!

### Products

Answer by Joss Knight

### Joss Knight (view profile)

on 20 Apr 2017

You can loop over and read multiple entries in an input array (as an up-value variable) inside arrayfun, but you can't loop over and assign to elements of an output array. There is no general way to do this in MATLAB code.
Your best bet is to tell us what you're trying to do and and we can how a combination of vectorized MATLAB functions and possible use of pagefun can give you what you want without you having to write custom CUDA.

Answer by Hans-Martin Schwab

### Hans-Martin Schwab (view profile)

on 20 Apr 2017
Edited by Hans-Martin Schwab

### Hans-Martin Schwab (view profile)

on 20 Apr 2017

Hi Joss,
thank you for your answer! It is actually not easy for me to explain. I will try to break it down as much as I can:
What I am trying to do is to compute a Matrix M_out that is (k)x(m) from a matrix M_in that is (n)x(m). In this computation, each of the n rows of M_in produces a matrix M_out_i that is (k)x(m). In the end M_out is the sum of all n M_out_i matrixes.
Each M_out_i matrix is computed in dependence of one row of M_in in a recursion formula. This recursion consists of a mutiplication with the vector v1 and a convolution with the vector v2 to attain the next row in M_out_i. Then, the multiplication and convolution is applied again to attain the consecutive row of M_out_i and so on.
The convolution can be processed as multiplication in the frequency domain. Hence, my code looks like this:
M_out=zeros(k,m);
for i = 1:n
%%%to be executed independently(?): %%%%%
v_temp = M_in(i,:);
M_out_i = zeros(k,m);
for j = 1:k
v_temp = v_temp.*v1; % multiplication
v_temp = ifft( fft(v_temp).*V2 ); % convolution
M_out_i(j,:) = v_temp;
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
M_out = M_out + M_out_i;
end
This is pretty much the function I want ot execute and of which I belive the loop over i=1:n can run in parallel and I only need to add up the resulting M_out_i matrixes in the end. But I am actually not very experienced in GPU processing yet. It is, however, clear that the inner loop j=1:k cannot be parallelized, due to its recursive nature.
I hope this is not too confusing.