thank you for your answer! It is actually not easy for me to explain. I will try to break it down as much as I can:
What I am trying to do is to compute a Matrix M_out that is (k)x(m) from a matrix M_in that is (n)x(m). In this computation, each of the n rows of M_in produces a matrix M_out_i that is (k)x(m). In the end M_out is the sum of all n M_out_i matrixes.
Each M_out_i matrix is computed in dependence of one row of M_in in a recursion formula. This recursion consists of a mutiplication with the vector v1 and a convolution with the vector v2 to attain the next row in M_out_i. Then, the multiplication and convolution is applied again to attain the consecutive row of M_out_i and so on.
The convolution can be processed as multiplication in the frequency domain. Hence, my code looks like this:
for i = 1:n
v_temp = M_in(i,:);
M_out_i = zeros(k,m);
for j = 1:k
v_temp = v_temp.*v1;
v_temp = ifft( fft(v_temp).*V2 );
M_out_i(j,:) = v_temp;
M_out = M_out + M_out_i;
This is pretty much the function I want ot execute and of which I belive the loop over i=1:n can run in parallel and I only need to add up the resulting M_out_i matrixes in the end. But I am actually not very experienced in GPU processing yet. It is, however, clear that the inner loop j=1:k cannot be parallelized, due to its recursive nature.
I hope this is not too confusing.