I have a MATLAB script (a least-squares deconvolution algorithm, if you must know) which is very slow - it looks like it will take days to run on a distributed computing cluster - simply as a result of the sheer quantity of data I have to chew through.
I am trying to speed up my code using CUDA, which is looking to work brilliantly except for one piece which is still causing slowdown. I can vary the array sizes with which I am working to fit within video memory requirements, I am currently using arrays of roughly 8000*1000.
trian = gpuArray(zeros(8000,1000));
x = some 8000*1000 gpuArray that changes each iteration of k
trian_x = trian;
M = x>-1 & x<0;
N = x>0 & x<1;
trian_x(M) = 1+x(M);
trian_x(N) = 1-x(N);
The last two lines before the loop ends, where I assign each value trian_x such that it is 1+x for -1<x<0 and 1-x for 0<x<1, are increasing my execution times by a factor of 50 (and my CPU is bottlenecking). It seems the slowdown is where the GPU is working with the logical indexing.
If anyone has any ideas on speeding up this calculation that would be fantastic! Those two lines take 0.2 seconds to run. Multiply that by 10^7 executions and I will appreciate any speed-up to be gained!