I have just realized that the execution times for fftn operations inside a loop is not proportional to the length of the loop when working inside the GPU.
As an example, if I define a cube in the GPU
a = gpuArray(ones(256,256,256,'single'));
I see that the user time does not scale with the number of visits in a loop. For moderate loops I read:
>> N=100;tic;for i=1:N;g=fftn(a);end;toc
Elapsed time is 0.008618 seconds.
... but for a loop which is 10 times bigger
>> N=1000;tic;for i=1:N;g=fftn(a);end;toc
Elapsed time is 7.299844 seconds.
the total time does not scale by 10 but by 1000!!!! I know tic/toc is not the best way to measure performance, but it is still the time seen by the users of the program... is there some basical principle of handling gpuArrays inside loops that I am missing?