## Why GPU computation does not scale linearly with the number of loops ?

### gdgc (view profile)

on 19 Dec 2017
Latest activity Commented on by Joss Knight

on 20 Dec 2017

### Matt J (view profile)

Good evening everyone,
I'm currently trying to speed up my code by converting it to GPU computation. But I'm facing a problem I didn't have with CPU : The time doesn't scale linearly with the number of loops.
Here is the code :
Nloops = 500;
dt=1e-1; % time step
n=3e3; % number of vortices
z1=complex(gpuArray.randn(1,n),gpuArray.randn(1,n)); % vortices position
C1=repmat(1e-2*gpuArray.randn(n,1),1,n); % circulation
D=gpuArray.eye(n);
nD=~D;
tic
for ii=1:Nloops
Z1 = repmat(z1,n,1);
z1 = z1 + (dt*0.5i/pi) * (sum((C1.*nD)./(Z1-Z1.'+D),1));
end
toc
The result is quite surprising
Nloops = 100 : Elapsed time is 0.014950 seconds.
Nloops = 500 : Elapsed time is 17.072178 seconds.
While on CPU it scales well ( 3 and 15 seconds respectively ), do you have an idea why it scales so badly on GPU ?

### Matt J (view profile)

on 19 Dec 2017

tic and toc are not accurate measures of GPU execution time. Use gputimeit() instead.

Matt J

### Matt J (view profile)

on 19 Dec 2017
I'm using the Titan X.
Walter Roberson

### Walter Roberson (view profile)

on 19 Dec 2017
I am using a 650M card (2012 time frame) with R2017b.
I tried adding in a gather() to force the computation to finish, but it did not seem to make any difference. However, either way (with or without gather) if I run the tests close together then it takes additional time, but if I wait the timer is faster. This suggests the GPU might not have finished (even with the gather.)
Note: in the source below, I use 1e3 nodes not 3e3, so as to avoid filling my GPU memory.
function testtime
dt=1e-1; % time step
n=1e3; % number of vortices
z1 = complex(gpuArray.randn(1,n),gpuArray.randn(1,n)); % vortices position
C1=repmat(1e-2*gpuArray.randn(n,1),1,n); % circulation
D=gpuArray.eye(n);
nD=~D;
gputimeit(@()gather(LoopPart(z1)))
function z1 = LoopPart(z1)
Nloops = 500;
for ii=1:Nloops
Z1 = repmat(z1,n,1);
z1 = z1 + (dt*0.5i/pi) * (sum((C1.*nD)./(Z1-Z1.'+D),1));
end
end
end
Joss Knight

### Joss Knight (view profile)

on 20 Dec 2017
Try removing the repmat from your loop, with automatic scalar expansion you don't need it: Z1-Z1.' == z1-z1.'