1 view (last 30 days)

Show older comments

Hello all, This is a reformulated post related to this one where I am trying to understand certain observed speed effects on looped GPU operations, how to write better code for the GPU, and also how to understand timing measurements.

In the example below, I start by looping 100x over some numerical operation. In this operation, I found that if the size of the input arrays remained constant, the operations was very very fast. However, if the size of these inputs changed on each iteration, I saw massive slowdowns. Hopefully the code will make this clearer.

We start by running our loop on the CPU with fixed sizes and measure the timing.

% Initial parameters

M = 1000;

N = 1000;

its = 100;

% CPU operation, fixed size inputs.

tic

B = zeros(its,1); % Initialize the output matrix

for ii = 1:its

A1 = rand(M,N); % random input 1

A2 = rand(M,N); % random input 2

T = A1.*exp(1i*A2); % numerical operation 1

P = max(T(:)); % numerical operation 2

B(ii) = P; % output value

end

toc

Elapsed time is 4.284955 seconds.

Ok, now I will do the same function on the GPU:

% GPU operation, fixed size inputs.

wait(gpuDevice);

tic

B = gpuArray.zeros(its,1); % Initialize array

for ii = 1:100

A1 = gpuArray.rand(M,N);% random input 1 on gpu

A2 = gpuArray.rand(M,N);% random input 2 on gpu

T = A1.*exp(1i*A2); % numerical operation 1

P = max(T(:)); % numerical operation 2

B(ii) = P; % output value

end

wait(gpuDevice);

toc

Elapsed time is 0.171926 seconds. Great! However, the size of the input matrices A1 and A2 might be changing, and this is where I see some weirdness.

% Set up some changing sizes, random shifts in the size of A1 and A2

Mshifts = round(200*rand(its,1))-100; % Change 1

Nshifts = round(200*rand(its,1))-100; % Change 2

Now run it on the CPU with these changing sizes.

% CPU operation, variable size inputs.

tic

B = zeros(its,1); % Initialize the output matrix

for ii = 1:its

A1 = rand(M+Mshifts(ii),N+Nshifts(ii)); % random input 1

A2 = rand(M+Mshifts(ii),N+Nshifts(ii)); % random input 2

T = A1.*exp(1i*A2); % numerical operation 1

P = max(T(:)); % numerical operation 2

B(ii) = P; % output value

end

toc

Elapsed time is 4.583695 seconds. Around the same as before, given that our operations are sometimes larger or smaller.

HOWEVER, when I run it on the GPU

% GPU operation, variable size inputs.

wait(gpuDevice);

tic

B = gpuArray.zeros(its,1); % Initialize array

for ii = 1:its

A1 = gpuArray.rand(M+Mshifts(ii),N+Nshifts(ii));% random input 1 on gpu

A2 = gpuArray.rand(M+Mshifts(ii),N+Nshifts(ii));% random input 2 on gpu

T = A1.*exp(1i*A2); % numerical operation 1

P = max(T(:)); % numerical operation 2

B(ii) = P; % output value

end

wait(gpuDevice);

toc

Elapsed time is 1.143043 seconds. About 5-10x slower than previously!! Now, I know that there are some issues of how to measure the speed on the GPU, but the important point is that the loop does not finish till 1.14 seconds, meaning that if this is part of a larger code I will not actually continue on until that time has elapsed.

SO: Question - why is there this effective performance drop when A1 and A2 are not of constant size, and can I avoid it? Is there a clever way of pre-allocating, or using anonymous functions (I tried this with no real gain)?

Thanks y'all.

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!