Why for GPU is slower than CPU for this code? Is it because of sparsity or because of "for" loop

Question

Kai Liu on 20 Jan 2016

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/264457-why-for-gpu-is-slower-than-cpu-for-this-code-is-it-because-of-sparsity-or-because-of-for-loop

Commented: Joss Knight on 10 Feb 2016

I want to generate two 512 by 512 matrices, and change the value of about 1000-2000 points in each matrix.

Yet the GPU is significantly slower than the CPU, about 20 times slower. Is it because of the sparsity of the problem or because of the for loop?

I know that GPU for MATLAB can't deal with sparse matrix efficiently, and "for" loop is not well optimized by GPU, either. But which one is the bottle neck here?

Is there a clue that how I can evade this difficulty? Can't figure it out myself. Kinda of killing me.

Nx=512;
rhsx= zeros(Nx,Nx);
rhsy= zeros(Nx,Nx);
for i=1:256
    bmatrix=randn(4,4);
    rhsx(nx(i)-1:nx(i)+2,ny(i)-1:ny(i)+2)=...
        rhsx(nx(i)-1:nx(i)+2,ny(i)-1:ny(i)+2)+rforcx(i)*bmatrix;
    %"nx" and "ny" denote the position where a 
    %4 by 4 small square in the big matrix is changed.
    rhsy(nx(i)-1:nx(i)+2,ny(i)-1:ny(i)+2)=...
        rhsy(nx(i)-1:nx(i)+2,ny(i)-1:ny(i)+2)+rforcy(i)*bmatrix;
end

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Joss Knight on 4 Feb 2016

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/264457-why-for-gpu-is-slower-than-cpu-for-this-code-is-it-because-of-sparsity-or-because-of-for-loop#answer_208518

Edited: Joss Knight on 4 Feb 2016

Open in MATLAB Online

What you have here is a highly serial algorithm, accumulating small amounts of data inside a large array. To vectorize, perhaps consider looping over the 16 elements of your 4x4 square rather than over your nx and ny arrays. Something like:

bmatrix = randn(256,4,4);
for x = -1:2
  for y = -1:2
    linearIndex = sub2ind(size(rhsx), nx+x, ny+y);
    rhsx(linearIndex) = rhsx(linearIndex) + rforcx.*bmatrix(:,x+2,y+2);
    rhsy(linearIndex) = rhsy(linearIndex) + rforcy.*bmatrix(:,x+2,y+2);
  end
end

This only works as long as the values of nx and ny are unique. If not, we need to go for the motherload, and vectorize the remaining loops using accumarray:

bmatrix = randn(256,16);
[x, y] = meshgrid(-1:2);
x = bsxfun(@plus, nx(:), x(:)');
y = bsxfun(@plus, ny(:), y(:)');
bmatrixx = bsxfun(@times, rforcx(:), bmatrix);
bmatrixy = bsxfun(@times, rforcy(:), bmatrix);
rhsx = accumarray([x(:) y(:)], bmatrixx(:), [Nx Nx], @sum);
rhsy = accumarray([x(:) y(:)], bmatrixy(:), [Nx Nx], @sum);

2 Comments
Show NoneHide None

Kai Liu on 5 Feb 2016

Although I don't quite understand the second one, many Thanks.

Joss Knight on 10 Feb 2016

Open in MATLAB Online

Let me clarify:

% Get all the random boxes for all 256 positions. Each box
% is a row of this matrix.
bmatrix = randn(256,16);
% [x(:) y(:)] will be a list of the 16 offset coordinates for
% each box element
[x, y] = meshgrid(-1:2);
% Add every offset to every coordinate. The result is 256x16, that
% is one row per box, with 16 elements in each box. Now x and y
% identify every element of every box.
x = bsxfun(@plus, nx(:), x(:)');
y = bsxfun(@plus, ny(:), y(:)');
% Apply the multiplier to each box. So for every box b we multiply
% the appropriate elements rforcx(b) and rforcy(b).
bmatrixx = bsxfun(@times, rforcx(:), bmatrix);
bmatrixy = bsxfun(@times, rforcy(:), bmatrix);
% Now we have arrays that identify what to add together and where
% to put the results. accumarray can do the accumulation based
% on the list of coordinates [x(:) y(:)] and the data going into
% each element bmatrixx(:) and bmatrixy(:).
rhsx = accumarray([x(:) y(:)], bmatrixx(:), [Nx Nx], @sum);
rhsy = accumarray([x(:) y(:)], bmatrixy(:), [Nx Nx], @sum);

Sign in to comment.

Why for GPU is slower than CPU for this code? Is it because of sparsity or because of "for" loop

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

2 Comments
Show NoneHide None

See Also

Categories

Tags

Community Treasure Hunt

Why for GPU is slower than CPU for this code? Is it because of sparsity or because of "for" loop

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

2 Comments Show NoneHide None

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None