Why does my code only run slightly faster on GPU than CPU?

Question

0 votes

ad3d_gpu.zip

I want to reduce the noise in volumetric data while preserving edges. Towards this goal, I am applying an anisotropic diffusion filter to my 3-D data. The code http://www.mathworks.com/matlabcentral/fileexchange/14995-anisotropic-diffusion--perona---malik-, which is based on work by Perona & Malik, works well for small volumes of data but is quite slow for very large volumes.

I order to speed up the code, I modified so that it runs in parallel on my GPU using Matlab's Parallel Processing Toolbox. In theory, I should get a huge speedup, since computation at each point in the volume is independent of every other point. However, I only get a very slight speedup. I am relatively new to Matlab's GPU computing interface, so I expect there is something wrong with my implementation.

I was wondering if I could get some idea of where I went wrong and how the code should be modified so that I can get the necessary speedup.

I attached the GPU and CPU versions (ad3d_gpu.zip) of the code to this post.

4 Comments
Show 2 older comments Hide 2 older comments

Joss Knight on 14 Oct 2015

When I reduced the paging factor nBlock to 2, the speedup on my K20 was more like 2000% ! On the Quadro I only had enough memory to set it to 4, but I still got a speedup of 400%.

Kevin Miller on 14 Oct 2015

Ok, so I think I understand. Varun, thank you for the GPUBench code. It turns out that my GPU which I inherited when I took my current position (NVIDIA NVS 510) is totally crap for computing. Looks like I will have to go shopping.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Joss Knight on 14 Oct 2015

Open in MATLAB Online

2 votes

Your real problem here? You're not processing enough data at once. If I reduce the paging factor nBlock I could process more data, this utilized the GPU better and gave me much better performance. However, this may be ruled out by the amount of memory available on your GPU.

I ran the MATLAB profiler on your code. It pointed me to a few issues, but really you should run it yourself and take a look.

A bit naive about the precise algorithm here, but do you really need to be doing a manual image gradient calculation with 26 individual calls to a filter operation? Won't this translate into a single call to imgradient or convn? And why can't you use imfilter like the CPU code does? This is supported for gpuArrays.
arrayfun(@minus, ...) is redundant - just use minus ( A - B ).
Indexing is killing you here - you need to minimize the amount you do. Since all the right-hand-sides of the subtractions are the same, I stored that and reused, and this sped things up a bit, but there's much more you could do I imagine.
Do you have to do the computation in double-precision? See what happens if you cast vol to single instead of double at the start - on your card that may give you a boost (although you'd have to check the effect on accuracy).
diffstep looked like some element-wise multiplies plus a sum across dimension 4 (using bsxfun to apply the scalings from dx, dy etc to the whole array). Try doing that instead.
Make sure every variable of any size is a gpuArray - construct variables using 'like' syntax to ensure you get matching class and type:

nabla = zeros(..., 'like', diff3pp)

I could continue to investigate, but you're better off doing it yourself with an understanding of the problem. You need to keep trying to vectorize, to do more work on bigger data in fewer commands (see this article on vectorization for instance). But it does seem like maybe it's your GPU that's the problem factor here.

3 Comments
Show 1 older comment Hide 1 older comment

Kevin Miller on 14 Oct 2015

Hi Joss,

Thanks for the advice, it is much appreciated.

1) My first instinct when I was optimizing the original code was to consolidate all of the image operators into a single call of a 3-D convolution function. However, this does not give the right result. This code performs anisotropic diffusion, which means I need to find 26 diffusion coefficients for every voxel. Furthermore, the conduction equation "diffEqn1.m" is non-linear, so it needs to be applied to the gradient value computed in each direction.

2) According to Matlab's documentation arrayfun(..., ...) is the fastest way to allocate arithmatic operations to large arrays.

3) I agree, but it is difficult not to index, since each diffused block of data "diff3pp" needs to be extracted from the center of "diff_temp or diff3pp". I don't understand what you mean about storing the subtractions. Can you elaborate?

4) I'll try this after I have optimized the code in other ways.

5) I'll try this and let you know how it works.

6) After converting all variables to class gpuArray (out of the loop), I don't get a speed up. Why is that?

I'll also try my code on a computer with a better GPU to see if that makes a difference.

Thanks again for your help and any other comments/suggestions are welcome!

Edric Ellis on 15 Oct 2015

A small point on (2) - arrayfun is a benefit only if you're able to amalgamate multiple element-wise operations. For a single operation such as @minus, Joss' statement is entirely correct.

Joss Knight on 15 Oct 2015

Edited: Joss Knight on 15 Oct 2015

Open in MATLAB Online

Glad to help.

If you say so, but keep thinking!
See Edric's point. The only thing extra you get out of arrayfun is scalar dimension expansion, which you're not using here.
You compute diff3pp(2:nx-1,2:ny-1,2:nz-1) 26 times! You should just store it in a variable and re-use it on each line. While trying to work out what was going on I also computed the indexes ( xleft = 2-ishift:nx-1-ishift, xcenter = 2:nx-1, xright = 2+ishift:nx-1+ishift, etc) separately, because you're computing all of those loads of times each too.

#

Here's what I replaced diffstep with:

function val = diffstep(vol, delta_t, ...
    dx, dy, dz, dc, dh, dd, ...
    C, nabla)
invdxsq = 1./(dx^2);
invdysq = 1./(dy^2);
invdzsq = 1./(dz^2);
invdcsq = 1./(dc^2);
invdhsq = 1./(dh^2);
invddsq = 1./(dd^2);
scaling = [invdzsq; invdzsq; ...
           invdxsq; invdxsq; ...
           invdysq; invdysq; ...
           invdcsq; invdhsq; ...
           invdcsq; invdhsq; ...
           invdhsq; invdcsq; ...
           invdhsq; invdcsq; ...
           invddsq; invddsq; ...
           invddsq; invddsq; ...
           invdcsq; invdhsq; ...
           invdcsq; invdhsq; ...
           invdhsq; invdcsq; ...
           invdhsq; invdcsq];
scaling = shiftdim(scaling, -3);
CxnablaScaled = bsxfun(@times, scaling, C.*nabla);
val = vol + delta_t*(sum(CxnablaScaled, 4));

And then I called it with:

          diff_temp = diffstep(diff3pp(2:end-1,2:end-1,2:end-1), delta_t, ...
              dx, dy, dz, dc, dh, dd, ...
              C(:,:,:,1:26), nabla(:,:,:,1:26));

However, it didn't make much difference for me, sadly ( arrayfun is just so darn good...)

6. Probably cos everything of any size already was on the GPU. Don't start putting scalars and small variables onto the device, you waste time copying data across with no computational benefit.

Sign in to comment.

Why does my code only run slightly faster on GPU than CPU?

4 Comments
Show 2 older comments Hide 2 older comments

Answers (1)

3 Comments
Show 1 older comment Hide 1 older comment

Categories

Tags

Community Treasure Hunt

Why does my code only run slightly faster on GPU than CPU?

4 Comments Show 2 older comments Hide 2 older comments

Answers (1)

3 Comments Show 1 older comment Hide 1 older comment

Categories

Tags

See Also

Community Treasure Hunt

4 Comments
Show 2 older comments Hide 2 older comments

3 Comments
Show 1 older comment Hide 1 older comment