Faster interp1 and indexing on GPU

Dear all,
This is my first time using Matlab on GPU.
I tried the benchmark code to test my GPU. For double precision, my GPU is around 50 times better than CPU.
I changed my input arrary into gpuArray. The performance is shown in the figures. test_bi_grlt_pat*.m calls Bi_GLRT_patch1_1.m and then calls Dnoisefun.m (Dnoisefun. and noisefun.m are similiar.)
I am doing image processing. Bi_GLRT_patch1_1.m is basically gradient descent on each pixel. Dnoisefun.m calculates the gradient on each pixel. noisefun.m calculates the value on each pixel.
For CPU:
For GPU:
As we can see, GPU is much slower than CPU. The reason is: we called Dnoisefun.m and noisefun.m a lot; 'interp1' should be faster on GPU but didn't seem so; the indexing operation 'result(result<0)' is super slow on GPU.
Any advice on how to improve this?
Furthermore, I wrote a simple code to test different dimension of array's performance on GPU and CPU, where Inten, DProb is the x, y for interpolation:
gridSize = 1000000;
x =linspace(min(Inten),max(Inten),gridSize);
disp(size(x));
xg= gpuArray(x);
tic
result1=interp1(Inten,DProb,x,'linear','extrap' );
time1 = toc;
disp(time1)
x1=x';
tic
result2=interp1(Inten,DProb,x1,'linear','extrap' );
time2 = toc;
disp(time2)
tic
result3=interp1(Inten,DProb,xg,'linear','extrap' );
time3 = toc;
disp(time3)
xg1=xg';
tic
result=interp1(Inten,DProb,xg1,'linear','extrap' );
time4 = toc;
disp(time4)
The performance is not very consistent for different trials. Here are some of the trials' results:
test_gpu
1 10000
8.0200e-04
2.8000e-04
3.2500e-04
1.2600e-04
>> clear
>> test_gpu
1 100000
9.7700e-04
8.8300e-04
0.0011
1.6100e-04
>> clear
>> test_gpu
1 1000000
0.0055
0.0048
5.1600e-04
9.3200e-04
>> clear
>> test_gpu
1 1000000
0.0051
0.0046
3.5500e-04
1.1500e-04
>> clear
>> test_gpu
1 1000000
0.0059
0.0043
3.7100e-04
1.1600e-04
>> clear
>> test_gpu
1 1000000
0.0058
0.0046
3.6500e-04
1.1900e-04
>> clear
>> test_gpu
1 1000000
0.0057
0.0047
6.5600e-04
0.0011
Similarly, the idexing performance is not consistent either:
clear
load('DDetectorProb.mat')
gridSize = 1000000;
x =linspace(min(Inten),max(Inten),gridSize);
xs=x;
ban = (min(Inten)+max(Inten))/2;
disp(size(x));
xg= gpuArray(x);
xgs = xg;
tic
xs(x>ban)=1;
time1 = toc;
disp(time1)
x1=x';
xs = x1;
tic
xs(x1>ban)=1;
time2 = toc;
disp(time2)
tic
xgs(xg>ban)=1;
time3 = toc;
disp(time3)
xg1=xg';
xg1s = xg1;
tic
xg1s(xg1>ban)=1;
time4 = toc;
disp(time4)
Results:
1 1000000
0.0031
0.0034
0.0010
0.0014
1 1000000
0.0032
0.0030
7.6000e-04
8.7700e-04
1 1000000
0.0032
0.0031
7.2500e-04
0.0021
1 1000000
0.0030
0.0031
7.7100e-04
0.0019

5 Comments

Please provide the inputs Inten and DProb so that I can inspect your code.
Please also call wait(gpuDevice) before each call to tic or toc as per the guidelines for timing GPU code in the documentation here. That way you will be getting correct timings for your code.
Thanks. Just attached it.
Hopefully your response through technical support was sufficient?
Thanks. Yes. Should I copy or you copy the answers I get for other users who may see this post for help later?
I'd be interested to see your response from technical support - would you be able to post?
Thanks!

Sign in to comment.

Answers (1)

Instead of indexing modify your lower boundary slightly and use min and max
result = min(0.8, max(realmin, result)) ;
The difference is that in your original code any value that was exactly 0 was left exactly 0 and negative were modified to realmin (which is positive), whereas in this revised code, values that are exactly 0 would modified to realmin as well.

1 Comment

Which is to say: don't do your own indexing on GPUs if you can avoid it. The architecture of Nvidia gpu makes indexing inefficient.

Sign in to comment.

Tags

Asked:

on 18 Nov 2019

Commented:

on 12 Mar 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!