Solving Ax=b on Multiple GPUs

3 views (last 30 days)
omegayen
omegayen on 27 Aug 2015
Commented: Joss Knight on 3 Sep 2015
Hi
So I have looked over the Benchmarking A\b on the GPU example at http://www.mathworks.com/help/distcomp/examples/benchmarking-a-b-on-the-gpu.html . This is comparing CPU performance with a single GPU. I am wondering if we can expect a speed up with multiple GPUs when solving for A\b with the right code?
For example I have the following code to solve for Ax=b on a single GPU
tic
gd=gpuDevice(1);
N=13312;
A = gpuArray(rand(N));
B = gpuArray(rand(N,16));
tic;
Gx = A\B;
wait(gd);
time=toc;
x=gather(Gx);
tic;
wait(gd);
tover = toc;
time = max(time - tover, 0);
fprintf('Time on GPU: %f\n',time);
toc
For example I have the following code to solve Ax=b on two GPUs
tic
spmd
gd=gpuDevice(labindex);
srcWkrIdx = 1;
if labindex == srcWkrIdx
N=13312;
A1 = gpuArray(rand(N));
B1 = gpuArray(rand(N,16));
data{1}=A1;
data{2}=B1;
shared_data = labBroadcast(srcWkrIdx,data);
else
shared_data = labBroadcast(srcWkrIdx);
end
A=shared_data{:,1};
B=shared_data{:,2};
tic;
Gx = A\B(:,8*labindex-7:8*labindex);
wait(gd);
time=toc;
store=gather(Gx);
tic;
wait(gd);
tover = toc;
time = max(time - tover, 0);
fprintf('Time on GPU: %f\n',time);
end
x=[store{1} store{2}];
toc
For example I also have the following code to solve for Ax=b on two GPUs
tic
parfor kk=1:2
gd=gpuDevice(kk);
N=13312;
A = gpuArray(rand(N));
B = gpuArray(rand(N,16));
tic;
Gx = A\B(:,8*kk-7:8*kk);
wait(gd);
time=toc;
store{kk}=gather(Gx);
tic;
wait(gd);
tover = toc;
time = max(time - tover, 0);
fprintf('Time on GPU: %f\n',time);
end
x=cell2mat(store);
toc
Note that I actually have 16 different B's so I am really solving for 16 different x's.
The performance of both of the versions that runs on 2 GPUs is slower than the performance of the version that runs on a single GPU. I was hoping to see the reverse occur. I am wondering if it is possible to write code that will solve for x and have it be faster with multiple GPUs? If not is this due to communication of data between the GPUs taking additional time? Also for this case should I use the spmd or parfor versions?
Thanks for any help or suggestions you can provide.

Answers (1)

Joss Knight
Joss Knight on 2 Sep 2015
There are some mistakes in your timing here. tic and toc don't work in a nested way, toc just times from the most recent tic. If you want toc to give you a time from a different tic you need to use a variable:
tstart = tic;
% Some code including other calls to tic
toc(tstart)
Secondly, your first timing block doesn't call wait() before calling tic, so it includes the time to copy the data to the device. For proper timing with tic and toc you should call wait before both tic and toc. But instead you can use gputimeit:
gputimeit(@()A\B)
...although you can't use this in an spmd block. Also, use gpuArray.rand or rand(..., 'gpuArray') rather than gpuArray(rand()) to create the data directly on the device.
I ran your code (with some corrections) and found the same sort of thing you were - communication and other overheads means that splitting the right-hand-side data in two and running each half on a separate GPU is slower that doing everything on one GPU. But if you experiment with the amount of data you'll quickly see that the length of time it takes to run your backslash is the same with a lot more columns of B. Presumably the device is not fully utilised so you can throw more data at it and it will crunch those numbers in the same amount of time.
It's interesting that it took longer to run mldivide separately on each device. I presume it's because mldivide is a hybrid algorithm using both CPU and GPU, and there will be plenty of communication over the PCI bus which needs to be shared by both cards.
I had to increase the number of columns of B to over 3000 before I found the benefit to dividing the work in two overtook the overheads. I think the only way to get good efficiency out of a multi-GPU direct solve would be to use a distributed algorithm, something that divides the system matrix and not just the RHS.
Your parfor version looks reasonable since it can take care of the communication for you - however, you'd need to define the system matrix A before the loop so the two GPUs are using the same one.
  2 Comments
omegayen
omegayen on 2 Sep 2015
Edited: omegayen on 2 Sep 2015
Hi okay thank you for your advice, yes I was aware I hadn't fully resolved the timing and parfor version in my above code. Thanks for the suggestions.
I have updated my codes
1) solve for Ax=b on a single GPU
tstart = tic;
gd=gpuDevice(1);
N=13312;
b_col=16;
A = rand(N,'gpuArray');
B = rand(N,b_col,'gpuArray');
wait(gd);
tic
Gx = A\B;
time=toc;
x=gather(Gx);
wait(gd);
tic
wait(gd);
tover = toc;
time = max(time - tover, 0);
fprintf('Time on GPU: %f\n',time);
toc(tstart)
2) solve Ax=b on two GPUs using spmd
tstart = tic;
spmd
gd=gpuDevice(labindex);
srcWkrIdx = 1;
if labindex == srcWkrIdx
N=13312;
b_col1=16;
A1 = rand(N,'gpuArray');
B1 = rand(N,b_col1,'gpuArray');
data{1}=A1;
data{2}=B1;
data{3}=b_col1;
shared_data = labBroadcast(srcWkrIdx,data);
else
shared_data = labBroadcast(srcWkrIdx);
end
A=shared_data{:,1};
B=shared_data{:,2};
b_col=shared_data{:,3};
wait(gd)
tic;
Gx = A\B(:,b_col/2*labindex-(b_col/2-1):b_col/2*labindex);
time=toc;
store=gather(Gx);
wait(gd);
tic;
wait(gd);
tover = toc;
time = max(time - tover, 0);
fprintf('Time on GPU: %f\n',time);
end
x=[store{1} store{2}];
toc(tstart)
3) solve Ax=b on two GPUs using parfor
tstart = tic;
N=13312;
b_col=16;
A1 = rand(N);
B1 = rand(N,b_col);
parfor kk=1:2
gd=gpuDevice(kk);
A=gpuArray(A1);
B=gpuArray(B1);
wait(gd);
tic;
Gx = A\B(:,b_col/2*kk-(b_col/2-1):b_col/2*kk);
time=toc;
store{kk}=gather(Gx);
wait(gd);
tic;
wait(gd);
tover = toc;
time = max(time - tover, 0);
fprintf('Time on GPU: %f\n',time);
end
x=cell2mat(store);
toc(tstart)
I still see a faster execution on a single GPU even when increasing the number of columns of B to over 3000. Perhaps this is due to using different GPUs.
Note that my updated parfor version creates the system matrix A before the loop but then I have to send it to the memory of both GPUs inside the parfor loop, so with this particular change it actually runs slower than the spmd version. Not sure if there is a better approach.
I will see if dividing the system matrix provides further benefits. However I am not sure exactly how to do this. I was under the belief that you can't use codistributed arrays on GPUs.
Joss Knight
Joss Knight on 3 Sep 2015
You can't store gpuArrays in a codistributed array, no. Any distributed GPU algorithm would have to be written yourself.
You're still subtracting tover from the timing but I don't think that's necessary. There may be other factors at play making a difference between your and my experiments (you included a gather in the timing, for instance, I did not); however the real issue I think is that dividing B is not sufficient to make this algorithm efficient. You are still factorizing A on every GPU (LU decomposition in this case) and that is most of the computation - the only thing you're saving time on are the triangular solves. So my instinct is that without a distributed algorithm, this is never going to work.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!