MATLAB Answers

0

GPU utilization and parallel computation With Matlab for heavy computation

Asked by caesar on 2 Jul 2019
Latest activity Commented on by caesar on 9 Jul 2019
I have decent/ok machine with core i7 (8 cores), 32G of RAM and Nvidia geForce GTX 1080i and running Matlab 2018b. At the moment I am a bit confuse about how to use these resources in best way to run my Monte-Carlo simulation code. The two questions I have now:
1- How can I make all the heavy computaion to be run on the GPU alonside parallel compution capability of Matlab rather than the CPU and hence I can decide what is best to use? I have read different help topics and the conclusion I think I have got is, the data I have to work with should be in the form of gpuArray Am I right? or do I miss something here?. let us assume that I have the foollowing simple code to be run on GPU :
First_Vector=zeros(2,3);
% First_Vector=zeros(2,3,'gpuArray'); 1
[N,M]=size(First_Vector);
%[N,M]=size(First_Vector,'gpuArray'); 2
Second_Matrix=ones(N,M,2);
%Second_Matrix=ones(N,M,2,'gpuArray'); 3
Tset1= [20 20 20:30 30 30];
%Tset1gpuArray=gpuArray(Tset1); 4
Test2= [50 50 50;60 60 60];
%Tset2=gpuArray(Tset2); 5
K=100;
% the main code
for i=1:3
for j=1:3
[element]=Function1(test(i,j),K)
Test1(i,j)=element;
end
end
Second_Matrix(:,:,1)=Test1;
[Test1]=Function2(Test1,Test2);
% End of the main code
%% Function 1
function[outcome]=Function1(A,K)
outcome=A+K;
end
%%Function 2
function[T1]=Function2(T1,T2)
T1=T1+T2;
end
does the commented lines (1-5) are enough to run the 'main code' on the GPU?
2- I have tested the following simple code on GPU and CPU, CPU performance was by far better than GPU. is that supposed to be normal ?
thanks in advanced.
G = ones(10,10,'gpuArray');
tic
for k=1:100
for i=1: 1000
for j=1:10
G(j,:)=G(j,:)+2;
end
end
end
toc
G = ones(10,10);
tic
for k=1:100
for i=1: 1000
for j=1:10
G(j,:)=G(j,:)+2;
end
end
end
toc
% Elapsed time is 0.628241 seconds.

  0 Comments

Sign in to comment.

1 Answer

Answer by Andrea Picciau on 3 Jul 2019
Edited by Andrea Picciau on 3 Jul 2019
 Accepted Answer

I'll try to answer your questions in order...
  1. Yes! Isn't that great?
  2. Yes, because there are two problems with your code: (a) you're using a lot of for loops instead vector operations and (b) you're measuring GPU performance incorrectly. To fix (a), you should read this doc page that explains how to vectorize your code to get the best performance. To fix (b), you should take a look at my answer to a previous question and use the functions timeit and gputimeit.

  3 Comments

thanks Andrea Picciau for your answer.
well , I think there is nothing I can do at the moment to 'Vectorise' my code.
I tried to run the above code after transforming it to a function in order to test timeit and gputimeit and there is something that looks odd to me. Using ('gpuArray') and 100k loop, the gputime was 101 sec, However the exact time was 11:37 minutes . The same applies to timeit, where the actual time was 10.15 minutes but timeit was 100sec.
I think I am doing something wrong here for sure .
Last thing that might be irrelevenat but I hope you can help with, as I said I am running my Monte-Carlo script on different platform such as my own machine with the specs above and on High performance cluster (Uni Cluster). I noticed, that running my script using 16 cores or 8 cores nodes doesnt have any significance on the running time, RAM is fixed for both. Furthermore, running the same script on PC takes slightly less time than on cluster considering that I did couple of tests.
This last bit made me a bit sceptical about the whole thing as well as more confident that I am doing something wrong. Hopefully someone else have been in this sitution before and have an idea.
thanks again
A quick comment about what I meant by vectorising your code: I was looking at this bit here
for k = 1:100
for i = 1:1000
for j = 1:10
G(j,:) = G(j,:) + 2;
end
end
end
which could really be written as
G = G + 200000;
I imagined you were just trying to benchmark the same operation executed on a for loop, so I wrote a quick script for that. I benchmarked three versions of the same algorithm:
  • the fully vectorised version,
  • a for loop with some vectorisation,
  • a for loop without vectorisation.
My GPU is a Tesla K40c and my processor is an Intel Xeon E5-1650.
Let me show you my script:
numRows = 1000;
cpuData = ones(numRows, numRows);
gpuData = gpuArray(cpuData);
timeit(@() iVectorised(cpuData), 1) % 0.0030 seconds
gputimeit(@() iVectorised(gpuData), 1) % 9.3611e-05 seconds
timeit(@() iForLoop(cpuData), 1) % 0.0145 seconds
gputimeit(@() iForLoop(gpuData), 1) % 0.0011 seconds
timeit(@() iForLoopWithIndexing(cpuData, numRows), 1) % 0.2310 seconds
gputimeit(@() iForLoopWithIndexing(gpuData, numRows), 1) % 12.6261 seconds
%% HELPER FUNCTIONS
function dataOut = iVectorised(dataIn)
% Completely vectorised
dataOut = dataIn + 200;
end
function dataOut = iForLoop(dataIn)
% Partially vectorised, external for loop remains
for k = 1:100
dataIn = dataIn + 2;
end
dataOut = dataIn;
end
function dataOut = iForLoopWithIndexing(dataIn, numRows)
% Completely non-vectorised, uses indexing
for k = 1:100
for i = 1:numRows
dataIn(i,:) = dataIn(i,:) + 2;
end
end
dataOut = dataIn;
end
What you're observing is the last case (for loop without vectorisation). The reason it takes so long on the GPU is that indexing gpuArrays is very expensive. For example, you are:
  • moving the index i to the GPU,
  • creating a temporary gpuArray,
  • writing dataIn(i,:) to this temporary GPU array. To do this, you'll have to index dataIn by row rather than by column, (which is faster, usually),
  • scheduling dataIn(i,:) + 2 on the GPU,
  • assigning the output of this operation back to the right elements of dataIn.
To do most of these things, you need to be communicating back and forward between the CPU and the GPU, which is going to affect your performance (note: this is true for any GPU code, not just if you're using MATLAB). The vectorised version is highly optimised to avoid this ping-pong between your GPU and your CPU.
You also might want to consider larger problems. For example, the data in my script is 1000x1000, which is a reasonable size to start thinking about GPU acceleration.
Putting it all together, I would apply these two golden rules to your Monte Carlo code:
  • Reason in matrix and vector operations, not for loops. Vectorise, vectorise, vectorise.
  • Think about your overheads. Is the extra communication time worth spending? Should you use a parallel pool or a GPU?
Optimising parallel applications can be a difficult problem, but the rewards can be very high!
thanks again Andrea Picciau. your example is so elaborative and clear for me. As I mentioned before, i think there is no way for me to verctoris my code especially when its very complicated and includes call for different functions in iterative fashion.
the annoying thing now is that I cant make exploit the resources that I have (powerfull PC and cluster ) to the max .
regards

Sign in to comment.