gpu memory code optimization

Question

Octavian on 13 Dec 2014

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/166475-gpu-memory-code-optimization

Commented: Octavian on 15 Dec 2014

Dear Wizes,

I would appreciate if you could break this: My code includes gpuArray operations inside a for loop; the relevant portion is here:

% allocate gpu memory:
A=GPUArray.eye(x,'single'); B=GPUArray.zeros(y,x,'single'); C=GPUArray.zeros(x,y,'single'); % x>>>y
for n=1:t %for loop begins
... % not relevant, B and C are 'filled' by specific matrix multiplications
D=B*A; % size(D)= (y,x)
E=C*D; % size(E)= (x,x)
A=A-E;
clear E D
...
end

I must mention that all of A,B,C,D,E are different with each iteration in the for loop as they are reused.

The problem is that x is large, and A and E are huge (2 to 7Gb, depending on x), killing my gpu. I made it run, albeit slowly, by breaking E (performing operations row-wise in A for steps 6-7 above:

for i=1:size (A,1)
E=C(i,:)*D;
A(i,:)=A(i,:)-E;
clear E D

1. This works, but is very slow, I was wondering if there is a way to calculate the same for blocks of n rows at once, not one row at a time (with n scaled based on what the gpu can take, where x=kn+p, where p<n); or using mtimesx-like bsxfun routines for matrix multiplication.

2. It would be great if A could be broken in blocks of rows or columns, or in one at a time (row-wise or column-wise), however this is above my job description, given that A is the right multiplier in step 5. This would allow me to expand the size of x I can use.

Thank you, as always Octavio

6 Comments
Show 4 older commentsHide 4 older comments

Octavian on 14 Dec 2014

Edited: Octavian on 14 Dec 2014

Open in MATLAB Online

Dear Matt,

Sorry for my belated response, I am on call and had to run.. I found your answer brilliant, and I was ready to work on it when getting back, but I agree I may have oversimplified the loop trying to make it more palatable. Here is an update, including, I hope, all dependencies. I will start by listing the sizes of all named gpuArrays: A (x,x) B (y,x) C (x,y) D (y,x) E (x,x) F (x,y) % again, A and E are the elephants in the room

Here is the loop:

% allocate gpu memory:
A=GPUArray.eye(x,'single'); B=GPUArray.zeros(y,x,'single'); C=GPUArray.zeros(x,y,'single'); % x>>>y
for n=1:t                   %for loop begins
... % B is 'filled' by matrix multiplication of other small-sized inputs
F=A*B’;                       % size(F)= (x,y)  
….% B and F input goes on to modify C in several steps
D=B*A;                       % size(D)= (y,x)
E=C*D;                       % size(E)= (x,x)
A=A-E;
clear E D
A(ind)=A(ind) +c        % c=constant
... % C input determines the relevant loop output in several steps
end
% where ind is calculated as follows
% for i=1:x
% ind(i) =(i-1)*x+i
% end

Important, I must also mention that inside the loop, only A changes twice/iteration, see above.

Thank you again for your help. If there is no better way, would you mind posting your previous comment again for reference. If there is, I can only be blissfully humbled again,

Octavio

Matt J on 14 Dec 2014

Edited: Matt J on 14 Dec 2014

Open in MATLAB Online

would you mind posting your previous comment again for reference

Octavio,

If you mean the answer that I retracted, I don't think it's going to help you. I don't think it can work except when B and C are constant throughout the loop (which is why I retracted it). For that case, I've reproduced what I had in the Appendix below. What about my other questions? What is the ultimate intended use of the matrix A? And how big, typically, are x and y?

Appendix: For the case where B and C are constant, the originally posted recursion over A, with A initialized at I, can be expanded to give the following closed form formula,

A=(I-C*B)^t

Using the binomial theorem, this expands to

   A = I - sum_{k=1}^t  (-1)^k nchoosek(t,k) (C*B)^k
     = I - C*( sum_{k=1}^t  (-1)^k nchoosek(t,k)  (B*C)^(k-1)  )*B
     = I - C*Q*B

where

Q = sum_{k=1}^t (-1)^k nchoosek(t,k) (B*C)^(k-1)

is a polynomial function of the much smaller matrix B*C. Depending on the size of y, it might be possible simply to evaluate Q using polym and then just compute A=I-C*Q*B on the host, rather than the GPU.

Octavian on 14 Dec 2014

Edited: Octavian on 15 Dec 2014

Open in MATLAB Online

Dear Matt,

1. From the code above, the step F=A*B’; suggests that A is needed to generate F, which along with B leads to C in several steps, which in turn leads to the 'useful loop output'. All the other steps listed with relevance to A are just to adjust A for the next iteration of F=A*B’.

2. intended y= 64 or 128 and x=40960 ((64^2)*10) or 163840 ((128^2)*10) As of now, I can use the code as is for y up to 40, and up to 50 if I use the small for loop inside the for loop in my original message

for i=1:size (A,1)
E=C(i,:)*D;
A(i,:)=A(i,:)-E;
clear E D
end

where I break E row-wise. I was hoping to 1. speed this small loop by applying the operations not row-wise (listed), but to blocks of n rows in A at once ((with n scaled based on what the gpu can take, where x=kn+p, where p<n); or some other operation increasing the speed on GPU.

2. Better, I wanted to see if your suggestion above could be used to expand to higher dimensions (y=64) by keeping Ix or A_old on the CPU and iterating most of the other loop steps on the GPU, then gather the update of the A at each iteration, add it to Ix of A_old, then do F=A*B’ on the CPU, continue the iteration on CPU and so forth. Or as you suggested, find a smaller size matrix formula that can iterate quickly on the cpu altogether. The issue is that I do not have enough GPU memory to allocate for I= eye(single(40960). Thanks for your help,

Octavio

Matt J on 15 Dec 2014

Are none of these matrices sparse? I know that the GPU doesn't support sparse matrices, but if they are sparse, maybe the CPU is better?

Octavian on 15 Dec 2014

With the first iterations, A stays sparse, but then it fills pretty quickly, thanks, Octavio

Sign in to comment.

Sign in to answer this question.

gpu memory code optimization

6 Comments
Show 4 older commentsHide 4 older comments

Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

gpu memory code optimization

6 Comments Show 4 older commentsHide 4 older comments

Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

6 Comments
Show 4 older commentsHide 4 older comments