## gpu memory code optimization

### Octavian (view profile)

on 13 Dec 2014
Latest activity Commented on by Octavian

### Octavian (view profile)

on 15 Dec 2014
Dear Wizes,
I would appreciate if you could break this: My code includes gpuArray operations inside a for loop; the relevant portion is here:
1. % allocate gpu memory:
2. A=GPUArray.eye(x,'single'); B=GPUArray.zeros(y,x,'single'); C=GPUArray.zeros(x,y,'single'); % x>>>y
3. for n=1:t %for loop begins
4. ... % not relevant, B and C are 'filled' by specific matrix multiplications
5. D=B*A; % size(D)= (y,x)
6. E=C*D; % size(E)= (x,x)
7. A=A-E;
8. clear E D
9. ...
10. end
I must mention that all of A,B,C,D,E are different with each iteration in the for loop as they are reused.
The problem is that x is large, and A and E are huge (2 to 7Gb, depending on x), killing my gpu. I made it run, albeit slowly, by breaking E (performing operations row-wise in A for steps 6-7 above:
for i=1:size (A,1)
E=C(i,:)*D;
A(i,:)=A(i,:)-E;
clear E D
1. This works, but is very slow, I was wondering if there is a way to calculate the same for blocks of n rows at once, not one row at a time (with n scaled based on what the gpu can take, where x=kn+p, where p<n); or using mtimesx-like bsxfun routines for matrix multiplication.
2. It would be great if A could be broken in blocks of rows or columns, or in one at a time (row-wise or column-wise), however this is above my job description, given that A is the right multiplier in step 5. This would allow me to expand the size of x I can use.
Thank you, as always Octavio

Octavian

### Octavian (view profile)

on 14 Dec 2014
Dear Matt,
1. From the code above, the step F=A*B’; suggests that A is needed to generate F, which along with B leads to C in several steps, which in turn leads to the 'useful loop output'. All the other steps listed with relevance to A are just to adjust A for the next iteration of F=A*B’.
2. intended y= 64 or 128 and x=40960 ((64^2)*10) or 163840 ((128^2)*10) As of now, I can use the code as is for y up to 40, and up to 50 if I use the small for loop inside the for loop in my original message
for i=1:size (A,1)
E=C(i,:)*D;
A(i,:)=A(i,:)-E;
clear E D
end
where I break E row-wise. I was hoping to 1. speed this small loop by applying the operations not row-wise (listed), but to blocks of n rows in A at once ((with n scaled based on what the gpu can take, where x=kn+p, where p<n); or some other operation increasing the speed on GPU.
2. Better, I wanted to see if your suggestion above could be used to expand to higher dimensions (y=64) by keeping Ix or A_old on the CPU and iterating most of the other loop steps on the GPU, then gather the update of the A at each iteration, add it to Ix of A_old, then do F=A*B’ on the CPU, continue the iteration on CPU and so forth. Or as you suggested, find a smaller size matrix formula that can iterate quickly on the cpu altogether. The issue is that I do not have enough GPU memory to allocate for I= eye(single(40960). Thanks for your help,
Octavio
Matt J

### Matt J (view profile)

on 15 Dec 2014
Are none of these matrices sparse? I know that the GPU doesn't support sparse matrices, but if they are sparse, maybe the CPU is better?
Octavian

### Octavian (view profile)

on 15 Dec 2014
With the first iterations, A stays sparse, but then it fills pretty quickly, thanks, Octavio