I would appreciate if you could break this: My code includes gpuArray operations inside a for loop; the relevant portion is here:
- % allocate gpu memory:
- A=GPUArray.eye(x,'single'); B=GPUArray.zeros(y,x,'single'); C=GPUArray.zeros(x,y,'single'); % x>>>y
- for n=1:t %for loop begins
- ... % not relevant, B and C are 'filled' by specific matrix multiplications
- D=B*A; % size(D)= (y,x)
- E=C*D; % size(E)= (x,x)
- clear E D
I must mention that all of A,B,C,D,E are different with each iteration in the for loop as they are reused.
The problem is that x is large, and A and E are huge (2 to 7Gb, depending on x), killing my gpu. I made it run, albeit slowly, by breaking E (performing operations row-wise in A for steps 6-7 above:
for i=1:size (A,1)
clear E D
1. This works, but is very slow, I was wondering if there is a way to calculate the same for blocks of n rows at once, not one row at a time (with n scaled based on what the gpu can take, where x=kn+p, where p<n); or using mtimesx-like bsxfun routines for matrix multiplication.
2. It would be great if A could be broken in blocks of rows or columns, or in one at a time (row-wise or column-wise), however this is above my job description, given that A is the right multiplier in step 5. This would allow me to expand the size of x I can use.
Thank you, as always Octavio