Can A = A + B'*B be sped up somehow? It is seriously bottlenecking my for-loop

3 views (last 30 days)
I have a script where a very large matrix A (square, up to about 10000 x 10000 values) is initialized outside of a for-loop and is then overwritten many times within the loop like this:
A = zeros(6588,6588);
for i =1:1000
% B changes to a new value here and is a 27x6588 matrix
A = A + B' * B;
end
I wanted to remove the loop altogether but I think it’s impossible for me to calculate all instances of transpose(B)*B outside of the loop beforehand since I run out of memory.
Is there anything I could do to this code segment to speed it up? Computationally it’s just a few simple operations but they still take over 70% of my scripts runtime and I can’t figure out a way to improve this. Is it possible?
  1 Comment
Alfonso Nieto-Castanon
Alfonso Nieto-Castanon on 14 Aug 2015
there may be ways to reduce the computation time depending on whether B{i} show some dependencies between iterations (e.g. B{i} = C{i}*B0 with B0 having less rows than columns) or depending on what you are later using A for (e.g. eigenvectors of A may be more easily computed). Could you say a few more details of what these computations are meant to produce?

Sign in to comment.

Accepted Answer

James Tursa
James Tursa on 14 Aug 2015
Edited: James Tursa on 5 Sep 2015
If you have a C compiler available, another option is to call a BLAS routine from within a mex function to do this calculation. It may not save operation time on the B'*B part (not sure if MATLAB is smart enough to call DSYRK instead of DGEMM), but it may save on the time spent in the other parts. That is because A = A + B'*B can be done with one BLAS routine call (this type of update is built into the BLAS library code), as opposed to doing B'*B first into temporary storage, then adding it to A, then free'ing the temporary storage. E.g., the doc for DSYRK shows that it can do the following operation in one call, without forming B' explicitly first:
A := alpha * B^T * B + beta * A
So for your case, alpha = 1 and beta = 1. Also, by utilizing this BLAS call everything can be done in-place (i.e., no need to temporarily form the right-hand-side result first and then assign it to A).
Let me know if you want to explore this route and if so then I can code it up for you.
EDIT: 9/4/2015
SORRY this has taken so long! I wrote the code actually fairly quickly but had problems in debugging. Code worked in 32-bit but not in 64-bit, but identical code I had in another program worked in both 32-bit and 64-bit. Finally I figured out it was because for this new program I was forgetting to include the -largeArrayDims compilation flag in the mex command (needed to get the mwSignedIndex macro to return the correct type). In any event, here is the code (I changed the integers to ptrdiff_t to match blas.h). Hopefully it will shave some execution time off of your runs.
Two things to keep in mind. First, it does all the calculations IN-PLACE! So make sure your accumulator matrix ("A" in your example) is not shared. If you set it with the zeros function prior to the loop, then it will not be shared and you will be OK. Second, it only does one half of the matrix calculations when using two arguments. After you are done with the loop, then call it once with only one argument to fill in the other half. E.g., A calling sequence to accomplish your above example would be:
A = zeros(6588,6588);
for i =1:1000
% B changes to a new value here and is a 27x6588 matrix
dsyrk(A,B); % Does A = A + B'*B in-place, but lower triangle only
end
dsyrk(A); % Fills in the upper triangle from the lower triangle
And here is the mex routine dsyrk.c
/*--------------------------------------------------------------------------------------
* DSYRK does the operation C = C + A' * A in place
*
* Syntax: dsyrk(C,A) --> does C = C + A' * A (only lower triangle part)
* dsyrk(C) --> fills the upper triangle with the lower triangle
*
* C = a real double full N x N matrix
* A = a real double full K x N matrix
*
* The intent is to first initialize C, then call dsyrk(C,A) in a loop for various A
* matrices, then after the loop call dsyrk(C) once to fill in the upper triangle part.
* The code does the operations on C in place, so it is up to the user to make sure
* that C is not shared with any other variable prior to calling dsyrk.
*
* The code uses ptrdiff_t for the integer types because that is what the header file
* blas.h uses for the dsyrk function arguments.
*
* Programmer: James Tursa
*-------------------------------------------------------------------------------------- */
#include "mex.h"
#include "blas.h"
void xFILLPOS(double *Cpr, ptrdiff_t n);
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
double alpha = 1.0, beta = 1.0;
double *Apr, *Cpr;
char uplo = 'L';
char trans = 'T';
ptrdiff_t m, n, k, p, lda, ldc;
if( nlhs ) {
mexErrMsgTxt("Too many outputs ... this routine does all operations in-place");
}
if( nrhs == 2 ) {
Cpr = mxGetPr(prhs[0]);
m = mxGetM(prhs[0]);
n = mxGetN(prhs[0]);
Apr = mxGetPr(prhs[1]);
k = mxGetM(prhs[1]);
p = mxGetN(prhs[1]);
if( m != n || !mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]) || mxIsSparse(prhs[0]) ||
mxGetNumberOfDimensions(prhs[0]) != 2 ) {
mexErrMsgTxt("1st input must be real double non-sparse square matrix");
}
if( p != n || !mxIsDouble(prhs[1]) || mxIsComplex(prhs[1]) || mxIsSparse(prhs[1]) ||
mxGetNumberOfDimensions(prhs[1]) != 2 ) {
mexErrMsgTxt("2nd input must be real double non-sparse matrix compatible with 1st input");
}
lda = k;
ldc = n;
dsyrk( &uplo, &trans, &n, &k, &alpha, Apr, &lda, &beta, Cpr, &ldc );
} else if( nrhs == 1 ) {
Cpr = mxGetPr(prhs[0]);
m = mxGetM(prhs[0]);
n = mxGetN(prhs[0]);
if( m != n || !mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]) || mxIsSparse(prhs[0]) ||
mxGetNumberOfDimensions(prhs[0]) != 2 ) {
mexErrMsgTxt("1st input must be real double non-sparse square matrix");
}
xFILLPOS(Cpr,n);
} else {
mexErrMsgTxt("Need exactly 1 or 2 inputs");
}
}
/*--------------------------------------------------------------------------------------
* Fill the upper triangle with contents of the lower triangle
*-------------------------------------------------------------------------------------- */
void xFILLPOS(double *Cpr, ptrdiff_t n)
{
double *source, *target;
register ptrdiff_t i, j;
source = Cpr + 1;
target = Cpr + n;
for( i=1; i<n; i++ ) {
for( j=i; j<n; j++ ) {
*target = *source;
target += n;
source++;
}
source += i + 1;
target = source + n - 1;
}
}
  6 Comments
James Tursa
James Tursa on 20 Aug 2015
FYI, I haven't forgotten about this. I wrote the function but need to debug it ... should be able to post it in a day or two.
Peta
Peta on 20 Aug 2015
No problems, you are very kind for doing this and I’m not in any extreme hurry so take your time!

Sign in to comment.

More Answers (4)

arich82
arich82 on 14 Aug 2015
Edited: arich82 on 14 Aug 2015
Echoing Walter's response, you need more memory if you want it to go faster. One thing you might consider is using single precision variables instead of double, which effectively doubles the number of variables you can handle, but at the obvious cost of numerical precision.
If you're willing to get more memory or reduce your precision, you can get a 20x speed-up by pre-computing all of your B matrices ahead of time, then reshaping to make the computation just a single (albeit very large) matrix multiplication. You might also get a modest speed-up using James Tursa's venerable mtimesx package from the File Exchange (link here ).
Feel free to play with the function below to get some ideas. Note that I have an obscenely large quantity of RAM to play with, so you might want to reduce k_max to something more reasonable than 1000 before running it:
function symmat_add(k_max)
m = 27;
n = 6588;
%k_max = 1000; % use input variable...
% baseline
rng(0);
tic;
A = zeros(n, n);
for k = 1:k_max
% B changes to a new value here and is a 27x6588 matrix
B = rand(m, n);
A = A + B.'*B;
end
toc;
% mtimesx basic
rng(0);
tic;
B_mtx = rand(m, n, k_max);
A_mtx = sum(mtimesx(B_mtx, 'T', B_mtx), 3);
toc;
% mtimesx BLAS (i.e. SSYRK/DSYRK)
rng(0);
tic;
B_blas = rand(m, n, k_max);
A_blas = sum(mtimesx(B_blas, 'T', B_blas, 'blas'), 3);
toc;
% mtimesx speed
rng(0);
tic;
B_speed = rand(m, n, k_max);
A_speed = mtimesx(B_speed, 'T', B_speed, 'speed');
A_speed = sum(A_speed, 3);
toc;
% permute
rng(0);
tic;
B_perm = rand(m, n, k_max);
B_perm = reshape(permute(B_perm, [1, 3, 2]), m*k_max, n);
A_perm = B_perm.'*B_perm;
toc;
% verify results
norm(A - A_mtx, 'fro')
norm(A - A_blas, 'fro')
norm(A - A_speed, 'fro')
norm(A - A_perm, 'fro')
end % function symmat_add
For k_max = 10, I get:
>> symmat_add(10)
Elapsed time is 1.937278 seconds.
Elapsed time is 1.710109 seconds.
Elapsed time is 1.664435 seconds.
Elapsed time is 1.532890 seconds.
Elapsed time is 0.207303 seconds.
ans = 0
ans = 0
ans = 0
ans = 1.1347e-10
For k_max = 100, I get:
>> symmat_add(100)
Elapsed time is 17.950650 seconds.
Elapsed time is 17.012750 seconds.
Elapsed time is 17.257742 seconds.
Elapsed time is 15.000076 seconds.
Elapsed time is 0.855950 seconds.
ans = 0
ans = 0
ans = 0
ans = 1.4813e-09
For k_max = 1000, there seems to be an issue with mtimesx as it starts using TB's of memory, so I killed it. However, comparing just the baseline and permute portions of the code, I get:
>> symmat_add(1000)
Elapsed time is 171.627770 seconds.
Elapsed time is 6.422042 seconds.
ans = 3.9958e-08
Note that while the permute trick is by far the fastest, it also gives the largest apparent error, and the difference grows with the size of k_max; it may not be wise to mix this with single precision...
In summary, try to get more RAM, pre-compute B, then consider using permute to accelerate the computation. Your 27-by-6588-by-1000 B matrix should only require about 1.4 GB in double precision, which is large but reasonable, especially if you can pre-compute it and just load the resulting A matrix when you need to compute other large variables in your workspace.
I hope this helps.
  1 Comment
Peta
Peta on 15 Aug 2015
Thank you for your great answer. I have tried your suggestions using both random data and real data from my script and I also find that the permute method is the fastest.
However, since people suggested that it could be helpful to precalculate all B iterations before the loop I have now done that (and for B’ ), so maybe the situation is different now regarding what can be achieved with the calculation? This is what I have now:
First I precalculate all instances of B , which is a cell array containing 1073 cells where each cell is 27x6588 using cellfun:
Bcell = cellfun(@(a,b) b*a,D,Apcell,'UniformOutput', false);
% D is here a 1073x1 cell array where each cell is 27x6588
% Apcell is here a 1073x1 cell array where each cell is 27x27
Then I precalculate all transposed versions of B , also using cellfun , so that this doesn’t have to be done every loop iteration:
BcellTrans = cellfun(@transpose,Bcell,'UniformOutput', false);
Then I convert the cell array Bcell into a 27x6588x1073 matrix and its transposed variant BcellTrans into a 6588x27x1073 matrix:
for i = 1073:-1:1
B(:,:,i) = Bcell{i};
Bt(:,:,i) = BcellTrans{i};
end % By the way, is there a faster way to do this without the loop??
And now I can calculate my A , using either the original method or the permute method suggested by arich82 :
%%Original method to calculate A
tic
A = zeros(6588, 6588);
for k = 1:1073
A = A + Bt(:,:,k)*B(:,:,k);
end
toc %This took 247 seconds
%%Permute method to calculate A
tic
B = reshape(permute(B, [1, 3, 2]), size(B,1)*size(B,3), size(B,2));
A_perm = B.'*B;
toc %This took 35 seconds
% verify results
norm(A - A_perm, 'fro')
ans = 14.77
So the permute version is faster, however when I verify the results I get 14.77 even when everything is in doubles which sounds like quite a lot?
Will I have to live with some inaccuracy if I wish to increase the speed or are there other options now that I have the precalculated B and B’ at my disposal?

Sign in to comment.


Walter Roberson
Walter Roberson on 13 Aug 2015
Are the values of B independent of the previous iteration? Could you calculate any one B by knowing an initial value and the number of the iteration that is being done ("i" in your sample code)?
If so and if you have the Parallel Computing Toolbox, you could use parfor for this. Your "A" would be a "reduction variable":
parfor i = 1 : 1000
B = some calculation that does not depend upon the previous B but might depend on "i"
A = A + B' * B;
end
  2 Comments
Peta
Peta on 14 Aug 2015
Yes I have the Parallel Computing Toolbox and I tried doing it in a parfor loop instead, however I don’t think it’s a viable option in my case since parfor increases the memory usage compared to a normal for-loop so I was dangerously close to running out of RAM when I tried it.
The fastest option I have tried so far is keeping the for-loop and making A and B into gpuArrays before doing the calculations. That speeds it up quite a lot but when A and B gets big enough I run out of memory on my GPU so it’s still not ideal.
But shouldn’t it be possible to remove the loop altogether and replace it with something else that is faster without having a completely overwhelming memory usage?
Jan
Jan on 14 Aug 2015
Edited: Jan on 14 Aug 2015
If your memory gets low, install more memory. This is trivial, but efficient. Much more efficient than any smart tricks. Except for the trick to avoid repeated calculations. Are you able to create B'*B directly using the algorithm to create B?
Note that Matlab's JIT acceleration suffers significantly from variables, which change their type and similar things. So please post the relevant part of the code, which reproduces the problem. Perhaps we find an easy way to accelerate the code without changing the method.
Any parallel processing requires more RAM for physical reasons.

Sign in to comment.


Matt J
Matt J on 20 Aug 2015
Edited: Matt J on 20 Aug 2015
I wanted to remove the loop altogether but I think it’s impossible for me to calculate all instances of transpose(B)*B outside of the loop beforehand since I run out of memory.
Maybe you don't have enough memory to hold all instances of B a priori, but you should have enough memory to hold batches of, say, 200 of them at a time. Concatenating 200 generations of B.' column-wise would give you a 6588x5400 matrix, still smaller then A. This would allow you to do the update of A in bigger, vectorized chunks, as follows:
A = zeros(6588,6588);
for n=1:5
for i =1:200
BTcell{1,i}=...%put transpose(B) in here
end
Bt=cell2mat(Bcell);
A = A + Bt * Bt';
end
So, now, the update A = A + B' * B effectively only gets executed 5 times instead of 1000. This is important, because no matter how many rows B has, the computation of B'*B always results in a 6588x6588 memory allocation. So, pre-concatenating more B row data, like above, ensures that the memory allocation is repeated only 5 times versus 1000.
  3 Comments
Matt J
Matt J on 22 Aug 2015
Edited: Matt J on 22 Aug 2015
B is right now a 1073x1 cell array where each cell is 27x6588.
Basically what I'm saying is, instead of partitioning your data into 27x6588 chunks called B , partition instead in bigger N x 6588 chunks called Z where N>>27 is as large as your computer can handle. Then update your A matrix in the same way, but with Z
A = A + Z'*Z;
This way, the total number of updates of A and the accompanying 6588 x 6588 memory allocations will be reduced.
Now, the naive way to create the Z chunks from your current B chunks is to concatenate some of your B data row-wise. But row-wise concatenation is slow, so if there's a way you can either
(a) generate Z chunks directly, or
(b) generate B from scratch in 6588 x 27 form, i.e., with rows and columns interchanged, and then concatenate column-wise, or
(c) transpose the 27 x 6588 data before storing them to your cell and similarly concatenate column-wise,
then I think you will be better off. However, if you build the Z matrices by column-wise concatenation, then the way you update A must get modified to
A=A+Z*Z.';
First, generate all the 27x6588 chunks as you are now, except that instead make them 6588x27, either by transposition
Matt J
Matt J on 22 Aug 2015
Compare:
B=rand(27,6588);
Z=repmat(B,20,1);
A1=zeros(6588);
tic;
for i=1:40
A1=A1+B.'*B;
end
toc
%Elapsed time is 21.234253 seconds.
A2=zeros(6588);
tic
for i=1:2
A2=A2+Z'*Z;
end
toc
%Elapsed time is 1.343334 seconds.
>> Error=max(abs(A1(:)-A2(:)))
Error =
1.1937e-12

Sign in to comment.


Josh Lee
Josh Lee on 3 Oct 2015
Is your data comprised only of real values? If so have you tried using .' instead of '? The first performs only a transposition operation while the second performs a complex conjugation and transposition operations. I'm sure there are greater performance gains to be had by implemented the rather well reasoned solutions mentioned earlier. But every little bit helps, right?
  1 Comment
Matt J
Matt J on 3 Oct 2015
Edited: Matt J on 3 Oct 2015
You won't see a performance difference between .' and ' on purely real data. MATLAB will see that no imaginary part is present and skip anything conjugation-related.

Sign in to comment.

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!