## How can I do matrix multiplication on sets of data in an array without looping?

### John Kinch (view profile)

on 16 Apr 2019
Latest activity Commented on by John Kinch

on 16 Apr 2019

### James Tursa (view profile)

I'm calculating forward kinematics for a robotic maniupulator running at 500 hz. This means that I have millions of sets of joint angles that I'm doing forwad kinematics on and I'm attempting to do this mathematics as quickly as possible. I've been able to do everything quickly outside of a loop except for the final matrix multiplication. My loop takes multiple arrays that are Nx4, it's broken down into 4x4 matrices that are multiplied by eachother to get from base frame to the final end effector frame. Here's a sample of the data (I can't share as much info as I would hope), it's 3 arrays each with 3 sets of 4x4 matrices.
Tc0 = [1,0,0,0; 0,1,0,0; 0,0,1,0; 0,0,0,1; 1,0,0,0; 0,1,0,0; 0,0,1,0; 0,0,0,1; 1,0,0,0; 0,1,0,0; 0,0,1,0; 0,0,0,1];
T01 = [0.74,-0.66,0,0.0012; 0.66,0.74,0.002,0.0007; -0.0013,-0.0015,0.999,0.35; 0,0,0,1; 0.746,-0.66,0,0.0012; 0.66,0.74,0.002,0.0007; -0.0013,-0.0015,0.99,0.35; 0,0,0,1; 0.746,-0.66,0,0.0012; 0.66,0.74,0.002,0.0007; -0.0013,-0.0015,0.99,0.35; 0,0,0,1];
T12 = [0.82,0.57,0,0.003; 0.0009,-0.0012,-0.999,-0.25; -0.57,0.82,-0.0015,-0.0004; 0,0,0,1; 0.82,0.57,0,0.003; 0.0009,-0.0012,-0.999,-0.25; -0.57,0.82,-0.0015,-0.0004; 0,0,0,1; 0.82,0.57,0,0.003; 0.0009,-0.0012,-0.999,-0.25; -0.57,0.82,-0.0015,-0.0004; 0,0,0,1];
q = size(T12,1);
Tc3 = zeros(q,4);
for i=1:4:q
Tc3(i:i+3,1:4) = ((Tc0(i:i+3,1:4)*T01(i:i+3,1:4))*T12(i:i+3,1:4));
end
So that takes about 2 seconds for every 100,000 loops, which isn't bad. But over millions of data points it really adds up. I'd like to avoid using a loop but I can't just multiply the entire arrays together (not that I know of since elementwise multiplication isn't correct).
Any ideas?
Thanks
EDIT: I'm wondering if maybe there would be a way to do this using 3D arrays? There Tc0 = reshape(Tc0,4,4,3); T01=reshape(T01,4,4,3); T12=reshape(T01,4,4,3) and then somehow multiply each immediately with respect to their 3rd dimension index?

### James Tursa (view profile)

on 16 Apr 2019
Edited by James Tursa

### James Tursa (view profile)

on 16 Apr 2019

First, some basic data storage advice. When you have multiple matrices stored in a larger matrix or array, it is usually best to have the individual matrices stored contiguously in memory. This does two important things for you:
(1) Allows the cache to work better for you because you don't have to jump around in memory to piece together the elements to form the smaller matrices you want to work with.
(2) Allows you to use several of the FEX submissions that do nD matrix multiply on the 2D pages of your smaller matrices.
Unfortunately, you don't have this situation with your data. Because you have stacked your individual 4x4 matrices vertically in your larger matrix the individual 4x4 matrix elements are scattered all over in memory and are not contiguous. This defeats what the cache would have helped you with and slows down your 4x4 matrix access times. And the sub-matrix formation also creates a lot of temporary matrices that need data copying. Most of the time will be spent doing this extraneous stuff rather than doing the actual multiply calculations.
I would suggest that you reorder your data to make the 4x4 matrices contiguous in memory. That will make the downstream programming much easier and allow you to use the very fast FEX submissions for the actual multiplies.
E.g., here is a comparison of your looping M-code with 4x4 elements scattered in memory vs the same calculations done in compiled mex C-code using OpenMP parallel processing:
% Some large sample data
n = 1000000;
Tc0 = rand(4*n,4);
T01 = rand(4*n,4);
T12 = rand(4*n,4);
% The loop method with 4x4 elements scattered in memory
disp(' ');
disp('The M-code loops method:')
tic
q = size(T12,1);
Tc3 = zeros(q,4);
for i=1:4:q
Tc3(i:i+3,1:4) = ((Tc0(i:i+3,1:4)*T01(i:i+3,1:4))*T12(i:i+3,1:4));
end
toc
Tc0t = reshape(Tc0.',4,4,[]); Tc0t = permute(Tc0t,[2 1 3]);
T01t = reshape(T01.',4,4,[]); T01t = permute(T01t,[2 1 3]);
T12t = reshape(T12.',4,4,[]); T12t = permute(T12t,[2 1 3]);
disp(' ');
disp('The MTIMESX method:');
fprintf('The compiler used: %s\n',mtimesx('compiler')) % Load mtimesx and print out the compiler used
fprintf('The calculation mode: %s\n',mtimesx('speedomp')) % Put mtimesx into fastest mode
% Do the nD matrix multiply on the 2D pages with compiled parallel mex C-code
tic
Tc3x = mtimesx(mtimesx(Tc0t,T01t),T12t);
toc
% Compare the results
disp(' ');
disp('Are the results equal?');
Tc3m = permute(reshape(Tc3.',4,4,[]),[2 1 3]);
isequal(Tc3m,Tc3x)
And a sample run:
The M-code loops method:
Elapsed time is 4.797704 seconds.
The MTIMESX method:
The compiler used: Microsoft_Visual_C++_2013_Professional_(C)_12.0_cl
The calculation mode: SPEEDOMP
Elapsed time is 0.151637 seconds.
Are the results equal?
ans =
logical
1
So, 4.797704 sec vs 0.151637 sec. That means 97% of your M-code method is spent just doing data copying and access and temorary variable creation etc, and only 3% of your M-code is spent doing the actual multiplies. Not a good ratio for the M-code! In the above example, MTIMESX actually used unrolled parallel loops to do the individual 4x4 matrix multiples (it will do this for sizes up to 5x5). For larger 2D matrix sizes, it will call into the BLAS library that MATLAB uses to do the matrix multiples. But in all cases it does not do any data copying or temporary variable creation to do the work ... it points into the variables directly to get at the 4x4 matrices.
The above example uses an FEX submission called MTIMESX which requires a supported C compiler to be installed on your machine. The author hasn't updated it for recent versions of MATLAB (he really needs to find time to do this!), so you may have an adventure getting it to compile on your machine. This, and other FEX submissions that you could use for this calculation (assuming you reorder your data as suggested) can be found here:
MTIMESX: (mex, needs C-compiler)
MMX: (mex, needs C-compiler)
MULTIPROD: (M-code)

Jon

### Jon (view profile)

on 16 Apr 2019
Very enlightening discussion - thank you.
Interestingly, the final approach I outlined in the comments to my earlier answer in which I utilized a sparse block diagonal matrix, the time for 1e6 individual matrices (as in your example) is 0.8 seconds compare to 4 for the original. I would infer from your discussion that the sparse block diagonal matrices are also arranged more efficiently in memory. So not quite as good as the dedicated library, but if that is hard to get working, due to version issues, it might be something to look at.
James Tursa

### James Tursa (view profile)

on 16 Apr 2019
Yes, a sparse block diagonal formation will have the individual 4x4 matrices contiguous in memory, so that definitely helps.
John Kinch

### John Kinch (view profile)

on 16 Apr 2019
This looks like the answer I was looking for. I tried the mtimesx myself early on but couldn't get it to work with my C compiler (MinGW64) When I run mtimesx I get the error:
Error using mtimesx_build (line 169)
A C/C++ compiler has not been selected with mex -setup
>> mtimesx(T01t,T12t)
... Build routine for mtimesx
... Checking for PC
... Finding path of mtimesx C source code files
... Found file mtimesx.c in \\space\dfsroot\Users\jkinch\Private\My Documents\MATLAB\MTIMESX\mtimesx.c
... Found file mtimesx_RealTimesReal.c in \\space\dfsroot\Users\jkinch\Private\My Documents\MATLAB\MTIMESX\mtimesx_RealTimesReal.c
Error using mtimesx_build (line 169)
A C/C++ compiler has not been selected with mex -setup
When I run mex -setup
>> mex -setup
MEX configured to use 'MinGW64 Compiler (C)' for C language compilation.
Warning: The MATLAB C and Fortran API has changed to support MATLAB
variables with more than 2^32-1 elements. You will be required
to update your code to utilize the new API.
To choose a different language, select one from the following:
mex -setup C++
mex -setup FORTRAN
>>The author hasn't updated it for recent versions of MATLAB (he really needs to find time to do this!), so you may have an adventure getting it to compile on your machine
My current adventure has been fruitless, any suggestions on how to proceed? It seems like a common problem, maybe the author should find time...
Any simple workarounds that you know of? Can I just install a different compiler? Would it be as "simple" as changing to code to run on 64 bit instead of 32?
>>I would suggest that you reorder your data to make the 4x4 matrices contiguous in memory
Your talk of storing the matrices contiguously in memory may have gone completely over my head. What does that mean, and how do I do that if it's a preferable method? Is it just as simple as storing the matrices in 3D arrays?
Cheers,

on 16 Apr 2019
Edited by Jon

### Jon (view profile)

on 16 Apr 2019

I don't know if it would be any more efficient than your loop (you could check with some timing tests) but here is an alternative approach without loops. The idea is to put each of your three 4x4 matrices that are currently stacked in a 12x4 matrices into a block diagonal matrix with each of the 4x4 matrices on the main diagonal. Then you can directly multiply each of these block diagonal matrices to get your result. If you need to get it back into stacked form you can do this through indexing at the end. I'm not sure from the description you gave whether you always have just three stacked matrices or whether that was just for the example. If you typically have many more than three stacked matrices, then this approach will not scale well as you would get huge 4*N x 4*N block diagonal matrices where N is the number of stacked matrices.
Anyhow in case it is helpful here is some code that illustrates the approach
% put stacked matrices into block diagonal matrices and multiply
Tc3diag = blkdiag(Tc0(1:4,:),Tc0(5:8,:),Tc0(9:12,:))*...
blkdiag(T01(1:4,:),T01(5:8,:),T01(9:12,:))*...
blkdiag(T12(1:4,:),T12(5:8,:),T12(9:12,:));
% collapse the output block diagonal matrix into a stacked matrix
Tc3= [Tc3diag(1:4,1:4);Tc3diag(5:8,5:8);Tc3diag(9:12,9:12)];

John Kinch

### John Kinch (view profile)

on 16 Apr 2019
Thanks for the reply! Unfortunetely this is just an example, normal data sets are at least 10^6x4, 250,000 individual matrices to be multiplied
Jon

### Jon (view profile)

on 16 Apr 2019
I was also thinking along the lines of storing the values in a 3d array rather than a stacked 2d array. It still seems that you would have to have a loop, but I made the following script to compare the timing for the two approaches:
% timing comparison test on block matrix multiplication for large matrices
N = 1e6; % number of individual matrices
% looping through stacked matrices (current approach)
Tc0 = rand(4*N,4);
T01 = rand(4*N,4);
T12 = rand(4*N,4);
q = size(T12,1);
Tc3 = zeros(q,4);
disp(' ')
disp('elapsed time for current approach')
tic
for i=1:4:q
Tc3(i:i+3,1:4) = ((Tc0(i:i+3,1:4)*T01(i:i+3,1:4))*T12(i:i+3,1:4));
end
toc
% looping through 3d array (alternate approach)
Tc0 = rand(4,4,N);
T01 = rand(4,4,N);
T12 = rand(4,4,N);
Tc3 = zeros(4,4,N);
disp(' ')
disp('elapsed time for alternate approach')
tic
for i = 1:N
TC3 = Tc0(:,:,i)*T01(:,:,i)*T12(:,:,i);
end
toc
I found that the 3d approach was in fact faster typically it took approximately 60% as long as the current approach. So not a huge savings, but something. I think to use this though you would want to always store the data in the 3d arrays. Otherwise, I think you'd probably use up some of your savings with any reshaping.
By the way, if you needed to do the reshaping, you can do this using
Tc0_3d = permute(reshape(Tc0',4,4,N),[2,1,3])
Note that in the above you first transpose the matrix , and then permute the matrix dimensions to deal with the fact that the reshape acts columnwise.
Jon

### Jon (view profile)

on 16 Apr 2019
Actually, going back to the original idea of keeping the data in block diagonal matrices, this may still be a viable approach if you use sparse block diagonal matrices. In this case I think the memory requirements would be similar to what you are doing already. In the attached code I did a timing test for this approach with N = 1e6 individual blocks. I found that the matrix multiply, which is now just one line of code, takes only 0.8 seconds compared to 2.5 seconds to 4 seconds for the other approaches. This seems to be a substantial speedup. I don't know whether you have the freedom to rewrite the underlying code so that you can just stay with sparse block diagonal matrices. Otherwise going back and forth from stacked to sparse block diagonal would probably eat up any time savings.
%% sparse block matrix multiplication
N = 1e6
% make sparse block diagonal matrices
Tc0blocks = cell(N,1);
T01blocks = cell(N,1);
T12blocks = cell(N,1);
for k = 1:N
Tc0blocks{k} = sparse(rand(4,4));
T01blocks{k}= sparse(rand(4,4));
T12blocks{k} = sparse(rand(4,4));
end
% combine into block diagonal matrices
% They will be sparse because elements are sparse
Tc0 = blkdiag(Tc0blocks{:});
T01 = blkdiag(T01blocks{:});
T12 = blkdiag(T12blocks{:});
disp(' ')
disp('elapsed time for sparse block diagonal matrix approach')
tic
Tc3 = Tc0*T01*T12;
toc
Note in my timing I only count the part where the multiplication if done. The other part is just setting up to do the test