MATLAB Answers

3d matrix multiplication with reduce computation time

15 views (last 30 days)
Abi Waqas
Abi Waqas on 28 Aug 2017
Edited: James Tursa on 30 Aug 2017
Hello Everyone
I have two 3D-matrices A(M,N,I) and B(M,N,I) with complex numbers. Values of M,N and I are large say in the range 500-5000. An easy way to do matrix multiplication is to use for loop for I, but it hugely slows down my rest of the code because I have to do this multiplication thousand of time.
I have been searching the answer in MATLAB answer since two days. The recommended solutions are using following MEX files
1. ' ndfun ': It's very fast and doesn't support complex numbers.
2. ' MMX ': Seems like the version of 'ndfun' but also don't support complex numbers.
3. ' mtimesx ': This is the only solution I found which fits for my situation. It's really fast with real numbers but when I use the complex number the estimated time is greater than simply using for loop.
Can anyone here please recommend the solution in which I can do multiplication of two complex 3D-matrices without using the for the loop but reduces computation time?
I am really looking for a positive response.
Regards abi

  3 Comments

James Tursa
James Tursa on 28 Aug 2017
Can you post the m-code equivalent of what you are doing and the exact sized involved? Also, what MATLAB version and platform are you using? For the size ranges you mention above, the only speedup to be expected is likely to be the 2D slice copies that can be avoided in a mex routine, but maybe newer versions of MATLAB are smart enough to not make these copies.
Abi Waqas
Abi Waqas on 29 Aug 2017
I am using Windows 10, MATLAB 2016 b. I have placed the code below. Please note that the code I place here is but different from the original. I modified it to just to the timings information. The Matrix A and B, W1 and COEFF_S I keep them same for simplification but they are going to change for each iteration (comes from another part of the code). In the end Final and A matrices are same.
If you just use the real number in COFF(1:3) you will see a clear speed up up-to 3x. But when I am using complex numbers then "mtimesx" slows down drastically.
if true
clear ; close all ; clc
W1=zeros(500,500,500) ;
PRODOTTO3 = ((rand(500,500,500)>0.97));
PD3 = sparse(PRODOTTO3(:,:)) ;
COEFF = zeros(round(npolinomi),1) ;
COEFF(1:3) = [0.5+0.5i , 0.12+0.05i,0.001+0.001i] ;
COEFF=sparse(COEFF) ;
COEFF_S = sparse(size(COEFF,1),size(COEFF,1)) ;
tic ;
for l = 1:500
COEFF_S=reshape( COEFF.'*(PD3(:,:)) , size(COEFF,1), []).';
W1(:,:,l)=full(COEFF_S) ;
end
for j = 1 : 30
A = mtimesx(W1,W1) ;
end
toc
tic ;
for l = 1:500 %freq.
COEFF_S=reshape( COEFF.'*(PD3(:,:)) , size(COEFF,1), []).';
for j = 1 : 30
B = COEFF_S*COEFF_S ;
end
Final(:,:,l)=full(B) ;
end
toc

Sign in to comment.

Accepted Answer

James Tursa
James Tursa on 29 Aug 2017
Some testing issues I had:
I couldn't run the posted code because npolinomi was not defined. Looking at the code, I made a guess and set npolinomi = 500. The code then ran, but ran out of memory on my machine. So I dropped all of the 500's to 200's just to get things to run.
Speed observations:
It's appears to be somewhat of apples to oranges comparison here. You have sparse and full matrices in your testing mix, as well as logical variables. The fact that you have sparse matrices involved make a HUGE difference and this was not mentioned in your original post. For matrix multiplies, MTIMESX only has code for full double and full single variables. So you appear to be comparing a full*full matrix multiply (via MTIMESX) along with the W1 building vs a sparse*sparse multiply. The more sparse the matrices are, the better the sparse*sparse formulation will perform for speed. So if the matrices are sparse enough, it would not surprise me if the sparse code was faster. In addition, in my 32-bit R2011a PCWIN version the MTIMESX code runs faster, so it appears the newer MATLAB versions have been improved for sparse matrix multiply speed.
So, back to the original question of how to speed up your code given that there are sparse matrices involved. This is hard to say because the code you show is boiled down to the point that I cannot really tell if there is anything that could be sped up with a mex routine or with other methods. Working with sparse matrices in mex routines is not trivial, particularly if you are trying to combine them or incrementally build them. So I would have to see more detail about your real code before I could make any suggestions. E.g., how is COEFF_S really built? Reshaping a sparse matrix (even transposing a sparse vector) invokes a deep data copy, so maybe something could be changed here to avoid that data copy. Are there any calculations that could be moved outside the loop? Are there any full/sparse conversions that could be avoided? Etc, etc. We can't really advise on any of this based on the boiled down code posted above.

  7 Comments

Show 4 older comments
James Tursa
James Tursa on 29 Aug 2017
Matt has already explained the 4x complex speed difference, so I won't get into that.
I don't understand this comment of yours at all:
"One thing to note here is, in the second case multiplication is done using for loop instead of using 'mtimes' which technically consume more time."
In your code above, the j=1:30 loop isn't needed for anything. You are just repeating the same calculation 30 times. Obviously to speed up this particular case I would simply advise getting rid of the loop entirely and do the code that is inside the loop only once. But I assume this loop is present because it is a boiled down version of your real code where this loop actually does something. Since I haven't seen your real code here is is impossible for me to speculate on how to speed it up. And that is the crux of the issue here ... without seeing what your real code is doing, we don't have enough information to offer advice on speeding it up.
Example, take the following statement:
Final(:,:,l)=full(B)
That assignment can be done directly in a mex routine without the need to physically form full(B) first. But how much difference will this small change make and will other techniques make a bigger difference? Hard to say since I don't know what is really going on with the other parts of the code inside the loops in your real code.
James Tursa
James Tursa on 29 Aug 2017
I need to think about the 131s --> 136s timing result you are seeing with the sparse code ...
James Tursa
James Tursa on 30 Aug 2017
So, after thinking about it and making a few crude tests, I have a hypothesis (which may or may not be true).
The matrix multiply employed by MATLAB (and MTIMESX) for full matrices is contained in a 3rd party BLAS library that is highly optimized for speed, and internally employs multi-threading. E.g., the generic matrix multiply routine is called DGEMM for real double precision matrices. The complex counterpart to this routine is called ZGEMM, but this routine requires the real and imaginary parts of variables to be interleaved in memory like Fortran and C/C++. But in MATLAB, the real and imaginary parts of complex variables are stored in different parts of memory. So MATLAB had to make a choice when doing a complex matrix multiply ... copy the real and imaginary parts into temporary interleaved memory to call ZGEMM, or just call DGEMM on the appropriate real and imaginary parts? MATLAB chose the latter (and in fact that is what MTIMESX does also). Going from a real matrix multiply to a complex matrix multiply then means that DGEMM is called four times instead of once. In both cases the DGEMM code executes in a multi-threaded fashion as long as the sizes are large enough to justify it. So you see the 3x - 4x speed difference. That is all to be expected. On my system the CPU monitor is pegged at about 50% for both of these cases (I assume this represents 1/2 of the logical cores but all of the physical cores in the system).
For sparse real matrix multiply, it does not appear to me (when watching the CPU monitor) that the calculation is multi-threaded. The CPU usage is down at around 6% typically, which is what I would expect for my system is there is no multi-threading going on (50% / 8 cores).
For complex sparse matrix multiply, I suspect that this is not multi-threaded either, but the elements are calculated serially. Only those calculations where both operands are non-zero are done (see note below). My suspicion is that some type of appropriate SSE calculation(s) is done to calculate each individual complex element multiply, and that is where the complex speed increase is coming from. I.e., some SSE instruction(s) can essentially carry out the complex element multiply in about the same time as a single real element multiply. Hence you don't see much difference in the sparse real vs complex matrix multiply total times.
But ... THIS IS ALL JUST A GUESS!
Side Note: There is a mismatch in the underlying sparse matrix multiply algorithm used when there are NaN's involved. E.g.,
>> a
a =
NaN 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0.9747 0
0.9696 0 0 0 0
>> s = sparse(a);
>> a*a
ans =
NaN NaN NaN NaN NaN
NaN 0 0 0 0
NaN 0 0 0 0
NaN 0 0 0.9501 0
NaN 0 0 0 0
>> full(s*s)
ans =
NaN 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0.9501 0
NaN 0 0 0 0
You can see that the NaN's did not propagate appropriately in the sparse matrix multiply case. The sparse algorithm did not do any of the NaN*0 calculations ... it simply assumed the result would be 0 since one of the operands was 0. This, I suppose, could probably be considered a bug in MATLAB. Maybe I will report it ...

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!