MeX implementation of Tensorproduct multiplications

Dear all,
I am searching for a more efficient way to calculate tensorproduct multiplications of arbitrary sizes. First of all, i have already searched the web to get inspired by different solutions, e.g. here Tensorproduct at stackexchange:
1.) External Libraries, e.g. Eigen C++.
2.) Pure Matlab solution based in reshape and shiftdim.
3.) Matlab Tensor Toolbox.
However, so far i ended up with a MeX-file implementation with OpenMP parallelization.
--
In the following, i will give a short introduction of the math behind the implementation:
Suppose you have a matrix vector multiplication, where a matrix C is constructed by a Kronecker product of matrices A with size (n x m) and B with size (p x q). More informations can be found here Wikipedia.
.
In two dimensions this operation can be performed with O(npq+qnm) operations instead of O(mqnp) operations.
--
So far so good. In following example the sizes of the arrays are
matrix_xi = 4x4,
matrix_eta = 4x4,
vector = 16x500000x5,
(new version: vector = 16x2500000)
and may be initialized with ones or zeros. Moreover, rather than calculating only one Kronecker matrix vector multiplication, i want to calculate e.g. 500000x5 operations with one call (see the size of vector).
--
My current MeX-C file implementation:
/*************************************************
* CALLING:
*
* out = tensorproduct(A,B,vector)
*
*************************************************/
#include "mex.h"
#include "omp.h"
#define PP_A prhs[0]
#define PP_B prhs[1]
#define PP_vector prhs[2]
#define PP_out plhs[0]
#define n 4
#define m 4
#define p 4
#define q 4
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
const mwSize *dim;
int i,j,k,s,l;
double temp[n*q];
double *A = mxGetPr(PP_A);
double *B = mxGetPr(PP_B);
double *vector = mxGetPr(PP_vector);
dim = mxGetDimensions(PP_vector);
l = dim[1];
PP_out = mxCreateDoubleMatrix(l*n*p,1,mxREAL);
double *out = mxGetPr(PP_out);
#pragma omp parallel for private(i,j,k,s,temp)
for(k=0; k<l; k++){
for(i=0; i<n; i++){
for(j=0; j<q; j++){
temp[i+j*n]=0;
}
}
for(s=0; s<m; s++){
for(i=0; i<n; i++){
for(j=0; j<m; j++){
temp[i+s*n]+=A[i+j*n]*vector[j+s*m+k*m*p];
}
}
}
for(s=0; s<n; s++){
for(i=0; i<p; i++){
for(j=0; j<q; j++){
out[i*n+s+k*n*p]+=B[i+j*p]*temp[j*n+s];
}
}
}
}
}
--
The code can be compiled with:
mex FFLAGS='$FFLAGS -fopenmp -Ofast -funroll-loops' ...
CFLAGS='$CFLAGS -fopenmp -Ofast -funroll-loops' ...
LDFLAGS='$LDFLAGS -fopenmp -Ofast -funroll-loops' ...
COPTIMFLAGS='$COPTIMFLAGS -fopenmp -Ofast -funroll-loops' ...
LDOPTIMFLAGS='$LDOPTIMFLAGS -fopenmp -Ofast -funroll-loops' ...
DEFINES='$DEFINES -fopenmp -Ofast -funroll-loops' ...
tensorproduct.c
--
Currently on my Notebook: (Ubuntu 18.04, GCC 7.5.0, 4 Cores)
matrix_xi = ones(4,4);
matrix_eta = ones(4,4);
vector = ones(16,500000,5);
n = 50;
tic
for i=1:n
vector_out = reshape(tensorproduct(matrix_xi,matrix_eta,vector),size(vector));
end
toc
% Elapsed time is 8.036490 seconds.
--
A quite simple Matlab implementation without the use of the Kronecker property gives
matrix=kron(matrix_xi,matrix_eta);
tic
for i=1:n
vector_out = reshape(matrix*vector(:,:),size(vector));
end
toc
% Elapsed time is 10.489606 seconds.
--
With a different size of n,m,p,q = 7 the difference surprisingly does not change and gets worse (here you have to change the constants in the C-file)
matrix_xi = ones(7,7);
matrix_eta = ones(7,7);
vector = ones(49,500000,5);
n = 50;
tic
for i=1:n
vector_out = reshape(tensorproduct(matrix_xi,matrix_eta,vector),size(vector));
end
toc
% Elapsed time is 31.436541 seconds.
matrix=kron(matrix_xi,matrix_eta);
tic
for i=1:n
vector_out = reshape(matrix*vector(:,:),size(vector));
end
toc
% Elapsed time is 35.829957 seconds.
Luckily I am still faster than Matlab.
--
Note that only n,m,p,q are constants during compile time. To make the code useable for arbitrary sizes i have written a Macro to generate many C-files with different const n,m,p,q which will be included later with a header and a main C-file. This will help the compiler to unrole and inline the code.
This is the most efficient way i have achieved so far. I hope there are some experts here in this forum who can help to optimize these few lines. I am also open for other solutions, e.g. Matlab based or based on external libraries.
Thank you!

4 Comments

"In two dimensions this operation can be performed with O(npq+qnm) operations instead of O(mqnp) operations. "
Which operation? Please define clearly what variable you want to compute from what are input variables.
So far so good. In following example the sizes of the arrays are
matrix_xi = 4x4,
matrix_eta = 4x4,
vector = 16x500000x5,
What are the relation with A, B, C, X? on Wikipedia? They are all matrices and none is 3D array. Please use consistent notation to descripte the desired calculation.
I refere to matrix vector multiplication, where matrix C is build up by kron. Commonly it is called "vec trick".
C=kron(A,B);
With N ≡ n=m=p=q you can perform a matrix vector product with O(2*N*N*N) instead of O(N*N*N*N).
Here:
C = matrix
B = matrix_xi.';
A = matrix_eta;
vec(X) = vector (500000x5 times).
You can think of it in a simple way by applying matrix vector multiplications linewise in each direction. Here i have visualized X in matrix representation in order to apply the Tensorproduct multiplication linewise. Without the Kronecker property X would be handled as vector.
  • A horizontal arrow represents a matrix vector multiplication with e.g. "matrix_xi" on a small set of X.
  • A vertical arrow represents a matrix vector multiplication with e.g. "matrix_eta" on a small set of X.
  • The naive implementation would be to unfold X in a vector representation and apply "matrix" on whole X.
Are you familiar with Tensorproducts and their properties?
What exactly is unclear? I thought i have defined all stuff quite sufficient.
Regards
Yes I do know tensor and kron. But your notation is confusing. Actually your second statement
A = matrix_xi
B = matrix_eta
is wrong. Actually it's
B = matrix_xi.';
A = matrix_eta;
And your vector consists of a set of multiple X matrices.
Each reshape(vector(:,p,q), size(A,2), size(B,1)) is an X.
Your description is very confusing.
I knew what you want to compute. For the general reader, the description is careless and confusing.
Sorry, I am a little bit inaccurate here. I hope you got it.
Summarizing, i simply want to use the Kronecker property to apply many (500000*5) efficient matrix vector multiplications using O(npq+qnm) with one call.
Thank you for your help.
Edit: My main concern is a more efficient C implementation.

Sign in to comment.

 Accepted Answer

I propose two other methods of MATLAB.
The first simply replaces (:,:) by RESHAPE. The first doing some data copy, the second do not. It seems to be the fatest.
The second method uses MATLAB pagemtimes introduced in R2020b. (Edit: Jan corrects the release)
function benchtensor
matrix_xi = rand(7,7);
matrix_eta = rand(7,7);
vector = rand(49,500000,5);
n = 50;
tic
for i=1:n
vector_out = reshape(tensorproduct(matrix_xi,matrix_eta,vector),size(vector));
end
toc % Elapsed time is 19.429296 seconds.
matrix=kron(matrix_xi,matrix_eta);
tic
for i=1:n
vector_out = reshape(matrix*vector(:,:),size(vector));
end
toc % Elapsed time is 28.08963 seconds.
tic
for i=1:n
vector_out = reshape(matrix*reshape(vector,49,[]),size(vector));
end
toc % Elapsed time is 14.30025 seconds.
B = matrix_xi.';
A = matrix_eta;
tic
for i=1:n
X = reshape(vector, size(A,2), size(B,1), []);
CX = pagemtimes(pagemtimes(A, X), B); % Reshape CX if needed
end
toc % Elapsed time is 29.296709 seconds.
end

17 Comments

Thank you Bruno! I would like to understand the performance gain of your example:
tic
for i=1:n
vector_out = reshape(matrix*reshape(vector,49,[]),size(vector));
end
toc % Elapsed time is 14.45178 seconds.
On my computer it is nearly 3 times faster than:
tic
for i=1:n
vector_out = reshape(matrix*vector(:,:),size(vector));
end
toc % Elapsed time is 38.523173 seconds.
What's the reason?
I already explain: "The first doing some data copy, the second do not. "
Ok i got it. Does this also work for n~=m~=p~=q? The output would have a different dimension.
Yes it does!
This will help me. Thank you for your help.
Edit: I have accepted your answer. However, perhaps there is also a faster solution by using the Kronecker property.
The kronecker simplifies 2 left/right product to a single matrix-vector product for each X.
If you looks closely the flops, they requires respectively 4*N^3 vs 2*N^3 flops. The Kronecker simplies 3 nested loop to 2. The memory access of Kronecker might be more consecutive, so cache will be more efficient.
It's all win/win for Kronecker method. You might adapt your C code with preprocess by Kronecker rather than doing left/right multiplication.
What exactly do you mean? The C-Code already uses the line wise multiplication property. Can the C-code be improved in some way?
Your C code does carry out
TMP=(X*B) % on the loop
OUT=A*TMP
and not
C=kron(B.',A); % once
OUT=C*vec(X); % on the loop
Yes this is correct. From my knowledge, the first one is the cheap one (in case of flops).
Otherwise you would have to perform a 49 x 49 matrix vector multiplication. My goal is to avoid building and using the matrix C.
No Kronecker is twice cheaper, as I indicates the FLOPS of both methods in my earlier post. You multiply the matrix with a vectors. Count again.
Sorry I stand corrected, Kronecker requires 2*N^4 flops.
I corrected an error in my previous post. I should be O(2*N*N*N) aka O(npq+qnm) with
TMP=(X*B) % on the loop
OUT=A*TMP
instead of O(N*N*N*N) aka O(n*m*p*q) with
C=kron(B.',A); % once
OUT=C*vec(X); % on the loop
Edit: Hopefully it is correct now.
Make it O(N*N*N+N*N*N) for
TMP=(X*B) % on the loop
OUT=A*TMP
The (m,n) x (n x p) matrix-product trequires more or less 2*m*n*p flops
You mentioned careless and confusing ;-). Now i am sure i have corrected the most incorrect stuff i think, also in my previous anwers.
Some suggestions you could try is (depending if GCC see the optimization or not)
  • Merging k/l loops.
  • Precompute (k*vd+l*vd*vb) in the last nested loop computing OUT.
  • Use memset to initialize temp
@Bruno Luong: pagemtimes was introduced in R2020b, not 2018b.
Note that i merged k/l loops in the original post as suggested. For the C-code you have to use vector=size(49x2500000) instead of vector=size(49x500000x5) .
The other two hints did not give me any speed up.

Sign in to comment.

More Answers (0)

Categories

Find more on Particle & Nuclear Physics in Help Center and File Exchange

Asked:

on 6 Dec 2020

Edited:

on 23 Aug 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!