Scaling of vectorized code which is limited by memory access

Dear all,
what possibilities are there to make a vectorized code faster, where the speed is limited by memory access?
I have attached two runs on two different machines with the same vectorized MATLAB code. Here var_z_analyse is a huge array of fixed size, which will be accessed/querried by a long list of indices which may vary in size and content for each call. May parfor be a solution in this context?
For example the first line does not benefit from 4 to 18 cores, whereas the last two lines scale very well.
  • first machine has 4 cores on one node
4.png
  • seconde machine has 18 cores on one node
18.png
Regards

11 Comments

What are the sizes of all the variables? One option is to not form A directly, but pass the indexes into a mex routine and then do the u_z calculations there, avoiding the memory copy.
sum() of a .* is dot(), and when you have several of them adjacent there is often a coding involving * (and perhaps a transpose)
Not really, if you use "*" you compute many cross terms that are not required.
SUM if the best option. One can reshape to call one SUM instead of two but I doubt that will help a lot.
Another options is use James's MTIMESX (such function should be part of MATLAB distribution).
But the bottleneck is clearly the first statement that rearrange a huge 3 x 3 x N array.
Some background:
The code is part of an quadtree look up table approach in two (three) dimensions, where i have to find the right quad containing the data in a monomial representation. The query is not based on a tree traverse, but rather based on a hash table approach.
  • the ids of the quads are found e.g. in line 1
  • to evaluate the data from the modal representation (here monomials), i have to build-up different vandermonde matrices with horner sheme, e.g. line 2-7
  • i calculate the two dimentional interpolant via the outerproduct (which is nothing else than Newton interpolation, but highly efficient), e.g. line 8-9
The size of the arrays are, e.g.,
size(a_I)=[1,M] ,
size(A)=[3,3,M] ,
size(var_z_analyse)=[3,3,N] ,
size(Vandermondematrix_x)=[3,1,M] ,
size(Vandermondematrix_y)=[1,3,M] ,
size(outer_product)=[3,3,M] ,
size(u_z)=[1,M] ,
where N is fixed huge, e.g. 6000000, and M is arbitrary huge, e.g. 1000000, during runtime. Also note that 3 is the polynomial degree+1.
Does a mex-file implementation also suffer of M beeing arbitrary and what about the scaling with mex files?
Thank you, i know that these operations are similary, but i dont think it will help me with this issue.
As you already said, the bottleneck is not the outer product operation. The main problem is the data query.
Thank you for the discussion!
Regards
I would try mex coding if I was you, at least it save memory copying, but it make memory access not linear.
Also as I undestand there is about 6 query points falling in the same LUT (ratio N/M) not sure if its worth to group them together by some sorting in a_I, but I still mention it.
I have implemented some stuff, yet without success!
Any suggestions?
tic
for i=1:1000
A_mex=reshape(hash(reshape(var_z_analyse,[9,size(var_z_analyse,3)]),int64(a_I(:))),[3,3,size(a_I(:),1)]);
end
toc
4 cores: Elapsed time is 3.255311 seconds.
16 cores: Elapsed time is 3.634145 seconds.
tic
for i=1:1000
A=var_z_analyse(:,:,a_I(:));
end
toc
4 cores: Elapsed time is 3.054769 seconds.
16 cores: Elapsed time is 3.395311 seconds.
The size of the arrays:
size(a_I)=[1,M] ,
size(A)=[3,3,M] ,
size(var_z_analyse)=[3,3,N] ,
with N=19012 and M=130000.
I have compiled the mex file hash.f90 with gfortran
mex CFLAGS='$CFLAGS -fopenmp' LDFLAGS='$LDFLAGS -fopenmp' COPTIMFLAGS='$COPTIMFLAGS -fopenmp -O2' LDOPTIMFLAGS='$LDOPTIMFLAGS -fopenmp -O2' DEFINES='$DEFINES -fopenmp' -v hash.F90
on linux ubuntu machine 18.04 with openmp.
#include "fintrf.h"
!==================================================================================================================================
!> Wrapper between MATLAB and Fortan90
!==================================================================================================================================
SUBROUTINE mexFunction(nlhs,plhs,nrhs,prhs)
! MODULES
! IMPLICIT VARIABLE HANDLING
IMPLICIT NONE
!----------------------------------------------------------------------------------------------------------------------------------
! INPUT / OUTPUT VARIABLES
!----------------------------------------------------------------------------------------------------------------------------------
! LOCAL VARIABLES
mwPointer :: mxCreateDoubleMatrix,mxGetPr
mwPointer :: plhs(*),prhs(*)
mwPointer :: mxGetM,mxGetN
mwPointer :: x_pr1,y_pr1,x_pr2
mwPointer :: m1,n1,m2,n2
mwSize :: size1,size2,size3
INTEGER :: nlhs,nrhs,mxIsNumeric
INTEGER,PARAMETER :: numel = 2000000
DOUBLE PRECISION :: x1(numel),y1(numel)
INTEGER :: x2(numel)
!==================================================================================================================================
! Check for proper number of arguments.
IF(nrhs.NE.2)THEN
CALL mexErrMsgIdAndTxt('MATLAB:hash:nInput','Two input required.')
ELSEIF(nlhs.NE.1)THEN
CALL mexErrMsgIdAndTxt('MATLAB:hash:nOutput','One output required.')
END IF
! Get the size of the input arrays.
m1 = mxGetM(prhs(1))
n1 = mxGetN(prhs(1))
size1 = m1*n1
m2 = mxGetM(prhs(2))
n2 = mxGetN(prhs(2))
size2 = m2*n2
! output size
size3 = m1*n2
! Column * row must be smaller than numel
IF(size1.GT.numel)THEN
CALL mexErrMsgIdAndTxt('MATLAB:hash:mSize','Row * column must be <= 1000.')
END IF
! Column * row must be smaller than numel
IF(size2.GT.numel)THEN
CALL mexErrMsgIdAndTxt('MATLAB:hash:mSize','Row * column must be <= 1000.')
END IF
! Check that the array is numeric (not strings).
IF(mxIsNumeric(prhs(1)).EQ.0)THEN
CALL mexErrMsgIdAndTxt('MATLAB:hash:NonNumeric','Input must be a numeric array.')
END IF
! Check that the array is numeric (not strings).
IF(mxIsNumeric(prhs(2)).EQ.0)THEN
CALL mexErrMsgIdAndTxt('MATLAB:hash:NonNumeric','Input must be a numeric array.')
END IF
! Create matrix for the return argument.
plhs(1) = mxCreateDoubleMatrix(m1,n2,0)
x_pr1 = mxGetPr(prhs(1))
x_pr2 = mxGetPr(prhs(2))
y_pr1 = mxGetPr(plhs(1))
CALL mxCopyPtrToReal8(x_pr1,x1,size1)
CALL mxCopyPtrToInteger8(x_pr2,x2,size2)
! CALL the computational subroutine.
CALL hash(y1,x1,x2,m1,n1,m2,n2)
! Load the data into y_pr1, which is the output to MATLAB.
CALL mxCopyReal8ToPtr(y1,y_pr1,size3)
END SUBROUTINE
!==================================================================================================================================
!> Computional Subroutine
!==================================================================================================================================
SUBROUTINE hash(y1,x1,x2,m1,n1,m2,n2)
! MODULES
USE OMP_LIB
! IMPLICIT VARIABLE HANDLING
IMPLICIT NONE
!----------------------------------------------------------------------------------------------------------------------------------
! INPUT / OUTPUT VARIABLES
!----------------------------------------------------------------------------------------------------------------------------------
! LOCAL VARIABLES
mwSize :: m1,n1,m2,n2,i
DOUBLE PRECISION :: x1(m1,n1),y1(m1,n2)
INTEGER :: x2(m2,n2)
!==================================================================================================================================
!$OMP PARALLEL
!$OMP DO
DO i=1,n2
y1(:,i)= x1(:,x2(1,i))
END DO
!$OMP END DO
!$OMP END PARALLEL
END SUBROUTINE
I was thinking to drop completely those memory copying in preparation but replacing the last instruction
sum(sum(outer_product.*A,1),2)...
by the pseudo C-code along this line:
for (k=0; k++; k<N)
{
Ak = &var_z_analyse + 9*(a_I[k]-1);
s = 0;
for (j=0; j++; j<9) s = s + outer_product[k*9+j] * Ak[j];
uz[k] = s;
}
You could also do with fortran mex.
Thank you Bruno,
i have moved the whole stuff to the mex file. Now there is no big memory copying except filling the last array.
Here are some first results with the mex/openmp implementation with one and with four cores:
1 core mex file:
tic
for i=1:1000
u_mex=hash(a_1,a_2,a_3,a_4,a_5,a_6,a_7);
end
toc
Elapsed time is 9.974623 seconds.
4 cores mex file:
tic
for i=1:1000
u_mex=hash(a_1,a_2,a_3,a_4,a_5,a_6,a_7);
end
toc
Elapsed time is 12.943373 seconds.
vectorized matlab:
tic
for i=1:1000
... matlab code
end
toc
Elapsed time is 13.259077 seconds.
The mex file looks like:
!$OMP PARALLEL DO NUM_THREADS(4)
DO i=1,M
... local operation without memory copying
uz(1,i) = ...
END DO
!$OMP END PARALLEL DO
When is add additional dummy work to the loop
!$OMP PARALLEL DO NUM_THREADS(4)
DO i=1,M
... local operation without memory copying
uz(1,i) = ...
DO j=1,1000
... dummy work
END DO
END DO
!$OMP END PARALLEL DO
the results look like
1 core mex file + dummy work:
tic
for i=1:1000
u_mex=hash(a_1,a_2,a_3,a_4,a_5,a_6,a_7);
end
toc
Elapsed time is 51.427996 seconds.
4 cores mex file + dummy work:
tic
for i=1:1000
u_mex=hash(a_1,a_2,a_3,a_4,a_5,a_6,a_7);
end
toc
Elapsed time is 18.105696 seconds.
Good thing is now:
  • memory copying is gone
  • openmp + mex + fortran works
Bad thing is now:
  • the work load in my openmp loop is too small
  • the overhead of openmp (open, close, sync, ...) is the crucial part which limits the speed/scaling
Note that the sizes of the arrays were again N=19012 and M=130000, which is the average order of magnitude excepted.
This is a little bit frustrating! :)
More suggestions?
I would say you should optimize the code that carry out the computation, but not sure if you would win much.
I again express my though in C "pseudo" code (warning: I haven't tested at all)
double eta_k_pow[3], eta_k, xii, xi_k, s, ss;
double *Ak;
eta_k_pow[0] = 1;
for (k=0; k++; k<N)
{
Ak = &var_z_analyse + 9*(a_I[k]-1);
eta_k = u_eta[k];
eta_k_pow[1] = eta_k;
eta_k_pow[2] = eta_k*eta_k;
xii = 1;
xi_k = u_xi[k];
s = 0;
for (i=0; i<3; i++)
{
ss = 0;
for (j=0; j<3; j++)
{
ss = ss + eta_k[j] * *Ak;
Ak++;
}
s = s + xii * ss;
xii = xii * xi_k;
}
uz[k] = s;
}
You might even do not write i/j for loops, but write down the explicitly folding formula with the sum of 9 terms.
Thank you Bruno,
i already have optimized the fortran code for single performance. There's nothing more to get.
Regards
Back to the roots. The question is really simple.
We have a fixed list of data B and a list of ids C which may vary in size and content for each call.
A=B(:,:,C(:));
How to handle such stuff in Matlab/C/Fortran with or without openmp in context of speed/scaling for a medium size problem?

Sign in to comment.

 Accepted Answer

Here is my C-mex implementation, on my laptop it accelerates by 20 fold from your original MATLAB code (compiled with MSVS 2017, 1 thread)
N=19012;
M=130000;
u_xi = rand(1,M);
u_eta = rand(1,M);
a_I = ceil(rand(1,M)*N);
var_z_analyse = rand(3,3,N);
tic
for i=1:1000
V_x = zeros(3,1,M);
V_y = zeros(1,3,M);
A = var_z_analyse(:,:,a_I);
V_x(2,1,:) = u_xi(:);
V_x(1,1,:) = 1;
V_x(3,1,:) = V_x(2,1,:).*V_x(2,1,:);
V_y(1,2,:) = u_eta(:);
V_y(1,1,:) = 1;
V_y(1,3,:) = V_y(1,2,:).*V_y(1,2,:);
outer_product = bsxfun(@times,V_x,V_y);
u_z = sum(sum(outer_product.*A,1),2);
end
toc % 21.871430 seconds.
tic
for i=1:1000
u_z_mex = hash(u_xi,u_eta,var_z_analyse,a_I);
end
toc % 1.057965 seconds.
norm(u_z_mex(:) - u_z(:)) / norm(u_z(:)) % 7.5515e-17
The hash.c C-mex code (quick and dirty without checking correctness of the input)
/*************************************************
* CALLING:
* >> u_z_mex = hash(u_xi,u_eta,var_z_analyse,a_I);
* COMPILING:
* >> mex -O hash.c
************************************************/
#include "mex.h"
#include "matrix.h"
#define XI prhs[0]
#define ETA prhs[1]
#define VAR_Z_ANALYSE prhs[2]
#define A_I prhs[3]
#define UZ plhs[0]
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
int i,j,k,M;
double xi_k_pow[3], eta_k, xi_k, etap, s, ss;
double *Ak, *u_eta, *u_xi, *a_I, *uz, *var_z_analyse;
var_z_analyse = (double*)mxGetData(VAR_Z_ANALYSE);
u_eta = (double*)mxGetData(ETA);
u_xi = (double*)mxGetData(XI);
a_I = (double*)mxGetData(A_I);
M = (int)mxGetNumberOfElements(A_I);
UZ = mxCreateDoubleMatrix(1,M,mxREAL);
uz = (double*)mxGetData(UZ);
xi_k_pow[0] = 1;
for (k=M; k--;)
{
Ak = var_z_analyse + 9*((int)(*a_I++)-1);
eta_k = *u_eta++;
xi_k = *u_xi++;
xi_k_pow[1] = xi_k;
xi_k_pow[2] = xi_k*xi_k;
etap = 1.0;
s = 0.0;
for (i=3; i--;)
{
ss = 0.0;
for (j=0; j<3; j++) ss += xi_k_pow[j] * *(Ak++);
s += etap * ss;
etap *= eta_k;
}
*uz++ = s;
}
}

6 Comments

Thank you Bruno,
i will check your implementation!
Regards
The C implemetation seems to be quite faster.
I removed the zeros statement, because you dont need it in the matlab implementation and it makes the code 30% slower.
But anyway here are the results:
N=19012;
M=130000;
u_xi = rand(1,M);
u_eta = rand(1,M);
a_I = ceil(rand(1,M)*N);
var_z_analyse = rand(3,3,N);
tic
for i=1:1000
A = var_z_analyse(:,:,a_I);
V_x(2,1,:) = u_xi(:);
V_x(1,1,:) = 1;
V_x(3,1,:) = V_x(2,1,:).*V_x(2,1,:);
V_y(1,2,:) = u_eta(:);
V_y(1,1,:) = 1;
V_y(1,3,:) = V_y(1,2,:).*V_y(1,2,:);
outer_product = bsxfun(@times,V_x,V_y);
u_z = sum(sum(outer_product.*A,1),2);
end
toc % 14.624736 seconds.
tic
for i=1:1000
u_z_mex = hash(u_xi,u_eta,var_z_analyse,a_I);
end
toc % 1.806414 seconds.
I have a factor of ~8.
I think i have to check the fortran implementation. This is something to build up on.
Thank you Bruno!
"I removed the zeros statement, because you dont need it in the matlab implementation and it makes the code 30% slower. "
Can't agree. Of course for the for-loop, 2nd iteration the V_x and V_y is already allocated you don't see the befefit of zeros(), but if you clear the variable V_x, V_y after using it it will become slower, since the arrays are growed when assign the third row/column .
N=19012;
M=130000;
u_xi = rand(1,M);
u_eta = rand(1,M);
a_I = ceil(rand(1,M)*N);
var_z_analyse = rand(3,3,N);
tic
for i=1:1000
V_x = zeros(3,1,M);
V_y = zeros(1,3,M);
A = var_z_analyse(:,:,a_I);
V_x(2,1,:) = u_xi(:);
V_x(1,1,:) = 1;
V_x(3,1,:) = V_x(2,1,:).*V_x(2,1,:);
V_y(1,2,:) = u_eta(:);
V_y(1,1,:) = 1;
V_y(1,3,:) = V_y(1,2,:).*V_y(1,2,:);
outer_product = bsxfun(@times,V_x,V_y);
u_z = sum(sum(outer_product.*A,1),2);
end
toc % 21.665071 seconds.
tic
for i=1:1000
A = var_z_analyse(:,:,a_I);
V_x(2,1,:) = u_xi(:);
V_x(1,1,:) = 1;
V_x(3,1,:) = V_x(2,1,:).*V_x(2,1,:);
V_y(1,2,:) = u_eta(:);
V_y(1,1,:) = 1;
V_y(1,3,:) = V_y(1,2,:).*V_y(1,2,:);
outer_product = bsxfun(@times,V_x,V_y);
u_z = sum(sum(outer_product.*A,1),2);
clear V_x V_y % CLEAR takes normally very little time
end
toc % 26.225464 seconds.
Horner's evaluation variant, according to my test it's even faster (about 10%)
/*************************************************
* CALLING:
* >> u_z_mex = hash(u_xi,u_eta,var_z_analyse,a_I);
* COMPILING:
* >> mex -O hash.c
************************************************/
#include "mex.h"
#include "matrix.h"
#define XI prhs[0]
#define ETA prhs[1]
#define VAR_Z_ANALYSE prhs[2]
#define A_I prhs[3]
#define UZ plhs[0]
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
int i,j,k,M;
double eta_k, xi_k, s, ss;
double *Ak, *u_eta, *u_xi, *a_I, *uz, *var_z_analyse;
var_z_analyse = (double*)mxGetData(VAR_Z_ANALYSE);
u_eta = (double*)mxGetData(ETA);
u_xi = (double*)mxGetData(XI);
a_I = (double*)mxGetData(A_I);
M = (int)mxGetNumberOfElements(A_I);
UZ = mxCreateDoubleMatrix(1,M,mxREAL);
uz = (double*)mxGetData(UZ);
for (k=M; k--;)
{
Ak = var_z_analyse + 9*(int)(*a_I++);
eta_k = *u_eta++;
xi_k = *u_xi++;
s = 0.0;
for (i=3; i--;)
{
ss = *(--Ak);
for (j=2; j--;) ss = ss*xi_k + *(--Ak);
s = s*eta_k + ss;
}
*uz++ = s;
}
}
Folding for-loop variant code, 10% faster still
/*************************************************
* CALLING:
* >> u_z_mex = hash(u_xi,u_eta,var_z_analyse,a_I);
* COMPILING:
* >> mex -O hash.c
************************************************/
#include "mex.h"
#include "matrix.h"
#define XI prhs[0]
#define ETA prhs[1]
#define VAR_Z_ANALYSE prhs[2]
#define A_I prhs[3]
#define UZ plhs[0]
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
int k, M;
double eta_k, xi_k, s, ss;
double *Ak, *u_eta, *u_xi, *a_I, *uz, *var_z_analyse;
var_z_analyse = (double*)mxGetData(VAR_Z_ANALYSE);
u_eta = (double*)mxGetData(ETA);
u_xi = (double*)mxGetData(XI);
a_I = (double*)mxGetData(A_I);
M = (int)mxGetNumberOfElements(A_I);
UZ = mxCreateDoubleMatrix(1,M,mxREAL);
uz = (double*)mxGetData(UZ);
for (k=M; k--;)
{
Ak = var_z_analyse + 9*(int)(*a_I++);
eta_k = *u_eta++;
xi_k = *u_xi++;
s = *(--Ak);
s = s*xi_k + *(--Ak);
s = s*xi_k + *(--Ak);
ss = *(--Ak);
ss = ss*xi_k + *(--Ak);
ss = ss*xi_k + *(--Ak);
s = s*eta_k + ss;
ss = *(--Ak);
ss = ss*xi_k + *(--Ak);
ss = ss*xi_k + *(--Ak);
*uz++ = s*eta_k + ss;
}
}

Sign in to comment.

More Answers (0)

Categories

Find more on App Building in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!