Scaling of vectorized code which is limited by memory access
Show older comments
Dear all,
what possibilities are there to make a vectorized code faster, where the speed is limited by memory access?
I have attached two runs on two different machines with the same vectorized MATLAB code. Here var_z_analyse is a huge array of fixed size, which will be accessed/querried by a long list of indices which may vary in size and content for each call. May parfor be a solution in this context?
For example the first line does not benefit from 4 to 18 cores, whereas the last two lines scale very well.
- first machine has 4 cores on one node

- seconde machine has 18 cores on one node

Regards
11 Comments
James Tursa
on 10 Aug 2019
What are the sizes of all the variables? One option is to not form A directly, but pass the indexes into a mex routine and then do the u_z calculations there, avoiding the memory copy.
Walter Roberson
on 10 Aug 2019
sum() of a .* is dot(), and when you have several of them adjacent there is often a coding involving * (and perhaps a transpose)
Bruno Luong
on 10 Aug 2019
Edited: Bruno Luong
on 10 Aug 2019
Not really, if you use "*" you compute many cross terms that are not required.
SUM if the best option. One can reshape to call one SUM instead of two but I doubt that will help a lot.
Another options is use James's MTIMESX (such function should be part of MATLAB distribution).
But the bottleneck is clearly the first statement that rearrange a huge 3 x 3 x N array.
ConvexHull
on 10 Aug 2019
Edited: ConvexHull
on 10 Aug 2019
Bruno Luong
on 10 Aug 2019
I would try mex coding if I was you, at least it save memory copying, but it make memory access not linear.
Also as I undestand there is about 6 query points falling in the same LUT (ratio N/M) not sure if its worth to group them together by some sorting in a_I, but I still mention it.
ConvexHull
on 10 Aug 2019
Edited: ConvexHull
on 10 Aug 2019
Bruno Luong
on 10 Aug 2019
Edited: Bruno Luong
on 11 Aug 2019
I was thinking to drop completely those memory copying in preparation but replacing the last instruction
sum(sum(outer_product.*A,1),2)...
by the pseudo C-code along this line:
for (k=0; k++; k<N)
{
Ak = &var_z_analyse + 9*(a_I[k]-1);
s = 0;
for (j=0; j++; j<9) s = s + outer_product[k*9+j] * Ak[j];
uz[k] = s;
}
You could also do with fortran mex.
ConvexHull
on 11 Aug 2019
Edited: ConvexHull
on 11 Aug 2019
Bruno Luong
on 11 Aug 2019
Edited: Bruno Luong
on 11 Aug 2019
I would say you should optimize the code that carry out the computation, but not sure if you would win much.
I again express my though in C "pseudo" code (warning: I haven't tested at all)
double eta_k_pow[3], eta_k, xii, xi_k, s, ss;
double *Ak;
eta_k_pow[0] = 1;
for (k=0; k++; k<N)
{
Ak = &var_z_analyse + 9*(a_I[k]-1);
eta_k = u_eta[k];
eta_k_pow[1] = eta_k;
eta_k_pow[2] = eta_k*eta_k;
xii = 1;
xi_k = u_xi[k];
s = 0;
for (i=0; i<3; i++)
{
ss = 0;
for (j=0; j<3; j++)
{
ss = ss + eta_k[j] * *Ak;
Ak++;
}
s = s + xii * ss;
xii = xii * xi_k;
}
uz[k] = s;
}
You might even do not write i/j for loops, but write down the explicitly folding formula with the sum of 9 terms.
ConvexHull
on 11 Aug 2019
ConvexHull
on 11 Aug 2019
Edited: ConvexHull
on 11 Aug 2019
Accepted Answer
More Answers (0)
Categories
Find more on App Building in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!