Vectorize and use parfor in a nested loop with five levels

Asked by Fernando on 28 Jun 2013 at 15:15
Latest activity Commented on by Fernando about 13 hours ago

Hi all,

I'm working on an application that, for now, requires computing a nested loop with five levels. I'm trying to vectorize some parts and to use a parfor for one of the loops. However, I haven't been very successful. Any suggestions would be great. A simplified version of the code looks as follows,

clear all
clc
p1=(1:1:50)';
p2=p1;
np=max(size(p1));
[P1,P2]=meshgrid(p1,p2);
DP=unique(P1-P2);
K=max(size(DP));
ns=1000;
ne=10;
S1=randn(ns,ne);
S2=randn(ns,ne);
S3=randn(ns,ne);
S4=randn(ns,ne);
L1=rand(ns,ne);
L2=rand(ns,ne);
L3=rand(ns,ne);
L4=rand(ns,ne);
a=2;
BR=zeros(np,K,K,K,K);
P=BR;
for i=1:K,
    DP1=DP(i);
    C1=mean(a*S1*DP1>=0,2);
    e1=bsxfun(@times,L1,(1-2*C1));
    for j=1:K,
        DP2=DP(j);
        C2=mean(a*S2*DP2>=0,2);
        e2=bsxfun(@times,L2,(1-2*C2));
        for l=1:K,
            DP3=DP(l);
            C3=mean(a*S3*DP3>=0,2);
            e3=bsxfun(@times,L3,(1-2*C3));
            for m=1:K,
                DP4=DP(m);
                C4=mean(a*S4*DP4>=0,2);
                e4=bsxfun(@times,L4,(1-2*C4));
                  for pp=1:np,
                      pr=p2(pp);
                      dp1=p1-pr;
                      A=reshape(-bsxfun(@times,a.*C4,dp1'),ns,1,np);
                      N=bsxfun(@minus,A,e4);
                      f=exp(N);
                      PR=f./(1+f);
                      rev=zeros(ns,ne,np);
                      for zz=1:np,
                          rev(:,:,zz)=PR(:,:,zz)*p1(zz);
                      end
                      er=reshape(sum(mean(rev,2)),np,1);
                      [mpr,ind]=max(er);
                      BR(pp,m,l,j,i)=p1(ind);
                      P(pp,m,l,j,i)=mpr;
                  end
              end
          end
      end
  end

In this version the problems start when evaluating BR and P in the inner loop. Even though all pp, m, l, j and i are defined, it takes a lot of memory and time to evaluate BR(pp,m,l,j,i)=p1(ind), and I really don't know why. Then, after solving that, I need to improve on vectorizing what might be possible to vectorize, and choose a loop to use the parfor.

Thanks,

0 Comments

Fernando

Products

No products are associated with this question.

1 Answer

Answer by Jan Simon on 28 Jun 2013 at 17:21
Edited by Jan Simon on 30 Jun 2013 at 22:40

Some small improvements:

% Original code:
rev=zeros(ns,ne,np);
for zz=1:np,
  rev(:,:,zz)=PR(:,:,zz)*p1(zz);
end
% Vectorized (is this faster?):
rev = bsxfun(@times, PR, reshape(p1, 1, 1, np));

Calculating sum(mean(rev,2)) contains 2 summations. Perhaps one sum is faster:

er = sum(reshape(rev, np, []), 2) / ne;

Avoir some multiplications by -1:

% Original:
dp1 = p1-pr;
A   = reshape(-bsxfun(@times,a.*C4,dp1'),ns,1,np);
% New
dp1 = pr - p1;
A   = reshape(bsxfun(@times, a.*C4, dp1'), ns, 1, np);

In these lines:

C1 = mean(a*S1*DP1>=0,2);

some simplifications are possible: a * S1 * DP1 >= 0 does not depend on the positivbe constant a. DP1 is a scalar and only its sign is interesting. So the |meanS1 = mean(S1>=0, 2)| can be calculated once and in case of a negative sign of DP1 1 - meanS1 can be calculated fast. DP1==0 must be caught also.

I'd start a parallelization after these mathematical simplifications have been performed. Avoiding calculations is simply more efficient than trying to do them faster.

 

[EDITED] Now I've found the time to run your code on my computer. Beside the fact that 4GB are not enough to run your code (how much RAM do you have?), a simple test revealed the bottleneck: Use the profiler to find out, which lines consumes the most time. You will see, that all my ideas are meaningless, because they improve parts of your code, which require only one percent of the total time. Only vectorizing the creation of rev matters significantly. But the rest is dominated by the very expensive calculation of:

f = exp(N);

So this is the point where improvements are required at first. N is created by BSXFUN and you can exploit, that exp(a - b) equals exp(a) / exp(b), and that exp(a * b) equals exp(a) ^ b. Then the innermost loop becomes:

   exp_e4_inv = exp(e4);
   exp_a_C4   = exp(a * C4);
   for pp = 1:np
      dp1 = p2(pp) - p1;
      Ax  = reshape(bsxfun(@power,exp_a_C4, dp1'), ns, 1, np);
      f   = bsxfun(@times, Ax, exp_e4_inv);
      PR  = f./(1+f);
      rev = bsxfun(@times, PR, reshape(p1, 1, 1, np));
      er  = reshape(sum(reshape(rev, [], 3)) / ne, np, 1);
      [mpr,ind]=max(er);  % No changes here...
      BR(pp,m,l,j,i)=p1(ind);
      P(pp,m,l,j,i)=mpr;
   end

This reduces the total number of elements the exponential function is applied to. For the much smaller problem defined by p1=(1:3)', I get 1.45 seconds runtime instead of 3.02 of the original version. The results differ by rounding errors inside the expected limits from the original. So some basic mathematical simplifications gained the double speed.

Well, double speed is not impressing by such a large problem. But I hesitate to start to parallelize code without considering this. Now this is a good point to come back to the original question: How do we PARFOR this?

5 Comments

Fernando on 29 Jun 2013 at 17:13

Thanks Muthu, I'll work on your suggestion to see if I can improve on performance.

Jan Simon on 30 Jun 2013 at 21:58

See [EDITED]

Fernando about 13 hours ago

Thanks Jan for the edition. I had realized that the costly part was computing the exponential but I had no idea how to improve on that. Once again, thanks! And to answer your question about RAM, I'm running this on a computer cluster so as long as I don't need more than 50gb, it should be OK.

Jan Simon

Contact us