Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
Poor PARFOR performance

Subject: Poor PARFOR performance

From: Matt J

Date: 12 Nov, 2010 20:40:05

Message: 1 of 9

I just bought the Parallel Computing Toolbox and am running it on this 64-bit machine:

Dell Precision T7500
Intel(R) Xeon(R) CPU X5680 @ 3.33Ghz 3.33 GHz Dual 6 Core
24 GB RAM

My initial experiments with PARFOR are showing very poor speed performance, worse even than a plain old for-loop. For example,


N=500; M=3000;
Z=rand(M);

V=zeros(M);
matlabpool open 8;
 tic;
  parfor ii=1:N,
     V=V+Z;
  end;
toc;
matlabpool close
%Elapsed time is 12.297742 seconds.



Compare this to,



 V=zeros(M);
tic;
 for ii=1:N,
   V=V+Z;
 end;
toc;
%Elapsed time is 10.850052 seconds.


Any idea what's going on?

Subject: Poor PARFOR performance

From: Doug Schwarz

Date: 12 Nov, 2010 20:52:23

Message: 2 of 9

On 11/12/2010 3:40 PM, Matt J wrote:

[snip]

> N=500; M=3000; Z=rand(M);
>
> V=zeros(M);
> matlabpool open 8; tic; parfor ii=1:N,
> V=V+Z; end; toc; matlabpool close
> %Elapsed time is 12.297742 seconds.
>
>
> Compare this to,
>
>
> V=zeros(M); tic; for ii=1:N,
> V=V+Z; end; toc;
> %Elapsed time is 10.850052 seconds.
>
>
> Any idea what's going on?

You've added the overhead of parfor, but your loop cannot be run in
parallel because each iteration depends on the previous one.

--
Doug Schwarz
dmschwarz&ieee,org
Make obvious changes to get real email address.

Subject: Poor PARFOR performance

From: Sean

Date: 12 Nov, 2010 20:53:03

Message: 3 of 9

"Matt J " <mattjacREMOVE@THISieee.spam> wrote in message <ibk8n5$9g3$1@fred.mathworks.com>...
> I just bought the Parallel Computing Toolbox and am running it on this 64-bit machine:
>
> Dell Precision T7500
> Intel(R) Xeon(R) CPU X5680 @ 3.33Ghz 3.33 GHz Dual 6 Core
> 24 GB RAM
>
> My initial experiments with PARFOR are showing very poor speed performance, worse even than a plain old for-loop. For example,
>
>
> N=500; M=3000;
> Z=rand(M);
>
> V=zeros(M);
> matlabpool open 8;
> tic;
> parfor ii=1:N,
> V=V+Z;
> end;
> toc;
> matlabpool close
> %Elapsed time is 12.297742 seconds.
>
>
>
> Compare this to,
>
>
>
> V=zeros(M);
> tic;
> for ii=1:N,
> V=V+Z;
> end;
> toc;
> %Elapsed time is 10.850052 seconds.
>
>
> Any idea what's going on?

I've never been able to have it be faster when there's much memory required to be passed to a worker. It's best for when the computation time to memory consumption is large.

If you remember this thread between us:
http://www.mathworks.de/matlabcentral/newsreader/view_thread/269829
was a good example. The amount of time it took to transfer the two 32^3 uint8 volumes far outweighed the computation time; once again many thanks to your KronProd software. I don't remember the time tests but it was taking double digits as long to compute the same amount of vectors with multiple labs running as it did with a standard for-loop.

Subject: Poor PARFOR performance

From: Matt J

Date: 12 Nov, 2010 21:08:04

Message: 4 of 9

Doug Schwarz <see@sig.for.address.edu> wrote in message <bOhDo.15766$Mk2.14729@newsfe13.iad>...
>
> You've added the overhead of parfor, but your loop cannot be run in
> parallel because each iteration depends on the previous one.
=====

Doug, the PCT manual says otherwise. According to the manual, the toolbox is smart enough to recognize V as a so-called "reduction variable", i.e., that the loop over
V=V+Z is equivalent to

V=zeros(M)+Z+Z+Z+Z+....

which can be decomposed into partial sums, computable in parallel on each worker.

Subject: Poor PARFOR performance

From: Matt J

Date: 12 Nov, 2010 21:44:04

Message: 5 of 9

"Sean " <sean.dewolski@nospamplease.umit.maine.edu> wrote in message <ibk9ff$sfd$1@fred.mathworks.com>...
>
>
> I've never been able to have it be faster when there's much memory required to be passed to a worker. It's best for when the computation time to memory consumption is large.
==========

Sean - That does seem to have been part of the problem. So I changed my test parameters to

N=1e7; M=50;

which means that the arrays being broadcast to the labs are only 50x50 and the loop is N=1e7 iteration.

Now the times are 23 sec. parallel versus 8 sec. serial, so at least the parallel is beating the serial version.

I'm still fairly surprised that with 8 labs, I'm still getting less than a factor of 4 speed-up, though...

Subject: Poor PARFOR performance

From: Steven_Lord

Date: 12 Nov, 2010 22:32:16

Message: 6 of 9



"Matt J " <mattjacREMOVE@THISieee.spam> wrote in message
news:ibk8n5$9g3$1@fred.mathworks.com...
> I just bought the Parallel Computing Toolbox and am running it on this
> 64-bit machine:
>
> Dell Precision T7500
> Intel(R) Xeon(R) CPU X5680 @ 3.33Ghz 3.33 GHz Dual 6 Core
> 24 GB RAM
>
> My initial experiments with PARFOR are showing very poor speed
> performance, worse even than a plain old for-loop. For example,

*snip*

> Any idea what's going on?

I believe you're seeing this behavior in part because plus is fast. Try
doing something more expensive inside your loop so that the overhead of
partitioning the work to the workers doesn't overshadow the benefit you gain
by performing the task in parallel.

--
Steve Lord
slord@mathworks.com
To contact Technical Support use the Contact Us link on
http://www.mathworks.com

Subject: Poor PARFOR performance

From: Matt J

Date: 12 Nov, 2010 23:02:04

Message: 7 of 9

"Steven_Lord" <slord@mathworks.com> wrote in message <ibkf9g$fmr$1@fred.mathworks.com>...
>
> I believe you're seeing this behavior in part because plus is fast. Try
> doing something more expensive inside your loop so that the overhead of
> partitioning the work to the workers doesn't overshadow the benefit you gain
> by performing the task in parallel.
======

Steve, that's more or less what Sean said, but it's not adding up. Even if plus is fast, it should be reasonably expensive to loop over plus for N=1e8 iterations

But when I run with N=1e8 and M=10, I get

Serial - 20 sec
Parallel - 37 sec

If I were to get linear speed-up with 8 workers, the parallel version should be taking 2.5 sec. Since it's taking 37 sec., that would have to mean that partitioning the computation to the workers is taking an overhead of well over 30 seconds.

Why should it cost so much to broadcast a single 10x10 matrix Z to 8 workers?

Subject: Poor PARFOR performance

From: James Tursa

Date: 12 Nov, 2010 23:23:05

Message: 8 of 9

"Matt J " <mattjacREMOVE@THISieee.spam> wrote in message <ibkh1c$6s1$1@fred.mathworks.com>...
> "Steven_Lord" <slord@mathworks.com> wrote in message <ibkf9g$fmr$1@fred.mathworks.com>...
> >
> > I believe you're seeing this behavior in part because plus is fast. Try
> > doing something more expensive inside your loop so that the overhead of
> > partitioning the work to the workers doesn't overshadow the benefit you gain
> > by performing the task in parallel.
> ======
>
> Steve, that's more or less what Sean said, but it's not adding up. Even if plus is fast, it should be reasonably expensive to loop over plus for N=1e8 iterations
>
> But when I run with N=1e8 and M=10, I get
>
> Serial - 20 sec
> Parallel - 37 sec
>
> If I were to get linear speed-up with 8 workers, the parallel version should be taking 2.5 sec. Since it's taking 37 sec., that would have to mean that partitioning the computation to the workers is taking an overhead of well over 30 seconds.
>
> Why should it cost so much to broadcast a single 10x10 matrix Z to 8 workers?

I don't have the parallel toolbox so this is just guesswork on my part. What if you remove the reduction type of calculation from the loop and use something else that doesn't require reduction? Maybe MATLAB is forcing wait states in each thread to make sure the sum is done in the same order each time. Just guessing ...

James Tursa

Subject: Poor PARFOR performance

From: Matt J

Date: 13 Nov, 2010 00:23:06

Message: 9 of 9

"James Tursa" <aclassyguy_with_a_k_not_a_c@hotmail.com> wrote in message <ibki8p$nu8$1@fred.mathworks.com>...
>
> > Why should it cost so much to broadcast a single 10x10 matrix Z to 8 workers?
>
> I don't have the parallel toolbox so this is just guesswork on my part. What if you remove the reduction type of calculation from the loop and use something else that doesn't require reduction? Maybe MATLAB is forcing wait states in each thread to make sure the sum is done in the same order each time. Just guessing ...
=======

Okay, I think I figured it out. I was assuming that Z (because it's so small) would be cached by each of the workers. Apparently, though, it needs to be fetched from RAM in every pass through the loop (and all workers try to do so simultaneously).

I took your advice and ran a test which requires no communication:

 for ii=1:10000, rand(M); end; %M=500

Speed-up is approximately 6.7 from serial to parallel, and that's pretty consistent across different M.

I guess that's pretty decent for 8 workers. (Is it??)

 

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us