Got Questions? Get Answers.
Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
parfor 100 times slower than for on a 4-core machine

Subject: parfor 100 times slower than for on a 4-core machine

From: nav0239

Date: 28 Feb, 2009 01:47:03

Message: 1 of 6

Read a lot of parfor related issues and don't see the one I am tackling with: trying to use parfor to speed up code on a quad-core machine. I got 2 problems:

1) Using parfor is 100 times slower than just use 'for' and run on a single core. All four cores appear to be cycling through 0% and 100% load all the time and 99% of the time spent on data transfers?

2) The worker thread will not see the loaded dll library, unless I run the code once using single thread (and for vs parfor) before hand.

A skeleton fo the code is shown below. Can some point to me where the problem is. Could be that all the mex function calls and/or the calls to dll functions are evaluated only in the client? Or something else. If it helps to identify the problem, I can try to make the skeleton code working and post back results.

function parallelTest
  % load dll 'abc'
  for si = 1 : 1800
    tdata(si).a1 = rand(800,1);
    tdata(si).a2 = rand(800,1);
    tdata(si).b1 = si;
  end
  doParallel(tdata);
end

function doParallel(tdata)
  dlen = length(tdata(1).a1);
  parfor ti = 1: length(tdata)
    tmp = zeros(dlen,1);
    dd1 = calllib('abc', 'abcF', tdata(ti).a1); % call abcF in abc dll
    dd2 = mexF(tdata(ti).a2); % call mex function mexF
    tmp(dd1 > 0 & dd1 > tdata(ti).a2 & dd2 > 0) = true;
    tmp(dd1 < 0 & dd1 < tdata(ti).a2 & dd2 < 0) = false;
    % two more mex function calls
  end
  % some logic collecting results
end

Thanks a lot!

Subject: parfor 100 times slower than for on a 4-core machine

From: Jveer

Date: 28 Feb, 2009 02:28:02

Message: 2 of 6

welcome to the club of parfor headaches! lol

my guess is your code is suffering from huge communication overheads. trying reducing the number of matlabpool you are using. i had a similar problem: i decreased the number of local pools from 4 to 2 and my code ran abt 30% faster.

Subject: parfor 100 times slower than for on a 4-core machine

From: nav0239

Date: 28 Feb, 2009 02:49:02

Message: 3 of 6

"Jveer " <jveer@jveer.com> wrote in message <goa7fi$d0e$1@fred.mathworks.com>...
> welcome to the club of parfor headaches! lol
>
> my guess is your code is suffering from huge communication overheads. trying reducing the number of matlabpool you are using. i had a similar problem: i decreased the number of local pools from 4 to 2 and my code ran abt 30% faster.

That's what i thought earlier. So I have tidied up the code, say, make tdata a slice variable, reduce temporary variables and I am sure there is no loop dependency.
The problem could be the size of tdata, which is normally ~100 MB. As far as I can tell, only about 1/4 of the tdata, or 25 MB, needs to be transfered to the workers. I did not try to make it parallel at a higher level, because the workers would need to access the complete tdata of ~100 MB.

I always used 4 local workers. Will try 3 and see if it helps.

BTW, the parfor loop usually executes in 1 - 3 seconds using 'for' and single thread.

Subject: parfor 100 times slower than for on a 4-core machine

From: Narfi

Date: 28 Feb, 2009 03:55:20

Message: 4 of 6

"nav0239" <yue@mdadotdot.com> wrote in message <goa8mu$66s$1@fred.mathworks.com>...
> "Jveer " <jveer@jveer.com> wrote in message <goa7fi$d0e$1@fred.mathworks.com>...
> > welcome to the club of parfor headaches! lol
> >
> > my guess is your code is suffering from huge communication overheads. trying reducing the number of matlabpool you are using. i had a similar problem: i decreased the number of local pools from 4 to 2 and my code ran abt 30% faster.
>
> That's what i thought earlier. So I have tidied up the code, say, make tdata a slice variable, reduce temporary variables and I am sure there is no loop dependency.
> The problem could be the size of tdata, which is normally ~100 MB. As far as I can tell, only about 1/4 of the tdata, or 25 MB, needs to be transfered to the workers. I did not try to make it parallel at a higher level, because the workers would need to access the complete tdata of ~100 MB.
>
> I always used 4 local workers. Will try 3 and see if it helps.
>
> BTW, the parfor loop usually executes in 1 - 3 seconds using 'for' and single thread.

I doubt you will have great success trying to parallelize a loop that runs in 1-3 seconds and needs on the order of 25-100MB of input data.

Having said that, however, I noticed that in your code snippet, you had an outer for loop that called the inner for loop 1800 times. That's 1800 times the amount of computation, but only 4 times the communication requirements (25MB vs. 100MB). I would expect you to be much, much more likely to get the performance improvements you are looking for by trying to parallelize the outer for loop.

Best,

Narfi

Subject: parfor 100 times slower than for on a 4-core machine

From: nav0239

Date: 28 Feb, 2009 06:09:01

Message: 5 of 6

"Narfi" <narfi.stefansson@mathworks.com> wrote in message <goacj8$2gq$1@fred.mathworks.com>...
> "nav0239" <yue@mdadotdot.com> wrote in message <goa8mu$66s$1@fred.mathworks.com>...
> > "Jveer " <jveer@jveer.com> wrote in message <goa7fi$d0e$1@fred.mathworks.com>...
> > > welcome to the club of parfor headaches! lol
> > >
> > > my guess is your code is suffering from huge communication overheads. trying reducing the number of matlabpool you are using. i had a similar problem: i decreased the number of local pools from 4 to 2 and my code ran abt 30% faster.
> >
> > That's what i thought earlier. So I have tidied up the code, say, make tdata a slice variable, reduce temporary variables and I am sure there is no loop dependency.
> > The problem could be the size of tdata, which is normally ~100 MB. As far as I can tell, only about 1/4 of the tdata, or 25 MB, needs to be transfered to the workers. I did not try to make it parallel at a higher level, because the workers would need to access the complete tdata of ~100 MB.
> >
> > I always used 4 local workers. Will try 3 and see if it helps.
> >
> > BTW, the parfor loop usually executes in 1 - 3 seconds using 'for' and single thread.
>
> I doubt you will have great success trying to parallelize a loop that runs in 1-3 seconds and needs on the order of 25-100MB of input data.

If this is the problem, any suggestion to improve?
>
> Having said that, however, I noticed that in your code snippet, you had an outer for loop that called the inner for loop 1800 times. That's 1800 times the amount of computation, but only 4 times the communication requirements (25MB vs. 100MB). I would expect you to be much, much more likely to get the performance improvements you are looking for by trying to parallelize the outer for loop.

The outer for loop is for populating input data. The call to doParallel uses more than 99% of the time, which is what I am trying to make it parallel. This function is only called once and is not in the outer loop.

As in my actual code there are outer loops around the parfor loop to apply different set of parameters. If parfor is used in those loops instead, it means that the whole tdata would have to be shared between all the workers. I may try that however later and see if it improve things.

Thanks!
>
> Best,
>
> Narfi

Subject: parfor 100 times slower than for on a 4-core machine

From: Narfi

Date: 28 Feb, 2009 17:27:01

Message: 6 of 6

"nav0239" <yue@mdadotdot.com> wrote in message <goakdt$gnc$1@fred.mathworks.com>...
> The outer for loop is for populating input data. The call to doParallel uses more than 99% of the time, which is what I am trying to make it parallel. This function is only called once and is not in the outer loop.
>
> As in my actual code there are outer loops around the parfor loop to apply different set of parameters. If parfor is used in those loops instead, it means that the whole tdata would have to be shared between all the workers. I may try that however later and see if it improve things.
>
I now see how I misread your code.

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us