Thread Subject: Codistributed arrays performance

Subject: Codistributed arrays performance

From: Scott

Date: 8 Nov, 2009 18:00:17

Message: 1 of 8

After reading other postings related to this topic, I'm left unsatisfied by the explanation for lagging codistributed performance in mathematical operations. To try and maximize performance in my application I have done the following:

1) Write and performance tune the functions for serial (albeit multi-threaded) Matlab operation.

2) Benchmark serial performance and find the data array size that results in precipitous performance drop-off. Note that this happens long before hitting the memory limitations of my deployment machine.

3) Parallelize using codistributed arrays, managing redistribution explicitly to minimize communication problems.

4) Benchmark parallel performance for data array sizes that exceed the serial performance drop-off size.

5) Be disappointed by the fact that the parallel benchmarks clearly show a significantly reduced performance in elementwise binary operations, e.g. codistributed.times, codistributed.rdivide, codistributed.mtimes, etc.

I have enough previous experience writing MPI in C++ to understand how to avoid communication bottlenecks and using the mpiprofile I was able to reduce the communication overhead to < 4.8% of the total execution time.

The majority of the execution, 61% was taken by the element wise operations, codistributor1d.hElementwiseBinaryOpImpl, which seem to reduce the performance for an identical serial operation by at least 3x.

Can someone explain why these operations, in the absence of communication overhead, are so much slower than an identical serial execution? I would buy the multi-threading argument if I hadn't made sure to find and push the data size beyond what multi-threading seems to handle efficiently before benchmarking.

Subject: Codistributed arrays performance

From: Edric M Ellis

Date: 9 Nov, 2009 09:01:11

Message: 2 of 8

"Scott " <lorentz-spampadded@fastmail.fm> writes:

> 5) Be disappointed by the fact that the parallel benchmarks clearly show a
> significantly reduced performance in elementwise binary operations,
> e.g. codistributed.times, codistributed.rdivide, codistributed.mtimes, etc.
>
> I have enough previous experience writing MPI in C++ to understand how to avoid
> communication bottlenecks and using the mpiprofile I was able to reduce the
> communication overhead to < 4.8% of the total execution time.
>
> The majority of the execution, 61% was taken by the element wise operations,
> codistributor1d.hElementwiseBinaryOpImpl, which seem to reduce the performance
> for an identical serial operation by at least 3x.
>
> Can someone explain why these operations, in the absence of communication
> overhead, are so much slower than an identical serial execution? I would buy the
> multi-threading argument if I hadn't made sure to find and push the data size
> beyond what multi-threading seems to handle efficiently before benchmarking.

The main overhead when doing "embarassingly parallel" low numerical intensity
operations such as "times" or "rdivide" is the time taken to get into and out of
the underlying operation.

As you have seen, "codistributed.times" and so on are implemented using MATLAB
objects. Unfortunately, the overhead of object method dispatch is larger than
the relatively small amount of numerical computation required. We are aware that
this is a problem, and are working to try and increase the performance of
codistributed arrays. For now, the main advantages of codistributed arrays are
that they allow you to work with data sizes that do not fit onto a single
machine, and that the more complex linear algebra routines (such as ldivide) can
show performance benefit.

Cheers,

Edric.

Subject: Codistributed arrays performance

From: Edric M Ellis

Date: 9 Nov, 2009 14:48:37

Message: 3 of 8

Edric M Ellis <eellis@mathworks.com> writes:

> [...] For now, the main advantages of codistributed arrays are
> that they allow you to work with data sizes that do not fit onto a single
> machine, and that the more complex linear algebra routines (such as ldivide) can
> show performance benefit.

Oops, I mean "mldivide" (aka "backslash"), not "ldivide".

Cheers,

Edric.

Subject: Codistributed arrays performance

From: Scott

Date: 9 Nov, 2009 18:12:02

Message: 4 of 8

Ah, I see, good to know. Is parfor then the route to better performance when applicable, or is the parallel toolbox really just for large data sets at this point?

Subject: Codistributed arrays performance

From: Edric M Ellis

Date: 10 Nov, 2009 08:08:45

Message: 5 of 8

"Scott " <lorentz-spampadded@fastmail.fm> writes:

> Ah, I see, good to know. Is parfor then the route to better performance when
> applicable, or is the parallel toolbox really just for large data sets at this
> point?

In general, if a problem can be addressed using parfor, it will almost certainly
be quicker as there are fewer synchronisation points for communication, and the
dynamic scheduling attempts to get better load-balancing.

Cheers,

Edric.

Subject: Codistributed arrays performance

From: Scott

Date: 10 Nov, 2009 20:42:02

Message: 6 of 8

Edric M Ellis <eellis@mathworks.com> wrote in message <ytw3a4mer8y.fsf@uk-eellis-deb5-64.mathworks.co.uk>...
> "Scott " <lorentz-spampadded@fastmail.fm> writes:
>
> > Ah, I see, good to know. Is parfor then the route to better performance when
> > applicable, or is the parallel toolbox really just for large data sets at this
> > point?
>
> In general, if a problem can be addressed using parfor, it will almost certainly
> be quicker as there are fewer synchronisation points for communication, and the
> dynamic scheduling attempts to get better load-balancing.
>
> Cheers,
>
> Edric.

My code is highly vectorized with very few for-loops. Would you expect the parfor performance to exceed that of a vectorized multi-threaded computation for large datasets? Or should I be considering semi-vectorized coding to take greater advantage of parfor? Seems like the array indexing necessary for that would slow it down, but I don't have a good handle on the performance trade-offs.

Subject: Codistributed arrays performance

From: Edric M Ellis

Date: 11 Nov, 2009 08:40:18

Message: 7 of 8

"Scott " <lorentz-spampadded@fastmail.fm> writes:

> Edric M Ellis <eellis@mathworks.com> wrote in message
> <ytw3a4mer8y.fsf@uk-eellis-deb5-64.mathworks.co.uk>...
>> "Scott " <lorentz-spampadded@fastmail.fm> writes:
>>
>> > Ah, I see, good to know. Is parfor then the route to better performance when
>> > applicable, or is the parallel toolbox really just for large data sets at
>> > this point?
>> In general, if a problem can be addressed using parfor, it will almost
>> certainly be quicker as there are fewer synchronisation points for
>> communication, and the dynamic scheduling attempts to get better
>> load-balancing.
>>
>> Cheers,
>>
>> Edric.
>
> My code is highly vectorized with very few for-loops. Would you expect the
> parfor performance to exceed that of a vectorized multi-threaded computation for
> large datasets? Or should I be considering semi-vectorized coding to take
> greater advantage of parfor? Seems like the array indexing necessary for that
> would slow it down, but I don't have a good handle on the performance
> trade-offs.

I'm afraid it's hard to say. The usual principles of using the profiler to work
out where time is being taken should help. Generally, to get speedup with
PARFOR, you need to ensure that the overheads of sending out and getting back
the data doesn't exceed the amount of computation required. Basically, it comes
down to having each loop iteration performing a largish amount of computation
compared to the amount of input and output data needed. A couple of extreme
examples:

y = rand( 1, N );
parfor ii=1:N
  x(ii) = y + 1;
end

In that case, all of x and y have to be sent to/from the workers, but the amount
of computation is trivial. This will be much slower than the obvious "x = y +
1".

parfor ii=1:N
  pause( ii );
end

This example is slightly silly, but should give almost perfect speedup compared
to the "for" version of the same loop, since the amount of data transferred is
zero, and the "work" done takes a long time compared to the PARFOR overheads.

Cheers,

Edric.

Subject: Codistributed arrays performance

From: Scott

Date: 11 Nov, 2009 15:55:22

Message: 8 of 8

Got it. Thanks for all the information, Edric.

Scott

Tags for this Thread

Everyone's Tags:

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Tag Activity for This Thread
Tag Applied By Date/Time
mpiprofile Scott 8 Nov, 2009 13:04:05
performance Scott 8 Nov, 2009 13:04:05
profile Scott 8 Nov, 2009 13:04:05
codistributed Scott 8 Nov, 2009 13:04:05
parallel Scott 8 Nov, 2009 13:04:04
rssFeed for this Thread

Contact us at files@mathworks.com