Path: news.mathworks.com!not-for-mail
From: <HIDDEN>
Newsgroups: comp.soft-sys.matlab
Subject: Codistributed arrays performance
Date: Sun, 8 Nov 2009 18:00:17 +0000 (UTC)
Organization: University of Oregon
Lines: 17
Message-ID: <hd70vh$n62$1@fred.mathworks.com>
Reply-To: <HIDDEN>
NNTP-Posting-Host: webapp-03-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1257703217 23746 172.30.248.38 (8 Nov 2009 18:00:17 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Sun, 8 Nov 2009 18:00:17 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 990459
Xref: news.mathworks.com comp.soft-sys.matlab:583380


After reading other postings related to this topic, I'm left unsatisfied by the explanation for lagging codistributed performance in mathematical operations. To try and maximize performance in my application I have done the following:

1) Write and performance tune the functions for serial (albeit multi-threaded) Matlab operation.

2) Benchmark serial performance and find the data array size that results in precipitous performance drop-off. Note that this happens long before hitting the memory limitations of my deployment machine.

3) Parallelize using codistributed arrays, managing redistribution explicitly to minimize communication problems.

4) Benchmark parallel performance for data array sizes that exceed the serial performance drop-off size.

5) Be disappointed by the fact that the parallel benchmarks clearly show a significantly reduced performance in elementwise binary operations, e.g. codistributed.times, codistributed.rdivide, codistributed.mtimes, etc.

I have enough previous experience writing MPI in C++ to understand how to avoid communication bottlenecks and using the mpiprofile I was able to reduce the communication overhead to < 4.8% of the total execution time. 

The majority of the execution, 61% was taken by the element wise operations, codistributor1d.hElementwiseBinaryOpImpl, which seem to reduce the performance for an identical serial operation by at least 3x.

Can someone explain why these operations, in the absence of communication overhead, are so much slower than an identical serial execution? I would buy the multi-threading argument if I hadn't made sure to find and push the data size beyond what multi-threading seems to handle efficiently before benchmarking.