<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/265302</link>
    <title>MATLAB Central Newsreader - Codistributed arrays performance</title>
    <description>Feed for thread: Codistributed arrays performance</description>
    <language>en-us</language>
    <copyright>&amp;copy;1994-2012 by MathWorks, Inc.</copyright>
    <webmaster>webmaster@mathworks.com</webmaster>
    <generator>MATLAB Central Newsreader</generator>
    <docs>http://blogs.law.harvard.edu/tech/rss</docs>
    <ttl>60</ttl>
    <image>
      <title>MathWorks</title>
      <url>http://www.mathworks.com/images/membrane_icon.gif</url>
    </image>
    <item>
      <pubDate>Sun, 08 Nov 2009 18:00:17 -0500</pubDate>
      <title>Codistributed arrays performance</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/265302#693046</link>
      <author>Scott </author>
      <description>After reading other postings related to this topic, I'm left unsatisfied by the explanation for lagging codistributed performance in mathematical operations. To try and maximize performance in my application I have done the following:&lt;br&gt;
&lt;br&gt;
1) Write and performance tune the functions for serial (albeit multi-threaded) Matlab operation.&lt;br&gt;
&lt;br&gt;
2) Benchmark serial performance and find the data array size that results in precipitous performance drop-off. Note that this happens long before hitting the memory limitations of my deployment machine.&lt;br&gt;
&lt;br&gt;
3) Parallelize using codistributed arrays, managing redistribution explicitly to minimize communication problems.&lt;br&gt;
&lt;br&gt;
4) Benchmark parallel performance for data array sizes that exceed the serial performance drop-off size.&lt;br&gt;
&lt;br&gt;
5) Be disappointed by the fact that the parallel benchmarks clearly show a significantly reduced performance in elementwise binary operations, e.g. codistributed.times, codistributed.rdivide, codistributed.mtimes, etc.&lt;br&gt;
&lt;br&gt;
I have enough previous experience writing MPI in C++ to understand how to avoid communication bottlenecks and using the mpiprofile I was able to reduce the communication overhead to &amp;lt; 4.8% of the total execution time. &lt;br&gt;
&lt;br&gt;
The majority of the execution, 61% was taken by the element wise operations, codistributor1d.hElementwiseBinaryOpImpl, which seem to reduce the performance for an identical serial operation by at least 3x.&lt;br&gt;
&lt;br&gt;
Can someone explain why these operations, in the absence of communication overhead, are so much slower than an identical serial execution? I would buy the multi-threading argument if I hadn't made sure to find and push the data size beyond what multi-threading seems to handle efficiently before benchmarking.</description>
    </item>
    <item>
      <pubDate>Mon, 09 Nov 2009 09:01:11 -0500</pubDate>
      <title>Re: Codistributed arrays performance</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/265302#693156</link>
      <author>Edric M Ellis</author>
      <description>&quot;Scott &quot; &amp;lt;lorentz-spampadded@fastmail.fm&amp;gt; writes:&lt;br&gt;
&lt;br&gt;
&amp;gt; 5) Be disappointed by the fact that the parallel benchmarks clearly show a&lt;br&gt;
&amp;gt; significantly reduced performance in elementwise binary operations,&lt;br&gt;
&amp;gt; e.g. codistributed.times, codistributed.rdivide, codistributed.mtimes, etc.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; I have enough previous experience writing MPI in C++ to understand how to avoid&lt;br&gt;
&amp;gt; communication bottlenecks and using the mpiprofile I was able to reduce the&lt;br&gt;
&amp;gt; communication overhead to &amp;lt; 4.8% of the total execution time.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; The majority of the execution, 61% was taken by the element wise operations,&lt;br&gt;
&amp;gt; codistributor1d.hElementwiseBinaryOpImpl, which seem to reduce the performance&lt;br&gt;
&amp;gt; for an identical serial operation by at least 3x.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Can someone explain why these operations, in the absence of communication&lt;br&gt;
&amp;gt; overhead, are so much slower than an identical serial execution? I would buy the&lt;br&gt;
&amp;gt; multi-threading argument if I hadn't made sure to find and push the data size&lt;br&gt;
&amp;gt; beyond what multi-threading seems to handle efficiently before benchmarking.&lt;br&gt;
&lt;br&gt;
The main overhead when doing &quot;embarassingly parallel&quot; low numerical intensity&lt;br&gt;
operations such as &quot;times&quot; or &quot;rdivide&quot; is the time taken to get into and out of&lt;br&gt;
the underlying operation.&lt;br&gt;
&lt;br&gt;
As you have seen, &quot;codistributed.times&quot; and so on are implemented using MATLAB&lt;br&gt;
objects. Unfortunately, the overhead of object method dispatch is larger than&lt;br&gt;
the relatively small amount of numerical computation required. We are aware that&lt;br&gt;
this is a problem, and are working to try and increase the performance of&lt;br&gt;
codistributed arrays. For now, the main advantages of codistributed arrays are&lt;br&gt;
that they allow you to work with data sizes that do not fit onto a single&lt;br&gt;
machine, and that the more complex linear algebra routines (such as ldivide) can&lt;br&gt;
show performance benefit.&lt;br&gt;
&lt;br&gt;
Cheers,&lt;br&gt;
&lt;br&gt;
Edric.</description>
    </item>
    <item>
      <pubDate>Mon, 09 Nov 2009 14:48:37 -0500</pubDate>
      <title>Re: Codistributed arrays performance</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/265302#693228</link>
      <author>Edric M Ellis</author>
      <description>Edric M Ellis &amp;lt;eellis@mathworks.com&amp;gt; writes:&lt;br&gt;
&lt;br&gt;
&amp;gt; [...] For now, the main advantages of codistributed arrays are&lt;br&gt;
&amp;gt; that they allow you to work with data sizes that do not fit onto a single&lt;br&gt;
&amp;gt; machine, and that the more complex linear algebra routines (such as ldivide) can&lt;br&gt;
&amp;gt; show performance benefit.&lt;br&gt;
&lt;br&gt;
Oops, I mean &quot;mldivide&quot; (aka &quot;backslash&quot;), not &quot;ldivide&quot;. &lt;br&gt;
&lt;br&gt;
Cheers,&lt;br&gt;
&lt;br&gt;
Edric.</description>
    </item>
    <item>
      <pubDate>Mon, 09 Nov 2009 18:12:02 -0500</pubDate>
      <title>Re: Codistributed arrays performance</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/265302#693273</link>
      <author>Scott </author>
      <description>Ah, I see, good to know. Is parfor then the route to better performance when applicable, or is the parallel toolbox really just for large data sets at this point?</description>
    </item>
    <item>
      <pubDate>Tue, 10 Nov 2009 08:08:45 -0500</pubDate>
      <title>Re: Codistributed arrays performance</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/265302#693423</link>
      <author>Edric M Ellis</author>
      <description>&quot;Scott &quot; &amp;lt;lorentz-spampadded@fastmail.fm&amp;gt; writes:&lt;br&gt;
&lt;br&gt;
&amp;gt; Ah, I see, good to know. Is parfor then the route to better performance when&lt;br&gt;
&amp;gt; applicable, or is the parallel toolbox really just for large data sets at this&lt;br&gt;
&amp;gt; point?&lt;br&gt;
&lt;br&gt;
In general, if a problem can be addressed using parfor, it will almost certainly&lt;br&gt;
be quicker as there are fewer synchronisation points for communication, and the&lt;br&gt;
dynamic scheduling attempts to get better load-balancing.&lt;br&gt;
&lt;br&gt;
Cheers,&lt;br&gt;
&lt;br&gt;
Edric.</description>
    </item>
    <item>
      <pubDate>Tue, 10 Nov 2009 20:42:02 -0500</pubDate>
      <title>Re: Codistributed arrays performance</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/265302#693626</link>
      <author>Scott </author>
      <description>Edric M Ellis &amp;lt;eellis@mathworks.com&amp;gt; wrote in message &amp;lt;ytw3a4mer8y.fsf@uk-eellis-deb5-64.mathworks.co.uk&amp;gt;...&lt;br&gt;
&amp;gt; &quot;Scott &quot; &amp;lt;lorentz-spampadded@fastmail.fm&amp;gt; writes:&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; &amp;gt; Ah, I see, good to know. Is parfor then the route to better performance when&lt;br&gt;
&amp;gt; &amp;gt; applicable, or is the parallel toolbox really just for large data sets at this&lt;br&gt;
&amp;gt; &amp;gt; point?&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; In general, if a problem can be addressed using parfor, it will almost certainly&lt;br&gt;
&amp;gt; be quicker as there are fewer synchronisation points for communication, and the&lt;br&gt;
&amp;gt; dynamic scheduling attempts to get better load-balancing.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Cheers,&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Edric.&lt;br&gt;
&lt;br&gt;
My code is highly vectorized with very few for-loops. Would you expect the parfor performance to exceed that of a vectorized multi-threaded computation for large datasets? Or should I be considering semi-vectorized coding to take greater advantage of parfor? Seems like the array indexing necessary for that would slow it down, but I don't have a good handle on the performance trade-offs.</description>
    </item>
    <item>
      <pubDate>Wed, 11 Nov 2009 08:40:18 -0500</pubDate>
      <title>Re: Codistributed arrays performance</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/265302#693757</link>
      <author>Edric M Ellis</author>
      <description>&quot;Scott &quot; &amp;lt;lorentz-spampadded@fastmail.fm&amp;gt; writes:&lt;br&gt;
&lt;br&gt;
&amp;gt; Edric M Ellis &amp;lt;eellis@mathworks.com&amp;gt; wrote in message&lt;br&gt;
&amp;gt; &amp;lt;ytw3a4mer8y.fsf@uk-eellis-deb5-64.mathworks.co.uk&amp;gt;...&lt;br&gt;
&amp;gt;&amp;gt; &quot;Scott &quot; &amp;lt;lorentz-spampadded@fastmail.fm&amp;gt; writes:&lt;br&gt;
&amp;gt;&amp;gt; &lt;br&gt;
&amp;gt;&amp;gt; &amp;gt; Ah, I see, good to know. Is parfor then the route to better performance when&lt;br&gt;
&amp;gt;&amp;gt; &amp;gt; applicable, or is the parallel toolbox really just for large data sets at&lt;br&gt;
&amp;gt;&amp;gt; &amp;gt; this point?&lt;br&gt;
&amp;gt;&amp;gt;  In general, if a problem can be addressed using parfor, it will almost&lt;br&gt;
&amp;gt;&amp;gt; certainly be quicker as there are fewer synchronisation points for&lt;br&gt;
&amp;gt;&amp;gt; communication, and the dynamic scheduling attempts to get better&lt;br&gt;
&amp;gt;&amp;gt; load-balancing.&lt;br&gt;
&amp;gt;&amp;gt; &lt;br&gt;
&amp;gt;&amp;gt; Cheers,&lt;br&gt;
&amp;gt;&amp;gt; &lt;br&gt;
&amp;gt;&amp;gt; Edric.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; My code is highly vectorized with very few for-loops. Would you expect the&lt;br&gt;
&amp;gt; parfor performance to exceed that of a vectorized multi-threaded computation for&lt;br&gt;
&amp;gt; large datasets? Or should I be considering semi-vectorized coding to take&lt;br&gt;
&amp;gt; greater advantage of parfor? Seems like the array indexing necessary for that&lt;br&gt;
&amp;gt; would slow it down, but I don't have a good handle on the performance&lt;br&gt;
&amp;gt; trade-offs.&lt;br&gt;
&lt;br&gt;
I'm afraid it's hard to say. The usual principles of using the profiler to work&lt;br&gt;
out where time is being taken should help. Generally, to get speedup with&lt;br&gt;
PARFOR, you need to ensure that the overheads of sending out and getting back&lt;br&gt;
the data doesn't exceed the amount of computation required. Basically, it comes&lt;br&gt;
down to having each loop iteration performing a largish amount of computation&lt;br&gt;
compared to the amount of input and output data needed. A couple of extreme&lt;br&gt;
examples:&lt;br&gt;
&lt;br&gt;
y = rand( 1, N );&lt;br&gt;
parfor ii=1:N&lt;br&gt;
&amp;nbsp;&amp;nbsp;x(ii) = y + 1;&lt;br&gt;
end&lt;br&gt;
&lt;br&gt;
In that case, all of x and y have to be sent to/from the workers, but the amount&lt;br&gt;
of computation is trivial. This will be much slower than the obvious &quot;x = y +&lt;br&gt;
1&quot;.&lt;br&gt;
&lt;br&gt;
parfor ii=1:N&lt;br&gt;
&amp;nbsp;&amp;nbsp;pause( ii );&lt;br&gt;
end&lt;br&gt;
&lt;br&gt;
This example is slightly silly, but should give almost perfect speedup compared&lt;br&gt;
to the &quot;for&quot; version of the same loop, since the amount of data transferred is&lt;br&gt;
zero, and the &quot;work&quot; done takes a long time compared to the PARFOR overheads.&lt;br&gt;
&lt;br&gt;
Cheers,&lt;br&gt;
&lt;br&gt;
Edric.</description>
    </item>
    <item>
      <pubDate>Wed, 11 Nov 2009 15:55:22 -0500</pubDate>
      <title>Re: Codistributed arrays performance</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/265302#693910</link>
      <author>Scott </author>
      <description>Got it. Thanks for all the information, Edric.&lt;br&gt;
&lt;br&gt;
Scott</description>
    </item>
  </channel>
</rss>

