<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/246906</link>
    <title>MATLAB Central Newsreader - Multithreading: Negative returns?? Puzzling benchmarks</title>
    <description>Feed for thread: Multithreading: Negative returns?? Puzzling benchmarks</description>
    <language>en-us</language>
    <copyright>&amp;copy;1994-2012 by MathWorks, Inc.</copyright>
    <webmaster>webmaster@mathworks.com</webmaster>
    <generator>MATLAB Central Newsreader</generator>
    <docs>http://blogs.law.harvard.edu/tech/rss</docs>
    <ttl>60</ttl>
    <image>
      <title>MathWorks</title>
      <url>http://www.mathworks.com/images/membrane_icon.gif</url>
    </image>
    <item>
      <pubDate>Tue, 17 Mar 2009 22:46:01 -0400</pubDate>
      <title>Multithreading: Negative returns?? Puzzling benchmarks</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/246906#635688</link>
      <author>Michael Johnston</author>
      <description>I just got a new computer with dual Xeon 5450s at 3ghz (8 CPUs,&lt;br&gt;
total). I decided to see how well the multi-threading in BLAS and&lt;br&gt;
LAPACK would work, so I ran a simple test: Multiply two matrices&lt;br&gt;
together a bunch of times, and then do the same thing with the&lt;br&gt;
division operator. Perform this test for a number of threads up to the&lt;br&gt;
number of CPU cores. Then plot the percentage change in the execution&lt;br&gt;
time relative to the single threaded case.&lt;br&gt;
&lt;br&gt;
The result is strange. I certainly expected diminishing returns to&lt;br&gt;
scale to multi-threading as the number of threads increased. But I&lt;br&gt;
never expected to see *diminishing* returns to scale. While smaller&lt;br&gt;
matrices perform relatively worse, presumably as a result of overhead&lt;br&gt;
from thread creation, my benchmarks indicate that even for reasonably&lt;br&gt;
sized matrices (e.g., 500-by-500) the returns to multi-threading&lt;br&gt;
become negative surprisingly quickly.&lt;br&gt;
&lt;br&gt;
I'm very surprised to see this on a new shared-memory system. Has&lt;br&gt;
anyone else gotten benchmarks like this? I have posted a graph of the&lt;br&gt;
plot, as well as the benchmark code I wrote, on my web site with more&lt;br&gt;
information: &lt;a href=&quot;http://michaelkjohnston.com/perm/mt8bench/&quot;&gt;http://michaelkjohnston.com/perm/mt8bench/&lt;/a&gt;&lt;br&gt;
&lt;br&gt;
Any ideas?? Anecdotes? Theories?&lt;br&gt;
&lt;br&gt;
Best regards,&lt;br&gt;
&lt;br&gt;
Michael</description>
    </item>
    <item>
      <pubDate>Wed, 18 Mar 2009 00:46:01 -0400</pubDate>
      <title>Re: Multithreading: Negative returns?? Puzzling benchmarks</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/246906#635711</link>
      <author>Derek O'Connor</author>
      <description>Michael Johnston &amp;lt;mkjohnst@gmail.com&amp;gt; wrote in message &amp;lt;3b948ba5-b0c4-4b53-8352-de5641ddd313@o2g2000prl.googlegroups.com&amp;gt;...&lt;br&gt;
&amp;gt; I just got a new computer with dual Xeon 5450s at 3ghz (8 CPUs,&lt;br&gt;
&amp;gt; total). I decided to see how well the multi-threading in BLAS and&lt;br&gt;
&amp;gt; LAPACK would work, so I ran a simple test: Multiply two matrices&lt;br&gt;
&amp;gt; together a bunch of times, and then do the same thing with the&lt;br&gt;
&amp;gt; division operator. Perform this test for a number of threads up to the&lt;br&gt;
&amp;gt; number of CPU cores. Then plot the percentage change in the execution&lt;br&gt;
&amp;gt; time relative to the single threaded case.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; The result is strange. I certainly expected diminishing returns to&lt;br&gt;
&amp;gt; scale to multi-threading as the number of threads increased. But I&lt;br&gt;
&amp;gt; never expected to see *diminishing* returns to scale. While smaller&lt;br&gt;
&amp;gt; matrices perform relatively worse, presumably as a result of overhead&lt;br&gt;
&amp;gt; from thread creation, my benchmarks indicate that even for reasonably&lt;br&gt;
&amp;gt; sized matrices (e.g., 500-by-500) the returns to multi-threading&lt;br&gt;
&amp;gt; become negative surprisingly quickly.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; I'm very surprised to see this on a new shared-memory system. Has&lt;br&gt;
&amp;gt; anyone else gotten benchmarks like this? I have posted a graph of the&lt;br&gt;
&amp;gt; plot, as well as the benchmark code I wrote, on my web site with more&lt;br&gt;
&amp;gt; information: &lt;a href=&quot;http://michaelkjohnston.com/perm/mt8bench/&quot;&gt;http://michaelkjohnston.com/perm/mt8bench/&lt;/a&gt;&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Any ideas?? Anecdotes? Theories?&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Best regards,&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Michael&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
Dear Michael,&lt;br&gt;
&lt;br&gt;
The matrices used in your test above are tiny : 14x14 and 200x200.&lt;br&gt;
&lt;br&gt;
Take a look at these test results on a Dell Precision 690 with dual Xeon 5345s at 2.3GHz, 8GB ram. &lt;br&gt;
&lt;a href=&quot;http://www.derekroconnor.net/Software/Benchmarks.htm&quot;&gt;http://www.derekroconnor.net/Software/Benchmarks.htm&lt;/a&gt;&lt;br&gt;
&lt;br&gt;
These tests show substantial multicore speedups for Matmult and LU Decomp, but very little speedups for SVD or EIG .&lt;br&gt;
&lt;br&gt;
Regards,&lt;br&gt;
&lt;br&gt;
Derek O'Connor</description>
    </item>
    <item>
      <pubDate>Wed, 18 Mar 2009 01:36:05 -0400</pubDate>
      <title>Re: Multithreading: Negative returns?? Puzzling benchmarks</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/246906#635722</link>
      <author>Michael Johnston</author>
      <description>On Mar 17, 8:46=A0pm, &quot;Derek O'Connor&quot; &amp;lt;derekrocon...@eircom.net&amp;gt; wrote:&lt;br&gt;
&amp;gt; Michael Johnston &amp;lt;mkjoh...@gmail.com&amp;gt; wrote in message &amp;lt;3b948ba5-b0c4-4b5=&lt;br&gt;
3-8352-de5641ddd...@o2g2000prl.googlegroups.com&amp;gt;...&lt;br&gt;
&amp;gt; &amp;gt; I just got a new computer with dual Xeon 5450s at 3ghz (8 CPUs,&lt;br&gt;
&amp;gt; &amp;gt; total). I decided to see how well the multi-threading in BLAS and&lt;br&gt;
&amp;gt; &amp;gt; LAPACK would work, so I ran a simple test: Multiply two matrices&lt;br&gt;
&amp;gt; &amp;gt; together a bunch of times, and then do the same thing with the&lt;br&gt;
&amp;gt; &amp;gt; division operator. Perform this test for a number of threads up to the&lt;br&gt;
&amp;gt; &amp;gt; number of CPU cores. Then plot the percentage change in the execution&lt;br&gt;
&amp;gt; &amp;gt; time relative to the single threaded case.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; The result is strange. I certainly expected diminishing returns to&lt;br&gt;
&amp;gt; &amp;gt; scale to multi-threading as the number of threads increased. But I&lt;br&gt;
&amp;gt; &amp;gt; never expected to see *diminishing* returns to scale. While smaller&lt;br&gt;
&amp;gt; &amp;gt; matrices perform relatively worse, presumably as a result of overhead&lt;br&gt;
&amp;gt; &amp;gt; from thread creation, my benchmarks indicate that even for reasonably&lt;br&gt;
&amp;gt; &amp;gt; sized matrices (e.g., 500-by-500) the returns to multi-threading&lt;br&gt;
&amp;gt; &amp;gt; become negative surprisingly quickly.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; I'm very surprised to see this on a new shared-memory system. Has&lt;br&gt;
&amp;gt; &amp;gt; anyone else gotten benchmarks like this? I have posted a graph of the&lt;br&gt;
&amp;gt; &amp;gt; plot, as well as the benchmark code I wrote, on my web site with more&lt;br&gt;
&amp;gt; &amp;gt; information:&lt;a href=&quot;http://michaelkjohnston.com/perm/mt8bench/&quot;&gt;http://michaelkjohnston.com/perm/mt8bench/&lt;/a&gt;&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Any ideas?? Anecdotes? Theories?&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Best regards,&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Michael&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Dear Michael,&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; The matrices used in your test above are tiny : 14x14 and 200x200.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Take a look at these test results on a Dell Precision 690 with dual Xeon =&lt;br&gt;
5345s at 2.3GHz, 8GB ram.&lt;a href=&quot;http://www.derekroconnor.net/Software/Benchmarks.h=&quot;&gt;http://www.derekroconnor.net/Software/Benchmarks.h=&lt;/a&gt;&lt;br&gt;
tm&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; These tests show substantial multicore speedups for Matmult and LU Decomp=&lt;br&gt;
, but very little speedups for SVD or EIG .&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Regards,&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Derek O'Connor&lt;br&gt;
&lt;br&gt;
Dear Derek,&lt;br&gt;
&lt;br&gt;
Thanks very much for your reply. That's really helpful!  My prior was&lt;br&gt;
that the BLAS+LAPACK libraries would make optimal decisions with&lt;br&gt;
respect to threading -- I think documentation from Mathworks says&lt;br&gt;
something to this effect -- so I was surprised to see run times&lt;br&gt;
actually increase. The worst part of this is perhaps that CPU&lt;br&gt;
utilization rises steadily until it hits 100% in all of these tests as&lt;br&gt;
the number of threads increases. I tested a new 24-CPU Xeon machine&lt;br&gt;
and found that, for one piece of code, run times were effectively the&lt;br&gt;
same with multi-threading on and off, but that with it on CPU&lt;br&gt;
utilization was 24x higher. I'll use your code to replicate your&lt;br&gt;
benchmarks on my hardware tomorrow when I get back to work and post an&lt;br&gt;
update.&lt;br&gt;
&lt;br&gt;
Best regards,&lt;br&gt;
&lt;br&gt;
Michael</description>
    </item>
  </channel>
</rss>

