Path: news.mathworks.com!newsfeed-00.mathworks.com!nlpi057.nbdc.sbc.com!prodigy.net!news.glorb.com!news2.glorb.com!postnews.google.com!o2g2000prl.googlegroups.com!not-for-mail
From: Michael Johnston <mkjohnst@gmail.com>
Newsgroups: comp.soft-sys.matlab
Subject: Multithreading: Negative returns?? Puzzling benchmarks
Date: Tue, 17 Mar 2009 15:46:01 -0700 (PDT)
Organization: http://groups.google.com
Lines: 26
Message-ID: <3b948ba5-b0c4-4b53-8352-de5641ddd313@o2g2000prl.googlegroups.com>
NNTP-Posting-Host: 140.80.194.3
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Trace: posting.google.com 1237329962 25009 127.0.0.1 (17 Mar 2009 22:46:02 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Tue, 17 Mar 2009 22:46:02 +0000 (UTC)
Cc: mjohnston@bankofcanada.ca
Complaints-To: groups-abuse@google.com
Injection-Info: o2g2000prl.googlegroups.com; posting-host=140.80.194.3; 
	posting-account=dwbQVQkAAACN_1BI7VOnXlWvTWi3ZdU4
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) 
	AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.48 Safari/525.19,gzip(gfe),gzip(gfe)
Xref: news.mathworks.com comp.soft-sys.matlab:525688


I just got a new computer with dual Xeon 5450s at 3ghz (8 CPUs,
total). I decided to see how well the multi-threading in BLAS and
LAPACK would work, so I ran a simple test: Multiply two matrices
together a bunch of times, and then do the same thing with the
division operator. Perform this test for a number of threads up to the
number of CPU cores. Then plot the percentage change in the execution
time relative to the single threaded case.

The result is strange. I certainly expected diminishing returns to
scale to multi-threading as the number of threads increased. But I
never expected to see *diminishing* returns to scale. While smaller
matrices perform relatively worse, presumably as a result of overhead
from thread creation, my benchmarks indicate that even for reasonably
sized matrices (e.g., 500-by-500) the returns to multi-threading
become negative surprisingly quickly.

I'm very surprised to see this on a new shared-memory system. Has
anyone else gotten benchmarks like this? I have posted a graph of the
plot, as well as the benchmark code I wrote, on my web site with more
information: http://michaelkjohnston.com/perm/mt8bench/

Any ideas?? Anecdotes? Theories?

Best regards,

Michael