Path: news.mathworks.com!not-for-mail
From: "Tim Davis" <davis@cise.ufl.edu>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Matlab Vectorisation Speed - How is it done in c++?
Date: Mon, 17 Dec 2007 13:42:47 +0000 (UTC)
Organization: University of Florida
Lines: 118
Message-ID: <fk5ucn$ngr$1@fred.mathworks.com>
References: <eb177713-6655-4454-bbf6-92d2c91bb6a6@s19g2000prg.googlegroups.com> <08100346-586e-41fb-bb41-9d9342d269ed@w40g2000hsb.googlegroups.com>
Reply-To: "Tim Davis" <davis@cise.ufl.edu>
NNTP-Posting-Host: webapp-02-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1197898967 24091 172.30.248.37 (17 Dec 2007 13:42:47 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Mon, 17 Dec 2007 13:42:47 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 45902
Xref: news.mathworks.com comp.soft-sys.matlab:442739



sturlamolden <sturlamolden@yahoo.no> wrote in message
<08100346-586e-41fb-bb41-9d9342d269ed@w40g2000hsb.googlegroups.com>...
> On 17 Des, 01:05, Phil Winder
<philipwin...@googlemail.com> wrote:
> 
> > Im currently porting some matlab algorithms to c++ code.
 The test
> > code I have is testing the vector math capabilities and
how fast they
> > can go.  I have found that it can be very, very fast and
I am
> > strugglling to reproduce the speed in c++. How does
matlab do it? And
> > how can it be reproduced in c++?
> 
> Beating the performance of vectorized Matlab code is very
hard, and
> usually not worth the effort.
> 
> Matlab makes calls to optimized C and Fortran libraries
such as blas/
> atlas, lapack and fftw. You cannot duplicate their
efficacies in C++
> for at least two reasons:
> 
> 1. There are issues related to the language syntax that
makes Fortran
> particularly easy to optimize for compilers, such as lack
of pointer
> aliasing. This is particularly important for optimal
allocation of
> registers when the CPU goes into a tight loop.
> 
> 2. A lot of effort have been put into making these
libraries as fast
> as possible. This includes optimal use of cache and branch
prediction.
> Duplicating these efforts on your own is going to take the
rest of
> your life to complete.
> 
> My advice would be this:
> 
> If you want speed in your C++ app, link and call the same
libraries as
> Matlab do. Most of them are available for free. Give the
C++ compiler
> pointer aliasing hints wherever possible.
> 
> In addition:
> 
> Use optimization level 3 on numerical code and level 2 on non-
> numerical code. Process the data in chunks that fit in
your L1 cache.
> Force the CPU to prefetch if you know it will help.
Page-align your
> arrays in RAM.  Memory access is terribly slow, traverse
as few times
> as possible. Never use strided memory access. Manually
unroll tight
> loops. Exploit arithmetic pipelining of four subsequent
operations.
> Avoid divisions, transform to a multiplcation. Exploit
multiple CPUs:
> use MPI or OpenMP, forkjoin with labour-sharing
threadpools, etc. Use
> inline assembly to access SIMD parallel registers.
> 
> Also remember Hoare's statement about optimization, quoted
by D.
> Knuth: "Premature optimization is the root of all evil in
computer
> programming." Profile your code. Direct your optimizations
to the
> important bottlenecks. They are likely to be few. Do as
much as you
> can with the bottlenecks, and never mind the reminding 90%
of your
> code.
> 
> But before you begin: ask yourself if the hard work is
goint to be
> worth the effort.



Regarding (1):  I write in C and I haven't found (1) to be
that much of an issue (although I do worry about it and it's
well worth it for you to mention here).  I think the more
recent versions of gcc are able to work around this issue. 
More serious for C is the abuse of pointers (indirect
addressing, which requires lots of memory traffic).  Memory
traffic is more of a problem than register allocation,
anyway (which you point out too, regarding the stride issue)..

Regarding (2):  Yes, that's definitely true.  One could
write an m-file script that does x=A\b without backslash, in
maybe 50 lines of M (LU factorization if square, QR with
Householder if rectangular).  Backslash itself has maybe
250,000 lines of code (that's a guess, but an educated one
since I wrote about half that).

Some vector operations are trivial (a = b+c) to write in C
or Fortran.  If you write a=b*c where b and c are matrices,
then there's no way you'll match performance in an optimized
BLAS library.

Rule of thumb: if the work is O(n) where n is the size of
the data, then there's a decent chance that simple C or
Fortran code can match (not beat) MATLAB.  If the work is
higher than O(n) than you probably can't beat MATLAB with
simple C.  Matrix add fits in the former category; matrix
multiply doesn't.

You can always call the BLAS / LAPACK yourself, in the dense
case, or use available C code for the sparse case.  Lots of
the code in x=a*b, x=A\b, etc is open source.