Path: news.mathworks.com!not-for-mail
From: "Tim Davis" <davis@cise.ufl.edu>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Matlab Vectorisation Speed - How is it done in c++?
Date: Tue, 18 Dec 2007 11:25:50 +0000 (UTC)
Organization: University of Florida
Lines: 126
Message-ID: <fk8anu$jqe$1@fred.mathworks.com>
References: <eb177713-6655-4454-bbf6-92d2c91bb6a6@s19g2000prg.googlegroups.com>  <825523e3-b124-44a4-b82f-7b01b3495029@f3g2000hsg.googlegroups.com>
Reply-To: "Tim Davis" <davis@cise.ufl.edu>
NNTP-Posting-Host: webapp-02-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1197977150 20302 172.30.248.37 (18 Dec 2007 11:25:50 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Tue, 18 Dec 2007 11:25:50 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 45902
Xref: news.mathworks.com comp.soft-sys.matlab:442890



See my replies, interleafed below (for a definition of
interleaf posting, see
http://www.cise.ufl.edu/~davis/Horror_matrices.html#composting
  )

"Steven G. Johnson" <stevenj@alum.mit.edu> wrote in message
<825523e3-b124-44a4-b82f-7b01b3495029@f3g2000hsg.googlegroups.com>...
> On Dec 17, 8:42 am, "Tim Davis" <da...@cise.ufl.edu> wrote:
> > > 1. There are issues related to the language syntax that
> > makes Fortran
> > > particularly easy to optimize for compilers, such as lack
> > of pointer
> > > aliasing. This is particularly important for optimal
> > allocation of
> > > registers when the CPU goes into a tight loop.
> >
> > Regarding (1):  I write in C and I haven't found (1) to be
> > that much of an issue (although I do worry about it and
it's > > well worth it for you to mention here).  I thinkthe
more
> > recent versions of gcc are able to work around this issue.
> > More serious for C is the abuse of pointers (indirect
> > addressing, which requires lots of memory traffic).  Memory
> > traffic is more of a problem than register allocation,
> > anyway (which you point out too, regarding the stride
issue)..
> 
> The old canard about pointer aliasing semantics being
weaker in C than
> in Fortran hasn't been an issue even in principle for
almost 10 years
> now, since the 1999 C standard introduced the "restrict"
keyword.  In
> practice, I've never found it to be a major practical
issue in highly
> optimized code, since for key loops you often want to
partially unroll
> them yourself anyway, and in any case higher-level
memory-access
> patterns are usually more important for performance.
> 
> Regarding the "abuse of pointers" I'm not sure what you're
talking
> about.  Array access in C, properly implemented, requires
no more or
> less pointer indirection than in Fortran or any other
language.

Right - I agree with you completely.

For "abuse of pointers", I mean data structures that use an
unnecessary amount of indirection (pointers to pointers to
pointers to ...).  I mean that "C gives you enough rope to
hang yourself".  Yes, simple arrays require no more or less
indirection than any other language.

> It's a good learning exercise, by the way, to implement a
matrix
> multiply yourself and compare it to a fast BLAS
implementation.  Even
> if you turn off things like SSE2 instructions, it is
probably a factor
> of 6 faster than your first try, for a decent-sized
matrix.  On the
> other hand, matrix multiplication is simple enough that
it's not *too*
> hard to get at least reasonably close to a fast BLAS if
you have some
> notion of what you are doing.  (I had a class once a few
years ago
> where there was a contest to write a dgemm as fast as
possible, and at
> least one student beat the fastest free BLAS at the time
for at least
> one matrix size.)

Yes, that is a good exercise.  It's a lot more difficult
than it looks.

> I once had an old Fortran programmer remark to me, "A
matrix multiply
> is just three loops!  How many possible ways can there be
to implement
> it?"  Recently, I told that story to an old compiler
engineer, and he
> immediately responded "Six ways (3 factorial), and I once
wrote a
> compiler that automatically found the best loop order."  

That's hilarious!

> The correct
> answer (neglecting exotic algorithms like Strassen etc.
that no one
> uses) is closer to n^3 factorial, since the n^3
multiplications all
> commute.  Programming was simpler when floating-point
arithmetic
> dominated the runtime and all you had to worry about was
the operation
> count.

Yup, I would guess n^3 factorial, maybe more because you can
do a flop in so many ways (fused mult-adds, SSE3 or not, etc).

A similar question I sometimes get:

"Gaussian elimination is just a few loops, how many lines of
code can it possibly take?" ... backslash includes probably
250,000 lines of code (C and Fortran; an educated guess,
since I wrote about half of it but haven't seen the other
half).  It can be done in maybe 20 or so lines of code in C
or Fortran, in a naive implementation of Gaussian
elimination with partial pivoting, but then it will be 10 or
20 times slower than x=A\b in the dense case, and quite
literally up to millions of times slower in the sparse case.

Matrix multiply is not quite so extreme, but not far off. 
Readers, if they're curious, should take a look at the ATLAS
or Goto BLAS source code (both are available).  They are
quite lengthy codes - but very fast.

Ditto for FFT (see FFTW for example).  Fast codes are not
(always) short codes; elegant codes are the fast ones, which
are not always short.