<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/160993</link>
    <title>MATLAB Central Newsreader - Matlab Vectorisation Speed - How is it done in c++?</title>
    <description>Feed for thread: Matlab Vectorisation Speed - How is it done in c++?</description>
    <language>en-us</language>
    <copyright>&amp;copy;1994-2012 by MathWorks, Inc.</copyright>
    <webmaster>webmaster@mathworks.com</webmaster>
    <generator>MATLAB Central Newsreader</generator>
    <docs>http://blogs.law.harvard.edu/tech/rss</docs>
    <ttl>60</ttl>
    <image>
      <title>MathWorks</title>
      <url>http://www.mathworks.com/images/membrane_icon.gif</url>
    </image>
    <item>
      <pubDate>Mon, 17 Dec 2007 00:05:32 -0500</pubDate>
      <title>Matlab Vectorisation Speed - How is it done in c++?</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/160993#406400</link>
      <author>Phil Winder</author>
      <description>Hi,&lt;br&gt;
Im currently porting some matlab algorithms to c++ code.  The test&lt;br&gt;
code I have is testing the vector math capabilities and how fast they&lt;br&gt;
can go.  I have found that it can be very, very fast and I am&lt;br&gt;
strugglling to reproduce the speed in c++. How does matlab do it? And&lt;br&gt;
how can it be reproduced in c++?&lt;br&gt;
&lt;br&gt;
Thanks,&lt;br&gt;
Phil Winder</description>
    </item>
    <item>
      <pubDate>Mon, 17 Dec 2007 01:36:05 -0500</pubDate>
      <title>Re: Matlab Vectorisation Speed - How is it done in c++?</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/160993#406402</link>
      <author>sturlamolden</author>
      <description>On 17 Des, 01:05, Phil Winder &amp;lt;philipwin...@googlemail.com&amp;gt; wrote:&lt;br&gt;
&lt;br&gt;
&amp;gt; Im currently porting some matlab algorithms to c++ code.  The test&lt;br&gt;
&amp;gt; code I have is testing the vector math capabilities and how fast they&lt;br&gt;
&amp;gt; can go.  I have found that it can be very, very fast and I am&lt;br&gt;
&amp;gt; strugglling to reproduce the speed in c++. How does matlab do it? And&lt;br&gt;
&amp;gt; how can it be reproduced in c++?&lt;br&gt;
&lt;br&gt;
Beating the performance of vectorized Matlab code is very hard, and&lt;br&gt;
usually not worth the effort.&lt;br&gt;
&lt;br&gt;
Matlab makes calls to optimized C and Fortran libraries such as blas/&lt;br&gt;
atlas, lapack and fftw. You cannot duplicate their efficacies in C++&lt;br&gt;
for at least two reasons:&lt;br&gt;
&lt;br&gt;
1. There are issues related to the language syntax that makes Fortran&lt;br&gt;
particularly easy to optimize for compilers, such as lack of pointer&lt;br&gt;
aliasing. This is particularly important for optimal allocation of&lt;br&gt;
registers when the CPU goes into a tight loop.&lt;br&gt;
&lt;br&gt;
2. A lot of effort have been put into making these libraries as fast&lt;br&gt;
as possible. This includes optimal use of cache and branch prediction.&lt;br&gt;
Duplicating these efforts on your own is going to take the rest of&lt;br&gt;
your life to complete.&lt;br&gt;
&lt;br&gt;
My advice would be this:&lt;br&gt;
&lt;br&gt;
If you want speed in your C++ app, link and call the same libraries as&lt;br&gt;
Matlab do. Most of them are available for free. Give the C++ compiler&lt;br&gt;
pointer aliasing hints wherever possible.&lt;br&gt;
&lt;br&gt;
In addition:&lt;br&gt;
&lt;br&gt;
Use optimization level 3 on numerical code and level 2 on non-&lt;br&gt;
numerical code. Process the data in chunks that fit in your L1 cache.&lt;br&gt;
Force the CPU to prefetch if you know it will help. Page-align your&lt;br&gt;
arrays in RAM.  Memory access is terribly slow, traverse as few times&lt;br&gt;
as possible. Never use strided memory access. Manually unroll tight&lt;br&gt;
loops. Exploit arithmetic pipelining of four subsequent operations.&lt;br&gt;
Avoid divisions, transform to a multiplcation. Exploit multiple CPUs:&lt;br&gt;
use MPI or OpenMP, forkjoin with labour-sharing threadpools, etc. Use&lt;br&gt;
inline assembly to access SIMD parallel registers.&lt;br&gt;
&lt;br&gt;
Also remember Hoare's statement about optimization, quoted by D.&lt;br&gt;
Knuth: &quot;Premature optimization is the root of all evil in computer&lt;br&gt;
programming.&quot; Profile your code. Direct your optimizations to the&lt;br&gt;
important bottlenecks. They are likely to be few. Do as much as you&lt;br&gt;
can with the bottlenecks, and never mind the reminding 90% of your&lt;br&gt;
code.&lt;br&gt;
&lt;br&gt;
But before you begin: ask yourself if the hard work is goint to be&lt;br&gt;
worth the effort.</description>
    </item>
    <item>
      <pubDate>Mon, 17 Dec 2007 10:09:45 -0500</pubDate>
      <title>Re: Matlab Vectorisation Speed - How is it done in c++?</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/160993#406420</link>
      <author>Phil Winder</author>
      <description>On Dec 17, 1:36 am, sturlamolden &amp;lt;sturlamol...@yahoo.no&amp;gt; wrote:&lt;br&gt;
&amp;gt; On 17 Des, 01:05, Phil Winder &amp;lt;philipwin...@googlemail.com&amp;gt; wrote:&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Im currently porting some matlab algorithms to c++ code.  The test&lt;br&gt;
&amp;gt; &amp;gt; code I have is testing the vector math capabilities and how fast they&lt;br&gt;
&amp;gt; &amp;gt; can go.  I have found that it can be very, very fast and I am&lt;br&gt;
&amp;gt; &amp;gt; strugglling to reproduce the speed in c++. How does matlab do it? And&lt;br&gt;
&amp;gt; &amp;gt; how can it be reproduced in c++?&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Beating the performance of vectorized Matlab code is very hard, and&lt;br&gt;
&amp;gt; usually not worth the effort.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Matlab makes calls to optimized C and Fortran libraries such as blas/&lt;br&gt;
&amp;gt; atlas, lapack and fftw. You cannot duplicate their efficacies in C++&lt;br&gt;
&amp;gt; for at least two reasons:&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; 1. There are issues related to the language syntax that makes Fortran&lt;br&gt;
&amp;gt; particularly easy to optimize for compilers, such as lack of pointer&lt;br&gt;
&amp;gt; aliasing. This is particularly important for optimal allocation of&lt;br&gt;
&amp;gt; registers when the CPU goes into a tight loop.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; 2. A lot of effort have been put into making these libraries as fast&lt;br&gt;
&amp;gt; as possible. This includes optimal use of cache and branch prediction.&lt;br&gt;
&amp;gt; Duplicating these efforts on your own is going to take the rest of&lt;br&gt;
&amp;gt; your life to complete.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; My advice would be this:&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; If you want speed in your C++ app, link and call the same libraries as&lt;br&gt;
&amp;gt; Matlab do. Most of them are available for free. Give the C++ compiler&lt;br&gt;
&amp;gt; pointer aliasing hints wherever possible.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; In addition:&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Use optimization level 3 on numerical code and level 2 on non-&lt;br&gt;
&amp;gt; numerical code. Process the data in chunks that fit in your L1 cache.&lt;br&gt;
&amp;gt; Force the CPU to prefetch if you know it will help. Page-align your&lt;br&gt;
&amp;gt; arrays in RAM.  Memory access is terribly slow, traverse as few times&lt;br&gt;
&amp;gt; as possible. Never use strided memory access. Manually unroll tight&lt;br&gt;
&amp;gt; loops. Exploit arithmetic pipelining of four subsequent operations.&lt;br&gt;
&amp;gt; Avoid divisions, transform to a multiplcation. Exploit multiple CPUs:&lt;br&gt;
&amp;gt; use MPI or OpenMP, forkjoin with labour-sharing threadpools, etc. Use&lt;br&gt;
&amp;gt; inline assembly to access SIMD parallel registers.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Also remember Hoare's statement about optimization, quoted by D.&lt;br&gt;
&amp;gt; Knuth: &quot;Premature optimization is the root of all evil in computer&lt;br&gt;
&amp;gt; programming.&quot; Profile your code. Direct your optimizations to the&lt;br&gt;
&amp;gt; important bottlenecks. They are likely to be few. Do as much as you&lt;br&gt;
&amp;gt; can with the bottlenecks, and never mind the reminding 90% of your&lt;br&gt;
&amp;gt; code.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; But before you begin: ask yourself if the hard work is goint to be&lt;br&gt;
&amp;gt; worth the effort.&lt;br&gt;
&lt;br&gt;
Great reply. Thanks for the detail&lt;br&gt;
Would it not also be easier to compile my matlab code into a dll and&lt;br&gt;
link that from my c++ program? Thus using matlabs optimisations before&lt;br&gt;
I even call it?&lt;br&gt;
&lt;br&gt;
Thanks,&lt;br&gt;
&lt;br&gt;
Phil</description>
    </item>
    <item>
      <pubDate>Mon, 17 Dec 2007 13:42:47 -0500</pubDate>
      <title>Re: Matlab Vectorisation Speed - How is it done in c++?</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/160993#406446</link>
      <author>Tim Davis</author>
      <description>sturlamolden &amp;lt;sturlamolden@yahoo.no&amp;gt; wrote in message&lt;br&gt;
&amp;lt;08100346-586e-41fb-bb41-9d9342d269ed@w40g2000hsb.googlegroups.com&amp;gt;...&lt;br&gt;
&amp;gt; On 17 Des, 01:05, Phil Winder&lt;br&gt;
&amp;lt;philipwin...@googlemail.com&amp;gt; wrote:&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; &amp;gt; Im currently porting some matlab algorithms to c++ code.&lt;br&gt;
&amp;nbsp;The test&lt;br&gt;
&amp;gt; &amp;gt; code I have is testing the vector math capabilities and&lt;br&gt;
how fast they&lt;br&gt;
&amp;gt; &amp;gt; can go.  I have found that it can be very, very fast and&lt;br&gt;
I am&lt;br&gt;
&amp;gt; &amp;gt; strugglling to reproduce the speed in c++. How does&lt;br&gt;
matlab do it? And&lt;br&gt;
&amp;gt; &amp;gt; how can it be reproduced in c++?&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Beating the performance of vectorized Matlab code is very&lt;br&gt;
hard, and&lt;br&gt;
&amp;gt; usually not worth the effort.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Matlab makes calls to optimized C and Fortran libraries&lt;br&gt;
such as blas/&lt;br&gt;
&amp;gt; atlas, lapack and fftw. You cannot duplicate their&lt;br&gt;
efficacies in C++&lt;br&gt;
&amp;gt; for at least two reasons:&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; 1. There are issues related to the language syntax that&lt;br&gt;
makes Fortran&lt;br&gt;
&amp;gt; particularly easy to optimize for compilers, such as lack&lt;br&gt;
of pointer&lt;br&gt;
&amp;gt; aliasing. This is particularly important for optimal&lt;br&gt;
allocation of&lt;br&gt;
&amp;gt; registers when the CPU goes into a tight loop.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; 2. A lot of effort have been put into making these&lt;br&gt;
libraries as fast&lt;br&gt;
&amp;gt; as possible. This includes optimal use of cache and branch&lt;br&gt;
prediction.&lt;br&gt;
&amp;gt; Duplicating these efforts on your own is going to take the&lt;br&gt;
rest of&lt;br&gt;
&amp;gt; your life to complete.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; My advice would be this:&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; If you want speed in your C++ app, link and call the same&lt;br&gt;
libraries as&lt;br&gt;
&amp;gt; Matlab do. Most of them are available for free. Give the&lt;br&gt;
C++ compiler&lt;br&gt;
&amp;gt; pointer aliasing hints wherever possible.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; In addition:&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Use optimization level 3 on numerical code and level 2 on non-&lt;br&gt;
&amp;gt; numerical code. Process the data in chunks that fit in&lt;br&gt;
your L1 cache.&lt;br&gt;
&amp;gt; Force the CPU to prefetch if you know it will help.&lt;br&gt;
Page-align your&lt;br&gt;
&amp;gt; arrays in RAM.  Memory access is terribly slow, traverse&lt;br&gt;
as few times&lt;br&gt;
&amp;gt; as possible. Never use strided memory access. Manually&lt;br&gt;
unroll tight&lt;br&gt;
&amp;gt; loops. Exploit arithmetic pipelining of four subsequent&lt;br&gt;
operations.&lt;br&gt;
&amp;gt; Avoid divisions, transform to a multiplcation. Exploit&lt;br&gt;
multiple CPUs:&lt;br&gt;
&amp;gt; use MPI or OpenMP, forkjoin with labour-sharing&lt;br&gt;
threadpools, etc. Use&lt;br&gt;
&amp;gt; inline assembly to access SIMD parallel registers.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Also remember Hoare's statement about optimization, quoted&lt;br&gt;
by D.&lt;br&gt;
&amp;gt; Knuth: &quot;Premature optimization is the root of all evil in&lt;br&gt;
computer&lt;br&gt;
&amp;gt; programming.&quot; Profile your code. Direct your optimizations&lt;br&gt;
to the&lt;br&gt;
&amp;gt; important bottlenecks. They are likely to be few. Do as&lt;br&gt;
much as you&lt;br&gt;
&amp;gt; can with the bottlenecks, and never mind the reminding 90%&lt;br&gt;
of your&lt;br&gt;
&amp;gt; code.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; But before you begin: ask yourself if the hard work is&lt;br&gt;
goint to be&lt;br&gt;
&amp;gt; worth the effort.&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
Regarding (1):  I write in C and I haven't found (1) to be&lt;br&gt;
that much of an issue (although I do worry about it and it's&lt;br&gt;
well worth it for you to mention here).  I think the more&lt;br&gt;
recent versions of gcc are able to work around this issue. &lt;br&gt;
More serious for C is the abuse of pointers (indirect&lt;br&gt;
addressing, which requires lots of memory traffic).  Memory&lt;br&gt;
traffic is more of a problem than register allocation,&lt;br&gt;
anyway (which you point out too, regarding the stride issue)..&lt;br&gt;
&lt;br&gt;
Regarding (2):  Yes, that's definitely true.  One could&lt;br&gt;
write an m-file script that does x=A\b without backslash, in&lt;br&gt;
maybe 50 lines of M (LU factorization if square, QR with&lt;br&gt;
Householder if rectangular).  Backslash itself has maybe&lt;br&gt;
250,000 lines of code (that's a guess, but an educated one&lt;br&gt;
since I wrote about half that).&lt;br&gt;
&lt;br&gt;
Some vector operations are trivial (a = b+c) to write in C&lt;br&gt;
or Fortran.  If you write a=b*c where b and c are matrices,&lt;br&gt;
then there's no way you'll match performance in an optimized&lt;br&gt;
BLAS library.&lt;br&gt;
&lt;br&gt;
Rule of thumb: if the work is O(n) where n is the size of&lt;br&gt;
the data, then there's a decent chance that simple C or&lt;br&gt;
Fortran code can match (not beat) MATLAB.  If the work is&lt;br&gt;
higher than O(n) than you probably can't beat MATLAB with&lt;br&gt;
simple C.  Matrix add fits in the former category; matrix&lt;br&gt;
multiply doesn't.&lt;br&gt;
&lt;br&gt;
You can always call the BLAS / LAPACK yourself, in the dense&lt;br&gt;
case, or use available C code for the sparse case.  Lots of&lt;br&gt;
the code in x=a*b, x=A\b, etc is open source.</description>
    </item>
    <item>
      <pubDate>Mon, 17 Dec 2007 13:55:30 -0500</pubDate>
      <title>Re: Matlab Vectorisation Speed - How is it done in c++?</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/160993#406449</link>
      <author>Steve Amphlett</author>
      <description>Phil Winder &amp;lt;philipwinder@googlemail.com&amp;gt; wrote in message &lt;br&gt;
&amp;lt;eb177713-6655-4454-bbf6-&lt;br&gt;
92d2c91bb6a6@s19g2000prg.googlegroups.com&amp;gt;...&lt;br&gt;
&amp;gt; Hi,&lt;br&gt;
&amp;gt; Im currently porting some matlab algorithms to c++ code.  &lt;br&gt;
The test&lt;br&gt;
&amp;gt; code I have is testing the vector math capabilities and &lt;br&gt;
how fast they&lt;br&gt;
&amp;gt; can go.  I have found that it can be very, very fast and &lt;br&gt;
I am&lt;br&gt;
&amp;gt; strugglling to reproduce the speed in c++. How does &lt;br&gt;
matlab do it? And&lt;br&gt;
&amp;gt; how can it be reproduced in c++?&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Thanks,&lt;br&gt;
&amp;gt; Phil Winder&lt;br&gt;
&lt;br&gt;
If you can work &quot;in place&quot; you'll get a 10+ speedup:&lt;br&gt;
&lt;br&gt;
&amp;gt;&amp;gt; myfunc(x);&lt;br&gt;
&lt;br&gt;
rather than&lt;br&gt;
&lt;br&gt;
x=myfunc(x);&lt;br&gt;
&lt;br&gt;
All that memory allocation and copying is a waste.</description>
    </item>
    <item>
      <pubDate>Mon, 17 Dec 2007 14:47:25 -0500</pubDate>
      <title>Re: Matlab Vectorisation Speed - How is it done in c++?</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/160993#406463</link>
      <author>Phil Winder</author>
      <description>On Dec 17, 1:42 pm, &quot;Tim Davis&quot; &amp;lt;da...@cise.ufl.edu&amp;gt; wrote:&lt;br&gt;
&amp;gt; sturlamolden &amp;lt;sturlamol...@yahoo.no&amp;gt; wrote in message&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;lt;08100346-586e-41fb-bb41-9d9342d26...@w40g2000hsb.googlegroups.com&amp;gt;...&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; On 17 Des, 01:05, Phil Winder&lt;br&gt;
&amp;gt; &amp;lt;philipwin...@googlemail.com&amp;gt; wrote:&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; Im currently porting some matlab algorithms to c++ code.&lt;br&gt;
&amp;gt;  The test&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; code I have is testing the vector math capabilities and&lt;br&gt;
&amp;gt; how fast they&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; can go.  I have found that it can be very, very fast and&lt;br&gt;
&amp;gt; I am&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; strugglling to reproduce the speed in c++. How does&lt;br&gt;
&amp;gt; matlab do it? And&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; how can it be reproduced in c++?&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Beating the performance of vectorized Matlab code is very&lt;br&gt;
&amp;gt; hard, and&lt;br&gt;
&amp;gt; &amp;gt; usually not worth the effort.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Matlab makes calls to optimized C and Fortran libraries&lt;br&gt;
&amp;gt; such as blas/&lt;br&gt;
&amp;gt; &amp;gt; atlas, lapack and fftw. You cannot duplicate their&lt;br&gt;
&amp;gt; efficacies in C++&lt;br&gt;
&amp;gt; &amp;gt; for at least two reasons:&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; 1. There are issues related to the language syntax that&lt;br&gt;
&amp;gt; makes Fortran&lt;br&gt;
&amp;gt; &amp;gt; particularly easy to optimize for compilers, such as lack&lt;br&gt;
&amp;gt; of pointer&lt;br&gt;
&amp;gt; &amp;gt; aliasing. This is particularly important for optimal&lt;br&gt;
&amp;gt; allocation of&lt;br&gt;
&amp;gt; &amp;gt; registers when the CPU goes into a tight loop.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; 2. A lot of effort have been put into making these&lt;br&gt;
&amp;gt; libraries as fast&lt;br&gt;
&amp;gt; &amp;gt; as possible. This includes optimal use of cache and branch&lt;br&gt;
&amp;gt; prediction.&lt;br&gt;
&amp;gt; &amp;gt; Duplicating these efforts on your own is going to take the&lt;br&gt;
&amp;gt; rest of&lt;br&gt;
&amp;gt; &amp;gt; your life to complete.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; My advice would be this:&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; If you want speed in your C++ app, link and call the same&lt;br&gt;
&amp;gt; libraries as&lt;br&gt;
&amp;gt; &amp;gt; Matlab do. Most of them are available for free. Give the&lt;br&gt;
&amp;gt; C++ compiler&lt;br&gt;
&amp;gt; &amp;gt; pointer aliasing hints wherever possible.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; In addition:&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Use optimization level 3 on numerical code and level 2 on non-&lt;br&gt;
&amp;gt; &amp;gt; numerical code. Process the data in chunks that fit in&lt;br&gt;
&amp;gt; your L1 cache.&lt;br&gt;
&amp;gt; &amp;gt; Force the CPU to prefetch if you know it will help.&lt;br&gt;
&amp;gt; Page-align your&lt;br&gt;
&amp;gt; &amp;gt; arrays in RAM.  Memory access is terribly slow, traverse&lt;br&gt;
&amp;gt; as few times&lt;br&gt;
&amp;gt; &amp;gt; as possible. Never use strided memory access. Manually&lt;br&gt;
&amp;gt; unroll tight&lt;br&gt;
&amp;gt; &amp;gt; loops. Exploit arithmetic pipelining of four subsequent&lt;br&gt;
&amp;gt; operations.&lt;br&gt;
&amp;gt; &amp;gt; Avoid divisions, transform to a multiplcation. Exploit&lt;br&gt;
&amp;gt; multiple CPUs:&lt;br&gt;
&amp;gt; &amp;gt; use MPI or OpenMP, forkjoin with labour-sharing&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; threadpools, etc. Use&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; inline assembly to access SIMD parallel registers.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Also remember Hoare's statement about optimization, quoted&lt;br&gt;
&amp;gt; by D.&lt;br&gt;
&amp;gt; &amp;gt; Knuth: &quot;Premature optimization is the root of all evil in&lt;br&gt;
&amp;gt; computer&lt;br&gt;
&amp;gt; &amp;gt; programming.&quot; Profile your code. Direct your optimizations&lt;br&gt;
&amp;gt; to the&lt;br&gt;
&amp;gt; &amp;gt; important bottlenecks. They are likely to be few. Do as&lt;br&gt;
&amp;gt; much as you&lt;br&gt;
&amp;gt; &amp;gt; can with the bottlenecks, and never mind the reminding 90%&lt;br&gt;
&amp;gt; of your&lt;br&gt;
&amp;gt; &amp;gt; code.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; But before you begin: ask yourself if the hard work is&lt;br&gt;
&amp;gt; goint to be&lt;br&gt;
&amp;gt; &amp;gt; worth the effort.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Regarding (1):  I write in C and I haven't found (1) to be&lt;br&gt;
&amp;gt; that much of an issue (although I do worry about it and it's&lt;br&gt;
&amp;gt; well worth it for you to mention here).  I think the more&lt;br&gt;
&amp;gt; recent versions of gcc are able to work around this issue.&lt;br&gt;
&amp;gt; More serious for C is the abuse of pointers (indirect&lt;br&gt;
&amp;gt; addressing, which requires lots of memory traffic).  Memory&lt;br&gt;
&amp;gt; traffic is more of a problem than register allocation,&lt;br&gt;
&amp;gt; anyway (which you point out too, regarding the stride issue)..&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Regarding (2):  Yes, that's definitely true.  One could&lt;br&gt;
&amp;gt; write an m-file script that does x=A\b without backslash, in&lt;br&gt;
&amp;gt; maybe 50 lines of M (LU factorization if square, QR with&lt;br&gt;
&amp;gt; Householder if rectangular).  Backslash itself has maybe&lt;br&gt;
&amp;gt; 250,000 lines of code (that's a guess, but an educated one&lt;br&gt;
&amp;gt; since I wrote about half that).&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Some vector operations are trivial (a = b+c) to write in C&lt;br&gt;
&amp;gt; or Fortran.  If you write a=b*c where b and c are matrices,&lt;br&gt;
&amp;gt; then there's no way you'll match performance in an optimized&lt;br&gt;
&amp;gt; BLAS library.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Rule of thumb: if the work is O(n) where n is the size of&lt;br&gt;
&amp;gt; the data, then there's a decent chance that simple C or&lt;br&gt;
&amp;gt; Fortran code can match (not beat) MATLAB.  If the work is&lt;br&gt;
&amp;gt; higher than O(n) than you probably can't beat MATLAB with&lt;br&gt;
&amp;gt; simple C.  Matrix add fits in the former category; matrix&lt;br&gt;
&amp;gt; multiply doesn't.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; You can always call the BLAS / LAPACK yourself, in the dense&lt;br&gt;
&amp;gt; case, or use available C code for the sparse case.  Lots of&lt;br&gt;
&amp;gt; the code in x=a*b, x=A\b, etc is open source.&lt;br&gt;
&lt;br&gt;
Tim: Thanks for the info, I think the way to go is to look into the&lt;br&gt;
open source libraries you talk about. Presumably someone has done all&lt;br&gt;
this before, so I don't think it should be too hard to find the&lt;br&gt;
libraries I am looking for.&lt;br&gt;
Steve: Good point, but I am still looking to move away from Matlab&lt;br&gt;
code.&lt;br&gt;
&lt;br&gt;
Thanks,&lt;br&gt;
&lt;br&gt;
Phil Winder</description>
    </item>
    <item>
      <pubDate>Mon, 17 Dec 2007 21:22:28 -0500</pubDate>
      <title>Re: Matlab Vectorisation Speed - How is it done in c++?</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/160993#406543</link>
      <author>Steven G. Johnson</author>
      <description>On Dec 17, 8:42 am, &quot;Tim Davis&quot; &amp;lt;da...@cise.ufl.edu&amp;gt; wrote:&lt;br&gt;
&amp;gt; &amp;gt; 1. There are issues related to the language syntax that&lt;br&gt;
&amp;gt; makes Fortran&lt;br&gt;
&amp;gt; &amp;gt; particularly easy to optimize for compilers, such as lack&lt;br&gt;
&amp;gt; of pointer&lt;br&gt;
&amp;gt; &amp;gt; aliasing. This is particularly important for optimal&lt;br&gt;
&amp;gt; allocation of&lt;br&gt;
&amp;gt; &amp;gt; registers when the CPU goes into a tight loop.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Regarding (1):  I write in C and I haven't found (1) to be&lt;br&gt;
&amp;gt; that much of an issue (although I do worry about it and it's&lt;br&gt;
&amp;gt; well worth it for you to mention here).  I think the more&lt;br&gt;
&amp;gt; recent versions of gcc are able to work around this issue.&lt;br&gt;
&amp;gt; More serious for C is the abuse of pointers (indirect&lt;br&gt;
&amp;gt; addressing, which requires lots of memory traffic).  Memory&lt;br&gt;
&amp;gt; traffic is more of a problem than register allocation,&lt;br&gt;
&amp;gt; anyway (which you point out too, regarding the stride issue)..&lt;br&gt;
&lt;br&gt;
The old canard about pointer aliasing semantics being weaker in C than&lt;br&gt;
in Fortran hasn't been an issue even in principle for almost 10 years&lt;br&gt;
now, since the 1999 C standard introduced the &quot;restrict&quot; keyword.  In&lt;br&gt;
practice, I've never found it to be a major practical issue in highly&lt;br&gt;
optimized code, since for key loops you often want to partially unroll&lt;br&gt;
them yourself anyway, and in any case higher-level memory-access&lt;br&gt;
patterns are usually more important for performance.&lt;br&gt;
&lt;br&gt;
Regarding the &quot;abuse of pointers&quot; I'm not sure what you're talking&lt;br&gt;
about.  Array access in C, properly implemented, requires no more or&lt;br&gt;
less pointer indirection than in Fortran or any other language.&lt;br&gt;
&lt;br&gt;
It's a good learning exercise, by the way, to implement a matrix&lt;br&gt;
multiply yourself and compare it to a fast BLAS implementation.  Even&lt;br&gt;
if you turn off things like SSE2 instructions, it is probably a factor&lt;br&gt;
of 6 faster than your first try, for a decent-sized matrix.  On the&lt;br&gt;
other hand, matrix multiplication is simple enough that it's not *too*&lt;br&gt;
hard to get at least reasonably close to a fast BLAS if you have some&lt;br&gt;
notion of what you are doing.  (I had a class once a few years ago&lt;br&gt;
where there was a contest to write a dgemm as fast as possible, and at&lt;br&gt;
least one student beat the fastest free BLAS at the time for at least&lt;br&gt;
one matrix size.)&lt;br&gt;
&lt;br&gt;
I once had an old Fortran programmer remark to me, &quot;A matrix multiply&lt;br&gt;
is just three loops!  How many possible ways can there be to implement&lt;br&gt;
it?&quot;  Recently, I told that story to an old compiler engineer, and he&lt;br&gt;
immediately responded &quot;Six ways (3 factorial), and I once wrote a&lt;br&gt;
compiler that automatically found the best loop order.&quot;  The correct&lt;br&gt;
answer (neglecting exotic algorithms like Strassen etc. that no one&lt;br&gt;
uses) is closer to n^3 factorial, since the n^3 multiplications all&lt;br&gt;
commute.  Programming was simpler when floating-point arithmetic&lt;br&gt;
dominated the runtime and all you had to worry about was the operation&lt;br&gt;
count.&lt;br&gt;
&lt;br&gt;
Regards,&lt;br&gt;
Steven G. Johnson</description>
    </item>
    <item>
      <pubDate>Mon, 17 Dec 2007 22:03:40 -0500</pubDate>
      <title>Re: Matlab Vectorisation Speed - How is it done in c++?</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/160993#406546</link>
      <author>sturlamolden</author>
      <description>On 17 Des, 22:22, &quot;Steven G. Johnson&quot; &amp;lt;stev...@alum.mit.edu&amp;gt; wrote:&lt;br&gt;
&lt;br&gt;
&amp;gt; The old canard about pointer aliasing semantics being weaker in C than&lt;br&gt;
&amp;gt; in Fortran hasn't been an issue even in principle for almost 10 years&lt;br&gt;
&amp;gt; now, since the 1999 C standard introduced the &quot;restrict&quot; keyword.  I&lt;br&gt;
&lt;br&gt;
That is true. But notice that most C code is still written as ANSI C. C&lt;br&gt;
++ does not allow the restrict keyword either.&lt;br&gt;
&lt;br&gt;
Microsoft's C compiler does not support ISO C (aka C99). But the most&lt;br&gt;
recent Microsoft C compiler support the restrict keyword as an&lt;br&gt;
extension to ANSI C. Previously one would have to use compiler switch&lt;br&gt;
'/Oa' or '#pragma optimize(&quot;a&quot;, on)' to assume no aliasing in MSVC. In&lt;br&gt;
GCC one would use the gnu extension __restrict__ to ANSI C, unless&lt;br&gt;
compiling with -std=c99 in which case restrict would be defined. So&lt;br&gt;
GCC would often require non-standard syntax, and MSVC would not allow&lt;br&gt;
control of aliasing at the level of single single variables. One would&lt;br&gt;
then end up with C code cluttered with preprocessor conditionals to&lt;br&gt;
allow compilation on more than a single platform.&lt;br&gt;
&lt;br&gt;
A typical pathologic case in ANSI C and ISO C++ would be:&lt;br&gt;
&lt;br&gt;
double *c, *a, *b;&lt;br&gt;
int i, n;&lt;br&gt;
/* initialize pointers and n */&lt;br&gt;
for (i=0; i&amp;lt;n;i++)&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;*c++ = *a++ + *b++; /* aliasing? */&lt;br&gt;
&lt;br&gt;
Which is easily solved in ISO C:&lt;br&gt;
&lt;br&gt;
typedef double *restrict arrayptr;&lt;br&gt;
arrayptr a, b, c;&lt;br&gt;
int i, n;&lt;br&gt;
/* initialize pointers and n */&lt;br&gt;
for (i=0; i&amp;lt;n;i++)&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;*c++ = *a++ + *b++; /* no aliasing */</description>
    </item>
    <item>
      <pubDate>Mon, 17 Dec 2007 22:50:01 -0500</pubDate>
      <title>Re: Matlab Vectorisation Speed - How is it done in c++?</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/160993#406551</link>
      <author>roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson)</author>
      <description>In article &amp;lt;9f100466-2e16-48cd-a4df-5c5cab65def4@l32g2000hse.googlegroups.com&amp;gt;,&lt;br&gt;
sturlamolden  &amp;lt;sturlamolden@yahoo.no&amp;gt; wrote:&lt;br&gt;
&amp;gt;On 17 Des, 22:22, &quot;Steven G. Johnson&quot; &amp;lt;stev...@alum.mit.edu&amp;gt; wrote:&lt;br&gt;
&lt;br&gt;
&amp;gt;&amp;gt; The old canard about pointer aliasing semantics being weaker in C than&lt;br&gt;
&amp;gt;&amp;gt; in Fortran hasn't been an issue even in principle for almost 10 years&lt;br&gt;
&amp;gt;&amp;gt; now, since the 1999 C standard introduced the &quot;restrict&quot; keyword.  I&lt;br&gt;
&lt;br&gt;
&amp;gt;That is true. But notice that most C code is still written as ANSI C. C&lt;br&gt;
&amp;gt;++ does not allow the restrict keyword either.&lt;br&gt;
&lt;br&gt;
&amp;gt;Microsoft's C compiler does not support ISO C (aka C99).&lt;br&gt;
&lt;br&gt;
I believe you are slightly confused about the C standards.&lt;br&gt;
&lt;br&gt;
In 1989, ANSI published ANSI X3.159-1989, &quot;Programming Language - C&quot;.&lt;br&gt;
In 1990, ISO adopted X3.159-1989 mostly just renumbering&lt;br&gt;
some sections of the standard document. C89 and C90 denote&lt;br&gt;
essentially the same language and are spoken of interchangably&lt;br&gt;
even in the standards-fussy newsgroup comp.lang.c .&lt;br&gt;
&lt;br&gt;
In 1999, ISO published ISO/IEC 9899:1999. In 2000, ANSI adopted&lt;br&gt;
the ISO 1999 standard.&lt;br&gt;
&lt;br&gt;
*Officially* The C89 and C90 standards are &quot;obsolete&quot;, and the 1999&lt;br&gt;
standard ISO standard (also adopted by ANSI) was &quot;C&quot; [*]. So &quot;ANSI C&quot;&lt;br&gt;
and &quot;ISO C&quot; refer to the same standard, the 1999 standard (plus TCs).&lt;br&gt;
And even before the 1999 standard was published, &quot;ANSI C&quot; and &quot;ISO C&quot;&lt;br&gt;
were so close that people only distinguish them when talking about&lt;br&gt;
the section numbers of the relevant documents.&lt;br&gt;
&lt;br&gt;
When people wish to distinguish between the 1989/1990 standard&lt;br&gt;
and the 1999 standard, to say something such as that most C code&lt;br&gt;
is still written to the 1989/1990 standard, then people would normally&lt;br&gt;
refer to either C89 or (less often C90), and C99.&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
[*] &quot;was&quot; because &quot;C&quot; is currently the 1999 standard together&lt;br&gt;
with some technical amendments, Technical Corrigendum 1 (TC1, 2001)&lt;br&gt;
and Technical Corrigendum 2 (TC2, 2004).&lt;br&gt;
&lt;br&gt;
&lt;a href=&quot;http://www.open-std.org/jtc1/sc22/wg14/www/standards&quot;&gt;http://www.open-std.org/jtc1/sc22/wg14/www/standards&lt;/a&gt;&lt;br&gt;
-- &lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&quot;History is a pile of debris&quot;                     -- Laurie Anderson</description>
    </item>
    <item>
      <pubDate>Tue, 18 Dec 2007 11:25:50 -0500</pubDate>
      <title>Re: Matlab Vectorisation Speed - How is it done in c++?</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/160993#406597</link>
      <author>Tim Davis</author>
      <description>See my replies, interleafed below (for a definition of&lt;br&gt;
interleaf posting, see&lt;br&gt;
&lt;a href=&quot;http://www.cise.ufl.edu/~davis/Horror_matrices.html#composting&quot;&gt;http://www.cise.ufl.edu/~davis/Horror_matrices.html#composting&lt;/a&gt;&lt;br&gt;
&amp;nbsp;&amp;nbsp;)&lt;br&gt;
&lt;br&gt;
&quot;Steven G. Johnson&quot; &amp;lt;stevenj@alum.mit.edu&amp;gt; wrote in message&lt;br&gt;
&amp;lt;825523e3-b124-44a4-b82f-7b01b3495029@f3g2000hsg.googlegroups.com&amp;gt;...&lt;br&gt;
&amp;gt; On Dec 17, 8:42 am, &quot;Tim Davis&quot; &amp;lt;da...@cise.ufl.edu&amp;gt; wrote:&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; 1. There are issues related to the language syntax that&lt;br&gt;
&amp;gt; &amp;gt; makes Fortran&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; particularly easy to optimize for compilers, such as lack&lt;br&gt;
&amp;gt; &amp;gt; of pointer&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; aliasing. This is particularly important for optimal&lt;br&gt;
&amp;gt; &amp;gt; allocation of&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; registers when the CPU goes into a tight loop.&lt;br&gt;
&amp;gt; &amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Regarding (1):  I write in C and I haven't found (1) to be&lt;br&gt;
&amp;gt; &amp;gt; that much of an issue (although I do worry about it and&lt;br&gt;
it's &amp;gt; &amp;gt; well worth it for you to mention here).  I thinkthe&lt;br&gt;
more&lt;br&gt;
&amp;gt; &amp;gt; recent versions of gcc are able to work around this issue.&lt;br&gt;
&amp;gt; &amp;gt; More serious for C is the abuse of pointers (indirect&lt;br&gt;
&amp;gt; &amp;gt; addressing, which requires lots of memory traffic).  Memory&lt;br&gt;
&amp;gt; &amp;gt; traffic is more of a problem than register allocation,&lt;br&gt;
&amp;gt; &amp;gt; anyway (which you point out too, regarding the stride&lt;br&gt;
issue)..&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; The old canard about pointer aliasing semantics being&lt;br&gt;
weaker in C than&lt;br&gt;
&amp;gt; in Fortran hasn't been an issue even in principle for&lt;br&gt;
almost 10 years&lt;br&gt;
&amp;gt; now, since the 1999 C standard introduced the &quot;restrict&quot;&lt;br&gt;
keyword.  In&lt;br&gt;
&amp;gt; practice, I've never found it to be a major practical&lt;br&gt;
issue in highly&lt;br&gt;
&amp;gt; optimized code, since for key loops you often want to&lt;br&gt;
partially unroll&lt;br&gt;
&amp;gt; them yourself anyway, and in any case higher-level&lt;br&gt;
memory-access&lt;br&gt;
&amp;gt; patterns are usually more important for performance.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Regarding the &quot;abuse of pointers&quot; I'm not sure what you're&lt;br&gt;
talking&lt;br&gt;
&amp;gt; about.  Array access in C, properly implemented, requires&lt;br&gt;
no more or&lt;br&gt;
&amp;gt; less pointer indirection than in Fortran or any other&lt;br&gt;
language.&lt;br&gt;
&lt;br&gt;
Right - I agree with you completely.&lt;br&gt;
&lt;br&gt;
For &quot;abuse of pointers&quot;, I mean data structures that use an&lt;br&gt;
unnecessary amount of indirection (pointers to pointers to&lt;br&gt;
pointers to ...).  I mean that &quot;C gives you enough rope to&lt;br&gt;
hang yourself&quot;.  Yes, simple arrays require no more or less&lt;br&gt;
indirection than any other language.&lt;br&gt;
&lt;br&gt;
&amp;gt; It's a good learning exercise, by the way, to implement a&lt;br&gt;
matrix&lt;br&gt;
&amp;gt; multiply yourself and compare it to a fast BLAS&lt;br&gt;
implementation.  Even&lt;br&gt;
&amp;gt; if you turn off things like SSE2 instructions, it is&lt;br&gt;
probably a factor&lt;br&gt;
&amp;gt; of 6 faster than your first try, for a decent-sized&lt;br&gt;
matrix.  On the&lt;br&gt;
&amp;gt; other hand, matrix multiplication is simple enough that&lt;br&gt;
it's not *too*&lt;br&gt;
&amp;gt; hard to get at least reasonably close to a fast BLAS if&lt;br&gt;
you have some&lt;br&gt;
&amp;gt; notion of what you are doing.  (I had a class once a few&lt;br&gt;
years ago&lt;br&gt;
&amp;gt; where there was a contest to write a dgemm as fast as&lt;br&gt;
possible, and at&lt;br&gt;
&amp;gt; least one student beat the fastest free BLAS at the time&lt;br&gt;
for at least&lt;br&gt;
&amp;gt; one matrix size.)&lt;br&gt;
&lt;br&gt;
Yes, that is a good exercise.  It's a lot more difficult&lt;br&gt;
than it looks.&lt;br&gt;
&lt;br&gt;
&amp;gt; I once had an old Fortran programmer remark to me, &quot;A&lt;br&gt;
matrix multiply&lt;br&gt;
&amp;gt; is just three loops!  How many possible ways can there be&lt;br&gt;
to implement&lt;br&gt;
&amp;gt; it?&quot;  Recently, I told that story to an old compiler&lt;br&gt;
engineer, and he&lt;br&gt;
&amp;gt; immediately responded &quot;Six ways (3 factorial), and I once&lt;br&gt;
wrote a&lt;br&gt;
&amp;gt; compiler that automatically found the best loop order.&quot;  &lt;br&gt;
&lt;br&gt;
That's hilarious!&lt;br&gt;
&lt;br&gt;
&amp;gt; The correct&lt;br&gt;
&amp;gt; answer (neglecting exotic algorithms like Strassen etc.&lt;br&gt;
that no one&lt;br&gt;
&amp;gt; uses) is closer to n^3 factorial, since the n^3&lt;br&gt;
multiplications all&lt;br&gt;
&amp;gt; commute.  Programming was simpler when floating-point&lt;br&gt;
arithmetic&lt;br&gt;
&amp;gt; dominated the runtime and all you had to worry about was&lt;br&gt;
the operation&lt;br&gt;
&amp;gt; count.&lt;br&gt;
&lt;br&gt;
Yup, I would guess n^3 factorial, maybe more because you can&lt;br&gt;
do a flop in so many ways (fused mult-adds, SSE3 or not, etc).&lt;br&gt;
&lt;br&gt;
A similar question I sometimes get:&lt;br&gt;
&lt;br&gt;
&quot;Gaussian elimination is just a few loops, how many lines of&lt;br&gt;
code can it possibly take?&quot; ... backslash includes probably&lt;br&gt;
250,000 lines of code (C and Fortran; an educated guess,&lt;br&gt;
since I wrote about half of it but haven't seen the other&lt;br&gt;
half).  It can be done in maybe 20 or so lines of code in C&lt;br&gt;
or Fortran, in a naive implementation of Gaussian&lt;br&gt;
elimination with partial pivoting, but then it will be 10 or&lt;br&gt;
20 times slower than x=A\b in the dense case, and quite&lt;br&gt;
literally up to millions of times slower in the sparse case.&lt;br&gt;
&lt;br&gt;
Matrix multiply is not quite so extreme, but not far off. &lt;br&gt;
Readers, if they're curious, should take a look at the ATLAS&lt;br&gt;
or Goto BLAS source code (both are available).  They are&lt;br&gt;
quite lengthy codes - but very fast.&lt;br&gt;
&lt;br&gt;
Ditto for FFT (see FFTW for example).  Fast codes are not&lt;br&gt;
(always) short codes; elegant codes are the fast ones, which&lt;br&gt;
are not always short.</description>
    </item>
    <item>
      <pubDate>Tue, 18 Dec 2007 11:55:58 -0500</pubDate>
      <title>Re: Matlab Vectorisation Speed - How is it done in c++?</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/160993#406600</link>
      <author>Tim Davis</author>
      <description>&quot;Tim Davis&quot; &amp;lt;davis@cise.ufl.edu&amp;gt; wrote in message &lt;br&gt;
...&lt;br&gt;
&amp;gt; Ditto for FFT (see FFTW for example).  Fast codes are not&lt;br&gt;
&amp;gt; (always) short codes; elegant codes are the fast ones, which&lt;br&gt;
&amp;gt; are not always short.&lt;br&gt;
&amp;gt; &lt;br&gt;
&lt;br&gt;
Steve - since you and I are clearly on the same page, I was&lt;br&gt;
writing more to the other readers of this thread.  So I&lt;br&gt;
tossed out the example of FFTW as a fast, elegant, but not&lt;br&gt;
short, code.  I know about the FFTW ... but I didn't know&lt;br&gt;
off the top of my head who the authors were.&lt;br&gt;
&lt;br&gt;
Then I looked up the FFTW after I posted my note, just out&lt;br&gt;
of curiousity, and found that you're one of the 2 co-authors.&lt;br&gt;
&lt;br&gt;
So in my reply to you I'm using your own code as an example&lt;br&gt;
... :-D !!</description>
    </item>
  </channel>
</rss>

