Path: news.mathworks.com!not-for-mail
From: "Andy Robb" <ajrobb@hotmail.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Speeding up sum of squares
Date: Thu, 8 May 2008 14:49:05 +0000 (UTC)
Organization: The MathWorks, Inc.
Lines: 74
Message-ID: <fvv3t1$a44$1@fred.mathworks.com>
References: <fm012450tjat7jgv5hqhtiokid3ketdsgj@4ax.com> <fvq0qr$7tv$1@fred.mathworks.com> <1v2124dbkuukhhprui7a8ss4vhmtgn3jcg@4ax.com> <fvq421$p13$1@fred.mathworks.com> <fvtkg4$foh$1@fred.mathworks.com> <fvtllj$n9b$1@fred.mathworks.com> <fvtvo2$n92$1@fred.mathworks.com> <fvu0m3$hjm$1@fred.mathworks.com> <fvupsb$4fo$1@fred.mathworks.com>
Reply-To: "Andy Robb" <ajrobb@hotmail.com>
NNTP-Posting-Host: webapp-02-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1210258145 10372 172.30.248.37 (8 May 2008 14:49:05 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Thu, 8 May 2008 14:49:05 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 1374918
Xref: news.mathworks.com comp.soft-sys.matlab:467408



"Tim Davis" <davis@cise.ufl.edu> wrote in message
<fvupsb$4fo$1@fred.mathworks.com>...
> It's likely to use loop unrolling as well, such as:
> 
> #include "mex.h"
> void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const
> mxArray *prhs[])
> {
>     double w, x, y, z;
>     register double sumsquares = 0.0;
>     double *dp;
>     register int i, n;
> 
>     dp = mxGetPr( prhs[0] );
>     n = mxGetNumberOfElements( prhs[0] );
>     for( i=0; i<n; i += 4 )
>     {
>         w = dp [i] ;
>         x = dp [i+1] ;
>         y = dp [i+2] ;
>         z = dp [i+3] ;
>         sumsquares += w * w + x * x + y * y + z * z ;
>     }
>     plhs[0] = mxCreateDoubleMatrix( 1, 1, mxREAL );
>     *mxGetPr( plhs[0] ) = sumsquares;
> }
> 
> which only works if n is divisible by 4 (I didn't include
> the clean up).
> 
> On my linux laptop using gcc (MATLAB 7.6, gcc 4.2.3):
> 
> the code above:
> >> tic;f=s3(x);toc
> Elapsed time is 0.524779 seconds.
> 
> >> tic;g=x'*x;toc
> Elapsed time is 0.543561 seconds.
> 
> your code:
> >> tic;e=s1(x);toc
> Elapsed time is 0.888125 seconds.
> 
> Loop unrolling is a common trick.
> 
> Don't forget that when you're timing things you need to run
> the code at least twice; the first time, the code itself
> gets loaded into memory.
> 
> gcc is smart enough not to need "register" statements.  It
> ignores them.  Visual C++ is probably not as smart.

With gcc you might want to try -march=athlon64 -mfpmath=sse
as this seems better able to use the pipeline (use could use
-march=pentium4 instead of athlon64).

Also, bear in mind that once you overflow your cache, main
memory can slow your code down by a factor of 6. Multiply on
modern processors is much faster than main memory and
probably even faster than primary cache. This might go some
way towards explaining why an intermediate array is so slow.

You might also like to try some of the following:

  /* comparison to zero can halve loop overhead */
  while ((n -= 4) >= 0) {
    /* don't bother copying the values - just use them */
    sumsquares += dp[0]*dp[0] + dp[1]*dp[1] + dp[2]*dp[2] +
dp[3]*dp[3];
    dp += 4;
  }

  /* finish up with trailing values */
  for (n = n & 3; --n >= 0;) sumsquares += dp[n] * dp[n];