Got Questions? Get Answers.
Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
Speeding up sum of squares

Subject: Speeding up sum of squares

From: richardstartz@comcast.net

Date: 6 May, 2008 16:11:51

Message: 1 of 13

I have a problem where most of the computational time is spent
computing

sum(x.^2)

where x is a vector of length between 100 and 10,000.

It seems I can speed up computation some by using

x'*x

instead.

Any thoughts or suggestions on further speed improvement?
Thanks.
-Dick Startz

Subject: Speeding up sum of squares

From: John D'Errico

Date: 6 May, 2008 16:26:04

Message: 2 of 13

richardstartz@comcast.net wrote in message
<fm012450tjat7jgv5hqhtiokid3ketdsgj@4ax.com>...
> I have a problem where most of the computational time is spent
> computing
>
> sum(x.^2)
>
> where x is a vector of length between 100 and 10,000.
>
> It seems I can speed up computation some by using
>
> x'*x
>
> instead.
>
> Any thoughts or suggestions on further speed improvement?
> Thanks.
> -Dick Startz

As always on these problems, its the little things
that sneak up behind you.

Is x real? Is x a column vector or a row vector?

If you are confident that you know the answers
to these questions always, then yes, you can
save some time by using the blas to do this dot
product.

x = rand(5000,1);

timeit(@() sum(x.*x))
ans =
   0.00045792

timeit(@() dot(x,x))
ans =
   0.00057047

timeit(@() x'*x)
ans =
   7.4318e-05

John

Subject: Speeding up sum of squares

From: Bruno Luong

Date: 6 May, 2008 16:29:04

Message: 3 of 13

richardstartz@comcast.net wrote in message
<fm012450tjat7jgv5hqhtiokid3ketdsgj@4ax.com>...

>
> Any thoughts or suggestions on further speed improvement?

norm(x)^2

Bruno

Subject: Speeding up sum of squares

From: richardstartz@comcast.net

Date: 6 May, 2008 16:52:33

Message: 4 of 13

On Tue, 6 May 2008 16:26:04 +0000 (UTC), "John D'Errico"
<woodchips@rochester.rr.com> wrote:

>richardstartz@comcast.net wrote in message
><fm012450tjat7jgv5hqhtiokid3ketdsgj@4ax.com>...
>> I have a problem where most of the computational time is spent
>> computing
>>
>> sum(x.^2)
>>
>> where x is a vector of length between 100 and 10,000.
>>
>> It seems I can speed up computation some by using
>>
>> x'*x
>>
>> instead.
>>
>> Any thoughts or suggestions on further speed improvement?
>> Thanks.
>> -Dick Startz
>
>As always on these problems, its the little things
>that sneak up behind you.
>
>Is x real? Is x a column vector or a row vector?
>
>If you are confident that you know the answers
>to these questions always, then yes, you can
>save some time by using the blas to do this dot
>product.
>
>x = rand(5000,1);
>
>timeit(@() sum(x.*x))
>ans =
> 0.00045792
>
>timeit(@() dot(x,x))
>ans =
> 0.00057047
>
>timeit(@() x'*x)
>ans =
> 7.4318e-05
>
>John

Thank you, John.

I have real column vectors, so it looks like your last suggestion will
speed things up my nearly an order of magnitude. That would be really
great.

If you have a minute, could you explain what it is that @() x'*x does
and why it makes such a difference, or point me to the right stuff to
read?

-Dick Startz

PS If it matters, I'm running 2007B on a dual processor Windows
machine.

PPS What's timeit()?

Subject: Speeding up sum of squares

From: John D'Errico

Date: 6 May, 2008 17:21:05

Message: 5 of 13

richardstartz@comcast.net wrote in message
<1v2124dbkuukhhprui7a8ss4vhmtgn3jcg@4ax.com>...

> Thank you, John.
>
> I have real column vectors, so it looks like your last suggestion will
> speed things up my nearly an order of magnitude. That would be really
> great.
>
> If you have a minute, could you explain what it is that @() x'*x does
> and why it makes such a difference, or point me to the right stuff to
> read?
>
> -Dick Startz
>
> PS If it matters, I'm running 2007B on a dual processor Windows
> machine.
>
> PPS What's timeit()?

timeit is a function from the file exchange,
written by Steve Eddins. Its a better way to
time code fragments than using tic and toc.

The product x'*x uses blas routines to speed
up the inner product. They are clearly much
more highly optimized than is the code
matlab produces for sum(x.^2). Bruno
pointed out that norm(x)^2 also is fairly fast.
I assume that it too is well optimized.

John

Subject: Speeding up sum of squares

From: richardstartz@comcast.net

Date: 6 May, 2008 17:32:45

Message: 6 of 13

On Tue, 6 May 2008 17:21:05 +0000 (UTC), "John D'Errico"
<woodchips@rochester.rr.com> wrote:

>richardstartz@comcast.net wrote in message
><1v2124dbkuukhhprui7a8ss4vhmtgn3jcg@4ax.com>...
>
>> Thank you, John.
>>
>> I have real column vectors, so it looks like your last suggestion will
>> speed things up my nearly an order of magnitude. That would be really
>> great.
>>
>> If you have a minute, could you explain what it is that @() x'*x does
>> and why it makes such a difference, or point me to the right stuff to
>> read?
>>
>> -Dick Startz
>>
>> PS If it matters, I'm running 2007B on a dual processor Windows
>> machine.
>>
>> PPS What's timeit()?
>
>timeit is a function from the file exchange,
>written by Steve Eddins. Its a better way to
>time code fragments than using tic and toc.
>
>The product x'*x uses blas routines to speed
>up the inner product. They are clearly much
>more highly optimized than is the code
>matlab produces for sum(x.^2). Bruno
>pointed out that norm(x)^2 also is fairly fast.
>I assume that it too is well optimized.
>
>John
>
Thanks again. I got it now.
-Dick Startz

Subject: Speeding up sum of squares

From: Tim Davis

Date: 8 May, 2008 01:20:04

Message: 7 of 13

"John D'Errico" <woodchips@rochester.rr.com> wrote in
message <fvq421$p13$1@fred.mathworks.com>...

> The product x'*x uses blas routines to speed
> up the inner product. They are clearly much
> more highly optimized than is the code
> matlab produces for sum(x.^2). Bruno
> pointed out that norm(x)^2 also is fairly fast.
> I assume that it too is well optimized.
>
> John

I'd hazard a guess that it's not a BLAS performance
difference. There really isn't a lot of benefit in
performance of the level-1 BLAS as compared with plain code,
given modern optimizing compilers.

It's probably because MATLAB is computing z=x.^2, and
storing z as a new vector, internally. Then it sums it up.
 The creation of y takes more memory traffic (8*n bytes
written then read back in). I would be quite surprised that
MATLAB is smart enough not to form the vector z,
but when it does x'*x it knows not to do that.

Just a guess, but it's backed up with this experiment:

>> x = rand (1e6,1);
>> tic;y=x'*x;toc
Elapsed time is 0.004829 seconds.

>> tic;y=sum(x.^2);toc
Elapsed time is 0.017520 seconds.

>> tic;z=x.^2;y=sum(z);toc
Elapsed time is 0.018008 seconds.

And, try this:


>> clear
>> x=rand(200e6,1);
>> tic;y=x'*x;toc
Elapsed time is 0.538544 seconds.

when the above command was working, the memory usage of
MATLAB didn't go up (I was watching it).

>>
>> tic;y=sum(x.^2);toc
??? Out of memory. Type HELP MEMORY for your options.

It's cool what you can learn about how MATLAB works
internally just by looking at error messages.

Subject: Speeding up sum of squares

From: John D'Errico

Date: 8 May, 2008 01:40:03

Message: 8 of 13

"Tim Davis" <davis@cise.ufl.edu> wrote in message
<fvtkg4$foh$1@fred.mathworks.com>...

> I'd hazard a guess that it's not a BLAS performance
> difference. There really isn't a lot of benefit in
> performance of the level-1 BLAS as compared with plain code,
> given modern optimizing compilers.
>
> It's probably because MATLAB is computing z=x.^2, and
> storing z as a new vector, internally. Then it sums it up.
> The creation of y takes more memory traffic (8*n bytes
> written then read back in). I would be quite surprised that
> MATLAB is smart enough not to form the vector z,
> but when it does x'*x it knows not to do that.

Of course. This makes much sense.

John

Subject: Speeding up sum of squares

From: James Tursa

Date: 8 May, 2008 04:32:02

Message: 9 of 13

"John D'Errico" <woodchips@rochester.rr.com> wrote in
message <fvtllj$n9b$1@fred.mathworks.com>...
> "Tim Davis" <davis@cise.ufl.edu> wrote in message
> <fvtkg4$foh$1@fred.mathworks.com>...
>
> > I'd hazard a guess that it's not a BLAS performance
> > difference. There really isn't a lot of benefit in
> > performance of the level-1 BLAS as compared with plain code,
> > given modern optimizing compilers.
> >
> > It's probably because MATLAB is computing z=x.^2, and
> > storing z as a new vector, internally. Then it sums it up.
> > The creation of y takes more memory traffic (8*n bytes
> > written then read back in). I would be quite surprised that
> > MATLAB is smart enough not to form the vector z,
> > but when it does x'*x it knows not to do that.
>
> Of course. This makes much sense.
>
> John

I had thought of the large intermediate array overhead also,
but to confirm this was the only thing going on I made the
following test runs with a 10,000,000 size double array:

(1) Matrix Multiply, BLAS calls in background

>> tic;x'*x;toc
Elapsed time is 0.028451 seconds.

(2) Element-by-element multiply, large intermediate array
first, sum function probably has BLAS calls

>> tic;sum(x.*x);toc
Elapsed time is 0.202210 seconds.

(3) Norm function, BLAS calls in background?

>> tic;norm(x)^2;toc
Elapsed time is 0.088459 seconds.

(4) Mex routine, straightforward

>> tic;sumsquares(x);toc
Elapsed time is 0.165403 seconds.

(5) Mex routine, form large intermediate array first

>> tic;sumsquares2(x);toc
Elapsed time is 0.336897 seconds.

I find two things interesting about these results.

First, the (4) Mex routine (see below) which did a
straightforward sum of squares without forming a large
intermediate array first was still *much* slower than (1),
indicating that the BLAS calls indeed seem to have a
significant effect on this calculation as John suggested.
Maybe the only difference is keeping the intermediate
results in registers. I don't know, and I didn't try to
force this in my mex routine.

Second, the difference between (1) and (2), about 0.1738
seconds, was very close to the difference between (4) and
(5), about 0.1715 seconds. This seems to independently
confirm the time difference is the cost of this intermediate
large array overhead. Probably the sum function uses BLAS
calls as well. This seems to confirm that the difference
between (1) and (2) is due mainly to the large intermediate
array cost as Tim suggested.

One of the worries I would have with (3) is the potential
loss of precision doing the sqrt and then the square again.

James Tursa

 ------------------------------------------

(4) Mex routine, straightforward

#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const
mxArray *prhs[])
{
    double sumsquares = 0.0;
    double *dp;
    int i, n;

    dp = mxGetPr( prhs[0] );
    n = mxGetNumberOfElements( prhs[0] );
    for( i=0; i<n; i++ )
    {
        sumsquares += (*dp) * (*dp);
        dp++;
    }
    plhs[0] = mxCreateDoubleMatrix( 1, 1, mxREAL );
    *mxGetPr( plhs[0] ) = sumsquares;
}

 ------------------------------------------

(5) Mex routine, form large intermediate array first

#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const
mxArray *prhs[])
{
    mxArray *mx;
    double sumsquares = 0.0;
    double *dp, *dx;
    int i, n;

    dp = mxGetPr( prhs[0] );
    n = mxGetNumberOfElements( prhs[0] );
    mx = mxCreateDoubleMatrix( n, 1, mxREAL );
    dx = mxGetPr( mx );
    for( i=0; i<n; i++ )
    {
        *dx++ = (*dp) * (*dp);
        dp++;
    }
    dx = mxGetPr( mx );
    for( i=0; i<n; i++ )
    {
        sumsquares += *dx++;
    }
    plhs[0] = mxCreateDoubleMatrix( 1, 1, mxREAL );
    *mxGetPr( plhs[0] ) = sumsquares;
    mxDestroyArray( mx );
}

Subject: Speeding up sum of squares

From: James Tursa

Date: 8 May, 2008 04:48:03

Message: 10 of 13

"James Tursa" <aclassyguywithaknotac@hotmail.com> wrote in
message <fvtvo2$n92$1@fred.mathworks.com>...
> I made the
> following test runs with a 10,000,000 size double array:
>
> (1) Matrix Multiply, BLAS calls in background
>
> >> tic;x'*x;toc
> Elapsed time is 0.028451 seconds.
>

Since it wasn't too difficult I went ahead and forced the
intermediate values to be register variables using Visual
C++ 8.0 (actually the C compiler that comes with it of
course). The results are:

>> tic;sumsquares3(x);toc
Elapsed time is 0.029241 seconds.

This result is almost identical to the x'*x method. So maybe
the only trick behind the scenes here to get the speed
improvement is to use register variables for the sum and
loop indexes.

James Tursa

---------------------------------------------------

(6) Mex routine, straightforward using register variables

#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const
mxArray *prhs[])
{
    register double sumsquares = 0.0;
    double *dp;
    register int i, n;

    dp = mxGetPr( prhs[0] );
    n = mxGetNumberOfElements( prhs[0] );
    for( i=0; i<n; i++ )
    {
        sumsquares += (*dp) * (*dp);
        dp++;
    }
    plhs[0] = mxCreateDoubleMatrix( 1, 1, mxREAL );
    *mxGetPr( plhs[0] ) = sumsquares;
}

Subject: Speeding up sum of squares

From: Tim Davis

Date: 8 May, 2008 11:58:03

Message: 11 of 13

It's likely to use loop unrolling as well, such as:

#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const
mxArray *prhs[])
{
    double w, x, y, z;
    register double sumsquares = 0.0;
    double *dp;
    register int i, n;

    dp = mxGetPr( prhs[0] );
    n = mxGetNumberOfElements( prhs[0] );
    for( i=0; i<n; i += 4 )
    {
        w = dp [i] ;
        x = dp [i+1] ;
        y = dp [i+2] ;
        z = dp [i+3] ;
        sumsquares += w * w + x * x + y * y + z * z ;
    }
    plhs[0] = mxCreateDoubleMatrix( 1, 1, mxREAL );
    *mxGetPr( plhs[0] ) = sumsquares;
}

which only works if n is divisible by 4 (I didn't include
the clean up).

On my linux laptop using gcc (MATLAB 7.6, gcc 4.2.3):

the code above:
>> tic;f=s3(x);toc
Elapsed time is 0.524779 seconds.

>> tic;g=x'*x;toc
Elapsed time is 0.543561 seconds.

your code:
>> tic;e=s1(x);toc
Elapsed time is 0.888125 seconds.

Loop unrolling is a common trick.

Don't forget that when you're timing things you need to run
the code at least twice; the first time, the code itself
gets loaded into memory.

gcc is smart enough not to need "register" statements. It
ignores them. Visual C++ is probably not as smart.

Subject: Speeding up sum of squares

From: Andy Robb

Date: 8 May, 2008 14:49:05

Message: 12 of 13

"Tim Davis" <davis@cise.ufl.edu> wrote in message
<fvupsb$4fo$1@fred.mathworks.com>...
> It's likely to use loop unrolling as well, such as:
>
> #include "mex.h"
> void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const
> mxArray *prhs[])
> {
> double w, x, y, z;
> register double sumsquares = 0.0;
> double *dp;
> register int i, n;
>
> dp = mxGetPr( prhs[0] );
> n = mxGetNumberOfElements( prhs[0] );
> for( i=0; i<n; i += 4 )
> {
> w = dp [i] ;
> x = dp [i+1] ;
> y = dp [i+2] ;
> z = dp [i+3] ;
> sumsquares += w * w + x * x + y * y + z * z ;
> }
> plhs[0] = mxCreateDoubleMatrix( 1, 1, mxREAL );
> *mxGetPr( plhs[0] ) = sumsquares;
> }
>
> which only works if n is divisible by 4 (I didn't include
> the clean up).
>
> On my linux laptop using gcc (MATLAB 7.6, gcc 4.2.3):
>
> the code above:
> >> tic;f=s3(x);toc
> Elapsed time is 0.524779 seconds.
>
> >> tic;g=x'*x;toc
> Elapsed time is 0.543561 seconds.
>
> your code:
> >> tic;e=s1(x);toc
> Elapsed time is 0.888125 seconds.
>
> Loop unrolling is a common trick.
>
> Don't forget that when you're timing things you need to run
> the code at least twice; the first time, the code itself
> gets loaded into memory.
>
> gcc is smart enough not to need "register" statements. It
> ignores them. Visual C++ is probably not as smart.

With gcc you might want to try -march=athlon64 -mfpmath=sse
as this seems better able to use the pipeline (use could use
-march=pentium4 instead of athlon64).

Also, bear in mind that once you overflow your cache, main
memory can slow your code down by a factor of 6. Multiply on
modern processors is much faster than main memory and
probably even faster than primary cache. This might go some
way towards explaining why an intermediate array is so slow.

You might also like to try some of the following:

  /* comparison to zero can halve loop overhead */
  while ((n -= 4) >= 0) {
    /* don't bother copying the values - just use them */
    sumsquares += dp[0]*dp[0] + dp[1]*dp[1] + dp[2]*dp[2] +
dp[3]*dp[3];
    dp += 4;
  }

  /* finish up with trailing values */
  for (n = n & 3; --n >= 0;) sumsquares += dp[n] * dp[n];

Subject: Speeding up sum of squares

From: James Tursa

Date: 8 May, 2008 15:30:55

Message: 13 of 13

"Tim Davis" <davis@cise.ufl.edu> wrote in message
<fvupsb$4fo$1@fred.mathworks.com>...
>
> Don't forget that when you're timing things you need to run
> the code at least twice; the first time, the code itself
> gets loaded into memory.
>

Yes, I did that. I posted typical times after several runs.

> gcc is smart enough not to need "register" statements. It
> ignores them. Visual C++ is probably not as smart.

Good point. I should have worded it better. The "register"
keyword doesn't actually force the compiler to do anything
special ... it is just a compiler request. For instance, the
built in lcc ignores it and gives the same result with or
without the register keyword.

James Tursa

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us