Got Questions? Get Answers.
Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
min ||Ax-b|| for large sparse A

Subject: min ||Ax-b|| for large sparse A

From: Mithun

Date: 22 Jun, 2010 14:42:05

Message: 1 of 21

I've seen this on several threads in MATLAB central, but haven't found a suitable solution. Here are some properties of A:

size(A) -> 2061233 2058240
sprank(A) -> 2058240
sprank([A b]) -> 2058241
nnz(A) -> 8232960 ; numel(A) = 4.2425e+012

I'm trying to min ||Ax-b||_2^2 The only function which works is lsqr ; the rest errors out with "Out of Memory".

lsqr is great for some A, but not all and I wanted to experiment with other methods like \, fminbnd (since I'm experimenting with constraints on x).

Any ideas?

Subject: min ||Ax-b|| for large sparse A

From: Matt J

Date: 22 Jun, 2010 17:01:22

Message: 2 of 21

Sometimes the following iterative algorithm works well:

absA=abs(A);
C=absA*sum(absA,2);

for ii=1:numIter

 X=X-A.'*(A*X-b)./C;

end

Subject: min ||Ax-b|| for large sparse A

From: Jacob

Date: 22 Jun, 2010 17:30:27

Message: 3 of 21

Could you provide some information about this method (name, paper, ...)? Also, if A is size m x n, then sum(A,2) will be of size, m x 1 so, I don't understand that step.

"Matt J " <mattjacREMOVE@THISieee.spam> wrote in message <hvqq92$q3n$1@fred.mathworks.com>...
> Sometimes the following iterative algorithm works well:
>
> absA=abs(A);
> C=absA*sum(absA,2);
>
> for ii=1:numIter
>
> X=X-A.'*(A*X-b)./C;
>
> end

Subject: min ||Ax-b|| for large sparse A

From: Bruno Luong

Date: 22 Jun, 2010 19:28:21

Message: 4 of 21

"Jacob " <mithunjacob_oohay@yahoo.com> wrote in message <hvqrvj$irq$1@fred.mathworks.com>...
> Could you provide some information about this method (name, paper, ...)? Also, if A is size m x n, then sum(A,2) will be of size, m x 1 so, I don't understand that step.

Matt's suggestion is no thing more than the simple (as opposed to conjugate) gradient method with empirical preconditioning. My professors told me never use it beside academic purpose. In any case it certainly can't compete with LSQR.

Bruno

Subject: min ||Ax-b|| for large sparse A

From: Matt J

Date: 22 Jun, 2010 20:37:20

Message: 5 of 21

"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hvr2sl$odb$1@fred.mathworks.com>...
> "Jacob " <mithunjacob_oohay@yahoo.com> wrote in message <hvqrvj$irq$1@fred.mathworks.com>...
> > Could you provide some information about this method (name, paper, ...)? Also, if A is size m x n, then sum(A,2) will be of size, m x 1 so, I don't understand that step.
==============

Sorry, I meant to write

C=absA.' * sum(absA,2); %note the transpose

So, yes sum(A,2) will be of size m x 1, but absA.' will be compatibly sized for this multiplication and C will be nx1



> Matt's suggestion is no thing more than the simple (as opposed to conjugate) gradient method with empirical preconditioning. My professors told me never use it beside academic purpose. In any case it certainly can't compete with LSQR.
==================

No, the preconditioning is not empirical. This particular choice of preconditioner C will ensure monotone descent of the algorithm without the need for line searches. You can find some information about it in the paper referenced down at the bottom.

How fast it will descend compared to other algorithms depends a lot on the structure of A. One expects it to be faster for more diagonally concentrated A. The MATLAB documentation doesn't say what algorithm LSQR uses, but judging from a peek inside lsqr.m, the computation per iteration for LSQR seems much greater. So, I don't think it can be a priori obvious which algorithm will perform better.

It's also much easier to add box constraints to this algorithm than others. For example, positivity constraints can be added with the following simple modification. LSQR cannot, apparently, accomodate any constraints.


for ii=1:numIter

 X=X-A.'*(A*X-b)./C;

 X(X<0)=0;

end


A relevant paper is:

@ARTICLE{
        DePa,
        author = {A R {De Pierro}},
        title = {{On} the relation between the {ISRA} and the {EM} algorithm for positron emission tomography},
        journal = {IEEE Tr. Med. Im.},
        volume = 12,
        number = 2,
        pages = {328--33},
        month = jun,
        year = 1993
}

Subject: min ||Ax-b|| for large sparse A

From: Bruno Luong

Date: 22 Jun, 2010 21:15:25

Message: 6 of 21

Sorry Matt, in optimization the poor performance of gradient method is one of the first lesson students leaned in school. This is no surprise for anyone. Try this example: after 10 iterations your algorithm is still far from being not converge, whereas a simple conjugate gradient converge in 2 iterations.

A=[100 9;
    9 1];
b=[1; 1];

%%
X = [0; 0];
numIter=10
absA=abs(A);
C=absA*sum(absA,2);
for ii=1:numIter
    X=X-A.'*(A*X-b)./C;
end
A*X-b

% Conjugate gradient

niter = size(b,1); % 2
x = [0;0]
r = b - A*x;
w = -r;
z = A*w;
a = (r'*w)/(w'*z);
x = x + a*w;
B = 0;

for i = 1:niter
    r = r - a*z;
    if( norm(r) < 1e-10 )
        break;
    end
    B = (r'*z)/(w'*z);
    w = -r + B*w;
    z = A*w;
    a = (r'*w)/(w'*z);
    x = x + a*w;
end
A*x-b

% Bruno

Subject: min ||Ax-b|| for large sparse A

From: Matt J

Date: 22 Jun, 2010 23:03:22

Message: 7 of 21

"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hvr95c$lhj$1@fred.mathworks.com>...
> Sorry Matt, in optimization the poor performance of gradient method is one of the first lesson students leaned in school. This is no surprise for anyone. Try this example: after 10 iterations your algorithm is still far from being not converge, whereas a simple conjugate gradient converge in 2 iterations.
=================

Not entirely an applicable comparison, Bruno, for the following reasons

(1) Your example favors CG mainly because the dimension N of the problem was small enough here to let you run a full N iterations, at which point theory says CG will attain the solution regardless of how well conditioned the problem is. Normally, however, you would have to stop CG after a number of iterations far fewer than N, in which case how close you come to the solution is more sensitive to the conditioning of the problem.

(2) It's not always true that gradient methods perform poorly. That depends on the choice of preconditioner. Maybe CG converged in 2 iterations, but had we used the pre-conditioner inv(A.'*A), the gradient method would have converged in only 1 iteration.

(3) As I said previously, CG does not modify well to handle constraints, whereas the proposed gradient accomodates it easily. To modify CG for constraints, you have to employ restarts, which can limit it's speed to that of gradient methods.


(4) As I said previously, and for reasons related to (2), the gradient method becomes more competitive when the Hessian is more diagonally concentrated. Competitiveness also depends on accuracy requirements. If you repeat your experiment with, say

A=[100 1; 1 15];

you will find that the gradient method will converge to the solution within a 3% error and within less CPU time than CG.

Subject: min ||Ax-b|| for large sparse A

From: Bruno Luong

Date: 23 Jun, 2010 03:04:05

Message: 8 of 21

"Matt J " <mattjacREMOVE@THISieee.spam> wrote in message <hvrffq$fvc$1@fred.mathworks.com>...
> "Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hvr95c$lhj$1@fred.mathworks.com>...
> > Sorry Matt, in optimization the poor performance of gradient method is one of the first lesson students leaned in school. This is no surprise for anyone. Try this example: after 10 iterations your algorithm is still far from being not converge, whereas a simple conjugate gradient converge in 2 iterations.
> =================
>
> Not entirely an applicable comparison, Bruno, for the following reasons
>
> (1) Your example favors CG mainly because the dimension N of the problem was small enough here to let you run a full N iterations, at which point theory says CG will attain the solution regardless of how well conditioned the problem is. Normally, however, you would have to stop CG after a number of iterations far fewer than N, in which case how close you come to the solution is more sensitive to the conditioning of the problem.

I can't how your algorithm that performs poorly already in 2D and how better chance in larger dimension. Both method at gradient and conjugate at iteration #n try to minimize the cost function projected on the nth Krylov's space, the difference is one method minimize exactly, the other don't and requires many iterations to converge.
 
>
> (2) It's not always true that gradient methods perform poorly. That depends on the choice of preconditioner. Maybe CG converged in 2 iterations, but had we used the pre-conditioner inv(A.'*A), the gradient method would have converged in only 1 iteration.

Common: Preconditioning is finding an approximation of inv(A.'*A) and *fast* to apply. If you apply inv(A.'*A) it is not doable (you already did do the work btw).

>
> (3) As I said previously, CG does not modify well to handle constraints, whereas the proposed gradient accomodates it easily. To modify CG for constraints, you have to employ restarts, which can limit it's speed to that of gradient methods.
>

Well you should take a look at pcg with projection. The gradient method is simply poor, and this is no secret, all the textbook on optimization I read will explain why no one use gradient method.
 
>
> (4) As I said previously, and for reasons related to (2), the gradient method becomes more competitive when the Hessian is more diagonally concentrated. Competitiveness also depends on accuracy requirements. If you repeat your experiment with, say
>
> A=[100 1; 1 15];
>
> you will find that the gradient method will converge to the solution within a 3% error and within less CPU time than CG.

Why not choose A = eye(2) and show your method can converge in 1 iteration? Oh you ready did it with the inv(A'*A) preconditioning thing.

Bruno

Subject: min ||Ax-b|| for large sparse A

From: Matt J

Date: 23 Jun, 2010 12:24:05

Message: 9 of 21

"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hvrtj5$e7m$1@fred.mathworks.com>...

> I can't how your algorithm that performs poorly already in 2D and how better chance in larger dimension. Both method at gradient and conjugate at iteration #n try to minimize the cost function projected on the nth Krylov's space, the difference is one method minimize exactly, the other don't and requires many iterations to converge.
=================

I didn't say the gradient algorithm would perform better. I said CG will perform worse, if it is not allowed to perform N full iterations, which is a realistic assumption, since in larger dimensional problems it is often not practical to iterate N times. That will put the two algorithms on a more level playing field.

As a test of this, you might rerun your code with the following higher dimensional data, which is also a lot less diagonally concentrated than my last example,

A=diag(1:7)+rand(7); A=A+A.';
N=size(A,1);
b=ones(N,1);
numIter=5;
   nIter=numIter; %for both the gradient algorithm and CG

You will see that the two algorithms perform quite similarly, and sometimes (depending on the output of rand(7)), the gradient algorithm performs better than CG.

Here's what a will agree with, however. I would agree that for unconstrained, quadratic minimization, CG will probably always outperform gradient methods WHEN BOTH USE THE SAME PRECONDITIONER. For some reason, though, you have chosen here to race conditioned gradiented against unconditioned CG and claim that CG will always be better.




> > (2) It's not always true that gradient methods perform poorly. That depends on the choice of preconditioner. Maybe CG converged in 2 iterations, but had we used the pre-conditioner inv(A.'*A), the gradient method would have converged in only 1 iteration.
>
> Common: Preconditioning is finding an approximation of inv(A.'*A) and *fast* to apply. If you apply inv(A.'*A) it is not doable (you already did do the work btw).
===============

True, but my only point was that a well-conditioned gradient algorithm can outperform a poorly conditioned conjugate gradient algorithm.



> > (3) As I said previously, CG does not modify well to handle constraints, whereas the proposed gradient accomodates it easily. To modify CG for constraints, you have to employ restarts, which can limit it's speed to that of gradient methods.
> >
>
> Well you should take a look at pcg with projection. The gradient method is simply poor, and this is no secret, all the textbook on optimization I read will explain why no one use gradient method.
=====================

I don't see how pcg with projection can avoid the need to restart. The projection operation will kill the conjugacy of the sequence if the algorithm hits a constraint boundary. Once you restart, pcg becomes a lot like a normal gradient algorithm again.

In any case, I've never tried pcg with projection myself, but I know people who have done so and who report that PCG gives no advantage over the gradient algorithms used for that application. This is possibly for the above reasons.

Also, for large nonquadratic minimization, the line searches required by CG are often impractical. This has led a lot of people to opt for gradient over CG as well.


> Why not choose A = eye(2) and show your method can converge in 1 iteration? Oh you ready did it with the inv(A'*A) preconditioning thing.
============

My example above for N=7 is a lot less diagonally concentrated. But why was this example so bad? Surely you wouldn't disagree that the Hessian often is diagonally concentrated. In any case, I already conceded that if PCG is given the advantage of the same preconditioner whether it be C or
 inv(A'*A), it will likely do better.

Subject: min ||Ax-b|| for large sparse A

From: Bruno Luong

Date: 23 Jun, 2010 13:15:20

Message: 10 of 21

"Matt J " <mattjacREMOVE@THISieee.spam> wrote in message <hvsud5$hdt$1@fred.mathworks.com>...
> "Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hvrtj5$e7m$1@fred.mathworks.com>...
>
> > I can't how your algorithm that performs poorly already in 2D and how better chance in larger dimension. Both method at gradient and conjugate at iteration #n try to minimize the cost function projected on the nth Krylov's space, the difference is one method minimize exactly, the other don't and requires many iterations to converge.
> =================
>
> I didn't say the gradient algorithm would perform better. I said CG will perform worse, if it is not allowed to perform N full iterations, which is a realistic assumption, since in larger dimensional problems it is often not practical to iterate N times. That will put the two algorithms on a more level playing field.
>
> As a test of this, you might rerun your code with the following higher dimensional data, which is also a lot less diagonally concentrated than my last example,
>
> A=diag(1:7)+rand(7); A=A+A.';
> N=size(A,1);
> b=ones(N,1);
> numIter=5;
> nIter=numIter; %for both the gradient algorithm and CG
>
> You will see that the two algorithms perform quite similarly, and sometimes (depending on the output of rand(7)), the gradient algorithm performs better than CG.
>
> Here's what a will agree with, however. I would agree that for unconstrained, quadratic minimization, CG will probably always outperform gradient methods WHEN BOTH USE THE SAME PRECONDITIONER. For some reason, though, you have chosen here to race conditioned gradiented against unconditioned CG and claim that CG will always be better.

No what I said is the gradient method fails even with a the *very* simple case of (2 x 2) of condition number about 500 (not large by any standard) and unconstrained. After 10 iterations it provides a current solution of

% For A=[100 9; 9 1]; b=[1; 1]

X =

    0.0098
    0.0115

instead of

>> A\b

ans =

   -0.4211
    4.7895

So this method is good to go to the trash can (to me). I certainly won't use such method if it already hits a wall at this level.


>
> True, but my only point was that a well-conditioned gradient algorithm can outperform a poorly conditioned conjugate gradient algorithm.

I'm not argue about the benefit of preconditioning, What I'm telling the gradient method is bad by itself. And if you think the right diagonal-preconditioner can do some miracle compensation, the (2 x 2) example showed above proves it won't.

Of course you an always find the case where it works more or less, such as identity matrix and diagonal preconditioning on a diagonal dominant matrix, thus make it looks like an identity matrix at the end. It's quite narrow IMO.

Bruno

Subject: min ||Ax-b|| for large sparse A

From: Matt J

Date: 23 Jun, 2010 14:47:05

Message: 11 of 21

"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hvt1d8$3dc$1@fred.mathworks.com>...

>
> No what I said is the gradient method fails even with a the *very* simple case of (2 x 2) of condition number about 500 (not large by any standard) and unconstrained.
>[snip]
> So this method is good to go to the trash can (to me). I certainly won't use such method if it already hits a wall at this level.
==============

But what would your alternatives be? I showed you a 7-dimensional, significantly non-diagonal example where CG does just as badly or worse, and with a condition number of only 10 or so....

If other algorithms hit even harder walls in higher dimensions and other more realistic situations, what are you left with?



> Of course you an always find the case where it works more or less, such as identity matrix and diagonal preconditioning on a diagonal dominant matrix, thus make it looks like an identity matrix at the end. It's quite narrow IMO.
================

It's not. There are quite a few applications where the Hessian is, even if not perfectly diagonal or even diagonally dominant, as concentrated around the diagonal as my N=7 example.

Subject: min ||Ax-b|| for large sparse A

From: Jacob

Date: 23 Jun, 2010 14:58:04

Message: 12 of 21

Your discussion regarding the matter seems great, and I must admit your simple method works very well for my matrix (which is highly diagonalized).

But I need more data about this method --- could you point me to some proofs regarding C?

"Matt J " <mattjacREMOVE@THISieee.spam> wrote in message <hvr6u0$m83$1@fred.mathworks.com>...
> "Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hvr2sl$odb$1@fred.mathworks.com>...
> > "Jacob " <mithunjacob_oohay@yahoo.com> wrote in message <hvqrvj$irq$1@fred.mathworks.com>...
> > > Could you provide some information about this method (name, paper, ...)? Also, if A is size m x n, then sum(A,2) will be of size, m x 1 so, I don't understand that step.
> ==============
>
> Sorry, I meant to write
>
> C=absA.' * sum(absA,2); %note the transpose
>
> So, yes sum(A,2) will be of size m x 1, but absA.' will be compatibly sized for this multiplication and C will be nx1
>
>
>
> > Matt's suggestion is no thing more than the simple (as opposed to conjugate) gradient method with empirical preconditioning. My professors told me never use it beside academic purpose. In any case it certainly can't compete with LSQR.
> ==================
>
> No, the preconditioning is not empirical. This particular choice of preconditioner C will ensure monotone descent of the algorithm without the need for line searches. You can find some information about it in the paper referenced down at the bottom.
>
> How fast it will descend compared to other algorithms depends a lot on the structure of A. One expects it to be faster for more diagonally concentrated A. The MATLAB documentation doesn't say what algorithm LSQR uses, but judging from a peek inside lsqr.m, the computation per iteration for LSQR seems much greater. So, I don't think it can be a priori obvious which algorithm will perform better.
>
> It's also much easier to add box constraints to this algorithm than others. For example, positivity constraints can be added with the following simple modification. LSQR cannot, apparently, accomodate any constraints.
>
>
> for ii=1:numIter
>
> X=X-A.'*(A*X-b)./C;
>
> X(X<0)=0;
>
> end
>
>
> A relevant paper is:
>
> @ARTICLE{
> DePa,
> author = {A R {De Pierro}},
> title = {{On} the relation between the {ISRA} and the {EM} algorithm for positron emission tomography},
> journal = {IEEE Tr. Med. Im.},
> volume = 12,
> number = 2,
> pages = {328--33},
> month = jun,
> year = 1993
> }

Subject: min ||Ax-b|| for large sparse A

From: Matt J

Date: 23 Jun, 2010 15:52:04

Message: 13 of 21

"Jacob " <mithunjacob_oohay@yahoo.com> wrote in message <hvt7ds$ctr$1@fred.mathworks.com>...
> Your discussion regarding the matter seems great, and I must admit your simple method works very well for my matrix (which is highly diagonalized).
>
> But I need more data about this method --- could you point me to some proofs regarding C?
===============

I mentioned this paper earlier in the thread. Did you already check it?

@ARTICLE{
        DePa,
        author = {A R {De Pierro}},
        title = {{On} the relation between the {ISRA} and the {EM} algorithm for positron emission tomography},
        journal = {IEEE Tr. Med. Im.},
        volume = 12,
        number = 2,
        pages = {328--33},
        month = jun,
        year = 1993
}

Subject: min ||Ax-b|| for large sparse A

From: Bruno Luong

Date: 23 Jun, 2010 16:24:04

Message: 14 of 21

"Matt J " <mattjacREMOVE@THISieee.spam> wrote in message <hvt6p8$hj$1@fred.mathworks.com>...

> But what would your alternatives be? I showed you a 7-dimensional, significantly non-diagonal example where CG does just as badly or worse, and with a condition number of only 10 or so....

Not what I observe, and consistently CG beats the gradient method that perform poorly:

>> t
Exact solution X:
    0.2008
    0.1093
    0.0750
    0.0633
    0.0458
    0.0199
    0.0271


Gradient solution X:
    0.0978
    0.0943
    0.0861
    0.0699
    0.0522
    0.0358
    0.0373

Residual 0.347107:

CG solution X:
Residual x:
    0.2008
    0.1094
    0.0750
    0.0633
    0.0458
    0.0199
    0.0271

Residual 0.000587:
>> t
Exact solution X:
    0.3615
    0.1135
    0.0428
    0.0410
    0.0259
    0.0286
    0.0230


Gradient solution X:
    0.0950
    0.0838
    0.0809
    0.0648
    0.0516
    0.0505
    0.0458

Residual 0.474012:

CG solution X:
Residual x:
    0.3613
    0.1144
    0.0418
    0.0416
    0.0256
    0.0287
    0.0230

Residual 0.009334:
>> t
Exact solution X:
    0.2051
    0.1161
    0.0950
    0.0448
    0.0316
    0.0400
    0.0279


Gradient solution X:
    0.1020
    0.0989
    0.0903
    0.0628
    0.0515
    0.0458
    0.0382

Residual 0.352715:

CG solution X:
Residual x:
    0.2050
    0.1163
    0.0948
    0.0449
    0.0316
    0.0400
    0.0279

Residual 0.001528:
>> t
Exact solution X:
    0.2362
    0.1138
    0.0953
    0.0552
    0.0414
    0.0255
    0.0210


Gradient solution X:
    0.0884
    0.1035
    0.0961
    0.0735
    0.0583
    0.0464
    0.0406

Residual 0.360280:

CG solution X:
Residual x:
    0.2362
    0.1138
    0.0951
    0.0556
    0.0411
    0.0256
    0.0210

Residual 0.005527:
>>

% Here is my code

A=diag(1:7)+rand(7); A=A+A.';
N=size(A,1);
b=ones(N,1);

%%
fprintf('Exact solution X:\n')
disp(A\b);

%%
X = zeros(N,1);
numIter=5;
absA=abs(A);
C=absA*sum(absA,2);
for ii=1:numIter
    X=X-A.'*(A*X-b)./C;
end
fprintf('\nGradient solution X:\n')
disp(X);
fprintf('Residual %f:\n', norm(A*X-b))

% Conjugate gradient

niter = numIter;
x = zeros(N,1);
r = b - A*x;
w = -r;
z = A*w;
a = (r'*w)/(w'*z);
x = x + a*w;
B = 0;

for i = 1:niter
    r = r - a*z;
    if( norm(r) < 1e-10 )
        break;
    end
    B = (r'*z)/(w'*z);
    w = -r + B*w;
    z = A*w;
    a = (r'*w)/(w'*z);
    x = x + a*w;
end
fprintf('\nCG solution X:\n')

fprintf('Residual x:\n')
disp(x);
fprintf('Residual %f:\n', norm(A*x-b))


> It's not. There are quite a few applications where the Hessian is, even if not perfectly diagonal or even diagonally dominant, as concentrated around the diagonal as my N=7 example.

If you want to convince your algorithm works you need a much more solid argument than showing one or two examples. But so far I see zero example where your algorithm really works.

Bruno

Subject: min ||Ax-b|| for large sparse A

From: Matt J

Date: 23 Jun, 2010 19:43:04

Message: 15 of 21

"Matt J " <mattjacREMOVE@THISieee.spam> wrote in message <hvtaj4$c1n$1@fred.mathworks.com>...
> "Jacob " <mithunjacob_oohay@yahoo.com> wrote in message <hvt7ds$ctr$1@fred.mathworks.com>...
> > Your discussion regarding the matter seems great, and I must admit your simple method works very well for my matrix (which is highly diagonalized).
> >
> > But I need more data about this method --- could you point me to some proofs regarding C?
> ===============
>
> I mentioned this paper earlier in the thread. Did you already check it?

Sorry, I checked that paper, and it's not precisely what you need. I'll give you my own quick derivation. The objective function is

f(x)=||A*x-b||^2/2;

but for a fixed y it can also be written equivalently as follows

f(x)=(x-y).'*(A.'*A)*(x-y)/2 + A.'*(A*y-b)*(x-y)+ ||A*y-b||^2/2;

Now consider the alternative function

Q(x)=(x-y).'*diag(C)*(x-y)/2 + A.'*(A*y-b)*(x-y)+||A*y-b||^2/2;


Subtracting the two functions we obtain

Q(x) - f(x) = (x-y).' * (diag(C) - A.'*A) * (x-y)/2

I now claim that diag(C) - A.'*A is non-negative definite. It is a simple exercise to prove this using the definition of C and the Gerschgorin circle theorem

http://en.wikipedia.org/wiki/Gerschgorin_disk_theorem

From the non-negative definiteness of diag(C) - A.'*A, the last equation tells us that

Q(x) >= f(x)

for all x with equality at x=y. This means that

f(y) - f(x) >= Q(y) - Q(x)

for all x. This is sometimes called the "optimization transfer principle". The above implies that if Q() is minimized at z then

f(y)-f(z) >= Q(y) - Q(z) >=0

and hence

f(z)<=f(y)


Thus, minimizing Q produces a point z that reduces the objective function from the starting point y. The formula for z is easily obtained by setting the gradient of Q to zero and gives

 z=y-A.'*(A*y-b)./C

which is precisely the recursion formula for the proposed algorithm.

This proves that the algorithm is monotone for this choice of preconditioner C. Proof of convergence can be established using the same theory as for gradient algorithms.

Subject: min ||Ax-b|| for large sparse A

From: Matt J

Date: 24 Jun, 2010 03:18:05

Message: 16 of 21

"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hvtcf4$epi$1@fred.mathworks.com>...
> "Matt J " <mattjacREMOVE@THISieee.spam> wrote in message <hvt6p8$hj$1@fred.mathworks.com>...
>
> > But what would your alternatives be? I showed you a 7-dimensional, significantly non-diagonal example where CG does just as badly or worse, and with a condition number of only 10 or so....
>
> Not what I observe, and consistently CG beats the gradient method that perform poorly:
 ====================

OK. I had a typo in my code. But regardless, try the code below, which also plots the progress of each algorithm versus CPU time. I consistently find the gradient method to have a significantly faster initial convergence rate, for various choices of problem size N. So at the very least, it seems like using the gradient alg initially and switching to CG later would be a worthwhile hybrid.


N=10;
 
q=1./(1:N).^2;
 W=toeplitz(q,q);
 V=diag(sqrt(1:N));

 
 A= V*W*V;
 b=A*(N:-1:1).';

niter=30;
numIter=niter;
xInit=ones(N,1);
%%

X = xInit;
absA=abs(A);
C=absA.'*sum(absA,2);
AtA=A.'*A;
Atb=A'*b;
Z=eye(N)-diag(1./C)*AtA;
d=Atb./C;

tic;
for ii=1:numIter
    %X=X-(AtA*X-Atb)./C;
    X=Z*X+d;
    resMM(ii)=norm(A*X-b);
end
tMM=toc,



% Conjugate gradient

x = xInit;
r = b - A*x;
w = -r;
z = A*w;
a = (r'*w)/(w'*z);
x = x + a*w;
B = 0;

tic,
for i = 1:niter
    r = r - a*z;
    B = (r'*z)/(w'*z);
    w = -r + B*w;
    z = A*w;
    a = (r'*w)/(w'*z);
    x = x + a*w;
    resCG(i)=norm(A*x-b);
end
tCG=toc,


plot((1:numIter)/numIter*tMM,resMM, 'o-', (1:niter)/niter*tCG, resCG,'*--');
legend('Gradient','CG')
xlabel 'CPU Time'
ylabel 'Cost'
figure(gcf)

Subject: min ||Ax-b|| for large sparse A

From: Bruno Luong

Date: 24 Jun, 2010 06:33:25

Message: 17 of 21

"Matt J " <mattjacREMOVE@THISieee.spam> wrote in message <hvuipd$qre$1@fred.mathworks.com>...
>I consistently find the gradient method to have a significantly faster initial convergence rate, for various choices of problem size N. So at the very least, it seems like using the gradient alg initially and switching to CG later would be a worthwhile hybrid.

Sight... Drawing an hative conclusion can be dangerous.

It has been proved that for the same matrix, rhs and starting at the same point:
1. CG = GM at the first iteration
2. CG must be superior to GM from the 2nd iteration
3. The conjugacy built by CG are global over all iterations even though only the last one is explicitly aimed

So making hybrid method will destroy the global conjugacy and degrade the performance of CG.

What you observe is the effect of preconditioning and not the superiority of GM, it cannot be possible mathematically that GM provides faster cost decreasing, assuming everything else equal, that's the fact.

Now we can use the same right diagonal preconditioning you propose in GM to CG, and CG will be superior to GM again, and might be there is even better preconditioner out there such as incomplete SOR, or Cholesky incomplete factorization, etc... depending on type of problems.

Bruno

Subject: min ||Ax-b|| for large sparse A

From: Matt J

Date: 24 Jun, 2010 14:34:23

Message: 18 of 21

"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hvuu7l$2ms$1@fred.mathworks.com>...
> "Matt J " <mattjacREMOVE@THISieee.spam> wrote in message <hvuipd$qre$1@fred.mathworks.com>...
> >I consistently find the gradient method to have a significantly faster initial convergence rate, for various choices of problem size N. So at the very least, it seems like using the gradient alg initially and switching to CG later would be a worthwhile hybrid.
>
> Sight... Drawing an hative conclusion can be dangerous.
>
> It has been proved that for the same matrix, rhs and starting at the same point:
> 1. CG = GM at the first iteration
> 2. CG must be superior to GM from the 2nd iteration
> 3. The conjugacy built by CG are global over all iterations even though only the last one is explicitly aimed
>
> So making hybrid method will destroy the global conjugacy and degrade the performance of CG.
>
> What you observe is the effect of preconditioning and not the superiority of GM, it cannot be possible mathematically that GM provides faster cost decreasing, assuming everything else equal, that's the fact.
===============

Assuming everything else to be equal is not practical, however. The CPU time of the two methods, in particular, would never be equal. Even though CG may have a faster rate of convergence, the lower computational cost per iteration of the gradient method, helps it stay ahead of CG in the race initially. It is that, and not just the preconditioning, which accounts for the faster initial descent of the gradient method in these tests.

This becomes even more of a factor when you move away from unconstrained, quadratic problems. In nonquadratic problems, the line searches required by CG can't be done analytically anymore and become far more CPU time consuming. Also, the faster descent of CG is only asymptotic, and doesn't necessarily occur at the initial iterations. (That, incidentally, is when the algorithm hybrid I was talking about becomes a logical thing to do.) Furthermore, even the asymptotic superiority of CG is only guaranteed by the theory if the solution is an interior point.

Subject: min ||Ax-b|| for large sparse A

From: Jacob

Date: 1 Jul, 2010 12:05:08

Message: 19 of 21

Thanks for the proof, but do you have any citable material? Papers/textbooks/etc?


"Matt J " <mattjacREMOVE@THISieee.spam> wrote in message <hvto48$8k$1@fred.mathworks.com>...
> "Matt J " <mattjacREMOVE@THISieee.spam> wrote in message <hvtaj4$c1n$1@fred.mathworks.com>...
> > "Jacob " <mithunjacob_oohay@yahoo.com> wrote in message <hvt7ds$ctr$1@fred.mathworks.com>...
> > > Your discussion regarding the matter seems great, and I must admit your simple method works very well for my matrix (which is highly diagonalized).
> > >
> > > But I need more data about this method --- could you point me to some proofs regarding C?
> > ===============
> >
> > I mentioned this paper earlier in the thread. Did you already check it?
>
> Sorry, I checked that paper, and it's not precisely what you need. I'll give you my own quick derivation. The objective function is
>
> f(x)=||A*x-b||^2/2;
>
> but for a fixed y it can also be written equivalently as follows
>
> f(x)=(x-y).'*(A.'*A)*(x-y)/2 + A.'*(A*y-b)*(x-y)+ ||A*y-b||^2/2;
>
> Now consider the alternative function
>
> Q(x)=(x-y).'*diag(C)*(x-y)/2 + A.'*(A*y-b)*(x-y)+||A*y-b||^2/2;
>
>
> Subtracting the two functions we obtain
>
> Q(x) - f(x) = (x-y).' * (diag(C) - A.'*A) * (x-y)/2
>
> I now claim that diag(C) - A.'*A is non-negative definite. It is a simple exercise to prove this using the definition of C and the Gerschgorin circle theorem
>
> http://en.wikipedia.org/wiki/Gerschgorin_disk_theorem
>
> From the non-negative definiteness of diag(C) - A.'*A, the last equation tells us that
>
> Q(x) >= f(x)
>
> for all x with equality at x=y. This means that
>
> f(y) - f(x) >= Q(y) - Q(x)
>
> for all x. This is sometimes called the "optimization transfer principle". The above implies that if Q() is minimized at z then
>
> f(y)-f(z) >= Q(y) - Q(z) >=0
>
> and hence
>
> f(z)<=f(y)
>
>
> Thus, minimizing Q produces a point z that reduces the objective function from the starting point y. The formula for z is easily obtained by setting the gradient of Q to zero and gives
>
> z=y-A.'*(A*y-b)./C
>
> which is precisely the recursion formula for the proposed algorithm.
>
> This proves that the algorithm is monotone for this choice of preconditioner C. Proof of convergence can be established using the same theory as for gradient algorithms.

Subject: min ||Ax-b|| for large sparse A

From: Matt J

Date: 2 Jul, 2010 14:53:05

Message: 20 of 21

"Jacob " <foobar@yahoo.com> wrote in message <i0i09k$99m$1@fred.mathworks.com>...
> Thanks for the proof, but do you have any citable material? Papers/textbooks/etc?
==========

See Example 7 in the following article. It's the same algorithm, but gives a different derivation than mine

@ARTICLE{
        Lan00,
        author = {K. Lange and D. R. Hunter and I. Yang},
        title = {{Optimization} transfer using surrogate objective functions},
        journal = {J. Computational and Graphical Stat.},
        volume = 9,
        number = 1,
        pages = {1--20},
        month = mar,
        year = 2000
}

Subject: min ||Ax-b|| for large sparse A

From: PAW

Date: 20 Aug, 2013 14:20:33

Message: 21 of 21

Dear Jacob, Matt and Bruno,

I just read your interesting conversation and decided to create an account and give my point of view since this is an interesting and recurrent question.

I must say I agree fully with Matt: simple methods such as gradient descents may perform better (at least in the first few iterations) than the more mature methods such as LSQR.

There is currently little or no theory explaining the practical success of these methods, but the differences are very clear empirically. For theoretical results, you could have a look at Nesterov, Nemirovskii (and other big shots in optimization) recent papers. The conclusion is unanim: for very large to huge scale problems, simple methods may outperform traditional methods found in textbooks by a huge amount.

A striking example is randomized coordinate descent methods which usually outperform full gradient methods in many scenarii and even have a justification in infinite dimension, while gradient methods have no meaning (regarding convergence rates).

So, now, as a professor in maths, and contrarily to my colleagues, I take a some time teaching the details of basic stuff such as gradient methods, since large scale problems arise more and more often and good methods to solve them are usually very different from what you find in standard linear algebra textbooks.

Best regards,
Pierre

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us