scale/normalize parameter vector for optimization

Question

0 votes

In my optim problem, the parameters

naturally vary by several orders of magnitude because they represent interpolation values of a B-Spline function F = spapi(k,x,y). For instance,

may be close to zero and

. The step tolerance is just a constant and does not account for these differences in scale between the parameters. In my case, the objective function changes differently w.r.t changes in the "smaller" y's than in larger y's, making parameter scaling or normalization potentially beneficial.

1. In machine learning literature, I often encounter [0,1] scaling

as a common technique. Would this approach be suitable for my problem as well? Or can you suggest more appropriate scaling techniques given the parameters represent interpolation values?

2. This might be a separate question, but also relates to parameter scaling/transformation. My Jacobian matrix J (hence, Hessian \approx J^T*J) tends to be poorly conditioned. I have considered to switching to a different basis for the parameter vector. So far, my parameters

are multipliers for the unit vectors

:

. I vaguely recall a discussion where a teacher suggested using the normalized eigenvectors of the Hessian as the unit vectors:

where

are the (normalized) eigenvectors of the Hessian and

are the new parameters.

My questions are: In theory, would parameterization in terms of the eigenvectors be effective in improving the conditioning of the problem? If so, is this approach compatible with the presence of bounds and linear constraints on the parameters?

Thank you!

6 Comments
Show 4 older comments Hide 4 older comments

SA-W on 7 Oct 2024

@John D'Errico

No. The eigenvectors of the Hessian will not help. You will still have a poorly scaled problem. And the Hessian will surely change depending on where you look, but you must use the same eigenvectors as the optimization moves around the parameter space. So what may have been a choice you like at the starting value for the optimization will not be "right" near the optimum.

Not sure what you mean here. Suppose we use the eigenvectors of the Hessian at the initial guess as new unit vectors and the initial guess is a good guess ror the solution. Even then, this does not really help for improving the conditioning?

John D'Errico on 7 Oct 2024

You don't understand. The eigenvalues and eigenvectors of the hessian matrix are continually changing, and will be doing so dramatically as you move around the parameter space. In some places, it is even likely the eigenvalues change sign, indicating there may be regions of both simultaneously positive and negative curvature. As such, those eigenvectors are constantly pointing in very different directions as you move. Is it truly likely your initial guess is such a great guess? I doubt that is true. Why not?

BECAUSE you already told us the problem is poorly conditioned.

A characteristic of poorly conditioned problems is you have a long valley where if you move aloong the bottom of that valley, the objective changes relatively little. I.e., there is some linear combination of the parameters where relatively large changes in that direction do little to the objective. And of course, since the objective is nonlinear, almost always that valley lies along a curve. Not even a long straight valley.

Next, does it improve the conditioning? No. Not at all. You will still have the same fundamental valley to chase along, but as I said, almost always, that vally is not even a long perfectly straight one.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Bruno Luong on 7 Oct 2024

Edited: Bruno Luong on 7 Oct 2024

Open in MATLAB Online

0 votes

1. In machine learning literature, I often encounter [0,1] scaling as a common technique. Would this approach be suitable for my problem as well? Or can you suggest more appropriate scaling techniques given the parameters represent interpolation values?

See equilibrate and balance to start

2. This might be a separate question, but also relates to parameter scaling/transformation. My Jacobian matrix J (hence, Hessian \approx J^T*J) tends to be poorly conditioned. I have considered to switching to a different basis for the parameter vector. So far, my parameters are multipliers for the unit vectors : . I vaguely recall a discussion where a teacher suggested using the normalized eigenvectors of the Hessian as the unit vectors: where are the (normalized) eigenvectors of the Hessian and are the new parameters.

This probably show what your teacher suggested, for the unconstrained case. Observe the smaller number of iterations needed for convergence when the basis is well chosen, render a better conditioning of the Hessian.

% Create ill conditioning matrix J
n = 4;
J = sqrtm(hilb(n));
xt = rand(n,1)
xt = 4×1
    0.7878
    0.0250
    0.9373
    0.7069
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
y = J*xt;
x0 = zeros(size(J,2),1);
[x,fval,exitflag,output] = fminunc(@(x)sum((J*x-y).^2),x0);
Local minimum found.

Optimization completed because the size of the gradient is less than
the value of the optimality tolerance.
output
output = struct with fields:
       iterations: 28
        funcCount: 145
         stepsize: 0.0113
     lssteplength: 1
    firstorderopt: 2.0514e-06
        algorithm: 'quasi-newton'
          message: 'Local minimum found....'
% Hessian (assumption linear model so there is no second order derivative
% here)
H = J'*J;
H = 1/2*(H + H'); % Make matrix numerically symmetric
fprintf('cond(H) = %g\n', cond(H));
cond(H) = 15513.7
[V,lambda] = eig(H,'vector');
lambda = reshape(lambda,1,[]);
% New basis (each column is a vector of basis, "normalized" eigen vector of
% H
B = V*diag(1./sqrt(lambda)); % Careful we assume H is spd matrix here inorder to be able to select such basis
% model with respect to coefficients in the new basis
JB = J*B;
% fprintf('cond(JB''*JB) = %g\n', cond(JB'*JB)); % NOTE: close to 1
[xB,fval,exitflag,outputPrecond] = fminunc(@(xB)sum((JB*xB-y).^2),x0);
Local minimum found.

Optimization completed because the size of the gradient is less than
the value of the optimality tolerance.
xPrecond = B*xB
xPrecond = 4×1
    0.7878
    0.0250
    0.9373
    0.7069
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
outputPrecond
outputPrecond = struct with fields:
       iterations: 2
        funcCount: 15
         stepsize: 0.3699
     lssteplength: 1
    firstorderopt: 7.9213e-09
        algorithm: 'quasi-newton'
          message: 'Local minimum found....'

One could make the Hessian matrix with better consitioned by change the basis of x as showed here or basis of y. Those correspond to the so called right or left preconditioning of the model matrix J.

For constrained case you need to consider the some sort Hessian projected to the active set. The topic is large, better to buy a good book of optimzation rather than post question here.

3 Comments
Show 1 older comment Hide 1 older comment

Bruno Luong on 7 Oct 2024

Edited: Bruno Luong on 7 Oct 2024

Sorry I won't go to constrained case, too long t explain in general framework and too specific receipt case-by-case. You should learn from somewhere else (eg, books), or just let the engine do the work for you.

SA-W on 7 Oct 2024

Can you recommend a book where such transformations are described? I like Numerical Optimization from Nocedal and Wright because it is designed for practitioners in engineering. Though I can not find knowledge like you presented in this answer.

Sign in to comment.

Answer 2

Matt J on 7 Oct 2024

Edited: Matt J on 7 Oct 2024

0 votes

The transformation

is non-differentiable, and probably is not a good idea for Optimization Toolbox optimizers which are smooth solvers. In machine learning, smoothness doesn't matter so much (a) because stochastic gradient solvers are typically used, which aren't as influenced by non-differentiability, and (b) because there isn't really a high priority in machine learning on precise minimization.

A common transformation strategy would be to take the diagonal of the Hessian, or in approximation J.'*J, at the initial point x0 and use that to pre-normalization of the variables. The idea is to scale things so that the Hessian diagonals at x0 are approximately equal 1. The effectiveness of this strategy gets better the more diagonally dominant the Hessian is, and the better the initial guess x0.

9 Comments
Show 7 older comments Hide 7 older comments

Bruno Luong on 7 Oct 2024

Edited: Bruno Luong on 7 Oct 2024

Open in MATLAB Online

Comparison of convergence between non- preconditioning, eigen vectors and Matt's Hessian unitary diagonal scaling, I would say the diagnonal scaling is not very effective for this test.

It is more relevant when the regression x vector comprises mixing parameters with different "types".

For splines interpolation with Bernstein basis, I don't think it is requires.

% Create ill conditioning matrix J
n = 4;
J = sqrtm(hilb(n));
xt = rand(n,1)
xt = 4×1
    0.3845
    0.2623
    0.3761
    0.5033
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
y = J*xt;
x0 = zeros(size(J,2),1);
[x,fval,exitflag,output] = fminunc(@(x)sum((J*x-y).^2),x0);
Local minimum found.

Optimization completed because the size of the gradient is less than
the value of the optimality tolerance.
output
output = struct with fields:
       iterations: 19
        funcCount: 115
         stepsize: 0.0013
     lssteplength: 1
    firstorderopt: 3.4981e-07
        algorithm: 'quasi-newton'
          message: 'Local minimum found....'
% Hessian (assumption linear model so there is no second order derivative
% here)
H = J'*J;
H = 1/2*(H + H'); % Make matrix numerically symmetric
fprintf('cond(H) = %g\n', cond(H));
cond(H) = 15513.7
[V,lambda] = eig(H,'vector');
lambda = reshape(lambda,1,[]);
% New basis
B = V./sqrt(lambda); % Careful we assume H is sdp here inorder to be able to select such basis
% model wrp to coefficients in the new basis
JB = J*B;
fprintf('cond(JB''*JB) = %g\n', cond(JB'*JB)); % NOTE: close to 1
cond(JB'*JB) = 1
[xB,fval,exitflag,outputPrecond] = fminunc(@(xB)sum((JB*xB-y).^2),x0);
Local minimum found.

Optimization completed because the size of the gradient is less than
the value of the optimality tolerance.
xPrecond = B*xB
xPrecond = 4×1
    0.3845
    0.2623
    0.3761
    0.5033
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
outputPrecond
outputPrecond = struct with fields:
       iterations: 2
        funcCount: 15
         stepsize: 0.1799
     lssteplength: 1
    firstorderopt: 3.5342e-09
        algorithm: 'quasi-newton'
          message: 'Local minimum found....'
% Matt's doagonal scaling
% Hessian independent of x0 for linear model;
BD = diag(1./sqrt( diag(H)));
JBD = J*BD;
HD1 = JBD'*JBD; % diagonal == 1
[xBD,fval,exitflag,outputdH] = fminunc(@(xBD)sum((JBD*xBD-y).^2),x0);
Local minimum found.

Optimization completed because the size of the gradient is less than
the value of the optimality tolerance.
xD = BD*xBD
xD = 4×1
    0.3846
    0.2622
    0.3764
    0.5031
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
outputdH
outputdH = struct with fields:
       iterations: 18
        funcCount: 110
         stepsize: 4.1073e-04
     lssteplength: 1
    firstorderopt: 1.7428e-06
        algorithm: 'quasi-newton'
          message: 'Local minimum found....'

Matt J on 8 Oct 2024

But the assumption was that J.'*J was diagonally dominant.

Bruno Luong on 8 Oct 2024

Edited: Bruno Luong on 8 Oct 2024

I did read that and understood.

In practice I encouted rarely, if not ever, the diagonal dominant characters of the Hessian, something that we rather undergo (the Hessian form). The most diagonal dominant that comes to my mind spontanuously is from Laplace/Posson equation. Usually it's already have consant diagonal, and people still derive more sophisticated preconditioning techinique (eg scaled FFT basis, multigrid technique ...)

I simply show the convergence result/how to code of such preconditioning using model matrix J without Hessian H in common and with simple example.

Feel free to give example where the assumption you stipulate is true and effective.

Sign in to comment.

Answer 3

Bruno Luong on 8 Oct 2024

0 votes

An interesting idea of generic "preconditioning" of Bspline approximmation is working with Dual Bernstein basis, decribe in Lu paper here

1 Comment
Show -1 older comments Hide -1 older comments

SA-W on 8 Oct 2024

Nice work, but addresses the topic of finding a nice B-Spline basis, not a basis for the representation of the parameter point.

Sign in to comment.

scale/normalize parameter vector for optimization

6 Comments
Show 4 older comments Hide 4 older comments

Accepted Answer

3 Comments
Show 1 older comment Hide 1 older comment

More Answers (2)

9 Comments
Show 7 older comments Hide 7 older comments

1 Comment
Show -1 older comments Hide -1 older comments

Categories

Products

Tags

Community Treasure Hunt

scale/normalize parameter vector for optimization

6 Comments Show 4 older comments Hide 4 older comments

Accepted Answer

3 Comments Show 1 older comment Hide 1 older comment

More Answers (2)

9 Comments Show 7 older comments Hide 7 older comments

1 Comment Show -1 older comments Hide -1 older comments

Categories

Products

Tags

See Also

Community Treasure Hunt

6 Comments
Show 4 older comments Hide 4 older comments

3 Comments
Show 1 older comment Hide 1 older comment

9 Comments
Show 7 older comments Hide 7 older comments

1 Comment
Show -1 older comments Hide -1 older comments