scale/normalize parameter vector for optimization
Show older comments
In my optim problem, the parameters
naturally vary by several orders of magnitude because they represent interpolation values of a B-Spline function F = spapi(k,x,y). For instance,
may be close to zero and
. The step tolerance is just a constant and does not account for these differences in scale between the parameters. In my case, the objective function changes differently w.r.t changes in the "smaller" y's than in larger y's, making parameter scaling or normalization potentially beneficial.
1. In machine learning literature, I often encounter [0,1] scaling
as a common technique. Would this approach be suitable for my problem as well? Or can you suggest more appropriate scaling techniques given the parameters represent interpolation values?
2. This might be a separate question, but also relates to parameter scaling/transformation. My Jacobian matrix J (hence, Hessian \approx J^T*J) tends to be poorly conditioned. I have considered to switching to a different basis for the parameter vector. So far, my parameters
are multipliers for the unit vectors
:
. I vaguely recall a discussion where a teacher suggested using the normalized eigenvectors of the Hessian as the unit vectors:
where
are the (normalized) eigenvectors of the Hessian and
are the new parameters.
My questions are: In theory, would parameterization in terms of the eigenvectors be effective in improving the conditioning of the problem? If so, is this approach compatible with the presence of bounds and linear constraints on the parameters?
Thank you!
6 Comments
John D'Errico
on 6 Oct 2024
Edited: John D'Errico
on 6 Oct 2024
No. The eigenvectors of the Hessian will not help. You will still have a poorly scaled problem. And the Hessian will surely change depending on where you look, but you must use the same eigenvectors as the optimization moves around the parameter space. So what may have been a choice you like at the starting value for the optimization will not be "right" near the optimum.
Anyway, several orders of magnitude are not going to cause a problems. That is, as long as the word several implies a number like 2 or 3 or 4. If several might indicate 10 or 12 powers of 10, then you need to worry.
Finally, will the transformation be compatible with bounds and linear constraints? In a sense, yes, since the transformation you refer to is linear. HOWEVER, bound constraints will no longer exist. All bound constraints will now be forced to become linear inequality constraints.
SA-W
on 6 Oct 2024
Torsten
on 6 Oct 2024
Sure. y >= lb gives inv(D)*z >= lb and y <= ub gives inv(D)*z <= ub.
John D'Errico
on 6 Oct 2024
Edited: John D'Errico
on 7 Oct 2024
If D is purely diagonal, then yes, bound constraints remain bounds. Otherwise, for non-trivial (not diagonal or permutations thereof) linear transformations, bound constraints transform into linear inequality constraints.
If D is essentially anything but diagonal, then bound constraints become linear inequality constraints, subject to a simple caveat. For example, consider this matrix D:
D = eye(4); D = D([3 1 4 2],:)
Here we see that D is simply a permutation matrix. Bounds will remain bounds, although you would also need to permute the bounds too.
We could probably write the conditions on whether bounds would remain bounds in the form of something simple, wherein if any linear combination is formed from any two or more variables, then bounds will no longer remain as such.
SA-W
on 7 Oct 2024
John D'Errico
on 7 Oct 2024
You don't understand. The eigenvalues and eigenvectors of the hessian matrix are continually changing, and will be doing so dramatically as you move around the parameter space. In some places, it is even likely the eigenvalues change sign, indicating there may be regions of both simultaneously positive and negative curvature. As such, those eigenvectors are constantly pointing in very different directions as you move. Is it truly likely your initial guess is such a great guess? I doubt that is true. Why not?
BECAUSE you already told us the problem is poorly conditioned.
A characteristic of poorly conditioned problems is you have a long valley where if you move aloong the bottom of that valley, the objective changes relatively little. I.e., there is some linear combination of the parameters where relatively large changes in that direction do little to the objective. And of course, since the objective is nonlinear, almost always that valley lies along a curve. Not even a long straight valley.
Next, does it improve the conditioning? No. Not at all. You will still have the same fundamental valley to chase along, but as I said, almost always, that vally is not even a long perfectly straight one.
Accepted Answer
More Answers (2)
The transformation
is non-differentiable, and probably is not a good idea for Optimization Toolbox optimizers which are smooth solvers. In machine learning, smoothness doesn't matter so much (a) because stochastic gradient solvers are typically used, which aren't as influenced by non-differentiability, and (b) because there isn't really a high priority in machine learning on precise minimization.
A common transformation strategy would be to take the diagonal of the Hessian, or in approximation J.'*J, at the initial point x0 and use that to pre-normalization of the variables. The idea is to scale things so that the Hessian diagonals at x0 are approximately equal 1. The effectiveness of this strategy gets better the more diagonally dominant the Hessian is, and the better the initial guess x0.
9 Comments
SA-W
on 7 Oct 2024
Your OP mentions nothing that implies either convexity or that the y(i) are monotonic. As far as you've told us the y(i) are freely varying unknowns. In any case, if you were to rewrite as,
then that indeed would be a differentiable formulation, but it is not clear to me how that improves conditioning. You speculate that you are seeing ill-conditioning as a result of some y(i) having poor scaling relative to other y(j), but the above transformation does not change the relative scaling. It just constrains y'(n)=1, y'(1)=0, and changes all other y(i) by a common factor of 1/(y(n)-y(1)).
SA-W
on 7 Oct 2024
Matt J
on 7 Oct 2024
Subject to the assumptions I mentioned, yes.
Comparison of convergence between non- preconditioning, eigen vectors and Matt's Hessian unitary diagonal scaling, I would say the diagnonal scaling is not very effective for this test.
It is more relevant when the regression x vector comprises mixing parameters with different "types".
For splines interpolation with Bernstein basis, I don't think it is requires.
% Create ill conditioning matrix J
n = 4;
J = sqrtm(hilb(n));
xt = rand(n,1)
y = J*xt;
x0 = zeros(size(J,2),1);
[x,fval,exitflag,output] = fminunc(@(x)sum((J*x-y).^2),x0);
output
% Hessian (assumption linear model so there is no second order derivative
% here)
H = J'*J;
H = 1/2*(H + H'); % Make matrix numerically symmetric
fprintf('cond(H) = %g\n', cond(H));
[V,lambda] = eig(H,'vector');
lambda = reshape(lambda,1,[]);
% New basis
B = V./sqrt(lambda); % Careful we assume H is sdp here inorder to be able to select such basis
% model wrp to coefficients in the new basis
JB = J*B;
fprintf('cond(JB''*JB) = %g\n', cond(JB'*JB)); % NOTE: close to 1
[xB,fval,exitflag,outputPrecond] = fminunc(@(xB)sum((JB*xB-y).^2),x0);
xPrecond = B*xB
outputPrecond
% Matt's doagonal scaling
% Hessian independent of x0 for linear model;
BD = diag(1./sqrt( diag(H)));
JBD = J*BD;
HD1 = JBD'*JBD; % diagonal == 1
[xBD,fval,exitflag,outputdH] = fminunc(@(xBD)sum((JBD*xBD-y).^2),x0);
xD = BD*xBD
outputdH
Matt J
on 7 Oct 2024
I would say the diagnonal scaling is not very effective for this test.
Naturally. Your chosen J strongly violates the assumptions I stipulated.
Bruno Luong
on 8 Oct 2024
The Hessian diagonal normalization is equivalent to normalize each column of model J unitary
BD = diag(1./vecnorm(J,2,1));
JBD = J*BD;
[xBD,fval,exitflag,outputdH] = fminunc(@(xBD)sum((JBD*xBD-y).^2),x0);
xD = BD*xBD
Matt J
on 8 Oct 2024
But the assumption was that J.'*J was diagonally dominant.
Bruno Luong
on 8 Oct 2024
Edited: Bruno Luong
on 8 Oct 2024
I did read that and understood.
In practice I encouted rarely, if not ever, the diagonal dominant characters of the Hessian, something that we rather undergo (the Hessian form). The most diagonal dominant that comes to my mind spontanuously is from Laplace/Posson equation. Usually it's already have consant diagonal, and people still derive more sophisticated preconditioning techinique (eg scaled FFT basis, multigrid technique ...)
I simply show the convergence result/how to code of such preconditioning using model matrix J without Hessian H in common and with simple example.
Feel free to give example where the assumption you stipulate is true and effective.
Bruno Luong
on 8 Oct 2024
0 votes
An interesting idea of generic "preconditioning" of Bspline approximmation is working with Dual Bernstein basis, decribe in Lu paper here
Categories
Find more on Linear Least Squares in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!