Unconstrained minimization is the problem of finding a vector x that is a local minimum to a scalar function f(x):
$$\underset{x}{\mathrm{min}}f(x)$$
The term unconstrained means that no restriction is placed on the range of x.
fminunc
trust-region
AlgorithmMany of the methods used in Optimization Toolbox™ solvers are based on trust regions, a simple yet powerful concept in optimization.
To understand the trust-region approach to optimization, consider the unconstrained minimization problem, minimize f(x), where the function takes vector arguments and returns scalars. Suppose you are at a point x in n-space and you want to improve, i.e., move to a point with a lower function value. The basic idea is to approximate f with a simpler function q, which reasonably reflects the behavior of function f in a neighborhood N around the point x. This neighborhood is the trust region. A trial step s is computed by minimizing (or approximately minimizing) over N. This is the trust-region subproblem,
$$\underset{s}{\mathrm{min}}\left\{q(s),\text{}s\in N\right\}.$$ | (6-1) |
The current point is updated to be x + s if f(x + s) < f(x); otherwise, the current point remains unchanged and N, the region of trust, is shrunk and the trial step computation is repeated.
The key questions in defining a specific trust-region approach to minimizing f(x) are how to choose and compute the approximation q (defined at the current point x), how to choose and modify the trust region N, and how accurately to solve the trust-region subproblem. This section focuses on the unconstrained problem. Later sections discuss additional complications due to the presence of constraints on the variables.
In the standard trust-region method ([48]), the quadratic approximation q is defined by the first two terms of the Taylor approximation to F at x; the neighborhood N is usually spherical or ellipsoidal in shape. Mathematically the trust-region subproblem is typically stated
$$\mathrm{min}\left\{\frac{1}{2}{s}^{T}Hs+{s}^{T}g\text{suchthat}\Vert Ds\Vert \le \Delta \right\},$$ | (6-2) |
where g is the gradient of f at the current point x, H is the Hessian matrix (the symmetric matrix of second derivatives), D is a diagonal scaling matrix, Δ is a positive scalar, and ∥ . ∥ is the 2-norm. Good algorithms exist for solving Equation 6-2 (see [48]); such algorithms typically involve the computation of a full eigensystem and a Newton process applied to the secular equation
$$\frac{1}{\Delta}-\frac{1}{\Vert s\Vert}=0.$$
Such algorithms provide an accurate solution to Equation 6-2. However, they require time proportional to several factorizations of H. Therefore, for large-scale problems a different approach is needed. Several approximation and heuristic strategies, based on Equation 6-2, have been proposed in the literature ([42] and [50]). The approximation approach followed in Optimization Toolbox solvers is to restrict the trust-region subproblem to a two-dimensional subspace S ([39] and [42]). Once the subspace S has been computed, the work to solve Equation 6-2 is trivial even if full eigenvalue/eigenvector information is needed (since in the subspace, the problem is only two-dimensional). The dominant work has now shifted to the determination of the subspace.
The two-dimensional subspace S is determined with the aid of a preconditioned conjugate gradient process described below. The solver defines S as the linear space spanned by s_{1} and s_{2}, where s_{1} is in the direction of the gradient g, and s_{2} is either an approximate Newton direction, i.e., a solution to
$$H\cdot {s}_{2}=-g,$$ | (6-3) |
or a direction of negative curvature,
$${s}_{2}^{T}\cdot H\cdot {s}_{2}<0.$$ | (6-4) |
The philosophy behind this choice of S is to force global convergence (via the steepest descent direction or negative curvature direction) and achieve fast local convergence (via the Newton step, when it exists).
A sketch of unconstrained minimization using trust-region ideas is now easy to give:
Formulate the two-dimensional trust-region subproblem.
Solve Equation 6-2 to determine the trial step s.
If f(x + s) < f(x), then x = x + s.
Adjust Δ.
These four steps are repeated until convergence. The trust-region dimension Δ is adjusted according to standard rules. In particular, it is decreased if the trial step is not accepted, i.e., f(x + s) ≥ f(x). See [46] and [49] for a discussion of this aspect.
Optimization Toolbox solvers treat a few important special cases of f with specialized functions: nonlinear least-squares, quadratic functions, and linear least-squares. However, the underlying algorithmic ideas are the same as for the general case. These special cases are discussed in later sections.
A popular way to solve large symmetric positive definite systems of linear equations Hp = –g is the method of Preconditioned Conjugate Gradients (PCG). This iterative approach requires the ability to calculate matrix-vector products of the form H·v where v is an arbitrary vector. The symmetric positive definite matrix M is a preconditioner for H. That is, M = C^{2}, where C^{–1}HC^{–1} is a well-conditioned matrix or a matrix with clustered eigenvalues.
In a minimization context, you can assume that the Hessian matrix H is symmetric. However, H is guaranteed to be positive definite only in the neighborhood of a strong minimizer. Algorithm PCG exits when a direction of negative (or zero) curvature is encountered, i.e., d^{T}Hd ≤ 0. The PCG output direction, p, is either a direction of negative curvature or an approximate (tol controls how approximate) solution to the Newton system Hp = –g. In either case p is used to help define the two-dimensional subspace used in the trust-region approach discussed in Trust-Region Methods for Nonlinear Minimization.
fminunc
quasi-newton
AlgorithmAlthough a wide spectrum of methods exists for unconstrained optimization, methods can be broadly categorized in terms of the derivative information that is, or is not, used. Search methods that use only function evaluations (e.g., the simplex search of Nelder and Mead [30]) are most suitable for problems that are not smooth or have a number of discontinuities. Gradient methods are generally more efficient when the function to be minimized is continuous in its first derivative. Higher order methods, such as Newton's method, are only really suitable when the second-order information is readily and easily calculated, because calculation of second-order information, using numerical differentiation, is computationally expensive.
Gradient methods use information about the slope of the function to dictate a direction of search where the minimum is thought to lie. The simplest of these is the method of steepest descent in which a search is performed in a direction, –∇f(x), where ∇f(x) is the gradient of the objective function. This method is very inefficient when the function to be minimized has long narrow valleys as, for example, is the case for Rosenbrock's function
$$f(x)=100{\left({x}_{2}-{x}_{1}^{2}\right)}^{2}+{(1-{x}_{1})}^{2}.$$ | (6-5) |
The minimum of this function is at x = [1,1], where f(x) = 0. A contour map of this function is shown in the figure below, along with the solution path to the minimum for a steepest descent implementation starting at the point [-1.9,2]. The optimization was terminated after 1000 iterations, still a considerable distance from the minimum. The black areas are where the method is continually zigzagging from one side of the valley to another. Note that toward the center of the plot, a number of larger steps are taken when a point lands exactly at the center of the valley.
Figure 6-1. Steepest Descent Method on Rosenbrock's Function
This function, also known as the banana function, is notorious in unconstrained examples because of the way the curvature bends around the origin. Rosenbrock's function is used throughout this section to illustrate the use of a variety of optimization techniques. The contours have been plotted in exponential increments because of the steepness of the slope surrounding the U-shaped valley.
For an animated version of this figure, enter bandem
at
the MATLAB^{®} command line.
Of the methods that use gradient information, the most favored are the quasi-Newton methods. These methods build up curvature information at each iteration to formulate a quadratic model problem of the form
$$\underset{x}{\mathrm{min}}\frac{1}{2}{x}^{T}Hx+{c}^{T}x+b,$$ | (6-6) |
where the Hessian matrix, H, is a positive definite symmetric matrix, c is a constant vector, and b is a constant. The optimal solution for this problem occurs when the partial derivatives of x go to zero, i.e.,
$$\nabla f({x}^{*})=H{x}^{*}+\text{\hspace{0.17em}}c=0.$$ | (6-7) |
The optimal solution point, x*, can be written as
$${x}^{*}=-{H}^{-1}c.$$ | (6-8) |
Newton-type methods (as opposed to quasi-Newton methods) calculate H directly and proceed in a direction of descent to locate the minimum after a number of iterations. Calculating H numerically involves a large amount of computation. Quasi-Newton methods avoid this by using the observed behavior of f(x) and ∇f(x) to build up curvature information to make an approximation to H using an appropriate updating technique.
A large number of Hessian updating methods have been developed. However, the formula of Broyden [3], Fletcher [12], Goldfarb [20], and Shanno [37] (BFGS) is thought to be the most effective for use in a general purpose method.
The formula given by BFGS is
$${H}_{k+1}={H}_{k}+\frac{{q}_{k}{q}_{k}^{T}}{{q}_{k}^{T}{s}_{k}}-\frac{{H}_{k}{s}_{k}{s}_{k}^{T}{H}_{k}^{T}}{{s}_{k}^{T}{H}_{k}{s}_{k}},$$ | (6-9) |
where
$$\begin{array}{c}{s}_{k}={x}_{k+1}-{x}_{k},\\ {q}_{k}=\nabla f\left({x}_{k+1}\right)-\nabla f\left({x}_{k}\right).\end{array}$$
As a starting point, H_{0} can be set to any symmetric positive definite matrix, for example, the identity matrix I. To avoid the inversion of the Hessian H, you can derive an updating method that avoids the direct inversion of H by using a formula that makes an approximation of the inverse Hessian H^{–1} at each update. A well-known procedure is the DFP formula of Davidon [7], Fletcher, and Powell [14]. This uses the same formula as the BFGS method (Equation 6-9) except that q_{k} is substituted for s_{k}.
The gradient information is either supplied through analytically calculated gradients, or derived by partial derivatives using a numerical differentiation method via finite differences. This involves perturbing each of the design variables, x, in turn and calculating the rate of change in the objective function.
At each major iteration, k, a line search is performed in the direction
$$d=-{H}_{k}^{-1}\cdot \nabla f\left({x}_{k}\right).$$ | (6-10) |
The quasi-Newton method is illustrated by the solution path on Rosenbrock's function in Figure 6-2, BFGS Method on Rosenbrock's Function. The method is able to follow the shape of the valley and converges to the minimum after 140 function evaluations using only finite difference gradients.
Figure 6-2. BFGS Method on Rosenbrock's Function
For an animated version of this figure, enter bandem
at
the MATLAB command line.
Line search is a search method that is used as part of a larger optimization algorithm. At each step of the main algorithm, the line-search method searches along the line containing the current point, x_{k}, parallel to the search direction, which is a vector determined by the main algorithm. That is, the method finds the next iterate x_{k+1} of the form
$${x}_{k+1}={x}_{k}+{\alpha}^{*}{d}_{k},$$ | (6-11) |
where x_{k} denotes the current iterate, d_{k} is the search direction, and α* is a scalar step length parameter.
The line search method attempts to decrease the objective function along the line x_{k} + α*d_{k} by repeatedly minimizing polynomial interpolation models of the objective function. The line search procedure has two main steps:
The bracketing phase determines the range of points on the line $${x}_{k+1}={x}_{k}+{\alpha}^{*}{d}_{k}$$ to be searched. The bracket corresponds to an interval specifying the range of values of α.
The sectioning step divides the bracket into subintervals, on which the minimum of the objective function is approximated by polynomial interpolation.
The resulting step length α satisfies the Wolfe conditions:
$$f\left({x}_{k}+\alpha {d}_{k}\right)\le f\left({x}_{k}\right)+{c}_{1}\alpha \nabla {f}_{k}^{T}{d}_{k},$$ | (6-12) |
$$\nabla f{\left({x}_{k}+\alpha {d}_{k}\right)}^{T}{d}_{k}\ge {c}_{2}\nabla {f}_{k}^{T}{d}_{k},$$ | (6-13) |
where c_{1} and c_{2} are constants with 0 < c_{1} < c_{2} < 1.
The first condition (Equation 6-12) requires that α_{k} sufficiently decreases the objective function. The second condition (Equation 6-13) ensures that the step length is not too small. Points that satisfy both conditions (Equation 6-12 and Equation 6-13) are called acceptable points.
The line search method is an implementation of the algorithm described in Section 2-6 of [13]. See also [31] for more information about line search.
Many of the optimization functions determine the direction of
search by updating the Hessian matrix at each iteration, using the
BFGS method (Equation 6-9). The function fminunc
also provides an option to use
the DFP method given in Quasi-Newton Methods (set HessUpdate
to 'dfp'
in options
to
select the DFP method). The Hessian, H, is always
maintained to be positive definite so that the direction of search, d,
is always in a descent direction. This means that for some arbitrarily
small step α in the direction d,
the objective function decreases in magnitude. You achieve positive
definiteness of H by ensuring that H is
initialized to be positive definite and thereafter $${q}_{k}^{T}{s}_{k}$$ (from Equation 6-14) is always positive. The term $${q}_{k}^{T}{s}_{k}$$ is a product of the line search
step length parameter α_{k} and
a combination of the search direction d with past
and present gradient evaluations,
$${q}_{k}^{T}{s}_{k}={\alpha}_{k}\left(\nabla f{\left({x}_{k+1}\right)}^{T}d-\nabla f{\left({x}_{k}\right)}^{T}d\right).$$ | (6-14) |
You always achieve the condition that $${q}_{k}^{T}{s}_{k}$$ is positive by performing a sufficiently accurate line search. This is because the search direction, d, is a descent direction, so that α_{k} and the negative gradient –∇f(x_{k})^{T}d are always positive. Thus, the possible negative term –∇f(x_{k+1})^{T}d can be made as small in magnitude as required by increasing the accuracy of the line search.
fminsearch
uses the Nelder-Mead simplex
algorithm as described in Lagarias et al. [57]. This algorithm uses a simplex of n + 1 points for n-dimensional
vectors x. The algorithm first makes a simplex
around the initial guess x_{0} by
adding 5% of each component x_{0}(i)
to x_{0}, and using these n vectors
as elements of the simplex in addition to x_{0}.
(It uses 0.00025 as component i if x_{0}(i) = 0.) Then, the
algorithm modifies the simplex repeatedly according to the following
procedure.
Note:
The keywords for the |
Let x(i) denote the list of points in the current simplex, i = 1,...,n+1.
Order the points in the simplex from lowest function value f(x(1)) to highest f(x(n+1)). At each step in the iteration, the algorithm discards the current worst point x(n+1), and accepts another point into the simplex. [Or, in the case of step 7 below, it changes all n points with values above f(x(1))].
Generate the reflected point
r = 2m – x(n+1),
where
m = Σx(i)/n, i = 1...n,
and calculate f(r).
If f(x(1)) ≤ f(r) < f(x(n)), accept r and terminate this iteration. Reflect
If f(r) < f(x(1)), calculate the expansion point s
s = m + 2(m – x(n+1)),
and calculate f(s).
If f(s) < f(r), accept s and terminate the iteration. Expand
Otherwise, accept r and terminate the iteration. Reflect
If f(r) ≥ f(x(n)), perform a contraction between m and the better of x(n+1) and r:
If f(r) < f(x(n+1)) (i.e., r is better than x(n+1)), calculate
c = m + (r – m)/2
and calculate f(c). If f(c) < f(r), accept c and terminate the iteration. Contract outside Otherwise, continue with Step 7 (Shrink).
If f(r) ≥ f(x(n+1)), calculate
cc = m + (x(n+1) – m)/2
and calculate f(cc). If f(cc) < f(x(n+1)), accept cc and terminate the iteration. Contract inside Otherwise, continue with Step 7 (Shrink).
Calculate the n points
v(i) = x(1) + (x(i) – x(1))/2
and calculate f(v(i)), i = 2,...,n+1. The simplex at the next iteration is x(1), v(2),...,v(n+1). Shrink
The following figure shows the points that fminsearch
might
calculate in the procedure, along with each possible new simplex.
The original simplex has a bold outline. The iterations proceed until
they meet a stopping criterion.