## Bayesian Optimization Algorithm

### Algorithm Outline

The Bayesian optimization algorithm attempts to minimize a scalar
objective function *f*(*x*) for *x* in
a bounded domain. The function can be deterministic or stochastic,
meaning it can return different results when evaluated at the same
point *x*. The components of *x* can
be continuous reals, integers, or categorical, meaning a discrete
set of names.

**Note**

Throughout this discussion, D represents the number of components
of *x*.

The key elements in the minimization are:

A Gaussian process model of

*f*(*x*).A Bayesian update procedure for modifying the Gaussian process model at each new evaluation of

*f*(*x*).An

*acquisition function**a*(*x*) (based on the Gaussian process model of*f*) that you maximize to determine the next point*x*for evaluation. For details, see Acquisition Function Types and Acquisition Function Maximization.

Algorithm outline:

Evaluate

*y*=_{i}*f*(*x*) for_{i}`NumSeedPoints`

points*x*, taken at random within the variable bounds._{i}`NumSeedPoints`

is a`bayesopt`

setting. If there are evaluation errors, take more random points until there are`NumSeedPoints`

successful evaluations. The probability distribution of each component is either uniform or log-scaled, depending on the`Transform`

value in`optimizableVariable`

.

Then repeat the following steps:

Update the Gaussian process model of

*f*(*x*) to obtain a posterior distribution over functions*Q*(*f*|*x*,_{i}*y*for_{i}*i*= 1,...,*t*). (Internally,`bayesopt`

uses`fitrgp`

to fit a Gaussian process model to the data.)Find the new point

*x*that maximizes the acquisition function*a*(*x*).

The algorithm stops after reaching any of the following:

A fixed number of iterations (default 30).

A fixed time (default is no time limit).

A stopping criterion that you supply in Bayesian Optimization Output Functions or Bayesian Optimization Plot Functions.

For the algorithmic differences in parallel, see Parallel Bayesian Algorithm.

### Gaussian Process Regression for Fitting the Model

The underlying probabilistic model for the objective function *f* is a
Gaussian process prior with added Gaussian noise in the observations. So the prior
distribution on *f*(*x*) is a Gaussian process with mean *μ*(*x*;*θ*) and covariance kernel function *k*(*x*,*x′*;*θ*). Here, *θ* is a vector of kernel parameters. For
the particular kernel function `bayesopt`

uses, see Kernel Function.

In a bit more detail, denote a set of points *X* = *x _{i}* with
associated objective function values

*F*=

*f*. The prior’s joint distribution of the function values

_{i}*F*is multivariate normal, with mean

*μ*(

*X*) and covariance matrix

*K*(

*X*,

*X*), where

*K*=

_{ij}*k*(

*x*,

_{i}*x*).

_{j}Without loss of generality, the prior mean is given as `0`

.

Also, the observations are assumed to have added Gaussian noise
with variance *σ*^{2}.
So the prior distribution has covariance *K*(*X*,*X*;*θ*)
+ *σ*^{2}*I*.

Fitting a Gaussian process regression model to observations
consists of finding values for the noise variance *σ*^{2} and
kernel parameters *θ*. This fitting is a computationally
intensive process performed by `fitrgp`

.

For details on fitting a Gaussian process to observations, see Gaussian Process Regression.

#### Kernel Function

The kernel function *k*(*x*,*x′*;*θ*) can
significantly affect the quality of a Gaussian process regression. `bayesopt`

uses
the ARD Matérn 5/2 kernel defined in Kernel (Covariance) Function Options.

See Snoek, Larochelle, and Adams [3].

### Acquisition Function Types

Six choices of acquisition functions are available for `bayesopt`

.
There are three basic types, with `expected-improvement`

also
modified by `per-second`

or `plus`

:

`'expected-improvement-per-second-plus'`

(default)`'expected-improvement'`

`'expected-improvement-plus'`

`'expected-improvement-per-second'`

`'lower-confidence-bound'`

`'probability-of-improvement'`

The acquisition functions evaluate the “goodness”
of a point *x* based on the posterior distribution
function *Q*. When there are coupled constraints,
including the Error constraint (see Objective Function Errors), all acquisition functions modify
their estimate of “goodness” following a suggestion
of Gelbart, Snoek, and Adams [2]. Multiply the “goodness” by an estimate of
the probability that the constraints are satisfied, to arrive at the
acquisition function.

#### Expected Improvement

The `'expected-improvement'`

family of acquisition
functions evaluates the expected amount of improvement in the objective
function, ignoring values that cause an increase in the objective.
In other words, define

*x*_{best}as the location of the lowest posterior mean.*μ*(_{Q}*x*_{best}) as the lowest value of the posterior mean.

Then the expected improvement

$$EI\left(x,Q\right)={E}_{Q}\left[\mathrm{max}\left(0,{\mu}_{Q}\left({x}_{\text{best}}\right)-f\left(x\right)\right)\right].$$

#### Probability of Improvement

The `'probability-of-improvement'`

acquisition
function makes a similar, but simpler, calculation as `'expected-improvement'`

.
In both cases, `bayesopt`

first calculates *x*_{best} and *μ _{Q}*(

*x*

_{best}). Then for

`'probability-of-improvement'`

, `bayesopt`

calculates
the probability *PI*that a new point

*x*leads to a better objective function value, modified by a “margin” parameter

*m*:

$$PI\left(x,Q\right)={P}_{Q}\left(f\left(x\right)<{\mu}_{Q}\left({x}_{\text{best}}\right)-m\right).$$

`bayesopt`

takes *m* as
the estimated noise standard deviation. `bayesopt`

evaluates
this probability as

$$PI=\Phi \left({\nu}_{Q}\left(x\right)\right),$$

where

$${\nu}_{Q}\left(x\right)=\frac{{\mu}_{Q}\left({x}_{\text{best}}\right)-m-{\mu}_{Q}\left(x\right)}{{\sigma}_{Q}\left(x\right)}.$$

Here Φ(·) is
the unit normal CDF, and *σ _{Q}* is
the posterior standard deviation of the Gaussian process at

*x*.

#### Lower Confidence Bound

The `'lower-confidence-bound'`

acquisition
function looks at the curve *G* two standard deviations
below the posterior mean at each point:

$$G\left(x\right)={\mu}_{Q}\left(x\right)-2{\sigma}_{Q}\left(x\right).$$

*G*(*x*) is the 2*σ _{Q}* lower
confidence envelope of the objective function model.

`bayesopt`

then
maximizes the negative of *G*:

$$LCB=2{\sigma}_{Q}\left(x\right)-{\mu}_{Q}\left(x\right).$$

#### Per Second

Sometimes, the time to evaluate the objective function can depend on the region. For example,
many Support Vector Machine calculations vary in timing a good deal over certain
ranges of points. If so, `bayesopt`

can obtain better
improvement per second by using time-weighting in its acquisition function. The
cost-weighted acquisition functions have the phrase
`per-second`

in their names.

These acquisition functions work as follows. During the objective
function evaluations, `bayesopt`

maintains another
Bayesian model of objective function evaluation time as a function
of position *x*. The expected improvement per second
that the acquisition function uses is

$$EIpS\left(x\right)=\frac{E{I}_{Q}\left(x\right)}{{\mu}_{S}\left(x\right)},$$

where *μ _{S}*(

*x*) is the posterior mean of the timing Gaussian process model.

#### Plus

To escape a local objective function minimum, the acquisition
functions with `plus`

in their names modify their
behavior when they estimate that they are *overexploiting* an
area. To understand overexploiting, let *σ _{F}*(

*x*) be the standard deviation of the posterior objective function at

*x*. Let

*σ*be the posterior standard deviation of the additive noise, so that

*σ _{Q}*

^{2}(

*x*) =

*σ*

_{F}^{2}(

*x*) +

*σ*

^{2}.

Define *t _{σ}* to
be the value of the

`ExplorationRatio`

option, a
positive number. The `bayesopt`

`plus`

acquisition
functions, after each iteration, evaluate whether the next point *x*satisfies

*σ _{F}*(

*x*) <

*t*

_{σ}*σ*.

If so, the algorithm declares that *x* is overexploiting.
Then the acquisition function modifies its Kernel Function by multiplying *θ* by
the number of iterations, as suggested by Bull [1]. This modification raises the variance *σ _{Q}* for
points in between observations. It then generates a new point based
on the new fitted kernel function. If the new point

*x*is again overexploiting, the acquisition function multiplies

*θ*by an additional factor of 10 and tries again. It continues in this way up to five times, trying to generate a point

*x*that is not overexploiting. The algorithm accepts the new

*x*as the next point.

`ExplorationRatio`

therefore controls a tradeoff
between exploring new points for a better global solution, versus
concentrating near points that have already been examined.

### Acquisition Function Maximization

Internally, `bayesopt`

maximizes an acquisition
function using the following general steps:

For algorithms starting with

`'expected-improvement'`

and for`'probability-of-improvement'`

,`bayesopt`

estimates the smallest feasible mean of the posterior distribution*μ*(_{Q}*x*_{best}) by sampling several thousand points within the variable bounds, taking several of the best (low mean value) feasible points, and improving them using local search, to find the ostensible best feasible point. Feasible means that the point satisfies constraints (see Constraints in Bayesian Optimization).For all algorithms,

`bayesopt`

samples several thousand points within the variable bounds, takes several of the best (high acquisition function) feasible points, and improves them using local search, to find the ostensible best feasible point. The acquisition function value depends on the modeled posterior distribution, not a sample of the objective function, and so it can be calculated quickly.

## References

[1] Bull, A. D. *Convergence rates of efficient global
optimization algorithms*. https://arxiv.org/abs/1101.3501v3, 2011.

[2] Gelbart, M., J. Snoek, R. P. Adams. *Bayesian Optimization
with Unknown Constraints*. https://arxiv.org/abs/1403.5607, 2014.

[3] Snoek, J., H. Larochelle, R. P. Adams. *Practical Bayesian
Optimization of Machine Learning Algorithms*. https://arxiv.org/abs/1206.2944, 2012.

## See Also

`bayesopt`

| `BayesianOptimization`