## Recursive Algorithms for Online Parameter Estimation

The recursive estimation algorithms in the System Identification Toolbox™ can be separated into two categories:

Infinite-history algorithms — These algorithms aim to minimize the error between the observed and predicted outputs for all time steps from the beginning of the simulation. The System Identification Toolbox supports infinite-history estimation in:

Recursive command-line estimators for the least-squares linear regression, AR, ARX, ARMA, ARMAX, OE, and BJ model structures

Simulink

^{®}Recursive Least Squares Estimator and Recursive Polynomial Model Estimator blocks

Finite-history algorithms — These algorithms aim to minimize the error between the observed and predicted outputs for a finite number of past time steps. The toolbox supports finite-history estimation for linear-in-parameters models:

Recursive command-line estimators for the least-squares linear regression, AR, ARX, and OE model structures

Simulink Recursive Least Squares Estimator block

Simulink Recursive Polynomial Model Estimator block, for AR, ARX, and OE structures only

Finite-history algorithms are typically easier to tune than the infinite-history algorithms when the parameters have rapid and potentially large variations over time.

### Recursive Infinite-History Estimation

#### General Form of Infinite-History Recursive Estimation

The general form of the infinite-history recursive estimation algorithm is as follows:

$$\widehat{\theta}\left(t\right)=\widehat{\theta}\left(t-1\right)+K\left(t\right)\left(y\left(t\right)-\widehat{y}\left(t\right)\right)$$

$$\widehat{\theta}\left(t\right)$$ is the parameter estimate at time *t*.
*y(t)* is the observed output at time
*t*, and $$\widehat{y}\left(t\right)$$ is the prediction of *y(t)* based on
observations up to time *t-1*. The gain,
*K(t)*, determines how much the current prediction error $$y\left(t\right)-\widehat{y}\left(t\right)$$ affects the update of the parameter estimate. The estimation
algorithms minimize the prediction-error term $$y\left(t\right)-\widehat{y}\left(t\right)$$.

The gain has the following form:

$$K\left(t\right)=Q\left(t\right)\psi \left(t\right)$$

The recursive algorithms supported by the System Identification Toolbox product differ based on different approaches for choosing the form
of *Q(t)* and computing $$\psi \left(t\right)$$. Here, $$\psi \left(t\right)$$ represents the gradient of the predicted model output $$\widehat{y}\left(t|\theta \right)$$ with respect to the parameters $$\theta $$.

The simplest way to visualize the role of the gradient $$\psi \left(t\right)$$ of the parameters, is to consider models with a linear-regression form:

$$y\left(t\right)={\psi}^{T}\left(t\right){\theta}_{0}\left(t\right)+e\left(t\right)$$

In this equation, $$\psi \left(t\right)$$ is the *regression vector* that is computed
based on previous values of measured inputs and outputs. $${\theta}_{0}\left(t\right)$$ represents the true parameters. *e(t)* is
the noise source (*innovations*), which is assumed to be
white noise. The specific form of $$\psi \left(t\right)$$ depends on the structure of the polynomial model.

For linear regression equations, the predicted output is given by the following equation:

$$\widehat{y}\left(t\right)={\psi}^{T}\left(t\right)\widehat{\theta}\left(t-1\right)$$

For models that do not have the linear regression form, it is not possible to compute exactly the predicted output and the gradient $$\psi \left(t\right)$$ for the current parameter estimate $$\widehat{\theta}\left(t-1\right)$$. To learn how you can compute approximation for $$\psi \left(t\right)$$ and $$\widehat{\theta}\left(t-1\right)$$ for general model structures, see the section on recursive prediction-error methods in [1].

#### Types of Infinite-History Recursive Estimation Algorithms

The System Identification Toolbox software provides the following infinite-history recursive estimation algorithms for online estimation:

The forgetting factor and Kalman Filter formulations are more computationally intensive than gradient and unnormalized gradient methods. However, they typically have better convergence properties.

**Forgetting Factor. **The following set of equations summarizes the *forgetting
factor* adaptation algorithm:

$$\widehat{\theta}\left(t\right)=\widehat{\theta}\left(t-1\right)+K\left(t\right)\left(y\left(t\right)-\widehat{y}\left(t\right)\right)$$

$$\widehat{y}\left(t\right)={\psi}^{T}\left(t\right)\widehat{\theta}\left(t-1\right)$$

$$K\left(t\right)=Q\left(t\right)\psi \left(t\right)$$

$$Q\left(t\right)=\frac{P\left(t-1\right)}{\lambda +{\psi}^{T}\left(t\right)P\left(t-1\right)\psi \left(t\right)}$$

$$P\left(t\right)={\scriptscriptstyle \frac{1}{\lambda}}\left(P\left(t-1\right)-\frac{P\left(t-1\right)\psi \left(t\right)\psi {\left(t\right)}^{T}P\left(t-1\right)}{\lambda +\psi {\left(t\right)}^{T}P\left(t-1\right)\psi \left(t\right)}\right)$$

The software ensures *P(t)* is a positive-definite matrix
by using a square-root algorithm to update it [2]. The software computes `P`

assuming that the residuals
(difference between estimated and measured outputs) are white noise, and the
variance of these residuals is 1.
*R _{2}*

`/2`

*
`P`

is approximately equal to the covariance matrix of
the estimated parameters, where *R*is the true variance of the residuals.

_{2}*Q(t)* is obtained by minimizing the following function
at time *t*:

$$\sum}_{k=1}^{t}{\lambda}^{t-k}{(y(k)-\widehat{y}(k))}^{2$$

See section 11.2 in [1] for details.

This approach discounts old measurements exponentially such that an
observation that is $$\tau $$ samples old carries a weight that is equal to $${\lambda}^{\tau}$$ times the weight of the most recent observation. $$\tau ={\scriptscriptstyle \raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$1-\lambda $}\right.}$$ represents the *memory horizon* of this
algorithm. Measurements older than $$\tau ={\scriptscriptstyle \raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$1-\lambda $}\right.}$$ typically carry a weight that is less than about 0.3.

$$\lambda $$ is called the forgetting factor and typically has a
positive value between `0.98`

and `0.995`

.
Set $$\lambda =1$$ to estimate time-invariant (constant) parameters. Set $$\lambda <1$$ to estimate time-varying
parameters.

**Note**

The forgetting factor algorithm for $$\lambda $$ = 1 is equivalent to the Kalman filter algorithm with
*R*_{1}=0 and
*R*_{2}=1. For more
information about the Kalman filter algorithm, see Kalman Filter.

**Kalman Filter. **The following set of equations summarizes the *Kalman
filter* adaptation algorithm:

$$\widehat{\theta}\left(t\right)=\widehat{\theta}\left(t-1\right)+K\left(t\right)\left(y\left(t\right)-\widehat{y}\left(t\right)\right)$$

$$\widehat{y}\left(t\right)={\psi}^{T}\left(t\right)\widehat{\theta}\left(t-1\right)$$

$$K\left(t\right)=Q\left(t\right)\psi \left(t\right)$$

$$Q\left(t\right)=\frac{P\left(t-1\right)}{{R}_{2}+{\psi}^{T}\left(t\right)P\left(t-1\right)\psi \left(t\right)}$$

$$P\left(t\right)=P\left(t-1\right)+{R}_{1}-\frac{P\left(t-1\right)\psi \left(t\right)\psi {\left(t\right)}^{T}P\left(t-1\right)}{{R}_{2}+\psi {\left(t\right)}^{T}P\left(t-1\right)\psi \left(t\right)}$$

The software ensures *P(t)* is a positive-definite matrix
by using a square-root algorithm to update it [2]. The software computes `P`

assuming that the residuals
(difference between estimated and measured outputs) are white noise, and the
variance of these residuals is 1.
*R _{2}**

`P`

is
the covariance matrix of the estimated parameters, and
*R*/

_{1}*R*is the covariance matrix of the parameter changes. Where,

_{2}*R*is the covariance matrix of parameter changes that you specify.

_{1}This formulation assumes the linear-regression form of the model:

$$y\left(t\right)={\psi}^{T}\left(t\right){\theta}_{0}\left(t\right)+e\left(t\right)$$

*Q(t)* is computed using a Kalman filter.

This formulation also assumes that the true parameters $${\theta}_{0}\left(t\right)$$ are described by a random walk:

$${\theta}_{0}\left(t\right)={\theta}_{0}\left(t-1\right)+w\left(t\right)$$

*w(t)* is Gaussian white noise with the following
covariance matrix, or *drift matrix*
*R*_{1}:

$$Ew\left(t\right){w}^{T}\left(t\right)={R}_{1}$$

*R*_{2} is the variance of the
innovations *e(t)* in the following equation:

$$y\left(t\right)={\psi}^{T}\left(t\right){\theta}_{0}\left(t\right)+e\left(t\right)$$

The Kalman filter algorithm is entirely specified by the sequence of data
*y(t)*, the gradient $$\psi \left(t\right)$$, *R*_{1},
*R*_{2}, and the initial
conditions $$\theta \left(t=0\right)$$ (initial guess of the parameters) and $$P\left(t=0\right)$$ (covariance matrix that indicates parameters
errors).

**Note**

It is assumed that *R*_{1} and
*P*(t = 0) matrices are scaled such that
*R*_{2} = 1. This scaling
does not affect the parameter estimates.

**Normalized and Unnormalized Gradient. **In the linear regression case, the gradient methods are also known as the
*least mean squares* (LMS) methods.

The following set of equations summarizes the *unnormalized
gradient* and *normalized gradient*
adaptation algorithm:

$$\widehat{y}\left(t\right)={\psi}^{T}\left(t\right)\widehat{\theta}\left(t-1\right)$$

$$K\left(t\right)=Q\left(t\right)\psi \left(t\right)$$

In the unnormalized gradient approach, *Q(t)* is given
by:

$$Q(t)=\gamma $$

In the normalized gradient approach, *Q(t)* is given
by:

$$Q\left(t\right)=\frac{\gamma}{{\left|\psi \left(t\right)\right|}^{2}+Bias}$$

The normalized gradient algorithm scales the adaptation gain,
*γ*, at each step by the square of the two-norm of the
gradient vector. If the gradient is close to zero, this can cause jumps in
the estimated parameters. To prevent these jumps, a bias term is introduced
in the scaling factor.

These choices of *Q(t)* for the gradient algorithms
update the parameters in the negative gradient direction, where the gradient
is computed with respect to the parameters. See pg. 372 in [1] for details.

### Recursive Finite-History Estimation

The finite-history estimation methods find parameter estimates
*θ*(*t*) by minimizing

$$\sum _{k=t-N+1}^{t}{(y(k)-\widehat{y}(k|\theta ))}^{2}},$$

where *y*(*k*) is the observed output at time
*k*, and $$\widehat{y}(k|\theta )$$ is the predicted output at time *k*. This
approach is also known as sliding-window estimation. Finite-history estimation
approaches minimize prediction errors for the last *N* time steps.
In contrast, infinite-history estimation methods minimize prediction errors starting
from the beginning of the simulation.

The System Identification Toolbox supports finite-history estimation for the linear-in-parameters models
(AR and ARX) where predicted output has the form $$\widehat{y}(k|\theta )=\Psi (k)\theta (k-1)$$. The software constructs and maintains a buffer of regressors
*ψ*(*k*) and observed outputs
*y*(*k*) for *k* = *t*-*N*+1,
*t*-*N*+2, … , *t*-2,
*t*-1, *t*. These buffers contain the necessary matrices for the underlying
linear regression problem of minimizing $${\Vert {\Psi}_{buffer}\theta -{y}_{buffer}\Vert}_{2}^{2}$$ over *θ*. The software solves this linear
regression problem using QR factoring with column pivoting.

## References

[1] Ljung, L. *System Identification: Theory for the
User*. Upper Saddle River, NJ: Prentice-Hall PTR, 1999.

[2] Carlson, N.A. "Fast triangular formulation of the square
root filter." *AIAA Journal*, Vol. 11, Number 9, 1973, pp.
1259-1265.

[3] Zhang, Q. "Some
Implementation Aspects of Sliding Window Least Squares Algorithms." *IFAC
Proceedings*. Vol. 33, Issue 15, 2000, pp. 763-768.

## See Also

Recursive Least Squares Estimator | Recursive Polynomial Model Estimator | `recursiveAR`

| `recursiveARMA`

| `recursiveARX`

| `recursiveARMAX`

| `recursiveOE`

| `recursiveBJ`

| `recursiveLS`