*Presample data* comes from time points before the beginning of
the observation period. In Econometrics
Toolbox™, you can specify your own presample data or use generated presample data.

In a conditional mean model, the distribution of
*ε _{t}* is conditional on historical
information. Historical information includes past responses, $${y}_{1},{y}_{2},\dots ,{y}_{t-1}$$, past innovations, $${\epsilon}_{1},{\epsilon}_{2},\dots ,{\epsilon}_{t-1}$$, and, if you include them in the model, past and present exogenous
covariates, $${x}_{1},{x}_{2},\dots ,{x}_{t-1},{x}_{t}$$.

The number of past responses and innovations that a current innovation depends on is determined by the degree of the AR or MA operators, and any differencing. For example, in an AR(2) model, each innovation depends on the two previous responses,

$${\epsilon}_{t}={y}_{t}-c-{\varphi}_{1}{y}_{t-1}-{\varphi}_{2}{y}_{t-2}.$$

In ARIMAX models, the current innovation also depends on the
*current value* of the exogenous covariate (unlike distributed
lag models). For example, in an ARX(2) model with one exogenous covariate, each
innovation depends on the previous two responses and the current value of the covariate,

$${\epsilon}_{t}={y}_{t}-c-{\varphi}_{1}{y}_{t-1}-{\varphi}_{2}{y}_{t-2}+{x}_{t}.$$

In general, the likelihood contribution of the first few innovations is conditional on historical information that might not be observable. How do you estimate the parameters without all the data? In the ARX(2) example, $${\epsilon}_{2}$$ explicitly depends on $${y}_{1},$$ $${y}_{0},$$ and $${x}_{2},$$ and $${\epsilon}_{1}$$ explicitly depends on $${y}_{0},$$ $${y}_{-1},$$ and $${x}_{1}$$. Implicitly, $${\epsilon}_{2}$$ depends on $${x}_{1}$$ and $${x}_{0},$$ and $${\epsilon}_{1}$$ depends on $${x}_{0}$$ and $${x}_{-1}.$$ However, you cannot observe $${y}_{0},$$ $${y}_{-1},$$ $${x}_{0},$$ and $${x}_{-1}.$$

The amount of presample data that you need to initialize a model depends on the
degree of the model. The property `P`

of an `arima`

model specifies the number of presample responses and exogenous
data that you need to initialize the AR portion of a conditional mean model. For
example, `P = 2`

in an ARX(2) model. Therefore, you need two responses
and two data points from *each* exogenous covariate series to
initialize the model.

One option is to use the first `P`

data from the response and
exogenous covariate series as your presample, and then fit your model to the remaining
data. This results in some loss of sample size. If you plan to compare multiple
potential models, be aware that you can only use likelihood-based measures of fit
(including the likelihood ratio test and information criteria) to compare models fit to
the same data (of the same sample size). If you specify your own presample data, then
you must use the largest required number of presample responses across all models that
you want to compare.

The property `Q`

of an `arima`

model
specifies the number of presample innovations needed to initialize the MA portion of a
conditional mean model. You can get presample innovations by dividing your data into two
parts. Fit a model to the first part, and infer the innovations. Then, use the inferred
innovations as presample innovations for estimating the second part of the data.

For a model with both an autoregressive and moving average component, you can specify both presample responses and innovations, one or the other, or neither.

By default, `estimate`

generates automatic presample response and
innovation data. The software:

Generates presample responses by backward forecasting.

Sets presample innovations to zero.

Does

*not*generate presample exogenous data. One option is to backward forecast each exogenous series to generate a presample during data preprocessing.