MATLAB Examples

Compare Robust Regression Techniques

This example compares the results among regression techniques that are and are not robust to influential outliers.

Contents

Influential outliers are extreme response or predictor observations that influence parameter estimates and inferences of a regression analysis. Responses that are influential outliers typically occur at the extremes of the domain. For example, you can imagine an instrument that might measure a response poorly or erratically at extreme levels of temperature.

With enough evidence, you can remove influential outliers from the data. However, if this is not possible, you can use regression techniques that are robust to outliers.

Simulate Data

Generate a random sample of 100 responses from the linear model

$$y_t = 1 + 2x_t + \varepsilon_t$$

where:

  • $x$ is a vector of evenly spaced values from 0 through 2.
  • $\varepsilon_t\sim N(0,0.5^2).$
rng('default');
n = 100;
x = linspace(0,2,n)';
b0 = 1;
b1 = 2;
sigma = 0.5;
e = randn(n,1);
y = b0 + b1*x + sigma*e;

Create influential outliers by inflating all responses corresponding to $x < 0.25$ by a factor of 3.

y(x < 0.25) = y(x < 0.25)*3;

Plot the data. Hold the plot for further graphs.

figure;
plot(x,y,'o');
h = gca;
xlim = h.XLim';
hl = legend('Data');
xlabel('x');
ylabel('y');
title('Regression Techniques Comparison')
hold on;

Estimate Linear Model

Estimate the coefficients and error variance using simple linear regression. Plot the regression line.

LSMdl = fitlm(x,y)
plot(xlim,[ones(2,1) xlim]*LSMdl.Coefficients.Estimate,'LineWidth',2);
hl.String{2} = 'Least Squares';
LSMdl = 


Linear regression model:
    y ~ 1 + x1

Estimated Coefficients:
                   Estimate      SE       tStat       pValue  
                   ________    _______    ______    __________

    (Intercept)     2.6814     0.28433    9.4304    2.0859e-15
    x1             0.78974     0.24562    3.2153     0.0017653


Number of observations: 100, Error degrees of freedom: 98
Root Mean Squared Error: 1.43
R-squared: 0.0954,  Adjusted R-Squared 0.0862
F-statistic vs. constant model: 10.3, p-value = 0.00177

LSMdl is a fitted LinearModel model object. The intercept and slope appear to be higher and lower, respectively, than they should. That is, the regression line might make poor predictions for any $x < 1$ and $x > 1.6$.

Estimate Bayesian Linear Regression Model with Diffuse Prior

Create a Bayesian linear regression model with a diffuse joint prior for the regression coefficients and error variance. Specify one predictor for the model.

PriorDiffuseMdl = bayeslm(1);

PriorDiffuseMdl is a diffuseblm model object that characterizes the joint prior distribution.

Estimate the posterior of the Bayesian linear regression model. Plot the regression line.

PosteriorDiffuseMdl = estimate(PriorDiffuseMdl,x,y);
plot(xlim,[ones(2,1) xlim]*PosteriorDiffuseMdl.Mu,'--','LineWidth',2);
hl.String{3} = 'Bayesian Diffuse';
Method: Analytic posterior distributions
Number of observations: 100
Number of predictors:   2
 
           |  Mean     Std         CI95        Positive      Distribution     
------------------------------------------------------------------------------
 Intercept | 2.6814  0.2873  [ 2.117,  3.246]    1.000   t (2.68, 0.28^2, 98) 
 Beta      | 0.7897  0.2482  [ 0.302,  1.277]    0.999   t (0.79, 0.25^2, 98) 
 Sigma2    | 2.0943  0.3055  [ 1.580,  2.773]    1.000   IG(49.00, 0.0099)    
 

PosteriorDiffuseMdl is a conjugateblm model object that characterizes the joint posterior distribution of the linear model parameters. The estimates of a Bayesian linear regression model with diffuse prior is generally almost equal to those of a simple linear regression model. Both models represent a naive approach to influential outliers, that is, the techniques treat outliers like any other observation.

Estimate Regression Model with ARIMA Errors

Create a regression model with ARIMA errors. Specify that the errors follow a t distribution with 3 degrees of freedom, but no lagged terms. This specification is effectively a regression model with t distributed errors.

regMdl = regARIMA(0,0,0);
regMdl.Distribution = struct('Name','t','DoF',3);

regMdl is a regARIMA model object. It is a template for estimation.

Estimate the regression model with ARIMA errors. Plot the regression line.

estRegMdl = estimate(regMdl,y,'X',x);
plot(xlim,[ones(2,1) xlim]*[estRegMdl.Intercept; estRegMdl.Beta],...
    'LineWidth',2);
hl.String{4} = 'regARIMA';
 
    Regression with ARIMA(0,0,0) Error Model:
    ------------------------------------------
    Conditional Probability Distribution: t

                                  Standard          t     
     Parameter       Value          Error       Statistic 
    -----------   -----------   ------------   -----------
    Intercept        1.46133      0.153999        9.48921
        Beta1        1.65313      0.129391        12.7763
     Variance       0.931156      0.171602        5.42626
          DoF              3         Fixed          Fixed

estRegMdl is a regARIMA model object containing the estimation results. Because the t distribution is more diffuse, the regression line attributes more variability to the influential outliers than to the other observations. Hence, the regression line appears to be a better predictive model than the other models.

Implement Quantile Regression Using Bag of Regression Trees

Grow a bag of 100 regression trees. Specify 20 for the minimum leaf size.

QRMdl = TreeBagger(100,x,y,'Method','regression','MinLeafSize',20);

QRMdl is a fitted TreeBagger model object.

Predict median responses for all observed $x$ values, that is, implement quantile regression. Plot the predictions.

qrPred = quantilePredict(QRMdl,x);
plot(x,qrPred,'LineWidth',2);
hl.String{5} = 'Quantile';
hold off;

The regression line appears to be slightly influenced by the outliers at the beginning of the sample, but then quickly follows the regARIMA model line.

You can adjust the behavior of the line by specifying various values for MinLeafSize when you train the bag of regression trees. Lower MinLeafSize values tend to follow the data more closely.

Implement Robust Bayesian Linear Regression

Consider a Bayesian linear regression model containing a one predictor, a t distributed disturbance variance with a profiled degrees of freedom parameter $\nu$. That is, let:

  • $\lambda_j\sim IG(\nu/2,2/\nu)$.
  • $\varepsilon_j\vert\lambda_j\sim N(0,\lambda_j\sigma^2)$
  • $f(\beta,\sigma^2)\propto \frac{1}{\sigma^2}$

These assumptions imply:

  • $\varepsilon_j\sim t(0,\sigma^2,\nu)$
  • $\lambda_j\vert\varepsilon_j\sim IG\left(\frac{\nu + 1}{2},\frac{2}{\nu + \varepsilon_j^2/\sigma^2}\right)$

$\lambda$ is vector of latent scale parameters that attributes low precision to observations that are far from the regression line. $\nu$ is a hyperparameter controlling the influence of $\lambda$ on the observations.

Specify a grid of values for $\nu$. A value of 1 corresponds to the Cauchy distribution, 2.1 means that the mean is well variance, and 100 means that the distribution is approximately normal.

nu = [0.01 0.1 1 2.1 5 10 100];
numNu = numel(nu);

For this problem, the Gibbs sampler is well-suited to estimate the coefficients because you can simulate the parameters of a Bayesian linear regression model conditioned on $\lambda$, and then simulate $\lambda$ from its conditional distribution.

Implement this Gibbs sampler.

  1. Draw parameters from the posterior distribution of $\beta,\sigma^2\vert y,x,\lambda$. To do this, deflate the observations by $\lambda$, create a diffuse prior model with two regression coefficients, draw a set of parameters from the posterior. The first regression coefficient corresponds to the intercept, so specify that bayeslm should not include one.
  2. Compute residuals.
  3. Draw values from the conditional posterior of $\lambda$.

For each value of $\nu$, run the Gibbs sampler for 20000 iterations and apply a burn in period of 5000. Preallocate for the posterior draws and initialize $\lambda$ to a vector of ones.

rng(1);
m = 20000;
burnin = 5000;
lambda = ones(n,m + 1,numNu); % Preallocation
estBeta = zeros(2,m + 1,numNu);
estSigma2 = zeros(1,m + 1,numNu);

% Create diffuse prior model.
PriorMdl = bayeslm(2,'Model','diffuse','Intercept',false);

for p = 1:numNu
    for j = 1:m
        % Scale observations.
        yDef = y./sqrt(lambda(:,j,p));
        xDef = [ones(n,1) x]./sqrt(lambda(:,j,p));

        % Simulate observations from conditional posterior of beta and
        % sigma2 given lambda and the data.
        [estBeta(:,j + 1,p),estSigma2(1,j + 1,p)] = simulate(PriorMdl,xDef,yDef);

        % Estimate residuals.
        ep = y - [ones(n,1) x]*estBeta(:,j + 1,p);

        % Specify shape and scale using conditional posterior of lambda
        % given beta, sigma2, and the data
        sp = (nu(p) + 1)/2;
        sc =  2./(nu(p) + ep.^2/estSigma2(1,j + 1,p));

        % Draw from conditional posterior of lambda given beta, sigma2,
        % and the data
        lambda(:,j + 1,p) = 1./gamrnd(sp,sc);
    end
end

Estimate posterior means for $\beta$, $\sigma^2$, and $\lambda$.

postEstBeta = squeeze(mean(estBeta(:,(burnin+1):end,:),2));
postLambda = squeeze(mean(lambda(:,(burnin+1):end,:),2));

For each $\nu$, plot the data and regression lines.

figure;
plot(x,y,'o');
h = gca;
xlim = h.XLim';
ylim = h.YLim;
hl = legend('Data');
hold on;
for p = 1:numNu;
    plotY = [ones(2,1) xlim]*postEstBeta(:,p);
    plot(xlim,plotY,'LineWidth',2);
    hl.String{p+1} = sprintf('nu = %f',nu(p));
end
xlabel('x');
ylabel('y');
title('Robust Bayesian Linear Regression')

Low values of $\nu$ tend to attribute high variability to the influential outliers. Hence, the regression line resembles the regARIMA line. As $\nu$ increases, the line behaves more like those of the naive approach.