Code covered by the BSD License  

Highlights from
regout

image thumbnail
from regout by Antonio Trujillo-Ortiz
Outlier test on an analysis of regression based on the externally studentized residual, R-student.

regout(D,alpha)
function [regout] = regout(D,alpha)
% REGOUT Outlier Test on an Analysis of Regression Based on R-Student.
%  Among the considerations in the use of analysis of regression, outliers or bad
%  values can seriously disturb the least-squares fit. They falls far from the line
%  implied by the rest of the data. If these points are really outliers, then the
%  estimate of the intercept may be incorrect and the residual mean square may be an
%  inflated estimate of the true variance. There are methods for scaled residuals
%  useful in finding observations that are outliers. One of them is the externally
%  studentized residual, usually called R-student. It is based on the fact that the
%  MSRes is an internally generated estimate of the variance obtained from fitting the
%  model to all n observations and it is necessary to get an estimation based on a
%  data set with the ith observation removed,
%
%                  s2_(i) = ((n-p)*MSRes - e_i^2/(1-h_ii))/(n-p-1).
%
%  Where e_i is the ith residual, h_ii is the ith diagonal element of the hat matrix H,
%  p is the number of regression coefficients (including the intercept) and n the number
%  of data.
%
%  As a result, an appropriate test for the a mean shift model is,
%
%                             t_i = e_i/sqrt(s2_(i)*(1-h_ii))
%
%  wich is the externally studentized residual or R-student. This statistic follows the
%  a Student t-distribution. But one could use a Bonferroni-type approach and compare all
%  n values of t_i to t_(alpha/2*n),n-p-1 to provide guidance regarding outliers. 
%
%  We agree with Montgomery et al. (2001) in the sense that...'it is our view that a formal
%  approach is usually not necessary and that only relatively crude cutoff values need be
%  considered and that, in general, a diagnostic view as opposed to a strict statistical
%  hypothesis-testing view is best'. Futhermore, detection of outliers often needs to be 
%  considered simultaneously with detection of influential observations, as you can find in
%  the Cookdist m-file in this same author page [ http://www.mathworks.com/matlabcentral/
%  fileexchange/loadFile.do?objectId=8716 ]
%  approach
%
%  Syntax: regout(D,alpha) 
%      
%  Inputs:
%       D - matrix data (=[X Y]) (last column must be the Y-dependent variable).
%           (X-independent variable entry can be for a simple [X], multiple [X1,X2,X3,...Xp] 
%           or polynomial [X,X^2,X^3,...,X^p] regression model).
%   alpha - significance (default = 0.05).
%
%  Outputs:
%       A complete summary (table and/or plot) of the outliers diagnostic test.
%
%  From the example 3.1 of Montgomery et al. (2001), to fit a multiple regression model with
%  two predictors, we are interested to test if there is some outliers observations by using
%  an externally studentized residual R-student with a significance value of 0.05.
%
%               -------------------   -------------------
%                 X1    X2     Y        X1    X2     Y
%               -------------------   -------------------
%                  7   560   16.68       6   462   19.75
%                  3   220   11.50       9   448   24.00
%                  3   340   12.03      10   776   29.00
%                  4    80   14.88       6   200   15.35
%                  6   150   13.75       7   132   19.00
%                  7   330   18.11       3    36    9.50
%                  2   110    8.00      17   770   35.10
%                  7   210   17.83      10   140   17.90
%                 30  1460   79.24      26   810   52.32
%                  5   605   21.50       9   450   18.75
%                 16   688   40.33       8   635   19.83
%                 10   215   21.00       4   150   10.75
%                  4    255  13.50    -------------------
%               -------------------
%
%  Data matrix must be:
%  D=[7 560 16.68;3 220 11.5;3 340 12.03;4 80 14.88;6 150 13.75;7 330 18.11;2 110 8;7 210 17.83;
%  30 1460 79.24;5 605 21.5;16 688 40.33;10 215 21;4 255 13.5;6 462 19.75;9 448 24;10 776 29;
%  6 200 15.35;7 132 19;3 36 9.5;17 770 35.1;10 140 17.9;26 810 52.32;9 450 18.75;8 635 19.83;
%  4 150 10.75];
%
%  Calling on Matlab the function: 
%             outreg(D)
%
%  Answer is:
%
%  Outliers diagnostic test of a regression analysis based on R-student.
%  ----------------------------------------------------------------------------
%   Observation               Y                   e                 R-student              
%  ----------------------------------------------------------------------------
%        9                 79.2400             7.4197                4.3108
%  ----------------------------------------------------------------------------
%  (t_(alpha/2*n),n-p-1 = 3.5272, alpha = 0.05, p = 3, n = 25)
%
%  Created by A. Trujillo-Ortiz, R. Hernandez-Walls and F.A. Trujillo-Perez
%             Facultad de Ciencias Marinas
%             Universidad Autonoma de Baja California
%             Apdo. Postal 453
%             Ensenada, Baja California
%             Mexico.
%             atrujo@uabc.mx
%             And the special collaboration of the post-graduate students of the 2005:2
%             Multivariate Statistics Course: D.A. Paz-Garcia, H.E. Chavez-Romo, 
%             K. Xolaltenco-Coyotl, and A. Montiel-Boehringer.
%  Copyright (C) October 25, 2005.
%
%  To cite this file, this would be an appropriate format:
%  Trujillo-Ortiz, A., R. Hernandez-Walls, F.A. Trujillo-Perez, D.A. Paz-Garcia, H.E. Chavez-Romo,
%         K. Xolaltenco-Coyotl and A. Montiel-Boehringer. (2005). REGOUT:Outlier Test on an Analysis
%         of Regression Based on R-Student. A MATLAB file. [WWW document]. URL http://www.mathworks.com/
%         matlabcentral/fileexchange/loadFile.do?objectId=8896
%
%  References:
%  Montgomery, D. C., Peck, E. A. and Vining, G. G. (2001), Introduction to Linear Regression
%          Analysis. 3rd Ed., New-York:John Wiley & Sons, Inc.
%

if nargin < 2,
   alpha = 0.05; %(default)
else
    alpha;
end; 

if (alpha <= 0 | alpha >= 1)
   fprintf('Warning: significance level must be between 0 and 1\n');
   return;
end;

[r c] = size(D);

n = r; %number of data

Y = D(:,c); %response vector

X = [ones(n,1) D(:,1:c-1)]; %design matrix

p = size(X,2); %number of parameters (including intercept)

b = inv(X'*X)*(X'*Y); %least squares parameters estimation

Ye = X*b; %expected response value

e = Y-Ye; %residual term

SSRes = e'*e;  %residual sum of squares

[rb,cb] = size(b);

v2 = n-rb; %residual degrees of freedom

MSRes = SSRes/v2; %residual mean square

Rse = sqrt(MSRes); %standard error term

H = X*inv(X'*X)*X'; %hat matrix

hii = diag(H); %leverage of the i-th observation

s2i = (((n-p)*MSRes)/(n-p-1))-(e.^2./(1-hii))/(n-p-1); %internal scaling

ti = e./sqrt(s2i.*(1-hii));

tc = tinv(1-(alpha/(2*n)),n-p-1); %R-student (externally studentized residual)

o = any(ti>tc,2); %detection of outliers

ii = [];
for i = 1:n,
    ii = [ii;i];
end;

disp('  ');
if sum(o)==0;
    disp('No outliers observation are identified.');
    disp(['(t_(alpha/2*n),n-p-1 = ',num2str(tc) ', ' 'alpha = ',num2str(alpha) ', ' 'p = ',num2str(p) ', ' 'n = ',num2str(n) ')']);
    plot(Ye,ti,'b*');
    xlabel('Predicted value (Y_i)'); 
    ylabel('R-student (t_i)'); 
    ob = (['(t(alpha/2*n),n-p-1 = ',num2str(tc) ', ' 'alpha = ',num2str(alpha) ', ' 'p = ',num2str(p) ', ' 'n = ',num2str(n) ')']);
    title({'Plot of residuals versus R-student value to detect presumed outliers.','' num2str(ob) ''})
    legend(['No presumed outlier(s) are detected'],0);
else
    disp('Outliers diagnostic test of a regression analysis based on R-student.');
    disp('----------------------------------------------------------------------------');
    disp(' Observation               Y                   e                 R-student              ');
    disp('----------------------------------------------------------------------------');
    fprintf('      %i                 %.4f             %.4f                %.4f\n',[ii(o) Y(o) e(o) ti(o)]');
    disp('----------------------------------------------------------------------------');
    disp(['(t_(alpha/2*n),n-p-1 = ',num2str(tc) ', ' 'alpha = ',num2str(alpha) ', ' 'p = ',num2str(p) ', ' 'n = ',num2str(n) ')']);
    
    hold on
    plot(Ye,ti,'b*');
    xlabel('Predicted value (Y_i)'); 
    ylabel('R-student (t_i)'); 
    ob = (['(t(alpha/2*n),n-p-1 = ',num2str(tc) ', ' 'alpha = ',num2str(alpha) ', ' 'p = ',num2str(p) ', ' 'n = ',num2str(n) ')']);
    title({'Plot of residuals versus R-student value to detect presumed outliers.','' num2str(ob) ''})
    
    id = [Ye(o) ti(o)];
    if sum(o)>=1.0,
        plot(id(:,1),id(:,2),'sb');
        legend(['+',['   ' 2] ' = Presumed Outlier(s) '],0);
    else
    end;
end;
hold off

return;

Contact us at files@mathworks.com