function [regout] = regout(D,alpha)
% REGOUT Outlier Test on an Analysis of Regression Based on R-Student.
% Among the considerations in the use of analysis of regression, outliers or bad
% values can seriously disturb the least-squares fit. They falls far from the line
% implied by the rest of the data. If these points are really outliers, then the
% estimate of the intercept may be incorrect and the residual mean square may be an
% inflated estimate of the true variance. There are methods for scaled residuals
% useful in finding observations that are outliers. One of them is the externally
% studentized residual, usually called R-student. It is based on the fact that the
% MSRes is an internally generated estimate of the variance obtained from fitting the
% model to all n observations and it is necessary to get an estimation based on a
% data set with the ith observation removed,
%
% s2_(i) = ((n-p)*MSRes - e_i^2/(1-h_ii))/(n-p-1).
%
% Where e_i is the ith residual, h_ii is the ith diagonal element of the hat matrix H,
% p is the number of regression coefficients (including the intercept) and n the number
% of data.
%
% As a result, an appropriate test for the a mean shift model is,
%
% t_i = e_i/sqrt(s2_(i)*(1-h_ii))
%
% wich is the externally studentized residual or R-student. This statistic follows the
% a Student t-distribution. But one could use a Bonferroni-type approach and compare all
% n values of t_i to t_(alpha/2*n),n-p-1 to provide guidance regarding outliers.
%
% We agree with Montgomery et al. (2001) in the sense that...'it is our view that a formal
% approach is usually not necessary and that only relatively crude cutoff values need be
% considered and that, in general, a diagnostic view as opposed to a strict statistical
% hypothesis-testing view is best'. Futhermore, detection of outliers often needs to be
% considered simultaneously with detection of influential observations, as you can find in
% the Cookdist m-file in this same author page [ http://www.mathworks.com/matlabcentral/
% fileexchange/loadFile.do?objectId=8716 ]
% approach
%
% Syntax: regout(D,alpha)
%
% Inputs:
% D - matrix data (=[X Y]) (last column must be the Y-dependent variable).
% (X-independent variable entry can be for a simple [X], multiple [X1,X2,X3,...Xp]
% or polynomial [X,X^2,X^3,...,X^p] regression model).
% alpha - significance (default = 0.05).
%
% Outputs:
% A complete summary (table and/or plot) of the outliers diagnostic test.
%
% From the example 3.1 of Montgomery et al. (2001), to fit a multiple regression model with
% two predictors, we are interested to test if there is some outliers observations by using
% an externally studentized residual R-student with a significance value of 0.05.
%
% ------------------- -------------------
% X1 X2 Y X1 X2 Y
% ------------------- -------------------
% 7 560 16.68 6 462 19.75
% 3 220 11.50 9 448 24.00
% 3 340 12.03 10 776 29.00
% 4 80 14.88 6 200 15.35
% 6 150 13.75 7 132 19.00
% 7 330 18.11 3 36 9.50
% 2 110 8.00 17 770 35.10
% 7 210 17.83 10 140 17.90
% 30 1460 79.24 26 810 52.32
% 5 605 21.50 9 450 18.75
% 16 688 40.33 8 635 19.83
% 10 215 21.00 4 150 10.75
% 4 255 13.50 -------------------
% -------------------
%
% Data matrix must be:
% D=[7 560 16.68;3 220 11.5;3 340 12.03;4 80 14.88;6 150 13.75;7 330 18.11;2 110 8;7 210 17.83;
% 30 1460 79.24;5 605 21.5;16 688 40.33;10 215 21;4 255 13.5;6 462 19.75;9 448 24;10 776 29;
% 6 200 15.35;7 132 19;3 36 9.5;17 770 35.1;10 140 17.9;26 810 52.32;9 450 18.75;8 635 19.83;
% 4 150 10.75];
%
% Calling on Matlab the function:
% outreg(D)
%
% Answer is:
%
% Outliers diagnostic test of a regression analysis based on R-student.
% ----------------------------------------------------------------------------
% Observation Y e R-student
% ----------------------------------------------------------------------------
% 9 79.2400 7.4197 4.3108
% ----------------------------------------------------------------------------
% (t_(alpha/2*n),n-p-1 = 3.5272, alpha = 0.05, p = 3, n = 25)
%
% Created by A. Trujillo-Ortiz, R. Hernandez-Walls and F.A. Trujillo-Perez
% Facultad de Ciencias Marinas
% Universidad Autonoma de Baja California
% Apdo. Postal 453
% Ensenada, Baja California
% Mexico.
% atrujo@uabc.mx
% And the special collaboration of the post-graduate students of the 2005:2
% Multivariate Statistics Course: D.A. Paz-Garcia, H.E. Chavez-Romo,
% K. Xolaltenco-Coyotl, and A. Montiel-Boehringer.
% Copyright (C) October 25, 2005.
%
% To cite this file, this would be an appropriate format:
% Trujillo-Ortiz, A., R. Hernandez-Walls, F.A. Trujillo-Perez, D.A. Paz-Garcia, H.E. Chavez-Romo,
% K. Xolaltenco-Coyotl and A. Montiel-Boehringer. (2005). REGOUT:Outlier Test on an Analysis
% of Regression Based on R-Student. A MATLAB file. [WWW document]. URL http://www.mathworks.com/
% matlabcentral/fileexchange/loadFile.do?objectId=8896
%
% References:
% Montgomery, D. C., Peck, E. A. and Vining, G. G. (2001), Introduction to Linear Regression
% Analysis. 3rd Ed., New-York:John Wiley & Sons, Inc.
%
if nargin < 2,
alpha = 0.05; %(default)
else
alpha;
end;
if (alpha <= 0 | alpha >= 1)
fprintf('Warning: significance level must be between 0 and 1\n');
return;
end;
[r c] = size(D);
n = r; %number of data
Y = D(:,c); %response vector
X = [ones(n,1) D(:,1:c-1)]; %design matrix
p = size(X,2); %number of parameters (including intercept)
b = inv(X'*X)*(X'*Y); %least squares parameters estimation
Ye = X*b; %expected response value
e = Y-Ye; %residual term
SSRes = e'*e; %residual sum of squares
[rb,cb] = size(b);
v2 = n-rb; %residual degrees of freedom
MSRes = SSRes/v2; %residual mean square
Rse = sqrt(MSRes); %standard error term
H = X*inv(X'*X)*X'; %hat matrix
hii = diag(H); %leverage of the i-th observation
s2i = (((n-p)*MSRes)/(n-p-1))-(e.^2./(1-hii))/(n-p-1); %internal scaling
ti = e./sqrt(s2i.*(1-hii));
tc = tinv(1-(alpha/(2*n)),n-p-1); %R-student (externally studentized residual)
o = any(ti>tc,2); %detection of outliers
ii = [];
for i = 1:n,
ii = [ii;i];
end;
disp(' ');
if sum(o)==0;
disp('No outliers observation are identified.');
disp(['(t_(alpha/2*n),n-p-1 = ',num2str(tc) ', ' 'alpha = ',num2str(alpha) ', ' 'p = ',num2str(p) ', ' 'n = ',num2str(n) ')']);
plot(Ye,ti,'b*');
xlabel('Predicted value (Y_i)');
ylabel('R-student (t_i)');
ob = (['(t(alpha/2*n),n-p-1 = ',num2str(tc) ', ' 'alpha = ',num2str(alpha) ', ' 'p = ',num2str(p) ', ' 'n = ',num2str(n) ')']);
title({'Plot of residuals versus R-student value to detect presumed outliers.','' num2str(ob) ''})
legend(['No presumed outlier(s) are detected'],0);
else
disp('Outliers diagnostic test of a regression analysis based on R-student.');
disp('----------------------------------------------------------------------------');
disp(' Observation Y e R-student ');
disp('----------------------------------------------------------------------------');
fprintf(' %i %.4f %.4f %.4f\n',[ii(o) Y(o) e(o) ti(o)]');
disp('----------------------------------------------------------------------------');
disp(['(t_(alpha/2*n),n-p-1 = ',num2str(tc) ', ' 'alpha = ',num2str(alpha) ', ' 'p = ',num2str(p) ', ' 'n = ',num2str(n) ')']);
hold on
plot(Ye,ti,'b*');
xlabel('Predicted value (Y_i)');
ylabel('R-student (t_i)');
ob = (['(t(alpha/2*n),n-p-1 = ',num2str(tc) ', ' 'alpha = ',num2str(alpha) ', ' 'p = ',num2str(p) ', ' 'n = ',num2str(n) ')']);
title({'Plot of residuals versus R-student value to detect presumed outliers.','' num2str(ob) ''})
id = [Ye(o) ti(o)];
if sum(o)>=1.0,
plot(id(:,1),id(:,2),'sb');
legend(['+',[' ' 2] ' = Presumed Outlier(s) '],0);
else
end;
end;
hold off
return;