Code covered by the BSD License  

Highlights from
Cookdist

image thumbnail
from Cookdist by Antonio Trujillo-Ortiz
Cook's distance influence index.

Cookdist(D)
function Cookdist(D)
% COOKDIST Cook's Distance Influence Index.
%  This quantity measures how much the entire regression function changes when
%  the i-th observation is deleted. Should be comparable to F_p,n-p: if the 'p-value'
%  of  D_i is 50 percent or more, then the i-th point is likely influential:
%  investigate this point further. Cook's distance (D_i) [Cook, 1977,1979] is an
%  influence measure based on the difference between the regression parameter
%  estimates b and what they become if the i-th data point is removed, b_-1.
%  Let y_(i) be the value opf the fitted response corresponding to the ith observation
%  removed. Then e_(i)=y_i - y_(i) is called the deleted residual. This is easily
%  computed if the leverage h_ii of the observation is known,
%
%                             e_(i) = e_i/(1 - h_ii).
%
%  According to the i-th studentized residual,
%
%                         r_i = e_i/sqrt(MSRes*(1-h_ii)),
%
%   D_i = (r_i^2/p)*(h_ii/(1-h_ii)) = (e_i^2/p*MSRes*(1-h_ii))*(h_ii/(1-h_ii))
%
%                    = e_i^2*(h_ii/(1-h_ii)^2)*(1/(p*MSRes)).
%
%  The usual criterion is that a point is influential if D_i exceeds the median of
%  the F_p,n-p distribution, where p is the number of regression coefficients (including
%  the intercept) and n the number of data.
%
%  **NOTE.- One should be careful. This procedure it is not a conclusive test to detect
%  any outliers on regression models, but unusual observations by its very high leverage
%  and high influence values. For such a case you should to check it under the appropriate 
%  assumptions.**
%
%  Syntax: cookdist(D) 
%      
%  Inputs:
%       D - matrix data (=[X Y]) (last column must be the Y-dependent variable).
%           (X-independent variable entry can be for a simple [X], multiple [X1,X2,X3,...Xp] 
%           or polynomial [X,X^2,X^3,...,X^p] regression model).
%
%  Outputs:
%       A complete summary (table and/or plot) of the Cook's influence index. For the graph,
%       the cross-hair can be positioned with the mouse at the selected location.  
%
%  From the example 4.1 of Jennrich (1995), we are interested to investigate if there is some
%  effect on the fitted response of removing the i-th observation or influence.
%
%              -------------------   -------------------
%                 X1    X2     Y        X1    X2     Y
%              -------------------   -------------------
%                 68    60    75        71    86    70
%                 49    94    63        95    94    96
%                 60    91    57        61    94    76
%                 68    81    88        72    94    75
%                 97    80    88        87    79    85
%                 82    92    79        40    50    40
%                 59    74    82        66    92    74
%                 50    89    73        58    82    70
%                 73    96    90        58    94    75
%                 39    87    62        77    78    72
%              -------------------   -------------------
%
%  Data matrix must be:
%  D=[68 60 75;49 94 63;60 91 57;68 81 88;97 80 88;82 92 79;59 74 82;50 89 73;73 96 90;
%  39 87 62;71 86 70;95 94 96;61 94 76;72 94 75;87 79 85;40 50 40;66 92 74;58 82 70;
%  58 94 75;77 78 72];
%
%  Calling on Matlab the function: 
%             cookdist(D)
%
%  Answer is:
%
%  Exploring the influence of observation by the Cook's distance to list those
%  particulary suspicious.
%  ----------------------------------------------------------------------------
%   Observation                Y                   e                 Cook              
%  ----------------------------------------------------------------------------
%        16                 40.0000            -10.3917             1.414
%  ----------------------------------------------------------------------------
%  (Fp,n-p(0.5) = 0.82121, p = 3, n = 20)
%
%  Created by A. Trujillo-Ortiz, R. Hernandez-Walls and F.A. Trujillo-Perez
%             Facultad de Ciencias Marinas
%             Universidad Autonoma de Baja California
%             Apdo. Postal 453
%             Ensenada, Baja California
%             Mexico.
%             atrujo@uabc.mx
%             And the special collaboration of the post-graduate students of the 2005:2
%             Multivariate Statistics Course: D.A. Paz-Garcia, H.E. Chavez-Romo, 
%             K. Xolaltenco-Coyotl, and A. Montiel-Boehringer.
%  Copyright (C) October 6, 2005.
%
%  To cite this file, this would be an appropriate format:
%  Trujillo-Ortiz, A., R. Hernandez-Walls, F.A. Trujillo-Perez, D.A. Paz-Garcia, H.E. Chavez-Romo,
%         K. Xolaltenco-Coyotl and A. Montiel-Boehringer. (2005). COOKDIST:Cook's Distance Influence Index. 
%         A MATLAB file. [WWW document]. URL http://www.mathworks.com/matlabcentral/fileexchange/
%         loadFile.do?objectId=8716
%
%  References:
%  Cook, R. D. (1977), Detection of influential observations in linear regression.
%          Technometrics, 19:15-18.
%  Cook, R. D. (1979), Influential observations in linear regression. Journal of
%          the American Statistical Association, 74:169-174.
%  Jennrich, R. I. (1995), An Introduction to Computational Statistics:Regression Analysis.
%          Inglewood Cliffs, NJ: Prentice Hall.    
%
%  -------------------------------
%  Modified 10/15/05 to address suggestions by Urs Schwartz.
%  -------------------------------
%

[r c] = size(D);

n = r; %number of data

Y = D(:,c); %response vector

X = [ones(n,1) D(:,1:c-1)]; %design matrix

p = size(X,2); %number of parameters

b = inv(X'*X)*(X'*Y); %least squares parameters estimation

Ye = X*b; %expected response value

e = Y-Ye; %residual term

SSRes = e'*e;  %residual sum of squares

[rb,cb] = size(b);

v2 = n-rb; %residual degrees of freedom

MSRes = SSRes/v2; %residual mean square

Rse = sqrt(MSRes); %standard error term

H = X*inv(X'*X)*X'; %hat matrix

hii = diag(H); %leverage of the i-th observation

ri = e./(Rse*sqrt(1-hii)); %Studentized residual

Di = diag((ri.^2/p)*[(hii./(1-hii))]'); %Cook's distance

F50 = finv(0.5,p,n-p); %mean of the F-distribution

d = [];
for i=1:n,
    d=[d;i];
end;

in = any(Di>F50,2);
I = [Di(in)];
O = d(in);
y = Y(in);
ee = e(in);

disp('  ')
if sum(in)==0;
    disp('No influence of the i-th observation is identified.');
    disp(['(Fp,n-p(0.5) = ',num2str(F50) ', ' 'p = ',num2str(p) ', ' 'n = ',num2str(n) ')']);
else (sum(in)>0);
    disp('Exploring the influence of observation by the Cook''s distance to list those');
    disp('particulary suspicious.');
    disp('----------------------------------------------------------------------------');
    disp(' Observation                Y                   e                 Cook              ');
    disp('----------------------------------------------------------------------------');
    fprintf('      %i                 %.4f            %.4f             %.3f\n',[O y ee I]');
    disp('----------------------------------------------------------------------------');
    disp(['(Fp,n-p(0.5) = ',num2str(F50) ', ' 'p = ',num2str(p) ', ' 'n = ',num2str(n) ')']);
end;

hold on
stem(Di,'*b')
xlabel('Observation number');
ylabel('Cook''s distance');
st = (['(Fp,n-p(0.5) = ',num2str(F50) ', ' 'p = ',num2str(p) ', ' 'n = ',num2str(n) ')']);
title({'Cook''s distance plot.','' num2str(st) ''})

id = [I O];
if sum(in)>=1.0,
    plot(id(:,2),id(:,1),'sb');
    legend(['+',['   ' 2] ' = Influential observation(s) '],0);
else
end;
hold off

return;

Contact us at files@mathworks.com