Code covered by the BSD License  

Highlights from
AnDarksamtest

from AnDarksamtest by Antonio Trujillo-Ortiz
Anderson-Darling k-sample procedure to test whether k sampled populations are identical.

AnDarksamtest(X,alpha)
function [AnDarksamtest] = AnDarksamtest(X,alpha)
%ANDARKSAMTEST Anderson-Darling k-sample procedure to test whether
% k sampled populations are identical
% 
% Anderson and Darling (1952,1954) introduced a goodness-of-fit statistic
%
%                         _inf            2
%                        /  {Fm(x) - Fo(x)}
%              ADm = m  /  ----------------- dFo(x)
%                 -inf_/    Fo(x){1 - Fo(x)}
%  
% to test the hypothesis that a random sample Fm(x) comes from a continuous
% population with a specified distribution function Fo(x). It is a 
% modification of the Kolmogorov-Smirnov (K-S) test and gives more weight
% to the tails than the K-S test.
%
% The corresponding two-sample version,
%
%                              _inf            2
%                  mn         /  {Fm(x) - Gn(x)}
%          ADmn = ----       /  ----------------- dHn(x)
%                   N  -inf_/    Hn(x){1 - Hn(x)}
%
% was proposed by Darling (1957) and studied in detail by Pettitt (1976). 
% Where Gn(x) is the empirical distribution function of the second 
% independent sample G(x) and Hn(x) = {mFm(x) + nGn(x)}/N, with N = m+n, is
% the empirical distribution function of the pooled sample. It is used to
% test the hypothesis that F = G.
%
% The Anderson-Darling k-sample test was introduced by Scholz and Stephens
% (1987) as a generalization of the two-sample Anderson-Darling test. It is
% a nonparametric statistical procedure, i.e., a rank test, and, thus, 
% requires no assumptions other than that the samples are true independent 
% random samples from their respective continuous populations (although 
% provisions for tied observations are made). It tests the hypothesis that
% the populations from which two or more independent samples of data were
% drawn are identical. This test can be used to decide whether data from
% different sources may be combined, because they are judged to come from
% one common distribution, i.e., the null hypothesis Ho of same population
% distributions cannot be rejected. In its opposite use, it can be seen as
% a generalization of a one-way ANOVA  for which the k-sample Kruskal-Wallis
% test (1952, 1953) is the most commonly used rank test.
%
% It is an omnibus test because of its effectiveness against all alternatives
% to the null hypothesis Ho's (all k populations being equal). For example,
% it is effective for changes in scale while locations are matched, which is
% a weakness of the Kruskal-Wallis test.
%
% The Anderson-Darling k-sample procedure assumes that i-th sample has a 
% continuous distribution function and we are interested in testing the null
% hypothesis that all sampled populations have the same distribution, without
% specifying the nature of that common distribution,
%
%                        Ho: F1 = F2 = ... = Fk
%
% without specifying the nature of that common distribution F. The statistic
% is,
%
%                _k _          _               2
%                \           /  {Fi(x) - Hn(x)}
%        ADK  =  /_ _   ni  /  ----------------- dHn(x)
%                i=1   Bn _/    Hn(x){1 - Hn(x)}
%
% where Bn = {x E R:Hn(x) < 1}.
%
% The computational formula for ADK* adjusted for ties is,
%
%                     _k _          _L _                2
%               n-1   \        1    \      {nFij - niHj}
%        ADK  = ----  /_ _  [ ----  /_ _  ---------------- ]
%               n^2   i=1      ni   j=1   Hj(n-Hj) - nhj/4
%               
%
% and the corresponding not adjusted for ties is
%
%                     _k _          L-1_               2
%                1    \        1    \       {nFij - niHj}
%        ADKn = ----  /_ _  [ ----  /_ _  ---------------- ]
%                n    i=1      ni   j=1        Hj(n-Hj)
%
% where:
%  k = number of groups (samples); i=1,2,...,k
%  ni = data values in the ith group; j=1,2,...ni
%  n = number total of observations (data); n=n1+n2+...+nk
%  xij = data in the i group and j observation within that group
%  z(j) = distinct values of all combined data ordered in ascendent way
%         denoted z(1),z(2),...,z(L)
%  L = lergest unique value, where it will be less than n with tied values
%  hj = number of values in the pooled sample equal to z(j)
%  Hj = number of values in the combined samples less than zj plus one half
%       the number of values in the combined samples equal to zj
%  Fij = number of values in the ith group which are less than zj plus one
%        half the number of values in this group which are equal to zj
%  
% Under Ho and assuming continuity of the common F distribution the mean of
% ADK is k-1 and the variance* is,
%                            
%                          an^3 + bn^2 + cn + d 
%               var(ADK) = --------------------- 
%                             (n-1)(n-2)(n-3)
%
% where,
%               a = (4g - 6)(k-1) + (10 - 6g)S
%               b = (2g -4)k^2 + 8Tk + (2g - 14T -4)S - 8T + 4g -6
%               c = (6T + 2g -2)k^2 + (4T - 4g + 6)k + (2T -6)S + 4T
%               d = (2T +6)k^2 - 4Tk
%                
% with,
%                   _k _               _n-1_
%                   \      1           \      1
%               S = /_ _  ---- ,   T = /_ _  ----  
%                          ni           i=1   i
%
% and
%                         _n-2_   _n-1_
%                         \       \        1
%                     g = /_ _    /_ _   ------  
%                         i=1     j=i+1  (n-i)j
%
% The observed k-sample Anderson-Darling statistic (ADK) is standardized
% using its exact sample mean and standard deviation to removes some of its
% dependence on the sample size. Then,
%
%                             ADK - (k-1)
%                      ADKs = ------------
%                               std(ADK)
%
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% *NOTE.-In some literature as in DOD-MIL-HDBK-17-1E_Vol 1, Ch. 8, Polymer
% Matrix Composites (1997) you can find this statistic with an equivalent
% mathematical expressions as:
%
%                         _k _          _L _                2
%                 n-1     \        1    \      {nFij - niHj}
%        ADK  = --------  /_ _  [ ----  /_ _  ---------------- ] ,
%               n^2(k-1)   i=1      ni   j=1   Hj(n-Hj) - nhj/4
%
%
%                              an^3 + bn^2 + cn + d 
%                 var(ADK) = ------------------------ ,
%                             (n-1)(n-2)(n-3)(k-1)^2
%
% and
%
%                                  ADK - 1
%                          ADKs = --------- .
%                                  std(ADK)
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% The approximate P-value of the observed ADK statistic can be calculated
% using a spline interpolation method. For the interested users, we are
% including, as a comment, the mathematical procedure to get the ADK 
% critical value.
%
% The interpolation coefficients for each alpa-value,
%
%               alpha = [0.25,0.10 0.05 0.025 0.01]
% are:
%                  bo = [0.675,1.281,1.645,1.96,2.326]
%                  b1 = [-0.245,0.25,0.678,1.149,1.822]
%                  b2 = [-0.105,-0.305,-0.362,-0.391,-0.396]
%
% and the general formula interpolation is,
%
%                  qnt = bo + b1/sqrt(k-1) + b2/(k-1)
%
% We give the Anderson-Darling k-sample procedure with and without 
% adjustment for ties.
%
% Finally, we compare the P-value with the desired significance level alpha
% to facilitate a decision about the null hypothesis Ho. 
%
% Syntax: function AnDarksamtest(X,alpha) 
%      
% Inputs:
%      X - data matrix (Size of matrix must be n-by-2; data=column 1,
%          sample=column 2) 
%  alpha - significance level (default = 0.05)
%
% Output:
%        - Complete Anderson-Darling k-sample test
%
% Example: From the example given by Scholz and Stephens (1987, p.922). It
% is interested to test if the four sets of eight measurements each of the
% smoothness of a certain type of paper obtained by four independent 
% laboratories are homogeneous or identical to consider them
% unstructurated. We use an alpha-value = 0.05. Data are,
%
%  -----------------------------------------------------------------
%    Laboratory                      Smoothness
%  -----------------------------------------------------------------
%        1          38.7  41.5  43.8  44.5  45.5  46.0  47.7  58.0
%        2          39.2  39.3  39.7  41.4  41.8  42.9  43.3  45.8
%        3          34.0  35.0  39.0  40.0  43.0  43.0  44.0  45.0
%        4          34.0  34.8  34.8  35.4  37.2  37.8  41.2  42.8
%  -----------------------------------------------------------------
%
% Data vector is:
%  X=[38.7 1;41.5 1;43.8 1;44.5 1;45.5 1;46.0 1;47.7 1;58.0 1;
%  39.2 2;39.3 2;39.7 2;41.4 2;41.8 2;42.9 2;43.3 2;45.8 2;
%  34.0 3;35.0 3;39.0 3;40.0 3;43.0 3;43.0 3;44.0 3;45.0 3; 
%  34.0 4;34.8 4;34.8 4;35.4 4;37.2 4;37.8 4;41.2 4;42.8 4];
%
% Calling on Matlab the function: 
%            AnDarksamtest(X)
%
% Answer is:
%
% K-sample Anderson-Darling Test
% ----------------------------------------------------------------------------
% Number of samples: 4
% Sample sizes: 8 8 8 8
% Total number of observations: 32
% Number of ties (identical values): 3
% Mean of the Anderson-Darling rank statistic: 3
% Standard deviation of the Anderson-Darling rank statistic: 1.2037664
% ----------------------------------------------------------------------------
% Not adjusted for ties.
% ----------------------------------------------------------------------------
% Anderson-Darling rank statistic: 8.3558723
% Standardized Anderson-Darling rank statistic: 4.4492622
% Probability associated to the Anderson-Darling rank statistic = 0.0022547
% 
% With a given significance = 0.050
% The samples were drawn from different populations: data may be considered
% structurated with respect to the random of fixed effect in question.
% ----------------------------------------------------------------------------
% Adjusted for ties.
% ----------------------------------------------------------------------------
% Anderson-Darling rank statistic: 8.3926093
% Standardized Anderson-Darling rank statistic: 4.4797806
% Probability associated to the Anderson-Darling rank statistic = 0.0021700
% 
% With a given significance = 0.050
% The samples were drawn from different populations: data may be considered
% structurated with respect to the random of fixed effect in question.
% ----------------------------------------------------------------------------
%
% Created by A. Trujillo-Ortiz, R. Hernandez-Walls, K. Barba-Rojo, 
%             L. Cupul-Magana and R.C. Zavala-Garcia
%             Facultad de Ciencias Marinas
%             Universidad Autonoma de Baja California
%             Apdo. Postal 453
%             Ensenada, Baja California
%             Mexico.
%             atrujo@uabc.mx
%
% Copyright. November 1, 2007.
%
% ---Special thanks are given to Michael Singer, Department of Earth & 
%    Planetary Science, Institute for Computational Earth System Science,
%    University of California at Berkely, for encouraging us to create
%    this m-file--
% ---We are grateful to Fritz Scholz and Michael Stephens, the authors of
%    of this test for their valuable suggestions to improve the clarity of
%    the presentation.
%
%  To cite this file, this would be an appropriate format:
%  Trujillo-Ortiz, A., R. Hernandez-Walls, K. Barba-Rojo, L. Cupul-Magana
%    and R.C. Zavala-Garcia. (2007). AnDarksamtest:Anderson-Darling k-sample
%    procedure to test the hypothesis that the populations of the drawned
%    groups are identical. A MATLAB file. [WWW document]. URL http://
%    www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=17451
%
% References:
% Anderson, T.W. and Darling, D.A. (1952), Asymptotic Theory of Certain
%     Goodness of Fit Criteria Based on Stochastic Processes. Annals of
%     Mathematical Statistics, 23:193-212.
% Anderson, T.W. and Darling, D.A. (1954), A Test of Goodness of Fit. 
%     Journal of the American Statistical Association, 49:765-769.
% Darling, D.A. (1957), The Kolmogorov-Smirnov, Cramer-von Mises
%     Tests. Annals of Mathematical Statistics, 28:823-838.
% Kruskal, W.H. and Wallis, W.A. (1952), Use of ranks in one-criterion
%     variance analysis. Journal of the American Statistical Association,
%     47(260):583621.
% Kruskal, W.H. and Wallis, W.A. (1953), Errata to Use of ranks in one-
%     criterion variance analysis. Journal of the American Statistical 
%     Association, 48:907-911.
% MIL-HDBK-17-1E (1997), The Composite Materials Handbook, Volume 1.
%     Polymer Matrix Composites: Guidelines for Characterization of
%     Structural Materials. Chapter 8, Department of Defense, 
%     Washington, D.C.
% Pettitt, A.N. (1976), A Two-Sample Anderson-Darling Rank Statistic.
%     Biometrika, 63:161-168.
% Scholz, F.W. and Stephens, M.A. (1987), K-Sample Anderson-Darling Tests.
%     Journal of the American Statistical Association, 82:918-924.
%

switch nargin
    case{2}
        if isempty(X) == false && isempty(alpha) == false
            if (alpha <= 0 || alpha >= 1)
                fprintf('Warning: Significance level error; must be 0 < alpha < 1 \n');
                return;
            end
        end
    case{1}
        alpha = 0.05;
    otherwise 
        error('Requires at least one input argument.');
end

X1 = X(:,1); %data vector
X2 = X(:,2); %grouping vector
k = max(X2); %number of samples (groups)
N = length(X1); %total number of data

if (k < 2),
    error('There must have at leat two samples.');
end

[c,v] = hist(sort(X1),sort(X1));
nc = find(c~=0);
c = [v(nc) c(nc)'];
hj = c(:,2); %number of values in combined samples equal to zj
hjn = c(1:end-1,2);
O = sort(X1);
zj = unique(O); %distinct values in the combined data set ordered
L = length(zj); 
zjn = zj(1:end-1); %distinct values in the combined data set ordered 
                   %for adjust for ties
Ln = length(zjn);

n=[];fij=0;Fij=0;Fijn=0;NN=0;DD=0;ADK1=0;ADK1n=0;
for i=1:k,
    r = find(X2 == i); 
    ni = length(r); %number of observations in sample i
    n = [n ni];
    if (any(ni==0)),
        error('One or more samples have no observations.');
    end
    rx = X1(r);
    [x1,x2] = ndgrid(rx,zj);
    [x1n,x2n] = ndgrid(rx,zjn);
    fij = sum(x1==x2);
    fijn = sum(x1n==x2n);
    Fij = [cumsum(fij)-(fij*.5)]'; %number of values in the ith group
                                   %which are less than zj plus one
                                   %half the number of values in this
                                   %group which are equal to zj. It is
                                   %adjusted for ties.
    
    Fijn = [cumsum(fijn)]'; %number of values in the ith group
                            %which are less than zj plus one
                            %half the number of values in this
                            %group which are equal to zj. It is
                            %not adjusted for ties.

    Hj = [cumsum(hj)-(hj*.5)]; %number of values in the combined samples less 
                           %than zj plus one half the number of values in
                           %the combined samples equal to zj. It is         
                           %adjusted for ties.

    Hjn = [cumsum(hjn)]; %number of values in the combined samples less 
                           %than zj plus one half the number of values in
                           %the combined samples equal to zj. It is not
                           %adjusted for ties.

    NN = (N*Fij-ni*Hj).^2;
    DD = (Hj.*(N-Hj))-(N*hj./4);
    ADK1 = ADK1 + (1/ni*sum(hj.*(NN./DD)));
    
    NNn = (N*Fijn-ni*Hjn).^2;
    DDn = Hjn.*(N-Hjn);
    ADK1n = ADK1n + (1/ni*sum(hjn.*(NNn./DDn)));
end

%k-sample Anderson-Darling observed statistic
ADK = ((N-1)/N^2)*ADK1; %adjusted for ties

ADKn = 1/N*ADK1n; %not adjusted for ties

%Exact sample variance of the k-sample Anderson-Darling
S = sum(1./n);
i = 1:N-1;
T = sum(1./i);

g=0;
for i = 1:N-2,
    g = g + (1./(N-i))*sum(1./((i+1):(N-1)));
end

a=0;b=0;c=0;d=0;
a = (4*g-6)*(k-1)+(10-6*g)*S;
b = (2*g-4)*k^2+8*T*k+(2*g-14*T-4)*S-8*T+4*g-6;
c = (6*T+2*g-2)*k^2+(4*T-4*g+6)*k+(2*T-6)*S+4*T;
d = (2*T+6)*k^2-4*T*k;

vADK = ((a*N^3)+(b*N^2)+(c*N)+d)/((N-1)*(N-2)*(N-3));

ADKs = (ADK-(k-1))/sqrt(vADK); %standardized k-sample Anderson-Darling 
                               %statistic

ADKsn = (ADKn-(k-1))/sqrt(vADK); %standardized k-sample Anderson-Darling 
                                 %statistic not adjusted for ties

%k-sample Anderson-Darling P-value calculation by an extrapolate-interpolate
%procedure
bo = [0.675,1.281,1.645,1.96,2.326];
b1 = [-0.245,0.25,0.678,1.149,1.822];
b2 = [-0.105,-0.305,-0.362,-0.391,-0.396];
qnt = bo+b1/sqrt(k-1)+b2/(k-1);
c0 = [1.0986123,2.1972246,2.9444390,3.6635616,4.5951199];

%not adjusted for ties
ext=0;
if (ADKs < qnt(1))|(ADKs >qnt(5)),
    ext = 1;
end
x = zeros(4,1);
yx = zeros(4,1);
if (ADKs <= qnt(3)),
    for i = 1:4,
        x(i) = qnt(i);
        yx(i) = c0(i);
    end
else
    for i = 1:4,
        x(i) = qnt(i+1);
        yx(i) = c0(i+1);
    end
end
y = yx;

Y = interp1(x,y,ADKs,'spline');
P = 1/(1+exp(Y));

%adjusted for ties
ext=0;
if (ADKsn < qnt(1))|(ADKsn >qnt(5)),
    ext = 1;
end
x = zeros(4,1);
yx = zeros(4,1);
if (ADKsn <= qnt(3)),
    for i = 1:4,
        x(i) = qnt(i);
        yx(i) = c0(i);
    end
else
    for i = 1:4,
        x(i) = qnt(i+1);
        yx(i) = c0(i+1);
    end
end
y = yx;

Y = interp1(x,y,ADKsn,'spline');
Pn = 1/(1+exp(Y));

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Optional k-sample Anderson-Darling critical value calculation procedure
% zc=norminv(1-alpha);
% 
% if (alpha == 0.25),
%     ADKc = 1 + sqrt(vADK)*(zc+(0.25/sqrt(k-1)) - (-0.105/(k-1)));
% elseif (alpha == 0.10),
%     ADKc = 1 + sqrt(vADK)*(zc+(-0.245/sqrt(k-1)) - (-0.305/(k-1)));
% elseif (alpha == 0.05),
%     ADKc = 1 + sqrt(vADK)*(zc+(0.678/sqrt(k-1)) - (0.362/(k-1)));
% elseif (alpha == 0.025),
%     ADKc = 1 + sqrt(vADK)*(zc+(1.149/sqrt(k-1)) - (-0.391/(k-1)));
% else (alpha == 0.01),
%     ADKc = 1 + sqrt(vADK)*(zc+(1.822/sqrt(k-1)) - (-0.396/(k-1)));
% end
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

disp(' ')
disp('K-sample Anderson-Darling Test')
disp('----------------------------------------------------------------------------')
fprintf('Number of samples: %i\n', k);
fprintf(['Sample sizes: ', repmat('%d ', 1, length(n)-1), '%d\n'], n);
fprintf('Total number of observations: %i\n', N);
fprintf('Number of ties (identical values): %i\n', N-L);
fprintf('Mean of the Anderson-Darling rank statistic: %i\n', k-1);
fprintf('Standard deviation of the Anderson-Darling rank statistic: %3.7f\n', sqrt(vADK));
disp('----------------------------------------------------------------------------')
disp('Not adjusted for ties.')
disp('----------------------------------------------------------------------------')
fprintf('Anderson-Darling rank statistic: %3.7f\n', ADKn);
fprintf('Standardized Anderson-Darling rank statistic: %3.7f\n', ADKsn);
fprintf('Probability associated to the Anderson-Darling rank statistic = %3.7f\n', Pn);
disp(' ')
fprintf('With a given significance = %3.3f\n', alpha);

if Pn >= alpha;
    disp('The populations from which the k-samples of data were drawn are identical:');
    disp('natural groupings have no significant effect (unstructurated).');
else
    disp('The samples were drawn from different populations: data may be considered'); 
    disp('structurated with respect to the random of fixed effect in question.');
end
disp('----------------------------------------------------------------------------')
disp('Adjusted for ties.')
disp('----------------------------------------------------------------------------')
fprintf('Anderson-Darling rank statistic: %3.7f\n', ADK);
fprintf('Standardized Anderson-Darling rank statistic: %3.7f\n', ADKs);
fprintf('Probability associated to the Anderson-Darling rank statistic = %3.7f\n', P);
disp(' ')
fprintf('With a given significance = %3.3f\n', alpha);
if P >= alpha;
    disp('The populations from which the k-samples of data were drawn are identical:');
    disp('natural groupings have no significant effect (unstructurated).');
else
    disp('The samples were drawn from different populations: data may be considered'); 
    disp('structurated with respect to the random of fixed effect in question.');
end
disp('----------------------------------------------------------------------------')

return,

Contact us at files@mathworks.com