Anderson Darling Goodness-of-the-fit test?

23 views (last 30 days)
Ines on 16 May 2012
Commented: William Rose on 16 Jun 2022 at 2:26
I want to assess the GOF of a given dataset to a variety of different distributions (i.e. exponential, lognormal, normal etc) using the Anderson-Darling test-statistics. Unfortunately, this doesn't seem to have been implemented in Matlab, so I was wonderung whether some of you might have a function that does exactly this?
William Rose on 16 Jun 2022 at 2:26
As I think you know, the AD test tests the null hypothesis that a set of numbers is drawn from a specified distribution, where specified means that all parameters are specified. Matlab's adtest will automatically use the max likelihood esitmate for the parameters, if you pass one of the standard distribution names without parameters. If you use a nonstandard distribution, such as Nakagami, you have to create a probabilty distribution object with specified parameters, and then pass the name of th PDO to adtest. Here's how it works:
x=rand(1,100); %generate 100 uniformly distributed random numbers on [0,1)
dist=makedist('Nakagami','mu',1,'omega',1); %create a Nakagami distribution with mu=1, omega=1
h = logical
1
p = 6.0000e-06
h=1 means that the null hypothesis, which is that the values in x come from distribution dist, is rejected. The p value is very small. This is what we expect, since the values in x were not from the specified Nakagami distribution.
Now let's make randoms that ARE from the specified distribution, and run the AD test.
dist2=makedist('Gamma','a',2,'b',3); %gamma distribution with a=shape=2, b=scale=3
y=random(dist2,[1,50]); %randoms with specified distribution
h = logical
0
p = 0.3271
h=0 means the null hypothesis, that the values in y come from the specified distribution, is accepted. The p value is high, as expected.
In your case, you probably want to estimate the parameters, because you don't know them for your experimental wind data. You can do this as follows. We will create some data with a known distirbution and then pretend we don't know the parameters. Then we will fit the distribution to the data, and use the distribution with the fitted parameters for the AD test.
dist3=makedist('Beta','a',1.5,'b',0.5); %define a distribution
x=random(dist3,[100,1]); %create randoms with specified distribution
dist4=fitdist(x,'Beta'); %fit a beta distribution to the randoms
h = logical
0
p = 0.9046
dist5=fitdist(x,'Nakagami'); %fit a Nakagami distribution to the randoms
h = logical
1
p = 5.5220e-04
When we fit a beta distribution to the beta-distributed random numbers, the null hypothesis is accepted, and the p value is high, as expected.
When we fit a Nakagami distribution to the same random numbers, the null hypothesis (which is tha the data is from the fittd Nakagami distribution) is rejected, and the p value is low, as expected.

Mike Croucher on 11 Apr 2022
The statistics toolbox has an adtest function. It can test against a specific distribution with known parameters, or a more general test against any of the following distributions with unknown parameters: 'norm', 'exp', 'ev', 'logn', 'weibull'
William Rose on 13 Apr 2022
@Mike Croucher, thank you for explaining that, and thank you for the answer.

William Rose on 10 Apr 2022
Calculating the AD statistic is straightforward, and the attached function does it. Figuring out the p-value associated with a given AD statistic is not straightforward. The attached function does it, but beware. This 2018 paper gives a formula for the p-value (Table 3). They cite a textbook from 1986, which I don't find online. They say this formula is supposed to work no matter what distribution is being tested. However, I have tested their Table 3 formula with Monte Carlo simulation*. The tests show that the Table 3 formula gives very wrong answers if the data is uniform and the tested distribution is also uniform. The formula also gives very wrong answers if the data is exponential and the tested distribution is also exponential. The formula gives reasonable answers if the data is normal and the tested distribution is normal. By "very wrong answers", I mean that if you generate 10,000 sets of 100 uniformly distributed random numbers, and test versus a uniform distribution, about 1% of the p-values should be <.01, about 5% should be <.05, and about 10% should be <.10. But the actual number of p-values that are less, at each level, is much greater than expected. In other words, there are too many low p-values. Thus, when testing data that is actually uniformly (or exponentially) distributed, one would reject the hypothesis "this data came from a uniform (or exponential) distribution" far more often than one should. The p-values for nomal data tested versus a normal distribution appear to be approximately correct, at least at p=0.01, 0.05, and 0.10, with data sets of 20, 50, and 100 points per set.
The function that computes the A-D statistic and the associated p-value is attached: AndersonDarling.m. Note the warnings above. I am also attaching four scripts which test the function. See the comments in the function and in each script.
The 2018 paper linked above and here has references to other papers and books which may have formulas that give better results for non-normal distributions.
*I did the Monte Carlo testing with 10,000 data sets of normally distributed random numbers, with 20 data points in each data set. The I did it again with 50 and with 100 in each data set. Then I did it all again with uniformly and exponentially distributed random numbers, for 90,000 data sets in all.
William Rose on 14 Apr 2022
@Debbie Green, you;re welocme.
Despite my bias in favor of my own scripts and functions :), I think Matlab's adtest() is superior. It is fast. It computes a critival value for the AD test, for a gen.Pareto distibution with known parameters, analytically. (I don't know how, but it does.) Here is an example of how you can use it and evaluate it. You can obviously modify this to address your interest: "observing the change in the Anderson-Darling statistic as I change the theta parameter in the GPD fit":
M=10000; N=20; %M=number of data sets, N=number of values in each set
x=random('Generalized Pareto',2,1,0,[M,N]); %generate randoms with specified distribution
pd=makedist('Generalized Pareto',2,1,0); %create a probability distribution object
h=zeros(M,1); p=h; adstat=h; cv=h; %allocate arrays
%next line runs the adtest M times on the M data sets
histogram(p)
469 469 469
Inspection of vector cv (critical value) shows that all the values are identical. This confirms that adtest() computes the critical value analytically, rather than by Monte Carlo. If adtest() were computing the CV by Monte Carlo, the CV would be a litlle different every time.
The histogram of p values appears flat, as we expect it to be. That is reassuring.
The histogram of adstat does not look crazy.
The display of the sums of h, of p<.05, and of adstat<cv(1) show that it adtest() is working as expected, and that h is true if the null hypothesis is rejected. Since the null hypothesis is true in this case, and since alpha=.05 by default, we expect 500 rejections (since there were M=10000 tests). The variance (using the binomial approximation) is , so the observed value of 469 rejections is less than 1.5 sigma away from the expected value. Therefore the observed value is consistent with expectations.