Matlab PDF testing for integers only

10 views (last 30 days)
ThB on 4 Mar 2017
Answered: Jeff Miller on 1 Mar 2018
I am currently trying to do a statistical analysis on a dataset which I have. The data ("Speed") contains approximately 8000 data points, integers only. I also created a duplicate of the data ("SpeedWeibull") without any zero entries for a weibull distribution. I then implement the following code to get a number of different distributions:
%Fit Weibull Rayleigh and Generalized extreme value Distribution
pdWeibull = fitdist(SpeedWeibull,'Weibull');
pdRayleigh = fitdist(Speed,'Rayleigh');
pdGEV= fitdist(Speed,'GeneralizedExtremeValue');
%Estimate distribution parameters
phatWeibull = mle(SpeedWeibull,'distribution','Weibull');
phatGEV = mle(Speed,'distribution','GeneralizedExtremeValue');
phatRayleigh = mle(Speed,'distribution','Rayleigh');
This is the visually best fitting distribution: However, when I then go ahead and try to test the goodness of fit, I get terrible results despite the distributions fitting reasonably well! I noticed that when I generate random data according to one of my distributions and then run the tests, I get very good agreement. This seems to be due to the fact that the actual data only contains integers. Even when I change my chi test so it now has 23 bins for example, I still get poor agreement. Could somebody help with this?
%chi squared test
[chiWeibull,p1] = chi2gof(SpeedWeibull,'CDF',pdWeibull);
[chiRayleigh, p2] = chi2gof(Speed,'CDF',pdRayleigh,'NBins',23);
[chiGEV, p3] = chi2gof(Speed,'CDF',pdGEV);
%anderson test
%Kolmogorov-Smirnov test
kolmWeibull = kstest(SpeedWeibull,'CDF',pdWeibull);
kolmRayleigh = kstest(Speed,'CDF',pdRayleigh);
kolmGEV = kstest(Speed,'CDF',pdGEV);
Joseph Hall on 1 Mar 2018
That was only an example as an option for the Distribution Fitter App. But it is not an option for the other functions and the Distribution Fitter App does not return a probability distribution object like distfit() does, so it does not solve the issue. I have found that chi2gof() has an option, 'ctrs', by which you can input a vector of the centers of each bin. This can produce an acceptable solution by choosing integer values to be the centers. This is not possible using other functions, adtest(), kstest(), lillietest(), etc. Chi-square goodness-of-fit has it's own drawbacks and so it would be nice to be able to us these others with discrete integer data as well.

Joseph Hall on 1 Mar 2018
Here is an example and solution for chi2go():
mu = 200;
sigma = 10;
rows = 5000;
columns = 1;
x_float = normrnd(mu,sigma,rows,columns);
pd_float = fitdist(x_float,'Normal');
[h,p] = chi2gof(x_float,'CDF',pd_float);
disp(pd_float); disp([{'reject null?','p-val'};{h,p}]);
NormalDistribution
Normal distribution
mu = 199.973 [199.697, 200.249]
sigma = 9.95756 [9.76616, 10.1567]
'reject null?' 'p-val'
[ 0] [0.5399]
figure; histogram(x_float,'Normalization','pdf')
hold on; plot([min(x_float):1:max(x_float)],pdf(pd_float,[min(x_float):1:max(x_float)]))
legend('histogram pdf','fitdist pdf')
title('Floating Point Distribution Samples and Fit - mu=200, sigma=10')
% here you can see that after rounding the variable to integer values, chi2go() rejects the null,
% indicating that the distribution is not normal, even though it clearly is normal
x_integer = round(x_float);
pd_integer = fitdist(x_integer,'Normal');
[h,p] = chi2gof(x_integer,'CDF',pd_integer);
disp(pd_integer); disp([{'reject null?','p-val'};{h,p}]);
NormalDistribution
Normal distribution
mu = 199.967 [199.691, 200.243]
sigma = 9.95697 [9.76558, 10.1561]
'reject null?' 'p-val'
[ 1] [1.7958e-04]
figure; histogram(x_integer,'Normalization','pdf')
hold on; plot([min(x_integer):1:max(x_integer)],pdf(pd_integer,[min(x_integer):1:max(x_integer)]))
legend('histogram pdf','fitdist pdf')
title('Rounded (Integer) Distribution Samples and Fit - mu=200, sigma=10')
% here is how to make chi2go() not falsely reject the null hypothesis
[h, p]= chi2gof(x_integer,'CDF',pd_integer,'ctrs',[min(x_integer):1:max(x_integer)])
disp(pd_integer); disp([{'reject null?','p-val'};{h,p}]);
NormalDistribution
Normal distribution
mu = 199.967 [199.691, 200.243]
sigma = 9.95697 [9.76558, 10.1561]
'reject null?' 'p-val'
[ 0] [0.9394]
% You can see that the p-value above is very different from the p-val when using x_float and
% pd_float. This is due to the sensitivity of the chi-square-goodness-of-fit method to binning
% If you choose the same bins, then both pd_float and pd_integer give similar p-values:
[h,p] = chi2gof(x_float,'CDF',pd_float,'ctrs',[min(x_integer):1:max(x_integer)])
disp(pd_float); disp([{'reject null?','p-val'};{h,p}]);
NormalDistribution
Normal distribution
mu = 199.973 [199.697, 200.249]
sigma = 9.95756 [9.76616, 10.1567]
'reject null?' 'p-val'
[ 0] [0.9394]

Jeff Miller on 1 Mar 2018
Well, if adtest, etc, won't let you specify the points at which to examine the distributions (analogous to setting the bins), then the only option that I can see is to "unround" or "smear" the integers.
Suppose you have k data points with the integer value N. Replace them with k equally spaced values representing the range from N-0.5 to N+0.5. I think the replacement values are N-0.5+1/(2k)+(i-1)/k, for i=1:k (but check that).
This is a reasonable approximation iff the distribution's PDF is approximately flat in the range around each integer. You could skew the replacement values slightly to reflect the predicted distribution's nonflat PDF around each integer, but that would bias the tests to accept the distribution (kind of a best-case scenario for that predicted distribution).