# Matlab PDF testing for integers only

10 views (last 30 days)

Show older comments

I am currently trying to do a statistical analysis on a dataset which I have. The data ("Speed") contains approximately 8000 data points, integers only. I also created a duplicate of the data ("SpeedWeibull") without any zero entries for a weibull distribution. I then implement the following code to get a number of different distributions:

%Fit Weibull Rayleigh and Generalized extreme value Distribution

pdWeibull = fitdist(SpeedWeibull,'Weibull');

pdRayleigh = fitdist(Speed,'Rayleigh');

pdGEV= fitdist(Speed,'GeneralizedExtremeValue');

%Estimate distribution parameters

phatWeibull = mle(SpeedWeibull,'distribution','Weibull');

phatGEV = mle(Speed,'distribution','GeneralizedExtremeValue');

phatRayleigh = mle(Speed,'distribution','Rayleigh');

This is the visually best fitting distribution:

However, when I then go ahead and try to test the goodness of fit, I get terrible results despite the distributions fitting reasonably well! I noticed that when I generate random data according to one of my distributions and then run the tests, I get very good agreement. This seems to be due to the fact that the actual data only contains integers. Even when I change my chi test so it now has 23 bins for example, I still get poor agreement. Could somebody help with this?

%chi squared test

[chiWeibull,p1] = chi2gof(SpeedWeibull,'CDF',pdWeibull);

[chiRayleigh, p2] = chi2gof(Speed,'CDF',pdRayleigh,'NBins',23);

[chiGEV, p3] = chi2gof(Speed,'CDF',pdGEV);

%anderson test

[adWeibull,ad1] = adtest(SpeedWeibull,'Distribution',pdWeibull);

[adRayleigh, ad2] = adtest(Speed,'Distribution',pdRayleigh);

[adGEV, ad3] = adtest(Speed,'Distribution',pdGEV);

%Kolmogorov-Smirnov test

kolmWeibull = kstest(SpeedWeibull,'CDF',pdWeibull);

kolmRayleigh = kstest(Speed,'CDF',pdRayleigh);

kolmGEV = kstest(Speed,'CDF',pdGEV);

##### 4 Comments

Joseph Hall
on 1 Mar 2018

### Answers (2)

Joseph Hall
on 1 Mar 2018

Here is an example and solution for chi2go():

mu = 200;

sigma = 10;

rows = 5000;

columns = 1;

x_float = normrnd(mu,sigma,rows,columns);

pd_float = fitdist(x_float,'Normal');

[h,p] = chi2gof(x_float,'CDF',pd_float);

disp(pd_float); disp([{'reject null?','p-val'};{h,p}]);

NormalDistribution

Normal distribution

mu = 199.973 [199.697, 200.249]

sigma = 9.95756 [9.76616, 10.1567]

'reject null?' 'p-val'

[ 0] [0.5399]

figure; histogram(x_float,'Normalization','pdf')

hold on; plot([min(x_float):1:max(x_float)],pdf(pd_float,[min(x_float):1:max(x_float)]))

legend('histogram pdf','fitdist pdf')

title('Floating Point Distribution Samples and Fit - mu=200, sigma=10')

% here you can see that after rounding the variable to integer values, chi2go() rejects the null,

% indicating that the distribution is not normal, even though it clearly is normal

x_integer = round(x_float);

pd_integer = fitdist(x_integer,'Normal');

[h,p] = chi2gof(x_integer,'CDF',pd_integer);

disp(pd_integer); disp([{'reject null?','p-val'};{h,p}]);

NormalDistribution

Normal distribution

mu = 199.967 [199.691, 200.243]

sigma = 9.95697 [9.76558, 10.1561]

'reject null?' 'p-val'

[ 1] [1.7958e-04]

figure; histogram(x_integer,'Normalization','pdf')

hold on; plot([min(x_integer):1:max(x_integer)],pdf(pd_integer,[min(x_integer):1:max(x_integer)]))

legend('histogram pdf','fitdist pdf')

title('Rounded (Integer) Distribution Samples and Fit - mu=200, sigma=10')

% here is how to make chi2go() not falsely reject the null hypothesis

[h, p]= chi2gof(x_integer,'CDF',pd_integer,'ctrs',[min(x_integer):1:max(x_integer)])

disp(pd_integer); disp([{'reject null?','p-val'};{h,p}]);

NormalDistribution

Normal distribution

mu = 199.967 [199.691, 200.243]

sigma = 9.95697 [9.76558, 10.1561]

'reject null?' 'p-val'

[ 0] [0.9394]

% You can see that the p-value above is very different from the p-val when using x_float and

% pd_float. This is due to the sensitivity of the chi-square-goodness-of-fit method to binning

% If you choose the same bins, then both pd_float and pd_integer give similar p-values:

[h,p] = chi2gof(x_float,'CDF',pd_float,'ctrs',[min(x_integer):1:max(x_integer)])

disp(pd_float); disp([{'reject null?','p-val'};{h,p}]);

NormalDistribution

Normal distribution

mu = 199.973 [199.697, 200.249]

sigma = 9.95756 [9.76616, 10.1567]

'reject null?' 'p-val'

[ 0] [0.9394]

##### 0 Comments

Jeff Miller
on 1 Mar 2018

Thanks for your clarification.

Well, if adtest, etc, won't let you specify the points at which to examine the distributions (analogous to setting the bins), then the only option that I can see is to "unround" or "smear" the integers.

Suppose you have k data points with the integer value N. Replace them with k equally spaced values representing the range from N-0.5 to N+0.5. I think the replacement values are N-0.5+1/(2k)+(i-1)/k, for i=1:k (but check that).

This is a reasonable approximation iff the distribution's PDF is approximately flat in the range around each integer. You could skew the replacement values slightly to reflect the predicted distribution's nonflat PDF around each integer, but that would bias the tests to accept the distribution (kind of a best-case scenario for that predicted distribution).

##### 0 Comments

### See Also

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!