Matlab PDF testing for integers only

Question

ThB on 4 Mar 2017

1
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/328083-matlab-pdf-testing-for-integers-only

Answered: Jeff Miller on 1 Mar 2018

I am currently trying to do a statistical analysis on a dataset which I have. The data ("Speed") contains approximately 8000 data points, integers only. I also created a duplicate of the data ("SpeedWeibull") without any zero entries for a weibull distribution. I then implement the following code to get a number of different distributions:

%Fit Weibull Rayleigh and Generalized extreme value Distribution
pdWeibull = fitdist(SpeedWeibull,'Weibull');
pdRayleigh = fitdist(Speed,'Rayleigh');
pdGEV= fitdist(Speed,'GeneralizedExtremeValue');
%Estimate distribution parameters
phatWeibull = mle(SpeedWeibull,'distribution','Weibull');
phatGEV = mle(Speed,'distribution','GeneralizedExtremeValue');
phatRayleigh = mle(Speed,'distribution','Rayleigh');

This is the visually best fitting distribution:

However, when I then go ahead and try to test the goodness of fit, I get terrible results despite the distributions fitting reasonably well! I noticed that when I generate random data according to one of my distributions and then run the tests, I get very good agreement. This seems to be due to the fact that the actual data only contains integers. Even when I change my chi test so it now has 23 bins for example, I still get poor agreement. Could somebody help with this?

%chi squared test
[chiWeibull,p1] = chi2gof(SpeedWeibull,'CDF',pdWeibull);
[chiRayleigh, p2] = chi2gof(Speed,'CDF',pdRayleigh,'NBins',23);
[chiGEV, p3] = chi2gof(Speed,'CDF',pdGEV);
%anderson test
[adWeibull,ad1] = adtest(SpeedWeibull,'Distribution',pdWeibull);
[adRayleigh, ad2] = adtest(Speed,'Distribution',pdRayleigh);
[adGEV, ad3] = adtest(Speed,'Distribution',pdGEV);
%Kolmogorov-Smirnov test
kolmWeibull = kstest(SpeedWeibull,'CDF',pdWeibull);
kolmRayleigh = kstest(Speed,'CDF',pdRayleigh);
kolmGEV = kstest(Speed,'CDF',pdGEV);

4 Comments
Show 2 older commentsHide 2 older comments

Jeff Miller on 28 Feb 2018

Why isn't it a satisfactory work around to choose the option "Bins centered on integers"?

Joseph Hall on 1 Mar 2018

That was only an example as an option for the Distribution Fitter App. But it is not an option for the other functions and the Distribution Fitter App does not return a probability distribution object like distfit() does, so it does not solve the issue. I have found that chi2gof() has an option, 'ctrs', by which you can input a vector of the centers of each bin. This can produce an acceptable solution by choosing integer values to be the centers. This is not possible using other functions, adtest(), kstest(), lillietest(), etc. Chi-square goodness-of-fit has it's own drawbacks and so it would be nice to be able to us these others with discrete integer data as well.

Sign in to comment.

Sign in to answer this question.

Answer 1

Joseph Hall on 1 Mar 2018

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/328083-matlab-pdf-testing-for-integers-only#answer_307855

Open in MATLAB Online

Here is an example and solution for chi2go():

mu = 200;
sigma = 10;
rows = 5000;
columns = 1;
x_float = normrnd(mu,sigma,rows,columns);
pd_float = fitdist(x_float,'Normal');
[h,p] = chi2gof(x_float,'CDF',pd_float);
disp(pd_float); disp([{'reject null?','p-val'};{h,p}]);
  NormalDistribution
    Normal distribution
         mu = 199.973   [199.697, 200.249]
      sigma = 9.95756   [9.76616, 10.1567]
      'reject null?'    'p-val' 
      [           0]    [0.5399]
figure; histogram(x_float,'Normalization','pdf')
hold on; plot([min(x_float):1:max(x_float)],pdf(pd_float,[min(x_float):1:max(x_float)]))
legend('histogram pdf','fitdist pdf')
title('Floating Point Distribution Samples and Fit - mu=200, sigma=10')
% here you can see that after rounding the variable to integer values, chi2go() rejects the null, 
% indicating that the distribution is not normal, even though it clearly is normal
x_integer = round(x_float);
pd_integer = fitdist(x_integer,'Normal');
[h,p] = chi2gof(x_integer,'CDF',pd_integer);
disp(pd_integer); disp([{'reject null?','p-val'};{h,p}]);
NormalDistribution
Normal distribution
     mu = 199.967   [199.691, 200.243]
  sigma = 9.95697   [9.76558, 10.1561]
    'reject null?'    'p-val'     
    [           1]    [1.7958e-04]
figure; histogram(x_integer,'Normalization','pdf')
hold on; plot([min(x_integer):1:max(x_integer)],pdf(pd_integer,[min(x_integer):1:max(x_integer)]))
legend('histogram pdf','fitdist pdf')
title('Rounded (Integer) Distribution Samples and Fit - mu=200, sigma=10')
   % here is how to make chi2go() not falsely reject the null hypothesis
    [h, p]= chi2gof(x_integer,'CDF',pd_integer,'ctrs',[min(x_integer):1:max(x_integer)])
    disp(pd_integer); disp([{'reject null?','p-val'};{h,p}]);
      NormalDistribution
      Normal distribution
           mu = 199.967   [199.691, 200.243]
        sigma = 9.95697   [9.76558, 10.1561]
        'reject null?'    'p-val' 
        [           0]    [0.9394]
    % You can see that the p-value above is very different from the p-val when using x_float and
    % pd_float. This is due to the sensitivity of the chi-square-goodness-of-fit method to binning
    % If you choose the same bins, then both pd_float and pd_integer give similar p-values:
    [h,p] = chi2gof(x_float,'CDF',pd_float,'ctrs',[min(x_integer):1:max(x_integer)])
    disp(pd_float); disp([{'reject null?','p-val'};{h,p}]);
      NormalDistribution
      Normal distribution
           mu = 199.973   [199.697, 200.249]
        sigma = 9.95756   [9.76616, 10.1567]
        'reject null?'    'p-val' 
        [           0]    [0.9394]

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 2

Jeff Miller on 1 Mar 2018

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/328083-matlab-pdf-testing-for-integers-only#answer_307879

Thanks for your clarification.

Well, if adtest, etc, won't let you specify the points at which to examine the distributions (analogous to setting the bins), then the only option that I can see is to "unround" or "smear" the integers.

Suppose you have k data points with the integer value N. Replace them with k equally spaced values representing the range from N-0.5 to N+0.5. I think the replacement values are N-0.5+1/(2k)+(i-1)/k, for i=1:k (but check that).

This is a reasonable approximation iff the distribution's PDF is approximately flat in the range around each integer. You could skew the replacement values slightly to reflect the predicted distribution's nonflat PDF around each integer, but that would bias the tests to accept the distribution (kind of a best-case scenario for that predicted distribution).

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Matlab PDF testing for integers only

4 Comments
Show 2 older commentsHide 2 older comments

Answers (2)

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

Matlab PDF testing for integers only

4 Comments Show 2 older commentsHide 2 older comments

Answers (2)

0 Comments Show -2 older commentsHide -2 older comments

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

4 Comments
Show 2 older commentsHide 2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments