Goodness of fit for a Zipf distribution
Show older comments
I have several vectors of numbers and would, for each one, like to fit a Zipf distribution, and estimate the goodness of fit relative to some standard benchmark.
From here, I got the formula for the PMF/PDF of the Zipf distribution, which I used as the third argument for fit in the following example:
x = [1:6]';
y = [80 40 20 10 5 2]';
plot(x,y,'or'), hold on % plot data points
[f, gof_info] = fit(x,y,'x.^(-(rho + 1)) ./ zeta(rho + 1)');
plot(f,'k') % plot model curve
However, this returns
rho = 28.05 (-Inf, Inf)
and therefore no model curve is plotted, and I'm not sure how to then continue with the goodness of fit test.
My questions:
1) Assuming the Zipf formula above is correct and rho is the only model parameter, why is this estimated with infinite confidence intervals?
2) This suggests using a Kolmogorov-Smirnov test (with bootstrapping to check significance) to compare the data to the theoretical Zipf's law distribution. However, the Matlab function kstest seems to refer to normal distributions only, with no option to specify another distribution type (e.g. Zipf). Others suggest other tests such as Anderson-Darling, but that too refers only to normal distributions.
3) Which of the goodness of fit parameters in gof_info output (SSE, R square etc.) needs to be passed to the significance test (Kolmogorov-Smirnov or Anderson-Darling)?
Many thanks for any help!
Accepted Answer
More Answers (1)
Walter Roberson
on 30 Mar 2023
0 votes
The zeta() function is 0 for all even negative integers, so zeta(rho+1) is 0 for all odd negative integers smaller than -1. You divide by that, so your function oscillates infinitely when rho is negative.
4 Comments
z8080
on 30 Mar 2023
The discussion in Wolfram Alpha says that rho must be positive, but your call to fit() does not restrict to positives.
x = [1:6]';
y = [80 40 20 10 5 2]';
plot(x,y,'or'), hold on % plot data points
[f, gof_info] = fit(x,y,'x.^(-(rho + 1)) ./ zeta(rho + 1)', 'Lower', .1, 'upper', 100);
f
The confidence interval is estimated from the standard deviation, so it can be quite wrong.
But let's try estimate the residue:
syms rho
residue = sum( (x.^(-(rho + 1)) ./ zeta(rho + 1) - y).^2 );
figure
fplot(residue, [10 100])
limit(residue, rho, inf)
double(ans)
So you have an asymptope and the result you are getting back from fit is basically just the place that fit gives up because the numeric difference is within tolerance of 0.
Walter Roberson
on 31 Mar 2023
One possibility is that the particular example numbers you posted is not well fit by a Zipf distribution.
Categories
Find more on Noncentral t Distribution in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!




