Goodness of fit for a Zipf distribution
20 views (last 30 days)
Show older comments
I have several vectors of numbers and would, for each one, like to fit a Zipf distribution, and estimate the goodness of fit relative to some standard benchmark.
From here, I got the formula for the PMF/PDF of the Zipf distribution, which I used as the third argument for fit in the following example:
x = [1:6]';
y = [80 40 20 10 5 2]';
plot(x,y,'or'), hold on % plot data points
[f, gof_info] = fit(x,y,'x.^(-(rho + 1)) ./ zeta(rho + 1)');
plot(f,'k') % plot model curve
However, this returns
rho = 28.05 (-Inf, Inf)
and therefore no model curve is plotted, and I'm not sure how to then continue with the goodness of fit test.
My questions:
1) Assuming the Zipf formula above is correct and rho is the only model parameter, why is this estimated with infinite confidence intervals?
2) This suggests using a Kolmogorov-Smirnov test (with bootstrapping to check significance) to compare the data to the theoretical Zipf's law distribution. However, the Matlab function kstest seems to refer to normal distributions only, with no option to specify another distribution type (e.g. Zipf). Others suggest other tests such as Anderson-Darling, but that too refers only to normal distributions.
3) Which of the goodness of fit parameters in gof_info output (SSE, R square etc.) needs to be passed to the significance test (Kolmogorov-Smirnov or Anderson-Darling)?
Many thanks for any help!
2 Comments
Torsten
on 30 Mar 2023
I don't understand why you use curve fitting when you have to do distribution fitting.
Read about the difference and the adequate functions:
Accepted Answer
Torsten
on 31 Mar 2023
Edited: Torsten
on 2 Apr 2023
I set up the maximum likelihood function for your data and came up with an estimated value of s = 2.116 for the distribution parameter s of the Zeta distribution with p_s(k) = 1/k^s / zeta(s) (k=1,2,3...). I don't know if this is the discrete probability density function we are talking about or if the distribution you have in mind has finite support and is the Zipf's distribution (i.e. p_s(k) = 1/k^s / sum_{i=1}^{i=N} 1/i^s (k=1,2,...,N)):
If you want the version with finite support (Zipf's distribution) (e.g. N = 6 in your case), replace
fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
zipf = 1./x.^smax / zeta(smax);
by
fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
zipf = 1./x.^smax / sum(1./x.^(smax));
x = [1:6]';
y = [80 40 20 10 5 2]';
X = [];
for i=1:numel(x)
X = [X,x(i)*ones(1,y(i))];
end
hold on
histogram(X,'Normalization','pdf')
%fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
s = 1.5:0.001:3;
fs = fun(s);
[~,index] = max(fs);
smax = s(index)
zipf = 1./x.^smax / zeta(smax);
%zipf = 1./x.^smax / sum(1./x.^(smax));
plot(x,zipf)
hold off
grid on
11 Comments
Torsten
on 4 Apr 2023
Edited: Torsten
on 4 Apr 2023
Why is "freq" non-integer for x = 3,6,7,8,9 ? I thought it is the absolute frequency that x is chosen.
To compare with the zipf distribution, you must calculate the empirical pdf from your (x,freq) data. It's given as epdf = freq/sum(freq).
After this, zipf must be defined as
zipf = 1./x.^alpha / sum(1./x.^alpha);
% Define some empirical frequency distribution
x = 1:10;
freq = 100 ./ x; % textbook zipf!
epdf = freq/sum(freq);
% Define the Zipf distribution
alpha = 1.0; % Shape parameter, 1.5 is apparently a good all-round value to start with
zipf_dist = 1./x.^alpha ./ sum(1./x.^alpha); % Compute the Zipf distribution
% Plot our empirical frequency distribution alongside the Zipf distribution
figure(1);
bar(x, epdf); % or freq\N
hold on;
plot(x, zipf_dist, 'r--');
xlabel('Rank');
ylabel('Probability');
legend('Observed', 'Zipf');
hold off
fun = @(alpha)prod(1./x.^(alpha*freq))./((sum(1./x.^alpha))^(sum(freq)));
alpha = 0.5:0.001:3;
falpha = arrayfun(@(alpha)fun(alpha),alpha);
figure(2)
plot(alpha,falpha)
[~,index] = max(falpha);
alphamax = alpha(index)
% Compute the goodness of fit using the chi-squared test
expected_freq = zipf_dist .* sum(freq);
chi_squared = sum((freq - expected_freq).^2 ./ expected_freq);
dof = length(freq) - 1;
p_value = 1 - chi2cdf(chi_squared, dof);
% Display the results
fprintf('Chi-squared statistic = %.4f\n', chi_squared);
fprintf('p-value = %.4f\n', p_value);
if p_value < 0.05
fprintf('Conclusion: The data is not from a Zipf distribution.\n');
else
fprintf('Conclusion: The data is from a Zipf distribution.\n');
end
More Answers (1)
Walter Roberson
on 30 Mar 2023
The zeta() function is 0 for all even negative integers, so zeta(rho+1) is 0 for all odd negative integers smaller than -1. You divide by that, so your function oscillates infinitely when rho is negative.
4 Comments
Walter Roberson
on 31 Mar 2023
One possibility is that the particular example numbers you posted is not well fit by a Zipf distribution.
See Also
Categories
Find more on Hypothesis Tests in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!