Goodness of fit for a Zipf distribution

I have several vectors of numbers and would, for each one, like to fit a Zipf distribution, and estimate the goodness of fit relative to some standard benchmark.
From here, I got the formula for the PMF/PDF of the Zipf distribution, which I used as the third argument for fit in the following example:
x = [1:6]';
y = [80 40 20 10 5 2]';
plot(x,y,'or'), hold on % plot data points
[f, gof_info] = fit(x,y,'x.^(-(rho + 1)) ./ zeta(rho + 1)');
plot(f,'k') % plot model curve
However, this returns
rho = 28.05 (-Inf, Inf)
and therefore no model curve is plotted, and I'm not sure how to then continue with the goodness of fit test.
My questions:
1) Assuming the Zipf formula above is correct and rho is the only model parameter, why is this estimated with infinite confidence intervals?
2) This suggests using a Kolmogorov-Smirnov test (with bootstrapping to check significance) to compare the data to the theoretical Zipf's law distribution. However, the Matlab function kstest seems to refer to normal distributions only, with no option to specify another distribution type (e.g. Zipf). Others suggest other tests such as Anderson-Darling, but that too refers only to normal distributions.
3) Which of the goodness of fit parameters in gof_info output (SSE, R square etc.) needs to be passed to the significance test (Kolmogorov-Smirnov or Anderson-Darling)?
Many thanks for any help!

2 Comments

I don't understand why you use curve fitting when you have to do distribution fitting.
Read about the difference and the adequate functions:
Thanks for pointing this out: indeed, seems I was oblivious about the difference between those two, and it's really distribution fitting that I need, since my x axis is merely rank.
The documentation you pointed to suggests to me it's fitdist that I should use, however that doesn't have the distribution I need (Zipf) among its possible distname values. Any more advice?

Sign in to comment.

 Accepted Answer

I set up the maximum likelihood function for your data and came up with an estimated value of s = 2.116 for the distribution parameter s of the Zeta distribution with p_s(k) = 1/k^s / zeta(s) (k=1,2,3...). I don't know if this is the discrete probability density function we are talking about or if the distribution you have in mind has finite support and is the Zipf's distribution (i.e. p_s(k) = 1/k^s / sum_{i=1}^{i=N} 1/i^s (k=1,2,...,N)):
If you want the version with finite support (Zipf's distribution) (e.g. N = 6 in your case), replace
fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
zipf = 1./x.^smax / zeta(smax);
by
fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
zipf = 1./x.^smax / sum(1./x.^(smax));
x = [1:6]';
y = [80 40 20 10 5 2]';
X = [];
for i=1:numel(x)
X = [X,x(i)*ones(1,y(i))];
end
hold on
histogram(X,'Normalization','pdf')
%fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
s = 1.5:0.001:3;
fs = fun(s);
[~,index] = max(fs);
smax = s(index)
smax = 2.1160
zipf = 1./x.^smax / zeta(smax);
%zipf = 1./x.^smax / sum(1./x.^(smax));
plot(x,zipf)
hold off
grid on

11 Comments

z8080
z8080 on 2 Apr 2023
Edited: z8080 on 2 Apr 2023
Thanks a lot for this helpful answer. A few points that I don't get, if I may, as a follow-up:
1) My data points are discrete, but I'm still not sure which of the two expressions you suggested (Zipf of Zeta) I'd need to choose. Using one or other when running the code, their respective plots look very similar to me.
2) As I change values in my example x vector, smax stays the same regardless of how good a fit the Zipf curve is. Which variable/parameter tells me how well the Zipf distribution fits the values in x?
3) Can fitdist not help here?
Torsten
Torsten on 2 Apr 2023
Edited: Torsten on 2 Apr 2023
1) My data points are discrete, but I'm still not sure which of the two expressions you suggested (Zipf of Zeta) I'd need to choose. Using one or other when running the code, their respective plots look very similar to me.
If all of the positive integers can be taken as values for the rank, choose zeta, if only ranks up to a maximum value can be taken, use Zipf with N being this maximal rank.
2) As I change values in my example x vector, smax stays the same regardless of how good a fit the Zipf curve is. Which variable/parameter tells me how well the Zipf distribution fits the values in x?
If you change x, you also have to change the maximum likelihood function I gave you.
There is no direct parameter that can tell you how well the fitting was done. You might want to make a graph of the empirical and the theoretical cumulative distribution functions to get at least a visual impression.
3) Can fitdist not help here?
How should fitdist help if the probability distribution you want cannot be selected ? Maybe your distribution is a special case of one of the distributions that can be chosen - I did not look into it more deeply. But fitdist does the same as I did: Define the maximum likelihood function with your data and determine the parameter(s) for which it is maximized. fitdist also does not return a parameter on how well the fitting was done, as far as I know. Maybe confidence intervals for the parameters - I'm not sure.
1. and 3.: Got it.
2.: Ah yes apologies, I hadn't realised the x values were hard-coded into the maximum likelihood function. My only problem is, I need to do this programatically, for thousands of vectors, and I'm not sure how to automate it, if this isn't a simple matter of providing x as an input and getting an output. Also, there's an 157 in your code which doesn't match the example values I had provided, but I can assume that was a mistake, and that that should have been 80, as in my code.
Torsten
Torsten on 2 Apr 2023
Edited: Torsten on 2 Apr 2023
Also, there's an 157 in your code which doesn't match the example values I had provided, but I can assume that was a mistake, and that that should have been 80, as in my code.
No, this is not an error, it's the sum of the y-values. Did you look up how to define the maximum likelihood function for a discrete distribution with continuous parameters to be fitted ?
Sorry, maybe I'm just stupid but from the documentation page of mle, if that's what you yourself have used to come up with that expression for fun, I wasn't clear how I can obtain that expression programatically, given an x input. Also, the @ operator in Matlab scares me :-(
Sorry, maybe I'm just stupid but from the documentation page of mle, if that's what you yourself have used to come up with that expression for fun, I wasn't clear how I can obtain that expression programatically, given an x input.
This is statistics, not MATLAB. You have realizations of 157 independent random variables (in your case coming from a Zipf distribution with unknown parameter s). Statistics tells you that the maximum likelihood estimator for s is the value that maximizes the product of the probabilties of the outcomes of your empirical data. Your data are that you got 80 times 1, 40 times 2,....
Now write down the probability that you get a 1 from the distribution, multiply it 80 times with itself, write down the probability that you get a 2 from the distribution, multiply it 40 times with itself and so on. Then multiply all the single products and you will get the maximum likelihood function to be optimized.
Many thanks for explaining. I think I understand how it works in principle, but I'm still not sure (other than by attempting some hacky & probably buggy code with lots of for loops) how to obtain that expression given an input vector x that will have a different number of elements each time. I wonder if there is some other (higher-level) way of obtaining the parameters of the fitted distribution (`s`, if that's the only one).
Also, ultimately I need to numerically assess how well Zipf fits to each of my vectors, as those numbers are then used further downstream. Assessing goodness of fit visually via the plot for each one, as you suggested, is unfortunately not an option. s merely tells us the shape of the Zipf if I understand well, but not how well it fits the data per se.
Torsten
Torsten on 2 Apr 2023
Edited: Torsten on 2 Apr 2023
I think I understand how it works in principle, but I'm still not sure (other than by attempting some hacky & probably buggy code with lots of for loops) how to obtain that expression given an input vector x that will have a different number of elements each time.
Yes, make an attempt with hacky and probably buggy code. You know what function should come out from the example above. If you have problems, include your code, and I will try to unhack and debug.
I wonder if there is some other (higher-level) way of obtaining the parameters of the fitted distribution (`s`, if that's the only one).
This is the usual way. Often the logarithm is taken of the maximum likelihood function in order to make numerics easier. Look it up under Wikipedia.
Also, ultimately I need to numerically assess how well Zipf fits to each of my vectors, as those numbers are then used further downstream. Assessing goodness of fit visually via the plot for each one, as you suggested, is unfortunately not an option. s merely tells us the shape of the Zipf if I understand well, but not how well it fits the data per se.
The visualization of your empirical distribution together with the theoretical approximation should be the starting point. I understand that this cannot be used in computations, but if one has problems interpreting theoretical results, a plot often helps.
But for theoretically estimating goodness of fit for distributions I reach my statistical limits. You mentionned Anderson-Darling - this could be a point to start.
You may find the discussion and the links given here of interest for your application:
Thanks once again for your very thorough help.
Here's my attempt so far (for some reason code highlighting doesn't work):
% Define some empirical frequency distribution
x = 1:10;
freq = 100 ./ x; % textbook zipf!
% Define the Zipf distribution
alpha = 1.5; % Shape parameter, 1.5 is apparently a good all-round value to start with
N = sum(freq); % Total number of observations
k = 1:length(x); % Rank of each observation
zipf_dist = N ./ (k.^alpha); % Compute the Zipf distribution
% Plot our empirical frequency distribution alongside the Zipf distribution
figure;
bar(x, freq); % or freq\N
hold on;
plot(x, zipf_dist, 'r--');
xlabel('Rank');
ylabel('Frequency');
legend('Observed', 'Zipf');
% Compute the goodness of fit using the chi-squared test
expected_freq = zipf_dist .* N;
chi_squared = sum((freq - expected_freq).^2 ./ expected_freq);
dof = length(freq) - 1;
p_value = 1 - chi2cdf(chi_squared, dof);
% Display the results
fprintf('Chi-squared statistic = %.4f\n', chi_squared);
Chi-squared statistic = 170591.8168
fprintf('p-value = %.4f\n', p_value);
p-value = 0.0000
if p_value < 0.05
fprintf('Conclusion: The data is not from a Zipf distribution.\n');
else
fprintf('Conclusion: The data is from a Zipf distribution.\n');
end
Conclusion: The data is not from a Zipf distribution.
As a sanity check, I used a textbook-Zipf rank distribution, which unfortunately doesn't pass the statistical test. If that doesn't, nothing will. Clearly something is wrong, but I don't know what. Using the Kolmogorov-Smirnoff test, or the Anderson-Darling test with a custom-built (non-normal) distribution in place of the chi-squared test does not change this.
THoughts?
Why is "freq" non-integer for x = 3,6,7,8,9 ? I thought it is the absolute frequency that x is chosen.
To compare with the zipf distribution, you must calculate the empirical pdf from your (x,freq) data. It's given as epdf = freq/sum(freq).
After this, zipf must be defined as
zipf = 1./x.^alpha / sum(1./x.^alpha);
% Define some empirical frequency distribution
x = 1:10;
freq = 100 ./ x; % textbook zipf!
epdf = freq/sum(freq);
% Define the Zipf distribution
alpha = 1.0; % Shape parameter, 1.5 is apparently a good all-round value to start with
zipf_dist = 1./x.^alpha ./ sum(1./x.^alpha); % Compute the Zipf distribution
% Plot our empirical frequency distribution alongside the Zipf distribution
figure(1);
bar(x, epdf); % or freq\N
hold on;
plot(x, zipf_dist, 'r--');
xlabel('Rank');
ylabel('Probability');
legend('Observed', 'Zipf');
hold off
fun = @(alpha)prod(1./x.^(alpha*freq))./((sum(1./x.^alpha))^(sum(freq)));
alpha = 0.5:0.001:3;
falpha = arrayfun(@(alpha)fun(alpha),alpha);
figure(2)
plot(alpha,falpha)
[~,index] = max(falpha);
alphamax = alpha(index)
alphamax = 1
% Compute the goodness of fit using the chi-squared test
expected_freq = zipf_dist .* sum(freq);
chi_squared = sum((freq - expected_freq).^2 ./ expected_freq);
dof = length(freq) - 1;
p_value = 1 - chi2cdf(chi_squared, dof);
% Display the results
fprintf('Chi-squared statistic = %.4f\n', chi_squared);
Chi-squared statistic = 0.0000
fprintf('p-value = %.4f\n', p_value);
p-value = 1.0000
if p_value < 0.05
fprintf('Conclusion: The data is not from a Zipf distribution.\n');
else
fprintf('Conclusion: The data is from a Zipf distribution.\n');
end
Conclusion: The data is from a Zipf distribution.

Sign in to comment.

More Answers (1)

The zeta() function is 0 for all even negative integers, so zeta(rho+1) is 0 for all odd negative integers smaller than -1. You divide by that, so your function oscillates infinitely when rho is negative.

4 Comments

It sounds like the wrong formula was used for the Zipf distribution then? That anyway does not match the one showed in the wikipedia article for the Zipf...
The discussion in Wolfram Alpha says that rho must be positive, but your call to fit() does not restrict to positives.
x = [1:6]';
y = [80 40 20 10 5 2]';
plot(x,y,'or'), hold on % plot data points
[f, gof_info] = fit(x,y,'x.^(-(rho + 1)) ./ zeta(rho + 1)', 'Lower', .1, 'upper', 100);
Warning: Start point not provided, choosing random start point.
f
f =
General model: f(x) = x.^(-(rho + 1)) ./ zeta(rho + 1) Coefficients (with 95% confidence bounds): rho = 61.5 (-Inf, Inf)
The confidence interval is estimated from the standard deviation, so it can be quite wrong.
But let's try estimate the residue:
syms rho
residue = sum( (x.^(-(rho + 1)) ./ zeta(rho + 1) - y).^2 );
figure
fplot(residue, [10 100])
limit(residue, rho, inf)
ans = 
8370
double(ans)
ans = 8370
So you have an asymptope and the result you are getting back from fit is basically just the place that fit gives up because the numeric difference is within tolerance of 0.
z8080
z8080 on 31 Mar 2023
Edited: z8080 on 31 Mar 2023
Thanks, but since the task I'm aiming for is fairly simple (see how well a Zipf distribution fits a bunch of numbers), surely there's a straightforward way to achieve that in Matlab? Edge cases such as the one you demonstrated suggest, if anything, that what I want cannot be achieved, which I doubt.
In sum, if restricting rho to positives doesn't get the job done, what does get it done?
One possibility is that the particular example numbers you posted is not well fit by a Zipf distribution.

Sign in to comment.

Products

Release

R2023a

Asked:

on 30 Mar 2023

Edited:

on 4 Apr 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!