Compare probability density functions of 2 vectors

I have two vectors v and w of different legths and with positive elements. Each element of both vectors can be considered to be drawn according to some probability density function. I would need to check that the pdf of both is (more or less) the same, how could I do this in Matlab? I was thinking to plot the cdf of both vectors and see how much they differed, but maybe there's a better way to do that.

2 Comments

Did you already check whether the means/medians are the same (e.g., ttest2 or ranksum) and also check the variances (e.g., vartest2, ansaribradley)? If you can reject equality of either of those then of course you can conclude the vectors come from populations with different pdfs.
You can also use kstest2 to check the full pdfs, but keep in mind that this test has very low power unless the pdfs are very different or the vectors are very long. (That means the data will rarely allow you to conclude the pdf's are different even when they are. Power would be much higher for the tests of means and variances.)

Sign in to comment.

 Accepted Answer

Use histcounts to compute pdfs then compare them

2 Comments

If seems cdf comparison is better than pdf. There os a better separation. Both are normalized to 1 (maximum difference).
As Jeff Miller has suggested you can also derive mean, median, standard deviation, high order moments, skewness, kurtosis, etc
Ntest = 100;
for i = 1:Ntest
x1=rand(1,500);
x2=rand(1,400);
x3=rand(1,600)*1.2;
[dpdf12(i),dcdf12(i)] = diffpdf(x1,x2);
[dpdf13(i),dcdf13(i)] = diffpdf(x1,x3);
end
i = 1:Ntest;
subplot(2,1,1)
plot(i, dpdf12, 'b', i, dpdf13, 'r')
ylabel('dpdf')
legend('x1-x2','x1-x3')
subplot(2,1,2)
plot(i, dcdf12, 'b', i, dcdf13, 'r')
ylabel('dcdf')
legend('x1-x2','x1-x3')
function [dpdf,dcdf] = diffpdf(x1,x2)
x = [x1,x2];
[a,b] = bounds(x);
edges = linspace(a,b,33);
pdf1 = histcounts(x1,edges,'Normalization','pdf');
pdf2 = histcounts(x2,edges,'Normalization','pdf');
dx = mean(diff(edges));
dpdf = norm(pdf1-pdf2,1)*dx/2; % divide by 2 so as max(dpdf) is 1
cdf1 = cumsum(pdf1)*dx; % histcounts(x1,edges,'Normalization','cdf');
cdf2 = cumsum(pdf2)*dx; % histcounts(x2,edges,'Normalization','cdf');
dcdf = norm(cdf1-cdf2, Inf);
end
Another exmple to show the dindicator when comppare RAND and RANDN
Ntest = 100;
for i = 1:Ntest
x1=rand(1,500);
x2=rand(1,400);
x3=randn(1,600);
[dpdf12(i),dcdf12(i)] = diffpdf(x1,x2);
[dpdf13(i),dcdf13(i)] = diffpdf(x1,x3);
end
i = 1:Ntest;
subplot(2,1,1)
plot(i, dpdf12, 'b', i, dpdf13, 'r')
ylabel('dpdf')
legend('x1-x2','x1-x3')
subplot(2,1,2)
plot(i, dcdf12, 'b', i, dcdf13, 'r')
ylabel('dcdf')
legend('x1-x2','x1-x3')
function [dpdf,dcdf] = diffpdf(x1,x2)
x = [x1,x2];
[a,b] = bounds(x);
edges = linspace(a,b,33);
pdf1 = histcounts(x1,edges,'Normalization','pdf');
pdf2 = histcounts(x2,edges,'Normalization','pdf');
dx = mean(diff(edges));
dpdf = norm(pdf1-pdf2,1)*dx/2; % divide by 2 so as max(dpdf) is 1
cdf1 = cumsum(pdf1)*dx; % histcounts(x1,edges,'Normalization','cdf');
cdf2 = cumsum(pdf2)*dx; % histcounts(x2,edges,'Normalization','cdf');
dcdf = norm(cdf1-cdf2, Inf);
end

Sign in to comment.

More Answers (1)

[Edit: fix error in my code below.]
x1=rand(1,50);
x2=rand(1,40);
[h,p]=kstest2(x1,x2);
if h
fprintf('prob(x1,x2) from same distribution=%.3f.',p);
fprintf('Accept the null hypothesis that they are are from same distribution.');
else
fprintf('prob(x1,x2) from same distribution=%.3f.',p);
fprintf('Reject the null hypothesis that they are are from same distribution.');
end
prob(x1,x2) from same distribution=0.503.
Accept the null hypothesis that they are are from same distribution.
OK

10 Comments

The two-sample K.-S. test does what @Bruno Luong suggested: it computes the CDFs of the two samples, then compares them. The comparison is done by finding the max vertical difference between the two CDFs. The null hypothesis is that the two samples are drawn from the same distribution. The max difference between the CDFs is used to estimate the probability that the null hypothesis is true. If the probability is low, the null hypothesis is rejected.
How would you sugges to do if we want to compare multi-variate random populations?
The cdf is not defined but l1 norm of the diffference of pdf would be a good criteria?
You are thinking at a higher level than me!
This seems to be the answer, or at least an answer. The authors are two of the four authors of the classic text, Numerical Recipes. I never had a class with Professor Press, although I knew people who did. I had one of the other authors, Prof. Vetterling, for a course on waves. I had some interesting discussions with him about the book.
According to kstest2, h = true "if the test rejects the null hypothesis," where "the null hypothesis that the data in vectors x1 and x2 are from the same continuous distribution."
In statistical tests, does rejection of the null hypothesis imply acceptance of the alternative hypothesis?
[Edit - I misunderstood @Paul's point.]
You and @T.C. are right, I will fix my code above, which was wrong.
Wait, the if and the else print the same thing. In one of them you should reject the null hypothesis?
I'm not sure how to interpret kstest2 result, shouldn' h be always TRUE? p seem to vary quite a bit from run to run.
Ntest = 100;
for i = 1:Ntest
x1=rand(1,500);
x2=rand(1,400);
x = [x1,x2];
[a,b] = bounds(x);
edges = linspace(a,b,33);
pdf1 = histcounts(x1,edges,'Normalization','pdf');
pdf2 = histcounts(x2,edges,'Normalization','pdf');
dx = mean(diff(edges));
pdfp(i) = norm(pdf1-pdf2,1)*dx;
cdf1 = histcounts(x1,edges,'Normalization','cdf');
cdf2 = histcounts(x2,edges,'Normalization','cdf');
cdfp(i) = norm(cdf1-cdf2, Inf);
[h(i),p(i)]=kstest2(x1,x2);
end
subplot(2,2,1)
plot(pdfp)
ylabel('pdfp')
subplot(2,2,2)
plot(cdfp)
ylabel('cdfp')
subplot(2,2,3)
plot(h)
ylabel('h')
subplot(2,2,4)
plot(p)
ylabel('p')
The first part of my comment above was intended to point out that the if test in the code in the answer is inconsistent with the doc page kstest2. If I'm reading that doc correctly, that code should be:
if h
% fprintf('prob(x1,x2) from same distribution=%.3f.',p);
% fprintf('Accept the null hypothesis that they are are from same distribution.');
fprintf('REJECT the null hypothesis that x1 and x2 are from same distribution.');
else
% fprintf('prob(x1,x2) from same distribution=%.3f.',p);
% fprintf('Accept the null hypothesis that they are are from same distribution.');
% Maybe this should say: "Accept the alternative hypothesis ... "
fprintf('Cannot reject the null hypothesis that x1 and x2 are from same distribution.');
end
Disclaimer: I don't have any expertise in statistical testing, just going by what the doc page states.
By "p seem to vary quite a bit from run to run." do you mean the following?
N = 1e2;
p = nan(1,N);
rng(100);
for ii = 1:N
x1=rand(1,500);
x2=rand(1,400);
[h,p(ii)]=kstest2(x1,x2);
end
plot(p)
If so, maybe @Jeff Miller's answer to this question regarding variability of the p-value would be of interest, albeit for a different test (maybe the same concept applies here?).
@Paul "By "p seem to vary quite a bit from run to run." do you mean the following?"
Yes your question seems arrive at the samme time than I edit my post.
Hi William,
After the edit, I still think the logic in the code is not correct. According to kstest2, if h = 1, the null hypothesis should be rejected. Consider what happens if we change the distribution for x2
x1=rand(1,50);
%x2=rand(1,40);
x2=randn(1,40);
[h,p]=kstest2(x1,x2);
if h
fprintf('prob(x1,x2) from same distribution=%.3f.',p);
fprintf('Accept the null hypothesis that they are are from same distribution.');
else
fprintf('prob(x1,x2) from same distribution=%.3f.',p);
fprintf('Reject the null hypothesis that they are are from same distribution.');
end
prob(x1,x2) from same distribution=0.000.
Accept the null hypothesis that they are are from same distribution.

Sign in to comment.

Products

Release

R2024a

Asked:

on 8 Nov 2024

Edited:

on 10 Nov 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!