quantifying the similarity between data sets

30 views (last 30 days)
Hi, I implemented an algorithm that tracks a particle in space and time. I applied it to two experiments and I got two data sets A=[X,Y] and B=[X,Y] of 8399 coordinate points each. The experiments were exactly the same. I ploted A and B and there are clear differences between them but overall, the points are within similar limits. Of course, they are never going to be exactly the same due to errors in the tracking algorithm. Still, given a certain criteria, Is there any method that quantify the difference between data sets in which I can say "ok, they are close enough" or "no, they are too much difference between them"?
Ps. I attached the data set I am currently analysing. Thank you

Answers (2)

Image Analyst
Image Analyst on 14 Jul 2017
  2 Comments
Daniel Mella
Daniel Mella on 16 Jul 2017
Thanks for your answer.
I tried it but it is not what I am looking for. I need a way to quantify how similar or different my plots are.
I have been thinking on applying FFT to A and B using the pwelch function and then calculate the cross correlation between spectras. I think that will give me the similarity in X and Y.
Image Analyst
Image Analyst on 16 Jul 2017
Methods like sift and surf first identify a bunch of "salient points" and then they use point matching algorithms to find subsets of points that seem to align fairly well. If you don't like the ones in the Computer Vision System Toolbox, you can use some other one: https://www.google.com/#q=point+matching+algorithm
Or look into how "optical flow" (also in the CVSToolbox) works.

Sign in to comment.


Star Strider
Star Strider on 16 Jul 2017
I can’t find anything online that address your problem, and there may be no consensus. Some exploration of your data reveals that the x-coordinates in both are (essentially) identically-distributed, and the y-coordinates in both are (essentially) identically distributed. The x- and y-coordinates have different distributions, and none of them are normally distributed.
One approach therefore could be to do a Wilcoxon Rank Sum or Mann-Whitney U test separately on the x-coordinates of the two data sets and the y-coordinates of the two data sets. This tests the null hypothesis that the medians are the same, against the alternate hypothesis that they are different.
AB = load('data_sets.mat');
A = AB.A;
B = AB.B;
[p1,h1,stats1] = ranksum(A(:,1),B(:,1));
[p2,h2,stats2] = ranksum(A(:,2),B(:,2));
These results indicate that the medians are not different with respect to both the x- and y-coordinates.
To demonstrate that the distributions of the x- and y-coordinates are not different would require a different test, such as a chi-square goodness-of-fit test of one x-coordinate distribution against the other, and similarly for the y-coordinates. (Use histogram or histcounts to generate the distributions.) You would have to write that code yourself, and then use the appropriate chi squared distribution function to calculate the p-values based on your calculated chi-square statistics and degrees-of-freedom.
Since a definitive discussion on this does not seem to exist, or at least has evaded my search for it, this is the best I can come up with.
  3 Comments
Star Strider
Star Strider on 17 Jul 2017
My pleasure,
I experimented with the chi-square idea in the interim:
Xedges = linspace(min([A(:,1);B(:,1)]),max([A(:,1);B(:,1)]), 20);
Yedges = linspace(min([A(:,2);B(:,2)]),max([A(:,2);B(:,2)]), 20);
[HXA,edgesx] = histcounts(A(:,1),Xedges);
[HXB,edgesx] = histcounts(B(:,1),Xedges);
[HYA,edgesy] = histcounts(A(:,2),Yedges);
[HYB,edgesy] = histcounts(B(:,2),Yedges);
FXA = HXA/sum(HXA)+sqrt(eps);
FXB = HXB/sum(HXB)+sqrt(eps);
FYA = HYA/sum(HYA)+sqrt(eps);
FYB = HXA/sum(HYB)+sqrt(eps);
QX = (FXA(:)-FXB(:)).^2./FXA(:);
Chi2_X = sum((FXA(:)-FXB(:)).^2./FXA(:));
Chi2_Y = sum((FYA(:)-FYB(:)).^2./FYA(:));
df = size(FXA(:),1)-1;
P1 = chi2cdf(Chi2_X, df);
P2 = chi2cdf(Chi2_Y, df);
I believe this is correct. I’ve not written code to calculate chi-square statistics in a while. Adding ‘sqrt(eps)’ prevents Inf and NaN values in the chi-square calculations, since some of the bins have zero values.
Unfortunately, the p-values are vanishingly small, meaning that the distributions are different (the probability of their being the same is essentially zero).
I would be hesitant to use pwelch on random spatial data. You might want to experiment with the fft2 function instead, and the image processing functions.
Yours appears to be a relatively new problem. I am not certain how to approach it, and the literature search I did turned up no relevant results.
Kafayat Olayinka
Kafayat Olayinka on 29 May 2020
Can you show us how to plot this and what it'll look like? Thanks

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!