Pdist gives NAN but there are no missing values in input array

6 views (last 30 days)
Hi Everyone,
I am using the pdist function to calculate pairwise distances (jaccard) between two observations. Now what happens sometimes, seemingly at random, is that instead of getting the distance, the value that is calculated gets shown as 'NaN'. The matrix that is taken as input, however, does not have any missing values. Rather it seems that the correct answer for these places should be a '0' (as in, they do not have anything in common - calculating a similarity measure using 1-pdist) . The same piece of code seems to work just fine on later versions of the data, but going back in time (when observations should be less similar) the 'NaN's start appearing. The script that I am using the calculate these measures is below. I have also attached one of the input csv files resulting in 'NaN's, as well the matrix in which everything is supposed to go (the first two columns are the pairs of observations, the third column is a successfully calculated similarity measure for a different time period, and the fourth column is supposed to be filled with the new similarity values).
Any help would be highly(!!) appreciated.
%read data
clearvars -except jaccard_dyadic
A = readmatrix('190529ML_lawyer_4');
%variables used to create matrix
args = A(:,1);
lawgs = A(:,2);
%matrix w/ ones if both args have had same law
arg_lawg = zeros(max(unique(A(:,1))), max(unique(A(:,2))));
empty_dim = size(A);
for i=1:empty_dim(1)
arg_lawg(args(i),lawgs(i))=1;
end
%loop calculating jaccard similarity measure (works on other iterations of data) filling in larger matrix
for i = 1:(find(jaccard_dyadic(:,1)==0, 1, 'first')-1)
jaccard_dyadic(i,4)=1-pdist2(arg_lawg(jaccard_dyadic(i,1),:),arg_lawg(jaccard_dyadic(i,2),:),'jaccard');
end
  2 Comments
Walter Roberson
Walter Roberson on 2 Jun 2019
>> find(isnan(jaccard_dyadic))
ans =
300007
300009
300011
300019
300022
300023
300027
300035
300041
300052
300063
300064
300068
300072
300074
John Kirk
John Kirk on 3 Jun 2019
Thanks for the response Walter. Have you been able to tell why these NaNs pop up to begin with? From my understanding, these should have been zeros, but I am now wondering whether I messed up somewhere.

Sign in to comment.

Accepted Answer

Walter Roberson
Walter Roberson on 3 Jun 2019
jaccard: One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ.
Now if there are no non-zero coordinates, then the number of differences is 0 and the number of coordinates is 0, so you are working with a 0/0
  1 Comment
John Kirk
John Kirk on 3 Jun 2019
Oh of course. Should have thought of that myself. Thanks for the clarification. Much appreciated!

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!