Finding partial overlaps between sets

on 10 Jan 2013

Thomas Arildsen (view profile)

Let us say I have some equal-sized sets of indices. I want to determine how much these sets overlap in the sense that I want to know how many of the sets contain a given index. I don't know what this is called, but I think of it as a kind of "soft intersection". I imagine doing it in one of the following ways, but I would like suggestions as to whether this is reasonable and possible better ways.
1. The indices in the sets can be in (1:N). I have K sets. I could create K length-N indicator arrays and add these K arrays together. The resulting length-N array would contain occurrence counts of the indices of their positions in the array.
2. Alternatively, I could create the union of the K sets. Then I could use find() to find occurrences of each of the members of the union set in a concatenation of the K index sets. The sizes of the results of find() would be occurrence counts of the indices in the sets.

Thomas Arildsen

Thomas Arildsen (view profile)

on 10 Jan 2013
I should add that the indices in each of the K sets are unique in that set, so the sets do not contain any duplicates.

Thomas Arildsen (view profile)

on 10 Jan 2013

OK, that didn't take very long to try... I implemented no. 1 as follows:
function [ idxUnion, idxCounts ] = overlap1( idxSets )
%OVERLAP1 quantifies overlap between sets.
% The function counts in how many of the input sets each index occurs.
% The sets are specified as columns of the input matrix.
K = size(idxSets,2);
N = max(idxSets(:));
indicator = zeros(K,N);
for k = 1:K
indicator(k,idxSets(:,k)) = 1;
end
counts = sum(indicator,1);
idxUnion = find(counts ~= 0);
idxCounts = counts(idxUnion);
end
I implemented no. 2 as follows:
function [ idxUnion, idxCounts ] = overlap2( idxSets )
%OVERLAP2 quantifies overlap between sets.
% The function counts in how many of the input sets each index occurs.
% The sets are specified as columns of the input matrix.
K = size(idxSets,2);
N = max(idxSets(:));
idxUnion = unique(idxSets)';
idxCounts = zeros(size(idxUnion));
for ii = 1:length(idxCounts)
idxCounts(ii) = length(find(idxUnion(ii) == idxSets));
end
end
I test them as follows:
function [ result ] = overlap_test()
%OVERLAY_TEST Tests the overlay1 and overlay2 functions for errors.
% This function runs overlay1 and overlay2 with a random test input and
% performs a couple of sanity checks on the output to validate these
% functions.
K = 100;
M = 100;
N = 1e4;
testSets = zeros(M,K);
for k = 1:K
testSets(:,k) = randperm(N,M)';
end
% Test overlap1 union
tic
[idxUnion1,idxCnts1] = overlap1(testSets);
t1 = toc;
disp(sprintf('overlap1 took %f s\n',t1))
result(1,1) = isempty(setdiff(idxUnion1,unique(testSets)));
% Test overlap2 union
tic
[idxUnion2,idxCnts2] = overlap2(testSets);
t2 = toc;
disp(sprintf('overlap2 took %f s\n',t2))
result(2,1) = isempty(setdiff(idxUnion2,unique(testSets)));
% Test agreement on counts
try
result(3,1) = all(idxCnts1 == idxCnts2);
catch anError
warning(anError.identifier,anError.message);
result(3,1) = 0;
end
end
No. 1 seems clearly faster.