Asked by Thomas Arildsen
on 10 Jan 2013

Let us say I have some equal-sized sets of indices. I want to determine how much these sets overlap in the sense that I want to know how many of the sets contain a given index. I don't know what this is called, but I think of it as a kind of "soft intersection". I imagine doing it in one of the following ways, but I would like suggestions as to whether this is reasonable and possible better ways.

- The indices in the sets can be in (1:N). I have K sets. I could create K length-N indicator arrays and add these K arrays together. The resulting length-N array would contain occurrence counts of the indices of their positions in the array.
- Alternatively, I could create the union of the K sets. Then I could use find() to find occurrences of each of the members of the union set in a concatenation of the K index sets. The sizes of the results of find() would be occurrence counts of the indices in the sets.

Answer by Thomas Arildsen
on 10 Jan 2013

Accepted Answer

OK, that didn't take very long to try... I implemented no. 1 as follows:

function [ idxUnion, idxCounts ] = overlap1( idxSets )

%OVERLAP1 quantifies overlap between sets.

% The function counts in how many of the input sets each index occurs.

% The sets are specified as columns of the input matrix.

K = size(idxSets,2);

N = max(idxSets(:));

indicator = zeros(K,N);

for k = 1:K

indicator(k,idxSets(:,k)) = 1;

end

counts = sum(indicator,1);

idxUnion = find(counts ~= 0);

idxCounts = counts(idxUnion);

end

I implemented no. 2 as follows:

function [ idxUnion, idxCounts ] = overlap2( idxSets )

%OVERLAP2 quantifies overlap between sets.

% The function counts in how many of the input sets each index occurs.

% The sets are specified as columns of the input matrix.

K = size(idxSets,2);

N = max(idxSets(:));

idxUnion = unique(idxSets)';

idxCounts = zeros(size(idxUnion));

for ii = 1:length(idxCounts)

idxCounts(ii) = length(find(idxUnion(ii) == idxSets));

end

end

I test them as follows:

function [ result ] = overlap_test()

%OVERLAY_TEST Tests the overlay1 and overlay2 functions for errors.

% This function runs overlay1 and overlay2 with a random test input and

% performs a couple of sanity checks on the output to validate these

% functions.

K = 100;

M = 100;

N = 1e4;

testSets = zeros(M,K);

for k = 1:K

testSets(:,k) = randperm(N,M)';

end

% Test overlap1 union

tic

[idxUnion1,idxCnts1] = overlap1(testSets);

t1 = toc;

disp(sprintf('overlap1 took %f s\n',t1))

result(1,1) = isempty(setdiff(idxUnion1,unique(testSets)));

% Test overlap2 union

tic

[idxUnion2,idxCnts2] = overlap2(testSets);

t2 = toc;

disp(sprintf('overlap2 took %f s\n',t2))

result(2,1) = isempty(setdiff(idxUnion2,unique(testSets)));

% Test agreement on counts

try

result(3,1) = all(idxCnts1 == idxCnts2);

catch anError

warning(anError.identifier,anError.message);

result(3,1) = 0;

end

end

No. 1 seems clearly faster.

Sign in to comment.

Opportunities for recent engineering grads.

Apply Today
## 1 Comment

## Thomas Arildsen (view profile)

Direct link to this comment:https://www.mathworks.com/matlabcentral/answers/58421-finding-partial-overlaps-between-sets#comment_121830

Sign in to comment.