Matlab find unique column-combinations in matrix and respective index

I have a large matrix with with multiple rows and a limited (but larger than 1) number of columns containing values between 0 and 9 and would like to find an efficient way to identify unique row-wise combinations and their indices to then build sums (somehwat like a pivot logic). Here is an example of what I am trying to achieve:
a =
1 2 3
2 2 3
3 2 1
1 2 3
3 2 1
uniqueCombs =
1 2 3
2 2 3
3 2 1
numOccurrences =
2
1
2
indizies:
[1;4]
[2]
[3;5]
From matrix a, I want to first identify the unique combinations (row-wise), then count the number occurrences / identify the row-index of the respective combination.
I have achieved this through generating strings with num2str and strcat, but this method appears to be very slow. Along these thoughts I have tried to find a way to form a new unique number through concatenating the values horizontally, but Matlab does not seem to support this (e.g. from [1;2;3] build 123). Sums won't work because they would remove the possibility to identify unique combinations. Any suggestions on how to best achieve this? Thanks!

 Accepted Answer

More or less the same as Jan's, using accumarray instead of splitapply (I'm still old school!):
A = [ 1 2 3
2 2 3
3 2 1
1 2 3
3 2 1];
[B, ~, ib] = unique(A, 'rows');
numoccurences = accumarray(ib, 1);
indices = accumarray(ib, find(ib), [], @(rows){rows}); %the find(ib) simply generates (1:size(a,1))'

4 Comments

This answer works just as well - many thanks guys!
@Benvaulter: Please compare the speeds of the two methods for calculating indices with your real data:
[B, iB, iA] = unique(A, 'rows');
tic
G = unique(iA);
n = length(G);
indices1 = cell(1, n);
for k = 1:n
indices1{k} = find(iA == G(k));
end
toc
[B, ~, ib] = unique(A, 'rows');
tic
indices2 = accumarray(ib, find(ib), [], @(rows){rows});
toc
I'm curious about the timings. If Guillaume's method is faster, use it and accept his answer. Thanks.
I suspect that accumarray will be faster as it is built-in compiled code whereas splitapply is m code, but I haven't conducted any test.
Note: for the indices,
indices = accumarray(ib, (1:numel(ib))', [], @(rows){rows});
is probably slightly faster, just not as concise.
@Guillaume: I compare this with cellfun: In older versions Matlab contained the C-sources for this Mex function. Here calling a function handle is very expensive, because the Matlab tier has to be called. Therefore the implicitely defined methods provided by strings are much faster: 'length', 'isclass' etc.
Then using a compiled Mex function is not a real benefit, because mexCallMATLAB has some overhead. This might concern accumarray also. I guess that your accumarray approach is faster than the loop, but I know that it looks very cryptic ;-)
But now I can leave the speculations and run a test: With
A = randi([1, 100], 1e5, 3); % Test data
my loop takes 14.75 seconds, your accumarray approach takes 0.44 seconds. The results differ in the order of the indices. So perhaps this is wanted:
[B, iB, iA] = unique(A, 'rows');
indices = accumarray(iA, (1:numel(iA)).', [], @(r){sort(r)});
The result is clear: @Benvaulter, please unaccept my answer and select Guillaume's, and of course use it also to save time and energy.

Sign in to comment.

More Answers (1)

A = [ 1 2 3; ...
2 2 3; ...
3 2 1; ...
1 2 3; ...
3 2 1];
[B, iB, iA] = unique(A, 'rows');
G = unique(iA);
numOccurrences = splitapply(@sum, iA, G);
I cannot test a method to obtain the indices list as wanted. I assume this works with splitapply also. A simple loop approach at least:
n = length(G);
indices = cell(1, n);
for k = 1:n
indices{k} = find(iA == G(k));
end
[EDITED] Code is tested now. Use the much faster solution of Guillaume for productive work.

Asked:

on 22 Mar 2017

Edited:

Jan
on 23 Mar 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!