4.5

4.5 | 4 ratings Rate this file 73 downloads (last 30 days) File Size: 2.51 KB File ID: #23333

Determine and count unique values of an array

by Anthony Kendall

 

17 Mar 2009 (Updated 21 Mar 2009)

Code covered by BSD License  

Very fast function to determine and count unique values of numeric, logical, char, cell arrays.

Download Now | Watch this File

File Information
Description

This function determines the unique values of an N-D array, and counts instances of those values using MATLAB's accumarray function, or in cases where accumarray is inappropriate the more traditional sort-diff method is used.

Its primary use is to very quickly count the number of instances of unique values within an array.

However, if only returning unique values (and not counts), it is slightly faster than MATLAB's built-in 'unique' function for arrays of intermediate to large sizes. This is particularly true for integer-valued arrays (not necessarily just integer type). For float arrays, the speed increase is due mainly to fewer input validation tests and other options. Unlike 'unique' it does not have 'rows', 'first', or 'last' options.

It returns the unique values in a sorted vector, and counts of those values are in the same order.

- Usage:
>> [uniques] = count_unique(largeArray);
>> [uniques,numUnique] = count_unique(largeArray);

- Example (a rather trivial example):
>> a = repmat((1:1000),1000,1); %build an unsorted array of values
>> [uniques,numUnique] = count_unique(a);
>> size(uniques)
ans =
        1000 1
>> numUnique(1:10)
ans =
        1000
        1000
        1000
        1000
        1000
        1000
        1000
        1000
        1000
        1000

- Example (a less trivial example):
>> a = floor(rand(1e6,1)*1e6);
>> [uniques,numUnique] = count_unique(a);
>>uniques(1:10),numUnique(1:10)
ans =
     0
     1
     4
     5
     7
     9
    11
    12
    15
    19

ans =
     2
     2
     2
     1
     2
     3
     1
     1
     1
     1

For returning only unique values:
- Speed Comparison (integer valued):
>> a = floor(rand(1e7,1)*1e6);
>> tic;[uniques] = count_unique(a);toc %count_unique
Elapsed time is 1.891197 seconds.
>>tic;[uniques] = unique(a);toc %built-in MATLAB
Elapsed time is 2.829160 seconds.

- Speed Comparison (decimal valued):
>>a = rand(1e7,1)*1e6;
>>tic;[uniques] = count_unique(a,'float');toc
Elapsed time is 2.995239 seconds.
>>tic;[uniques] = unique(a);toc
Elapsed time is 3.002850 seconds.

MATLAB release MATLAB 7.7 (R2008b)
Other requirements Requires 'accumarray' function, present in recent MATLAB versions.
Zip File Content  
Other Files count_unique.m,
license.txt
Tags for This File  
Everyone's Tags
Tags I've Applied
Add New Tags Please login to tag files.
Comments and Ratings (7)
18 Mar 2009 John D'Errico

I'm confused. Even the author admits this is only slightly faster than the existing unique, and that the main reason is probably because it does not bother to check the arguments. He also admits that it has considerably less capability than the existing unique.

Anyone truly in need of the maximum speed is better off with a direct call to accumarray anyway, which will further reduce any overhead.

Why does this exist? Perhaps some time comparisons might be of value, and some examples that show why this is a better choice than the existing alternatives.

The code does look to be carefully written at a glance, and the help seems reasonable, were I to rate this.

18 Mar 2009 Anthony Kendall

John,
The code is primarily used to count the number of each unique instance within an array, but as a side benefit is slightly faster than the existing unique function (and quite a bit faster for integer-valued arrays). I'll modify the description to make this more obvious.

Thanks for your comments.

18 Mar 2009 John D'Errico

Ok. That was quick. This is looking better now. As a potential downloader, the description gives me some reason to use it, especially if I'm looking to squeeze some speed out of a tight piece of code.

A look inside the code is always interesting. I like to see when the author has taken care in the code. Friendly code worries about things like the presence of nans, infs, different variable classes, etc. Also lots of comments. White space. Intelligently named, mnemonic variable names. These things make any piece of code an easy read, and easy to debug in the future.

Good help, examples. etc. All well done. Thanks for the changes.

21 Mar 2009 Bruno Luong

Inside the code, the author uses ACCUMARRAY on the values of the input array. This is a poor choice because it might create a large chunk of memory and slow down greatly.

Here is a code to check this effect. On my computer the worse time is 2.5 s, and unique can accomplish in less than 1 ms.

power=1:64;
time=zeros(length(power),2);

for k=1:length(power)
    
    twoelem=[0 2.^power(k)-1];
    disp(twoelem)
    
    tic
    count_unique(twoelem);
    time(k,1) = toc;
    
    tic
    unique(twoelem);
    time(k,2) = toc;
end
fprintf('worse time = %g s\n', max(time,[],1))
plot(power,time);
ylabel('second');
title('Time for array of two elements')
legend('count unique','unique')

21 Mar 2009 Anthony Kendall

Bruno,
Thanks for making this point. This can be addressed by specifying the 'float' option in count_unique. However, I've added also added a check that when the maximum value of the input array is much greater than its number of elements, accumarray is not used.

I ran your example on the unmodified code, and got results similar to yours. Now, on the modified code (without specifying the 'float' option, which does not use accumarray), the run times are both trivial in your example. The change increased run times for large array integer values slightly, and the description above is modified to reflect that.

Thanks again.

08 Apr 2009 Ilya  
08 Apr 2009 dave matthews

it would be nice to extend this to the n-d case, or at least the 2-d case, i.e., find and list unique pairs?

x = [1 2; 1 3; 1 2];
then, uniques = [1 2], [1 3];
and, numuniques = [2 1];

thanks

Please login to add a comment or rating.
Updates
18 Mar 2009

An improvement to make code more general.

18 Mar 2009

Added more thorough description and examples, in response to user comment.

18 Mar 2009

Further modifying description, correcting an error.

21 Mar 2009

Added check for arrays with large values but few elements, which are not well-suited to accumarray, at suggestion of user.

Tag Activity for this File
Tag Applied By Date/Time
unique Anthony Kendall 18 Mar 2009 11:06:14
count Anthony Kendall 18 Mar 2009 11:06:14
accumarray Anthony Kendall 22 Mar 2009 21:25:53
 

MATLAB Central Terms of Use

NOTICE: Any content you submit to MATLAB Central, including personal information, is not subject to the protections which may be afforded information collected under other sections of The MathWorks, Inc. Web site. You are entirely responsible for all content that you upload, post, e-mail, transmit or otherwise make available via MATLAB Central. The MathWorks does not control the content posted by visitors to MATLAB Central and, does not guarantee the accuracy, integrity, or quality of such content. Under no circumstances will The MathWorks be liable in any way for any content not authored by The MathWorks, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available via MATLAB Central. Read the complete Terms prior to use.

Contact us at files@mathworks.com