Code covered by the BSD License  

Highlights from
kmeans clustering

4.6875
4.7 | 18 ratings Rate this file 330 Downloads (last 30 days) File Size: 33.5 KB File ID: #24616 Version: 1.9
image thumbnail

kmeans clustering

by

Mo Chen (view profile)

 

01 Jul 2009 (Updated )

Fully vectorized kmeans algorithm. Fast yet simple (10 lines)

| Watch this File

File Information
Description

This is a very fast implementation of the original kmeans clustering algorithm without any fancy acceleration technique, such as kd-tree indexing and triangular inequation. (actually the fastest matlab implementation as far as I can tell.)

This code is as vectorized as possible. Yet it is very compact (only 10 lines of code). It is 10~100 times faster than the kmeans function in matlab.

The package also includes a function for ploting the data with labels.

Sample code:
>> load data;spread(x,y)
>> label = litekmeans(x,3);spread(x,label )

Detail explanation of this algorithm can be found in following blog post:
http://statinfer.wordpress.com/2011/12/12/efficient-matlab-ii-kmeans-clustering-algorithm/

Acknowledgements

This file inspired Kmeans, K Means++, and Wavelet Based Image Segmentation.

MATLAB release MATLAB 7.11 (R2010b)
Tags for This File   Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (47)
03 Jun 2015 David Inouye

There is a small bug for newer versions of MATLAB because the behavior of UNIQUE has changed so that the indices are always returned as column vectors. See demo below.

Though I'm not a guru on speed, one suggested change would be to add the following line after calls to UNIQUE:
label = label(:)';

Demo code:
clc; lb=[3 1 1 5]; szb=size(lb); [~,~,la] = unique(lb); sza=size(la);
fprintf('<<<<<< %s >>>>>>\nLabels before: %s, Labels after: %s\nSize before:%s, Size after:%s\n', version, mat2str(lb), mat2str(la), mat2str(szb), mat2str(sza));

Output on different versions of MATLAB:
<<<<<< 8.3.0.532 (R2014a) >>>>>>
Labels before: [3 1 1 5], Labels after: [2;1;1;3]
Size before:[1 4], Size after:[4 1]

<<<<<< 7.12.0.635 (R2011a) >>>>>>
Labels before: [3 1 1 5], Labels after: [2 1 1 3]
Size before:[1 4], Size after:[1 4]

Comment only
16 Mar 2015 selvakumar s

How to view the segmented result! pls help me!

Comment only
30 Dec 2014 Dipali

Dipali (view profile)

how to run k-means for a input as a any dataset file

Comment only
16 Sep 2014 Ashutosh Kumar Upadhyay  
26 Jun 2014 Antoni J. Canós

The algorithm randomly fails, probably due to the random initialization.

For instance, the vector of 1D data X=[9.13,2.68,2.33,9.41] with k=2 is sometimes clustered (labeled) as [1 1 1 1], instead of the right values [1 2 2 1] or [2 1 1 2];

Comment only
19 May 2014 Lakshmi

Hi, i'm new to Matlab coding. Pls tell how to run this code...

Comment only
07 May 2014 Chathurika  
14 Mar 2014 Navjot Jassi

how should i run this program ??

Comment only
05 Dec 2013 John

John (view profile)

Works fine, almost always. However, besides the needed transpose mentioned by Matei, I noticed another problem. Suppose I want 9 clusters from a dataset with 100 elements. Quite often I only get 7 or 8 groups. The same thing happens with other cluster numbers, though I haven't seen it for fewer than wanting 5.

Comment only
30 Nov 2013 mohit rai

sent file

Comment only
23 Oct 2013 Brian Wang  
16 Oct 2013 Milka

Milka (view profile)

Hey Mate,

did you already implement the Replication Feature? (From judging your code no, but maybe I don't get the stuff correctly yet ;-))

And why the spread function can only get an input matrix X of 2 or 3 rows, while in your kmeans algorithm it can have any size in the first dimension?

Comment only
14 Sep 2013 Matei Tene

@Zoe: I would rather just transpose instead of reshaping, i.e. in line 14

last = label';

Comment only
06 Aug 2013 Bojan

Bojan (view profile)

Cuts down my calculation from 4h down to 19min, so very pleased. Had to reshape label as per suggestion by Zoe below to make it work with 2013. Otherwise very good!

28 Jun 2013 rohith

rohith (view profile)

are no. of seeds equal to no.of clusters??
and i need labels for all data points.Whereas this is giving label whose dimension is equal to no. of seeds

04 May 2013 Ramani

Ramani (view profile)

please help me in Em estimator for parameter estimation

Comment only
03 May 2013 Zoé

Zoé (view profile)

Seems to be broken in R2013a due too a column/row vector mix-up. As a quick workaround I have added

label = reshape(label, length(label), 1);

after line 15 in litekmeans.m.

Comment only
03 May 2013 Zoé

Zoé (view profile)

 
03 Jul 2012 leila

leila (view profile)

Thank you for sharing. Does the code support 3d data?

Comment only
16 May 2012 Abdullah Alomari

not working ..

Comment only
25 Apr 2012 RS

RS (view profile)

Hi,
I am looking for an implemented kmeans clustering algorithm to segment a full body 3D CT matrix into 4 classes in Matlab. This kmeans output will then be used as input to Potts model segmentation.
Is there anyone who can help med with this or give me some suggestions.

Thanks in advance
Rachelle

Comment only
20 Mar 2012 Mo Chen

Mo Chen (view profile)

It is simply not for image.

Comment only
19 Mar 2012 Angie Orozco Sanchez

does it works for rgb images?

Comment only
01 Jan 2012 Jonathan C. Lansey  
06 Apr 2011 Tim Benham

Tim Benham (view profile)

Good, fast implementation but it should be possible to pass in the cluster centers. Also arguments are not checked (try k=1) and standard help is not provided.

12 Mar 2011 Mo Chen

Mo Chen (view profile)

Hi, YIMING
It is normal behavior. This line of code remove empty clusters which does not happen very often when k is small.

Comment only
08 Jan 2011 YIMING

YIMING (view profile)

Hey Michael,

I met problem when compiling the following sentence of your code,

[~,~,label] = unique(label);

As is known, the character '~' is not supported for representing non-used variables in recent releases of Matlab. However, if I change the '~' to some other variable names (not to be used), then this sentence seems unnecessary --- label is not changed after executing this sentence.

Any comments on this?

Comment only
25 Oct 2010 subrajeet

Even if we choose the cluster centers initially still there is no guarantee that we will get the same clusters every time.
There is one paperon this and a specific clustering technique to solve this problem But I am unable to code it properly. Can you have a look on this algorithm called Quality Threshold (QT) based clustering.

Heyer, L. J., Kruglyak, S., Yooseph, S. (1999). Exploring expression
% data: Identification and analysis of coexpressed genes. Genome Research
% 9, 1106–1115.
%
% http://genome.cshlp.org/content/9/11/1106.full
% http://genome.cshlp.org/content/9/11/1106/F5.large.jpg

Comment only
24 Oct 2010 Mo Chen

Mo Chen (view profile)

Hi, surajeet,
If you want deterministic behavior of kmeans, you can use some deterministic method to initialize the clusters. For example, you can first choose the sample which is fastest from the center of the whole data set as a cluster center, then iteratively choose sample which has the largest sum of distances from the chosen centers.

However, you are not guarantee to obtain a good result on any data set.

Comment only
24 Oct 2010 subrajeet

Thanks for the posting Michael. I have a small query is thery any clustering technique by which the cluster results will remain same everytime we run the code.
Please kindly tell.

26 Jul 2010 Babak Taati

This is a really nice function.

As Fen Xie pointed out, it sometimes returns empty clusters (esp if you set k, the desired number of clusters too high compare to the size of you data).

The solution is easy though. You only need two extra lines of code to get rid of the empty clusters:

% 1- dicsard empty clusters
centers = centers(:, unique(labels));

% 2- re-assign labels
[~,labels] = max(bsxfun(@minus,centers'*X,0.5*sum(centers.^2)'));

19 Jul 2010 Ehsan Ardestani

Hi Michael,

Thanks a lot for the nice code. It was very usefult for me.

20 Mar 2010 Mo Chen

Mo Chen (view profile)

Hi Yen, thanks, I've update the file that fixed the problem.

Comment only
20 Mar 2010 Anton Yen

Hmm, the site submitted an empty comment for me when I left the page. Nice.

Anyways, thank you for the highly optimized code. I can tell that a lot of thought went into minimizing the memory footprint and the number of computations (e.g., sparse matrices, coefficient placement outside of summations, computing xc-0.5c^2 instead of the full Euclidean distance x^2-2xc+c^2). I'd like to point out a bug, though -- litekmeans.m, line 9: your sum command isn't equipped to handle data sets that are 1-by-N. Change it to: 0.5*sum(center.^2,1)' to force summation over rows.

Comment only
20 Mar 2010 Mo Chen

Mo Chen (view profile)

Hi, Miao,
I just implement the classical kmeans. there is no improvement in the algorithm aspect. The speed gain is obtained by coding tricks which is vectorization. For the question of whether this one is suitable for large k, I guess you have to try it. You might get a result with the number of clusters less than k, because some clusters might become empty during the iterations.

Comment only
20 Mar 2010 Anton Yen  
12 Mar 2010 miao chen

And is it suitable for problems that the number of clusters is large, e.g. k=256?

11 Mar 2010 miao chen

Hi, Michael, I tried some other kmeans algorithms and found that your method indeed fast in speed.Why you say that it could be faster than other fast kmeans method(such as kd-tree and linear time algorithm), what is the computation complexity of your method, while the classical is O(nk)? Thanks a lot!

14 Feb 2010 Sev Bay

Hey Michael,

Thanks for sharing.
I wonder the complexity of this thing. Do you know?

16 Oct 2009 Mo Chen

Mo Chen (view profile)

The results of kmeans algorithm can be different with different initializations. Actually the kmeans function in matlab is not a standard kmeans algorithm. It tries to get smaller energy by switching data points in different clusters after the standard kmeans procedure converged.
One purpose of the litekmeans is to be simple (only 10 lines of code), therefore I did not add extra code to handle empty cluster. It just discard the cluster if the cluster becomes empty. You can modify the code yourself if you want extra functionality.

Comment only
13 Oct 2009 Fen Xie

this method produces empty clusters constantly, be careful dealing with these exceptions~

Comment only
13 Oct 2009 Fen Xie

Sorry, I have compared the results of your program and the embedded program of matlab, the two results doesn't show the same, so what does it mean??

Comment only
08 Oct 2009 Mo Chen

Mo Chen (view profile)

Yes, just call the litekmeans.m to get the clustering results. You cannot get a visualization in a simple way for the data whose dimensions are more than 3. The scatterd.m can only handle data of 2d or 3d.

Comment only
05 Oct 2009 Onur Kalabak

Thank you for the share. I have two questions. Do we have to use both of the functions to cluster? I have 13x7000 matrix which I want to cluster. Should I just simply apply the matrice to litekmeans.m? And how can I plot the result as displayed in the picture?

Thanks

05 Oct 2009 Onur Kalabak  
10 Aug 2009 Mo Chen

Mo Chen (view profile)

To Sven:
Sorry for the inconvenience. Here's the answer to your questions.
The function takes two parameters. The first one is a d x n data matrix, of which each column is assumed to be a sample vector of d dimension. The second parameter is the number of clusters. The output is a 1 x n vector, of which each element is the label of the corresponding input sample vector.
The function handles data of arbitrary dimensions.

Comment only
06 Aug 2009 Sven

Sven (view profile)

This gave a simple implementation to the problem I had.
A couple of notes:
- Reduction of code is good, reduction of comments is not. "litekmeans" is without any header line or comments describing what this (very useful) function does, what kind of input it takes, what it produces etc. Things like the dimension to arrange the input points are necessary to use the function properly, but are not explained in a header. Whether or not the function handles 1D, 2D, 3D, etc data isn't written anywhere.
- Providing data as a data.mat file is useful for functions requiring custom made data, but it tends to obscure the format of the data. It would be simpler and easier to understand by providing equally simple sample code that makes the data.
- Providing brief sample code on the file exchange is fine, but it should be included as a header to the file, so that people can find it when they need it, rather than when they first download it.
Otherwise, thanks! Like I said, it solved my immediate problem.

Updates
01 Jul 2009 1.2

update the files and description

20 Mar 2010 1.3

fix a bug for 1d data

30 Sep 2010 1.4

remove empty clusters according to suggestion

30 Sep 2010 1.5

remove empty clusters according to suggestion

07 Jan 2012 1.7

Cleaning up

04 Feb 2012 1.9

tuning

Contact us