Code covered by the BSD License  

Highlights from
kmeans clustering

4.66667

4.7 | 17 ratings Rate this file 303 Downloads (last 30 days) File Size: 33.5 KB File ID: #24616
image thumbnail

kmeans clustering

by

 

01 Jul 2009 (Updated )

Fully vectorized kmeans algorithm. Fast yet simple (10 lines)

| Watch this File

File Information
Description

This is a very fast implementation of the original kmeans clustering algorithm without any fancy acceleration technique, such as kd-tree indexing and triangular inequation. (actually the fastest matlab implementation as far as I can tell.)

This code is as vectorized as possible. Yet it is very compact (only 10 lines of code). It is 10~100 times faster than the kmeans function in matlab.

The package also includes a function for ploting the data with labels.

Sample code:
>> load data;spread(x,y)
>> label = litekmeans(x,3);spread(x,label )

Detail explanation of this algorithm can be found in following blog post:
http://statinfer.wordpress.com/2011/12/12/efficient-matlab-ii-kmeans-clustering-algorithm/

Acknowledgements

This file inspired Kmeans and K Means++.

MATLAB release MATLAB 7.11 (R2010b)
Tags for This File   Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (43)
26 Jun 2014 Antoni J. Canós

The algorithm randomly fails, probably due to the random initialization.

For instance, the vector of 1D data X=[9.13,2.68,2.33,9.41] with k=2 is sometimes clustered (labeled) as [1 1 1 1], instead of the right values [1 2 2 1] or [2 1 1 2];

19 May 2014 Lakshmi

Hi, i'm new to Matlab coding. Pls tell how to run this code...

07 May 2014 Chathurika  
14 Mar 2014 Navjot Jassi

how should i run this program ??

05 Dec 2013 John

Works fine, almost always. However, besides the needed transpose mentioned by Matei, I noticed another problem. Suppose I want 9 clusters from a dataset with 100 elements. Quite often I only get 7 or 8 groups. The same thing happens with other cluster numbers, though I haven't seen it for fewer than wanting 5.

30 Nov 2013 mohit rai

sent file

23 Oct 2013 Brian Wang  
16 Oct 2013 Milka

Hey Mate,

did you already implement the Replication Feature? (From judging your code no, but maybe I don't get the stuff correctly yet ;-))

And why the spread function can only get an input matrix X of 2 or 3 rows, while in your kmeans algorithm it can have any size in the first dimension?

14 Sep 2013 Matei Tene

@Zoe: I would rather just transpose instead of reshaping, i.e. in line 14

last = label';

06 Aug 2013 Bojan

Cuts down my calculation from 4h down to 19min, so very pleased. Had to reshape label as per suggestion by Zoe below to make it work with 2013. Otherwise very good!

28 Jun 2013 rohith

are no. of seeds equal to no.of clusters??
and i need labels for all data points.Whereas this is giving label whose dimension is equal to no. of seeds

04 May 2013 Ramani

please help me in Em estimator for parameter estimation

03 May 2013 Zoé

Seems to be broken in R2013a due too a column/row vector mix-up. As a quick workaround I have added

label = reshape(label, length(label), 1);

after line 15 in litekmeans.m.

03 May 2013 Zoé  
03 Jul 2012 leila

Thank you for sharing. Does the code support 3d data?

16 May 2012 Abdullah Alomari

not working ..

25 Apr 2012 RS

Hi,
I am looking for an implemented kmeans clustering algorithm to segment a full body 3D CT matrix into 4 classes in Matlab. This kmeans output will then be used as input to Potts model segmentation.
Is there anyone who can help med with this or give me some suggestions.

Thanks in advance
Rachelle

20 Mar 2012 Mo Chen

It is simply not for image.

19 Mar 2012 Angie Orozco Sanchez

does it works for rgb images?

01 Jan 2012 Jonathan C. Lansey  
06 Apr 2011 Tim Benham

Good, fast implementation but it should be possible to pass in the cluster centers. Also arguments are not checked (try k=1) and standard help is not provided.

12 Mar 2011 Mo Chen

Hi, YIMING
It is normal behavior. This line of code remove empty clusters which does not happen very often when k is small.

08 Jan 2011 YIMING

Hey Michael,

I met problem when compiling the following sentence of your code,

[~,~,label] = unique(label);

As is known, the character '~' is not supported for representing non-used variables in recent releases of Matlab. However, if I change the '~' to some other variable names (not to be used), then this sentence seems unnecessary --- label is not changed after executing this sentence.

Any comments on this?

25 Oct 2010 subrajeet

Even if we choose the cluster centers initially still there is no guarantee that we will get the same clusters every time.
There is one paperon this and a specific clustering technique to solve this problem But I am unable to code it properly. Can you have a look on this algorithm called Quality Threshold (QT) based clustering.

Heyer, L. J., Kruglyak, S., Yooseph, S. (1999). Exploring expression
% data: Identification and analysis of coexpressed genes. Genome Research
% 9, 1106–1115.
%
% http://genome.cshlp.org/content/9/11/1106.full
% http://genome.cshlp.org/content/9/11/1106/F5.large.jpg

24 Oct 2010 Mo Chen

Hi, surajeet,
If you want deterministic behavior of kmeans, you can use some deterministic method to initialize the clusters. For example, you can first choose the sample which is fastest from the center of the whole data set as a cluster center, then iteratively choose sample which has the largest sum of distances from the chosen centers.

However, you are not guarantee to obtain a good result on any data set.

24 Oct 2010 subrajeet

Thanks for the posting Michael. I have a small query is thery any clustering technique by which the cluster results will remain same everytime we run the code.
Please kindly tell.

26 Jul 2010 Babak Taati

This is a really nice function.

As Fen Xie pointed out, it sometimes returns empty clusters (esp if you set k, the desired number of clusters too high compare to the size of you data).

The solution is easy though. You only need two extra lines of code to get rid of the empty clusters:

% 1- dicsard empty clusters
centers = centers(:, unique(labels));

% 2- re-assign labels
[~,labels] = max(bsxfun(@minus,centers'*X,0.5*sum(centers.^2)'));

19 Jul 2010 Ehsan Ardestani

Hi Michael,

Thanks a lot for the nice code. It was very usefult for me.

20 Mar 2010 Mo Chen

Hi Yen, thanks, I've update the file that fixed the problem.

20 Mar 2010 Anton Yen

Hmm, the site submitted an empty comment for me when I left the page. Nice.

Anyways, thank you for the highly optimized code. I can tell that a lot of thought went into minimizing the memory footprint and the number of computations (e.g., sparse matrices, coefficient placement outside of summations, computing xc-0.5c^2 instead of the full Euclidean distance x^2-2xc+c^2). I'd like to point out a bug, though -- litekmeans.m, line 9: your sum command isn't equipped to handle data sets that are 1-by-N. Change it to: 0.5*sum(center.^2,1)' to force summation over rows.

20 Mar 2010 Mo Chen

Hi, Miao,
I just implement the classical kmeans. there is no improvement in the algorithm aspect. The speed gain is obtained by coding tricks which is vectorization. For the question of whether this one is suitable for large k, I guess you have to try it. You might get a result with the number of clusters less than k, because some clusters might become empty during the iterations.

20 Mar 2010 Anton Yen  
12 Mar 2010 miao chen

And is it suitable for problems that the number of clusters is large, e.g. k=256?

11 Mar 2010 miao chen

Hi, Michael, I tried some other kmeans algorithms and found that your method indeed fast in speed.Why you say that it could be faster than other fast kmeans method(such as kd-tree and linear time algorithm), what is the computation complexity of your method, while the classical is O(nk)? Thanks a lot!

14 Feb 2010 Sev Bay

Hey Michael,

Thanks for sharing.
I wonder the complexity of this thing. Do you know?

16 Oct 2009 Mo Chen

The results of kmeans algorithm can be different with different initializations. Actually the kmeans function in matlab is not a standard kmeans algorithm. It tries to get smaller energy by switching data points in different clusters after the standard kmeans procedure converged.
One purpose of the litekmeans is to be simple (only 10 lines of code), therefore I did not add extra code to handle empty cluster. It just discard the cluster if the cluster becomes empty. You can modify the code yourself if you want extra functionality.

13 Oct 2009 Fen Xie

this method produces empty clusters constantly, be careful dealing with these exceptions~

13 Oct 2009 Fen Xie

Sorry, I have compared the results of your program and the embedded program of matlab, the two results doesn't show the same, so what does it mean??

08 Oct 2009 Mo Chen

Yes, just call the litekmeans.m to get the clustering results. You cannot get a visualization in a simple way for the data whose dimensions are more than 3. The scatterd.m can only handle data of 2d or 3d.

05 Oct 2009 Onur Kalabak

Thank you for the share. I have two questions. Do we have to use both of the functions to cluster? I have 13x7000 matrix which I want to cluster. Should I just simply apply the matrice to litekmeans.m? And how can I plot the result as displayed in the picture?

Thanks

05 Oct 2009 Onur Kalabak  
10 Aug 2009 Mo Chen

To Sven:
Sorry for the inconvenience. Here's the answer to your questions.
The function takes two parameters. The first one is a d x n data matrix, of which each column is assumed to be a sample vector of d dimension. The second parameter is the number of clusters. The output is a 1 x n vector, of which each element is the label of the corresponding input sample vector.
The function handles data of arbitrary dimensions.

06 Aug 2009 Sven

This gave a simple implementation to the problem I had.
A couple of notes:
- Reduction of code is good, reduction of comments is not. "litekmeans" is without any header line or comments describing what this (very useful) function does, what kind of input it takes, what it produces etc. Things like the dimension to arrange the input points are necessary to use the function properly, but are not explained in a header. Whether or not the function handles 1D, 2D, 3D, etc data isn't written anywhere.
- Providing data as a data.mat file is useful for functions requiring custom made data, but it tends to obscure the format of the data. It would be simpler and easier to understand by providing equally simple sample code that makes the data.
- Providing brief sample code on the file exchange is fine, but it should be included as a header to the file, so that people can find it when they need it, rather than when they first download it.
Otherwise, thanks! Like I said, it solved my immediate problem.

Updates
01 Jul 2009

update the files and description

20 Mar 2010

fix a bug for 1d data

30 Sep 2010

remove empty clusters according to suggestion

30 Sep 2010

remove empty clusters according to suggestion

07 Jan 2012

Cleaning up

04 Feb 2012

tuning

Contact us