File Exchange

image thumbnail

Kmeans Clustering

version 2.0 (3.31 KB) by

Super fast and terse kmeans clustering.

4.29412
23 Ratings

193 Downloads

Updated

View License

This is a super duper fast implementation of the kmeans clustering algorithm. The code is fully vectorized and extremely succinct. It is much much faster than the Matlab builtin kmeans function. The kmeans++ seeding algorithm is also included (kseeds.m) for good initialization. Therefore, this package is not only for coolness, it is indeed practical.
Please try the demo script in the package.

Detail explanation of this algorithm can be found in following blog post:
http://statinfer.wordpress.com/2011/12/12/efficient-matlab-ii-kmeans-clustering-algorithm/

This function is now a part of the PRML toolbox (http://www.mathworks.com/matlabcentral/fileexchange/55826-pattern-recognition-and-machine-learning-toolbox).

Comments and Ratings (57)

Stephane CHO

Not working on R2015a

I can see visually the clusters are separated ,but how do I get to know the members of the clusters (i.e index of the main dataset )

Mo Chen

Mo Chen (view profile)

@John, this is a new syntax of Matlab >=R2016b.

SR

SR (view profile)

Same Problem as John. Running the demo throws an error:

Error using -
Matrix dimensions must agree.

Error in kseeds (line 15)
D = min(D,sum((X-mu(:,i-1)).^2,1));

Error in kmeans_demo (line 9)
y = kmeans(X,kseeds(X,k));

Does not work in MATLAB 7.12.0 (R2011a).
This tool does: https://www.mathworks.com/matlabcentral/fileexchange/28804-k-means++

John

John (view profile)

I'm having trouble running kmeans_demo. In kmeans_demo, the line y = kmeans(X,kseeds(X,k)); calls kseed with X(2,5000) and k=3. In kseeds, the line D = min(D,sum((X-mu(:,i-1)).^2,1)); the term sum((X-mu(:,i-1)).^2 doesn't evaluated with X as a 2x5000 array and mu a 2x1 array.

Yoshi Black

Bruno Costa

Chathurika

HOW TO CREATE A DATABASE IN MATLAB?

David Inouye

There is a small bug for newer versions of MATLAB because the behavior of UNIQUE has changed so that the indices are always returned as column vectors. See demo below.

Though I'm not a guru on speed, one suggested change would be to add the following line after calls to UNIQUE:
label = label(:)';

Demo code:
clc; lb=[3 1 1 5]; szb=size(lb); [~,~,la] = unique(lb); sza=size(la);
fprintf('<<<<<< %s >>>>>>\nLabels before: %s, Labels after: %s\nSize before:%s, Size after:%s\n', version, mat2str(lb), mat2str(la), mat2str(szb), mat2str(sza));

Output on different versions of MATLAB:
<<<<<< 8.3.0.532 (R2014a) >>>>>>
Labels before: [3 1 1 5], Labels after: [2;1;1;3]
Size before:[1 4], Size after:[4 1]

<<<<<< 7.12.0.635 (R2011a) >>>>>>
Labels before: [3 1 1 5], Labels after: [2 1 1 3]
Size before:[1 4], Size after:[1 4]

selvakumar s

How to view the segmented result! pls help me!

Dipali

Dipali (view profile)

how to run k-means for a input as a any dataset file

The algorithm randomly fails, probably due to the random initialization.

For instance, the vector of 1D data X=[9.13,2.68,2.33,9.41] with k=2 is sometimes clustered (labeled) as [1 1 1 1], instead of the right values [1 2 2 1] or [2 1 1 2];

Lakshmi

Hi, i'm new to Matlab coding. Pls tell how to run this code...

Chathurika

Navjot Jassi

how should i run this program ??

John

John (view profile)

Works fine, almost always. However, besides the needed transpose mentioned by Matei, I noticed another problem. Suppose I want 9 clusters from a dataset with 100 elements. Quite often I only get 7 or 8 groups. The same thing happens with other cluster numbers, though I haven't seen it for fewer than wanting 5.

mohit rai

sent file

Brian Wang

Milka

Milka (view profile)

Hey Mate,

did you already implement the Replication Feature? (From judging your code no, but maybe I don't get the stuff correctly yet ;-))

And why the spread function can only get an input matrix X of 2 or 3 rows, while in your kmeans algorithm it can have any size in the first dimension?

Matei Tene

@Zoe: I would rather just transpose instead of reshaping, i.e. in line 14

last = label';

Bojan

Bojan (view profile)

Cuts down my calculation from 4h down to 19min, so very pleased. Had to reshape label as per suggestion by Zoe below to make it work with 2013. Otherwise very good!

rohith

rohith (view profile)

are no. of seeds equal to no.of clusters??
and i need labels for all data points.Whereas this is giving label whose dimension is equal to no. of seeds

Ramani

Ramani (view profile)

please help me in Em estimator for parameter estimation

Zoé

Zoé (view profile)

Seems to be broken in R2013a due too a column/row vector mix-up. As a quick workaround I have added

label = reshape(label, length(label), 1);

after line 15 in litekmeans.m.

Zoé

Zoé (view profile)

leila

leila (view profile)

Thank you for sharing. Does the code support 3d data?

not working ..

RS

RS (view profile)

Hi,
I am looking for an implemented kmeans clustering algorithm to segment a full body 3D CT matrix into 4 classes in Matlab. This kmeans output will then be used as input to Potts model segmentation.
Is there anyone who can help med with this or give me some suggestions.

Thanks in advance
Rachelle

Mo Chen

Mo Chen (view profile)

It is simply not for image.

does it works for rgb images?

Tim Benham

Good, fast implementation but it should be possible to pass in the cluster centers. Also arguments are not checked (try k=1) and standard help is not provided.

Mo Chen

Mo Chen (view profile)

Hi, YIMING
It is normal behavior. This line of code remove empty clusters which does not happen very often when k is small.

YIMING

YIMING (view profile)

Hey Michael,

I met problem when compiling the following sentence of your code,

[~,~,label] = unique(label);

As is known, the character '~' is not supported for representing non-used variables in recent releases of Matlab. However, if I change the '~' to some other variable names (not to be used), then this sentence seems unnecessary --- label is not changed after executing this sentence.

Any comments on this?

subrajeet

Even if we choose the cluster centers initially still there is no guarantee that we will get the same clusters every time.
There is one paperon this and a specific clustering technique to solve this problem But I am unable to code it properly. Can you have a look on this algorithm called Quality Threshold (QT) based clustering.

Heyer, L. J., Kruglyak, S., Yooseph, S. (1999). Exploring expression
% data: Identification and analysis of coexpressed genes. Genome Research
% 9, 1106–1115.
%
% http://genome.cshlp.org/content/9/11/1106.full
% http://genome.cshlp.org/content/9/11/1106/F5.large.jpg

Mo Chen

Mo Chen (view profile)

Hi, surajeet,
If you want deterministic behavior of kmeans, you can use some deterministic method to initialize the clusters. For example, you can first choose the sample which is fastest from the center of the whole data set as a cluster center, then iteratively choose sample which has the largest sum of distances from the chosen centers.

However, you are not guarantee to obtain a good result on any data set.

subrajeet

Thanks for the posting Michael. I have a small query is thery any clustering technique by which the cluster results will remain same everytime we run the code.
Please kindly tell.

Babak Taati

This is a really nice function.

As Fen Xie pointed out, it sometimes returns empty clusters (esp if you set k, the desired number of clusters too high compare to the size of you data).

The solution is easy though. You only need two extra lines of code to get rid of the empty clusters:

% 1- dicsard empty clusters
centers = centers(:, unique(labels));

% 2- re-assign labels
[~,labels] = max(bsxfun(@minus,centers'*X,0.5*sum(centers.^2)'));

Hi Michael,

Thanks a lot for the nice code. It was very usefult for me.

Mo Chen

Mo Chen (view profile)

Hi Yen, thanks, I've update the file that fixed the problem.

a.y@n.m

Hmm, the site submitted an empty comment for me when I left the page. Nice.

Anyways, thank you for the highly optimized code. I can tell that a lot of thought went into minimizing the memory footprint and the number of computations (e.g., sparse matrices, coefficient placement outside of summations, computing xc-0.5c^2 instead of the full Euclidean distance x^2-2xc+c^2). I'd like to point out a bug, though -- litekmeans.m, line 9: your sum command isn't equipped to handle data sets that are 1-by-N. Change it to: 0.5*sum(center.^2,1)' to force summation over rows.

Mo Chen

Mo Chen (view profile)

Hi, Miao,
I just implement the classical kmeans. there is no improvement in the algorithm aspect. The speed gain is obtained by coding tricks which is vectorization. For the question of whether this one is suitable for large k, I guess you have to try it. You might get a result with the number of clusters less than k, because some clusters might become empty during the iterations.

a.y@n.m

miao chen

And is it suitable for problems that the number of clusters is large, e.g. k=256?

miao chen

Hi, Michael, I tried some other kmeans algorithms and found that your method indeed fast in speed.Why you say that it could be faster than other fast kmeans method(such as kd-tree and linear time algorithm), what is the computation complexity of your method, while the classical is O(nk)? Thanks a lot!

Sev Bay

Hey Michael,

Thanks for sharing.
I wonder the complexity of this thing. Do you know?

Mo Chen

Mo Chen (view profile)

The results of kmeans algorithm can be different with different initializations. Actually the kmeans function in matlab is not a standard kmeans algorithm. It tries to get smaller energy by switching data points in different clusters after the standard kmeans procedure converged.
One purpose of the litekmeans is to be simple (only 10 lines of code), therefore I did not add extra code to handle empty cluster. It just discard the cluster if the cluster becomes empty. You can modify the code yourself if you want extra functionality.

Fen Xie

this method produces empty clusters constantly, be careful dealing with these exceptions~

Fen Xie

Sorry, I have compared the results of your program and the embedded program of matlab, the two results doesn't show the same, so what does it mean??

Mo Chen

Mo Chen (view profile)

Yes, just call the litekmeans.m to get the clustering results. You cannot get a visualization in a simple way for the data whose dimensions are more than 3. The scatterd.m can only handle data of 2d or 3d.

Onur Kalabak

Thank you for the share. I have two questions. Do we have to use both of the functions to cluster? I have 13x7000 matrix which I want to cluster. Should I just simply apply the matrice to litekmeans.m? And how can I plot the result as displayed in the picture?

Thanks

Onur Kalabak

Mo Chen

Mo Chen (view profile)

To Sven:
Sorry for the inconvenience. Here's the answer to your questions.
The function takes two parameters. The first one is a d x n data matrix, of which each column is assumed to be a sample vector of d dimension. The second parameter is the number of clusters. The output is a 1 x n vector, of which each element is the label of the corresponding input sample vector.
The function handles data of arbitrary dimensions.

Sven

Sven (view profile)

This gave a simple implementation to the problem I had.
A couple of notes:
- Reduction of code is good, reduction of comments is not. "litekmeans" is without any header line or comments describing what this (very useful) function does, what kind of input it takes, what it produces etc. Things like the dimension to arrange the input points are necessary to use the function properly, but are not explained in a header. Whether or not the function handles 1D, 2D, 3D, etc data isn't written anywhere.
- Providing data as a data.mat file is useful for functions requiring custom made data, but it tends to obscure the format of the data. It would be simpler and easier to understand by providing equally simple sample code that makes the data.
- Providing brief sample code on the file exchange is fine, but it should be included as a header to the file, so that people can find it when they need it, rather than when they first download it.
Otherwise, thanks! Like I said, it solved my immediate problem.

Updates

2.0

finalize, I'm done with kmeans.

2.0

tweak and require Matlab R2016b or later

1.9

update tags

1.9

fix incompatibility issue with newer version of Matlab, due to the stupid API change of function unique()

1.9

tuning

1.7

Cleaning up

1.5

remove empty clusters according to suggestion

1.4

remove empty clusters according to suggestion

1.3

fix a bug for 1d data

1.2

update the files and description

MATLAB Release
MATLAB 9.1 (R2016b)

Download apps, toolboxes, and other File Exchange content using Add-On Explorer in MATLAB.

» Watch video