Code covered by the BSD License  

Highlights from
k-means++

4.85714

4.9 | 7 ratings Rate this file 126 Downloads (last 30 days) File Size: 1.74 KB File ID: #28804
image thumbnail

k-means++

by

 

23 Sep 2010 (Updated )

Cluster multivariate data using the k-means++ algorithm.

| Watch this File

File Information
Description

An efficient implementation of the k-means++ algorithm for clustering multivariate data. It has been shown that this algorithm has an upper bound for the expected value of the total intra-cluster distance which is log(k) competitive. Additionally, k-means++ usually converges in far fewer than vanilla k-means.

Acknowledgements

Kmeans Clustering inspired this file.

MATLAB release MATLAB 8.0 (R2012b)
Tags for This File   Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (29)
27 Jan 2014 mirko-stifler

I have code:
nfo = imfinfo('im.png');
X = im2double(imread('im','png'));
X = imadjust(X);
k = 4;

and kmeanspp??
I want to output the resulting image

for example:
X=kmeanspp(X,k);

how can I do?

thanks
I am an Italian student

10 Jan 2014 asmae

dear laurent,
when i use this algorithm several time for the same input, i get different output results ! it's this logic ?!

13 Dec 2013 Qinfan

Thanks! Very easy to use.

28 Nov 2013 Ajit

Hi Laurent,

I'm having a bit of trouble understanding the bsxfun lines. Is there any documentation or explanation of why you are maximizing this and how it relates to the distance, can't quite understand where the C' comes from?

Thanks.

22 Oct 2013 tsan toso

Hi Laurent,

First of all thanks for the code. I actually trying to use kmeans to bin my 1D data, the C output could be used as the bin centers. However, if I want the function to output edges is there a non-trivial way to do it?

Thanks.

15 Sep 2013 Laurent S

@Matei Tene: dot(C,C) is actually implemented as sum(C.*conj(C)), making both are equally fast.

14 Sep 2013 Matei Tene

Is dot(C,C) faster than sum(C.^2)? I see this being used in both in kmeans++ and in litekmeans.

25 Jul 2013 Zoé

Thank you for sharing this. This implementation is faster than litekmeans.

25 Jul 2013 Zoé  
11 Jul 2013 Pascal SZACHERSKI  
20 Jun 2013 Francisco Seixas

Hi there,

How can I implement the 'replicates' option of the original kmeans funtion using this?

Thanks

08 Feb 2013 Laurent S

@Xiaobo Li: Thank you for your comment, I updated the code so that it works with 1D datasets. The update should appear on the File Exchange shortly.

08 Feb 2013 Xiaobo Li

Hi, Laurent,
Thank you for your files. I try your codes and it works if I carefully choose the parameter. However, I cannot choose how many clusters to apply. In my perception, the parameter k should be the number of clusters, however, unfortunately, it is not.

In my case, X is [400*1] matrix, which means that my data is 1D and there are 400 data point. If I let k=1, the code works well, with the resulting C contains 22 unique centers. However, if I let k to be other number, say 4, it is the identical 4 columns with each column the same as in the case of k=1! Is it your initial purpose to do it? If it is the case, I don't understand the purpose of parameter 'k'.

Hope to know your answers! Anyway, thank you very much for your efforts and sharing!

Best regards,
Xiaobo

02 Jan 2012 Laurent S

@Micky: Good catch :-) Yes, using the distance instead of squared distance shows consistently better results on the datasets used by the original authors. If you happen to find a better heuristic, I would be happy to hear about it of course!

28 Dec 2011 Micky

Thank you very much, Laurent.

You seem to weigh the samples by their distance from the nearest existing centeroid, rather than by the square distance, although the article seems to recommend the square distance for k-means (rather than k-medians). Is this on purpose?

21 Dec 2011 Laurent S

@Andreas: Thanks! To answer your question: you could try using PCA on the cluster centers to discover which clusters are the most important (although it depends on how you define 'important').

21 Dec 2011 Andreas

Works very nicely, even on large data sets (50K samples in 128 dimensions) - thanks.

A question - suppose I have a good clustering - how can one know which dimensions are less useful than others? Most grateful for any suggestions -

07 Oct 2011 SD

I am trying to get initial centriod using your kmeans++ matlab code ,but i am not able to understand the output
your function returns like for e.g. I need ten clusters for the given data . I don't know how to get this information
from [L , C].

24 Aug 2011 musi

Hi thanks for it.
I am just confused how we will choose the next centroid.
Let's suppose we have chosen c1 (first centroid) uniformly at random. I calculate prob of each point with c1 and it is {p1,p2,....pn}.

What is the criteria to choose the next point? e.g. shld I choose the the one with max prob? min prob? or just with random prob (see next)

for all point i \in n
if(rand (0.0,1.0) < p(i))
c2= p(i)

Thanks

09 May 2011 Laurent S

@Renato: it could be that it is difficult to discern clusters in your dataset. Try visualizing the k-means++ initialization for a simple 2D Gaussian mixture and you should see a marked improvement over randomly selected cluster centra. For more experiments, see [D. Arthur and S. Vassilvitskii, "k-means++: The Advantages of Careful Seeding", Technical Report 2006-13, Stanford InfoLab, 2006.].

09 May 2011 Laurent S

@Cassie: I removed the dependency on randi for your convenience. The submission should be updated shortly.

09 May 2011 Renato Amorim

Dear Laurent,

I'm trying to make some experiments regarding the amount of time the algorithm takes to generate the initial seeds and have then extracted the below code from your submission:
(Please note my datasets are Entity x Feature, so there are some small adjustments)

L = [];
C = X(randi(size(X,1)),:);

L = ones(size(X,1),1);
for i = 2:k
D = X-C(L,:);
D = sqrt(dot(D',D'));
C(i,:) = X(find(rand < cumsum(D)/sum(D),1),:);
[~,L] = max(bsxfun(@minus,2*real(C*X'),dot(C',C').'));
end

But if I compare the accuracy of KMeans (statistics toolbox) using random seeds or the above generated seeds, I get about the same (in 100 runs).
Any thoughts?
I have standardized my dataset as x_ij = (x_ij - mean(x_j))/range(x_j)
Many thanks!!

06 May 2011 Cassie

Dear Laurent,
My matlab version does not support randi() function either. may I know if there is another way to replace this function? Thanks

24 Mar 2011 Laurent S

@Jia: the code you are referring to is not the code in this submission.

24 Mar 2011 Jia

I don't know why I run into an error like this each time:??? Error: File: litekmeans.m Line: 6 Column: 7
Expression or statement is incorrect--possibly unbalanced (, {, or [.

Error in ==> cluster at 3
f=litekmeans(X,3);
using
>>load data;
scatterd(X,y)
f=litekmeans(X,3);
scatterd(X,f)

the code broke at the line
[~,~,label] = unique(label); % remove empty clusters
could you help me with this?

23 Mar 2011 faizal

i ran this program but fail at this line:
C = X(:,randi(size(X,2)));

randi <any problem with this syntax?

23 Dec 2010 Nurhasanah Ivansyah

I have code like this:
clc;clear all
info = imfinfo('patient1_T1.png');
X = im2double(imread('patient1_T1','png'));
X = imadjust(X);
k = 4;
[L,C] = kmeans(X,k);
imagNew = reshape(X,512,512);
figure;imshow(imagNew,[]), title('image labeled by kmeans++ cluster ');

but i don't know how to display Label Every Pixel in the image using result kmeans++ and create image that segment
help me and thank...!

22 Dec 2010 Laurent S

@Nurhasanah Ivansyah: See http://en.wikipedia.org/wiki/Image_segmentation#Clustering_methods. One way to do image segmentation is to organize your image into (x,y,intensity) datapoints, and then to apply a clustering algorithm.

22 Dec 2010 Nurhasanah Ivansyah

it's excellent...but i don't know how to apply it to image(512x512). can you explain how to apply it in classification or segmentation image? help me n thanks...

Updates
17 Nov 2010

Even faster, even less code and also fixed a few small bugs.

07 May 2011

Removed dependency on randi for R2008a or lower (thanks Cassie).

17 May 2011

Small bugfix.

07 Aug 2011

Improved handling of overclustering (thanks Sid S) and added a screenshot.

11 Feb 2013

Fixed bug with 1D datasets (thanks Xiaobo Li).

Contact us