Code covered by the BSD License

### Highlights from k-means++

4.875
4.9 | 8 ratings Rate this file 99 Downloads (last 30 days) File Size: 1.74 KB File ID: #28804 Version: 1.7

# k-means++

### Laurent S (view profile)

23 Sep 2010 (Updated )

Cluster multivariate data using the k-means++ algorithm.

File Information
Description

An efficient implementation of the k-means++ algorithm for clustering multivariate data. It has been shown that this algorithm has an upper bound for the expected value of the total intra-cluster distance which is log(k) competitive. Additionally, k-means++ usually converges in far fewer than vanilla k-means.

Acknowledgements

Kmeans Clustering inspired this file.

This file inspired Kmeans With Variance Partitioning Initialization and Sparsified K Means.

MATLAB release MATLAB 8.0 (R2012b)
14 May 2015 arun mukundan

### arun mukundan (view profile)

Dear Laurent,

I have one question. In you initialization of cluster centres ( in the __for i = 2:k__ loop ), why do you assign the points to their clusters?

27 Jan 2014 mirko-stifler

### mirko-stifler (view profile)

I have code:
nfo = imfinfo('im.png');
k = 4;

and kmeanspp??
I want to output the resulting image

for example:
X=kmeanspp(X,k);

how can I do?

thanks
I am an Italian student

Comment only
10 Jan 2014 asmae

### asmae (view profile)

dear laurent,
when i use this algorithm several time for the same input, i get different output results ! it's this logic ?!

Comment only
13 Dec 2013 Qinfan

### Qinfan (view profile)

Thanks! Very easy to use.

28 Nov 2013 Ajit

### Ajit (view profile)

Hi Laurent,

I'm having a bit of trouble understanding the bsxfun lines. Is there any documentation or explanation of why you are maximizing this and how it relates to the distance, can't quite understand where the C' comes from?

Thanks.

Comment only
22 Oct 2013 tsan toso

### tsan toso (view profile)

Hi Laurent,

First of all thanks for the code. I actually trying to use kmeans to bin my 1D data, the C output could be used as the bin centers. However, if I want the function to output edges is there a non-trivial way to do it?

Thanks.

Comment only
15 Sep 2013 Laurent S

### Laurent S (view profile)

@Matei Tene: dot(C,C) is actually implemented as sum(C.*conj(C)), making both are equally fast.

Comment only
14 Sep 2013 Matei Tene

### Matei Tene (view profile)

Is dot(C,C) faster than sum(C.^2)? I see this being used in both in kmeans++ and in litekmeans.

25 Jul 2013 Zoé

### Zoé (view profile)

Thank you for sharing this. This implementation is faster than litekmeans.

Comment only
25 Jul 2013 Zoé

### Zoé (view profile)

11 Jul 2013 Pascal SZACHERSKI

### Pascal SZACHERSKI (view profile)

20 Jun 2013 Francisco Seixas

### Francisco Seixas (view profile)

Hi there,

How can I implement the 'replicates' option of the original kmeans funtion using this?

Thanks

Comment only
08 Feb 2013 Laurent S

### Laurent S (view profile)

@Xiaobo Li: Thank you for your comment, I updated the code so that it works with 1D datasets. The update should appear on the File Exchange shortly.

Comment only
08 Feb 2013 Xiaobo Li

### Xiaobo Li (view profile)

Hi, Laurent,
Thank you for your files. I try your codes and it works if I carefully choose the parameter. However, I cannot choose how many clusters to apply. In my perception, the parameter k should be the number of clusters, however, unfortunately, it is not.

In my case, X is [400*1] matrix, which means that my data is 1D and there are 400 data point. If I let k=1, the code works well, with the resulting C contains 22 unique centers. However, if I let k to be other number, say 4, it is the identical 4 columns with each column the same as in the case of k=1! Is it your initial purpose to do it? If it is the case, I don't understand the purpose of parameter 'k'.

Best regards,
Xiaobo

Comment only
02 Jan 2012 Laurent S

### Laurent S (view profile)

@Micky: Good catch :-) Yes, using the distance instead of squared distance shows consistently better results on the datasets used by the original authors. If you happen to find a better heuristic, I would be happy to hear about it of course!

Comment only
28 Dec 2011 Micky

### Micky (view profile)

Thank you very much, Laurent.

You seem to weigh the samples by their distance from the nearest existing centeroid, rather than by the square distance, although the article seems to recommend the square distance for k-means (rather than k-medians). Is this on purpose?

Comment only
21 Dec 2011 Laurent S

### Laurent S (view profile)

@Andreas: Thanks! To answer your question: you could try using PCA on the cluster centers to discover which clusters are the most important (although it depends on how you define 'important').

Comment only
21 Dec 2011 Andreas

### Andreas (view profile)

Works very nicely, even on large data sets (50K samples in 128 dimensions) - thanks.

A question - suppose I have a good clustering - how can one know which dimensions are less useful than others? Most grateful for any suggestions -

Comment only
07 Oct 2011 SD

### SD (view profile)

I am trying to get initial centriod using your kmeans++ matlab code ,but i am not able to understand the output
your function returns like for e.g. I need ten clusters for the given data . I don't know how to get this information
from [L , C].

Comment only
24 Aug 2011 musi

### musi (view profile)

Hi thanks for it.
I am just confused how we will choose the next centroid.
Let's suppose we have chosen c1 (first centroid) uniformly at random. I calculate prob of each point with c1 and it is {p1,p2,....pn}.

What is the criteria to choose the next point? e.g. shld I choose the the one with max prob? min prob? or just with random prob (see next)

for all point i \in n
if(rand (0.0,1.0) < p(i))
c2= p(i)

Thanks

Comment only
09 May 2011 Laurent S

### Laurent S (view profile)

@Renato: it could be that it is difficult to discern clusters in your dataset. Try visualizing the k-means++ initialization for a simple 2D Gaussian mixture and you should see a marked improvement over randomly selected cluster centra. For more experiments, see [D. Arthur and S. Vassilvitskii, "k-means++: The Advantages of Careful Seeding", Technical Report 2006-13, Stanford InfoLab, 2006.].

Comment only
09 May 2011 Laurent S

### Laurent S (view profile)

@Cassie: I removed the dependency on randi for your convenience. The submission should be updated shortly.

Comment only
09 May 2011 Renato Amorim

### Renato Amorim (view profile)

Dear Laurent,

I'm trying to make some experiments regarding the amount of time the algorithm takes to generate the initial seeds and have then extracted the below code from your submission:
(Please note my datasets are Entity x Feature, so there are some small adjustments)

L = [];
C = X(randi(size(X,1)),:);

L = ones(size(X,1),1);
for i = 2:k
D = X-C(L,:);
D = sqrt(dot(D',D'));
C(i,:) = X(find(rand < cumsum(D)/sum(D),1),:);
[~,L] = max(bsxfun(@minus,2*real(C*X'),dot(C',C').'));
end

But if I compare the accuracy of KMeans (statistics toolbox) using random seeds or the above generated seeds, I get about the same (in 100 runs).
Any thoughts?
I have standardized my dataset as x_ij = (x_ij - mean(x_j))/range(x_j)
Many thanks!!

06 May 2011 Cassie

### Cassie (view profile)

Dear Laurent,
My matlab version does not support randi() function either. may I know if there is another way to replace this function? Thanks

Comment only
24 Mar 2011 Laurent S

### Laurent S (view profile)

@Jia: the code you are referring to is not the code in this submission.

Comment only
24 Mar 2011 Jia

### Jia (view profile)

I don't know why I run into an error like this each time:??? Error: File: litekmeans.m Line: 6 Column: 7
Expression or statement is incorrect--possibly unbalanced (, {, or [.

Error in ==> cluster at 3
f=litekmeans(X,3);
using
scatterd(X,y)
f=litekmeans(X,3);
scatterd(X,f)

the code broke at the line
[~,~,label] = unique(label); % remove empty clusters
could you help me with this?

23 Mar 2011 faizal

### faizal (view profile)

i ran this program but fail at this line:
C = X(:,randi(size(X,2)));

randi <any problem with this syntax?

Comment only
23 Dec 2010 Nurhasanah Ivansyah

### Nurhasanah Ivansyah (view profile)

I have code like this:
clc;clear all
info = imfinfo('patient1_T1.png');
k = 4;
[L,C] = kmeans(X,k);
imagNew = reshape(X,512,512);
figure;imshow(imagNew,[]), title('image labeled by kmeans++ cluster ');

but i don't know how to display Label Every Pixel in the image using result kmeans++ and create image that segment
help me and thank...!

Comment only
22 Dec 2010 Laurent S

### Laurent S (view profile)

@Nurhasanah Ivansyah: See http://en.wikipedia.org/wiki/Image_segmentation#Clustering_methods. One way to do image segmentation is to organize your image into (x,y,intensity) datapoints, and then to apply a clustering algorithm.

Comment only
22 Dec 2010 Nurhasanah Ivansyah

### Nurhasanah Ivansyah (view profile)

it's excellent...but i don't know how to apply it to image(512x512). can you explain how to apply it in classification or segmentation image? help me n thanks...

17 Nov 2010 1.3

Even faster, even less code and also fixed a few small bugs.

07 May 2011 1.4

Removed dependency on randi for R2008a or lower (thanks Cassie).

17 May 2011 1.5

Small bugfix.

07 Aug 2011 1.6

Improved handling of overclustering (thanks Sid S) and added a screenshot.

11 Feb 2013 1.7

Fixed bug with 1D datasets (thanks Xiaobo Li).