File Exchange

k-means++

version 1.7.0.0 (1.74 KB) by Laurent S

Laurent S (view profile)

Cluster multivariate data using the k-means++ algorithm.

Updated 11 Feb 2013

An efficient implementation of the k-means++ algorithm for clustering multivariate data. It has been shown that this algorithm has an upper bound for the expected value of the total intra-cluster distance which is log(k) competitive. Additionally, k-means++ usually converges in far fewer than vanilla k-means.

Shuchao He

semaa ibrahim

semaa ibrahim (view profile)

Dear Laurent
How can I get the output ???
I'm using with image

Guy Maoz

HONGCHENG LIU

Zhiying Xu

MOHIT

Min Shi

can qiu

xiaoying wang

Shamna p

Shamna p (view profile)

Dear Laurent,

Thank you for sharing the code, its works faster than kmeans. But I ma getting different results for for same data in each run. If the seed points are initialized optimally it should give same result for all runs right? I selected number of cluster using evalclusters().I compared the sse of Kmeansapp and normal Kmeans( k means using random selection of keypoints ,some times I get normal kmeans have better sse. May I know is there any way I can get the result optimally and consistently

lzl

arun mukundan

arun mukundan (view profile)

Dear Laurent,

I have one question. In you initialization of cluster centres ( in the __for i = 2:k__ loop ), why do you assign the points to their clusters?

mirko-stifler

mirko-stifler (view profile)

I have code:
nfo = imfinfo('im.png');
k = 4;

and kmeanspp??
I want to output the resulting image

for example:
X=kmeanspp(X,k);

how can I do?

thanks
I am an Italian student

asmae

asmae (view profile)

dear laurent,
when i use this algorithm several time for the same input, i get different output results ! it's this logic ?!

Qinfan

Qinfan (view profile)

Thanks! Very easy to use.

Ajit

Ajit (view profile)

Hi Laurent,

I'm having a bit of trouble understanding the bsxfun lines. Is there any documentation or explanation of why you are maximizing this and how it relates to the distance, can't quite understand where the C' comes from?

Thanks.

tsan toso

tsan toso (view profile)

Hi Laurent,

First of all thanks for the code. I actually trying to use kmeans to bin my 1D data, the C output could be used as the bin centers. However, if I want the function to output edges is there a non-trivial way to do it?

Thanks.

Laurent S

Laurent S (view profile)

@Matei Tene: dot(C,C) is actually implemented as sum(C.*conj(C)), making both are equally fast.

Matei Tene

Matei Tene (view profile)

Is dot(C,C) faster than sum(C.^2)? I see this being used in both in kmeans++ and in litekmeans.

Zoé

Zoé (view profile)

Thank you for sharing this. This implementation is faster than litekmeans.

Zoé

Zoé (view profile)

Pascal SZACHERSKI

Francisco Seixas

Francisco Seixas (view profile)

Hi there,

How can I implement the 'replicates' option of the original kmeans funtion using this?

Thanks

Laurent S

Laurent S (view profile)

@Xiaobo Li: Thank you for your comment, I updated the code so that it works with 1D datasets. The update should appear on the File Exchange shortly.

Xiaobo Li

Xiaobo Li (view profile)

Hi, Laurent,
Thank you for your files. I try your codes and it works if I carefully choose the parameter. However, I cannot choose how many clusters to apply. In my perception, the parameter k should be the number of clusters, however, unfortunately, it is not.

In my case, X is [400*1] matrix, which means that my data is 1D and there are 400 data point. If I let k=1, the code works well, with the resulting C contains 22 unique centers. However, if I let k to be other number, say 4, it is the identical 4 columns with each column the same as in the case of k=1! Is it your initial purpose to do it? If it is the case, I don't understand the purpose of parameter 'k'.

Best regards,
Xiaobo

Laurent S

Laurent S (view profile)

@Micky: Good catch :-) Yes, using the distance instead of squared distance shows consistently better results on the datasets used by the original authors. If you happen to find a better heuristic, I would be happy to hear about it of course!

Micky

Micky (view profile)

Thank you very much, Laurent.

You seem to weigh the samples by their distance from the nearest existing centeroid, rather than by the square distance, although the article seems to recommend the square distance for k-means (rather than k-medians). Is this on purpose?

Laurent S

Laurent S (view profile)

@Andreas: Thanks! To answer your question: you could try using PCA on the cluster centers to discover which clusters are the most important (although it depends on how you define 'important').

Andreas

Andreas (view profile)

Works very nicely, even on large data sets (50K samples in 128 dimensions) - thanks.

A question - suppose I have a good clustering - how can one know which dimensions are less useful than others? Most grateful for any suggestions -

SD

SD (view profile)

I am trying to get initial centriod using your kmeans++ matlab code ,but i am not able to understand the output
your function returns like for e.g. I need ten clusters for the given data . I don't know how to get this information
from [L , C].

musi

musi (view profile)

Hi thanks for it.
I am just confused how we will choose the next centroid.
Let's suppose we have chosen c1 (first centroid) uniformly at random. I calculate prob of each point with c1 and it is {p1,p2,....pn}.

What is the criteria to choose the next point? e.g. shld I choose the the one with max prob? min prob? or just with random prob (see next)

for all point i \in n
if(rand (0.0,1.0) < p(i))
c2= p(i)

Thanks

Laurent S

Laurent S (view profile)

@Renato: it could be that it is difficult to discern clusters in your dataset. Try visualizing the k-means++ initialization for a simple 2D Gaussian mixture and you should see a marked improvement over randomly selected cluster centra. For more experiments, see [D. Arthur and S. Vassilvitskii, "k-means++: The Advantages of Careful Seeding", Technical Report 2006-13, Stanford InfoLab, 2006.].

Laurent S

Laurent S (view profile)

@Cassie: I removed the dependency on randi for your convenience. The submission should be updated shortly.

Renato Amorim

Renato Amorim (view profile)

Dear Laurent,

I'm trying to make some experiments regarding the amount of time the algorithm takes to generate the initial seeds and have then extracted the below code from your submission:
(Please note my datasets are Entity x Feature, so there are some small adjustments)

L = [];
C = X(randi(size(X,1)),:);

L = ones(size(X,1),1);
for i = 2:k
D = X-C(L,:);
D = sqrt(dot(D',D'));
C(i,:) = X(find(rand < cumsum(D)/sum(D),1),:);
[~,L] = max(bsxfun(@minus,2*real(C*X'),dot(C',C').'));
end

But if I compare the accuracy of KMeans (statistics toolbox) using random seeds or the above generated seeds, I get about the same (in 100 runs).
Any thoughts?
I have standardized my dataset as x_ij = (x_ij - mean(x_j))/range(x_j)
Many thanks!!

Cassie

Cassie (view profile)

Dear Laurent,
My matlab version does not support randi() function either. may I know if there is another way to replace this function? Thanks

Laurent S

Laurent S (view profile)

@Jia: the code you are referring to is not the code in this submission.

Jia

Jia (view profile)

I don't know why I run into an error like this each time:??? Error: File: litekmeans.m Line: 6 Column: 7
Expression or statement is incorrect--possibly unbalanced (, {, or [.

Error in ==> cluster at 3
f=litekmeans(X,3);
using
scatterd(X,y)
f=litekmeans(X,3);
scatterd(X,f)

the code broke at the line
[~,~,label] = unique(label); % remove empty clusters
could you help me with this?

faizal

faizal (view profile)

i ran this program but fail at this line:
C = X(:,randi(size(X,2)));

randi <any problem with this syntax?

Nurhasanah Ivansyah

Nurhasanah Ivansyah (view profile)

I have code like this:
clc;clear all
info = imfinfo('patient1_T1.png');
k = 4;
[L,C] = kmeans(X,k);
imagNew = reshape(X,512,512);
figure;imshow(imagNew,[]), title('image labeled by kmeans++ cluster ');

but i don't know how to display Label Every Pixel in the image using result kmeans++ and create image that segment
help me and thank...!

Laurent S

Laurent S (view profile)

@Nurhasanah Ivansyah: See http://en.wikipedia.org/wiki/Image_segmentation#Clustering_methods. One way to do image segmentation is to organize your image into (x,y,intensity) datapoints, and then to apply a clustering algorithm.

Nurhasanah Ivansyah

Nurhasanah Ivansyah (view profile)

it's excellent...but i don't know how to apply it to image(512x512). can you explain how to apply it in classification or segmentation image? help me n thanks...

 8 Feb 2013 1.7.0.0 Fixed bug with 1D datasets (thanks Xiaobo Li). 7 Aug 2011 1.6.0.0 Improved handling of overclustering (thanks Sid S) and added a screenshot. 17 May 2011 1.5.0.0 Small bugfix. 7 May 2011 1.4.0.0 Removed dependency on randi for R2008a or lower (thanks Cassie). 17 Nov 2010 1.3.0.0 Even faster, even less code and also fixed a few small bugs.
MATLAB Release Compatibility
Created with R2012b
Compatible with any release
Platform Compatibility
Windows macOS Linux