File Exchange

image thumbnail


version (1.74 KB) by Laurent S
Cluster multivariate data using the k-means++ algorithm.


Updated 11 Feb 2013

View License

An efficient implementation of the k-means++ algorithm for clustering multivariate data. It has been shown that this algorithm has an upper bound for the expected value of the total intra-cluster distance which is log(k) competitive. Additionally, k-means++ usually converges in far fewer than vanilla k-means.

Comments and Ratings (41)

Shuchao He

Dear Laurent
How can I get the output ???
I'm using with image

Guy Maoz


Zhiying Xu


MOHIT (view profile)

Min Shi

can qiu

xiaoying wang

Shamna p

Dear Laurent,

Thank you for sharing the code, its works faster than kmeans. But I ma getting different results for for same data in each run. If the seed points are initialized optimally it should give same result for all runs right? I selected number of cluster using evalclusters().I compared the sse of Kmeansapp and normal Kmeans( k means using random selection of keypoints ,some times I get normal kmeans have better sse. May I know is there any way I can get the result optimally and consistently


lzl (view profile)

arun mukundan

Dear Laurent,

I have one question. In you initialization of cluster centres ( in the __for i = 2:k__ loop ), why do you assign the points to their clusters?


I have code:
nfo = imfinfo('im.png');
X = im2double(imread('im','png'));
X = imadjust(X);
k = 4;

and kmeanspp??
I want to output the resulting image

for example:

how can I do?

I am an Italian student


asmae (view profile)

dear laurent,
when i use this algorithm several time for the same input, i get different output results ! it's this logic ?!


Qinfan (view profile)

Thanks! Very easy to use.


Ajit (view profile)

Hi Laurent,

I'm having a bit of trouble understanding the bsxfun lines. Is there any documentation or explanation of why you are maximizing this and how it relates to the distance, can't quite understand where the C' comes from?


tsan toso

Hi Laurent,

First of all thanks for the code. I actually trying to use kmeans to bin my 1D data, the C output could be used as the bin centers. However, if I want the function to output edges is there a non-trivial way to do it?


Laurent S

@Matei Tene: dot(C,C) is actually implemented as sum(C.*conj(C)), making both are equally fast.

Matei Tene

Is dot(C,C) faster than sum(C.^2)? I see this being used in both in kmeans++ and in litekmeans.


Zoé (view profile)

Thank you for sharing this. This implementation is faster than litekmeans.


Zoé (view profile)

Hi there,

How can I implement the 'replicates' option of the original kmeans funtion using this?


Laurent S

@Xiaobo Li: Thank you for your comment, I updated the code so that it works with 1D datasets. The update should appear on the File Exchange shortly.

Xiaobo Li

Hi, Laurent,
Thank you for your files. I try your codes and it works if I carefully choose the parameter. However, I cannot choose how many clusters to apply. In my perception, the parameter k should be the number of clusters, however, unfortunately, it is not.

In my case, X is [400*1] matrix, which means that my data is 1D and there are 400 data point. If I let k=1, the code works well, with the resulting C contains 22 unique centers. However, if I let k to be other number, say 4, it is the identical 4 columns with each column the same as in the case of k=1! Is it your initial purpose to do it? If it is the case, I don't understand the purpose of parameter 'k'.

Hope to know your answers! Anyway, thank you very much for your efforts and sharing!

Best regards,

Laurent S

@Micky: Good catch :-) Yes, using the distance instead of squared distance shows consistently better results on the datasets used by the original authors. If you happen to find a better heuristic, I would be happy to hear about it of course!


Micky (view profile)

Thank you very much, Laurent.

You seem to weigh the samples by their distance from the nearest existing centeroid, rather than by the square distance, although the article seems to recommend the square distance for k-means (rather than k-medians). Is this on purpose?

Laurent S

@Andreas: Thanks! To answer your question: you could try using PCA on the cluster centers to discover which clusters are the most important (although it depends on how you define 'important').


Works very nicely, even on large data sets (50K samples in 128 dimensions) - thanks.

A question - suppose I have a good clustering - how can one know which dimensions are less useful than others? Most grateful for any suggestions -


SD (view profile)

I am trying to get initial centriod using your kmeans++ matlab code ,but i am not able to understand the output
your function returns like for e.g. I need ten clusters for the given data . I don't know how to get this information
from [L , C].


musi (view profile)

Hi thanks for it.
I am just confused how we will choose the next centroid.
Let's suppose we have chosen c1 (first centroid) uniformly at random. I calculate prob of each point with c1 and it is {p1,p2,}.

What is the criteria to choose the next point? e.g. shld I choose the the one with max prob? min prob? or just with random prob (see next)

for all point i \in n
if(rand (0.0,1.0) < p(i))
c2= p(i)


Laurent S

@Renato: it could be that it is difficult to discern clusters in your dataset. Try visualizing the k-means++ initialization for a simple 2D Gaussian mixture and you should see a marked improvement over randomly selected cluster centra. For more experiments, see [D. Arthur and S. Vassilvitskii, "k-means++: The Advantages of Careful Seeding", Technical Report 2006-13, Stanford InfoLab, 2006.].

Laurent S

@Cassie: I removed the dependency on randi for your convenience. The submission should be updated shortly.

Dear Laurent,

I'm trying to make some experiments regarding the amount of time the algorithm takes to generate the initial seeds and have then extracted the below code from your submission:
(Please note my datasets are Entity x Feature, so there are some small adjustments)

L = [];
C = X(randi(size(X,1)),:);

L = ones(size(X,1),1);
for i = 2:k
D = X-C(L,:);
D = sqrt(dot(D',D'));
C(i,:) = X(find(rand < cumsum(D)/sum(D),1),:);
[~,L] = max(bsxfun(@minus,2*real(C*X'),dot(C',C').'));

But if I compare the accuracy of KMeans (statistics toolbox) using random seeds or the above generated seeds, I get about the same (in 100 runs).
Any thoughts?
I have standardized my dataset as x_ij = (x_ij - mean(x_j))/range(x_j)
Many thanks!!


Cassie (view profile)

Dear Laurent,
My matlab version does not support randi() function either. may I know if there is another way to replace this function? Thanks

Laurent S

@Jia: the code you are referring to is not the code in this submission.


Jia (view profile)

I don't know why I run into an error like this each time:??? Error: File: litekmeans.m Line: 6 Column: 7
Expression or statement is incorrect--possibly unbalanced (, {, or [.

Error in ==> cluster at 3
>>load data;

the code broke at the line
[~,~,label] = unique(label); % remove empty clusters
could you help me with this?


faizal (view profile)

i ran this program but fail at this line:
C = X(:,randi(size(X,2)));

randi <any problem with this syntax?

I have code like this:
clc;clear all
info = imfinfo('patient1_T1.png');
X = im2double(imread('patient1_T1','png'));
X = imadjust(X);
k = 4;
[L,C] = kmeans(X,k);
imagNew = reshape(X,512,512);
figure;imshow(imagNew,[]), title('image labeled by kmeans++ cluster ');

but i don't know how to display Label Every Pixel in the image using result kmeans++ and create image that segment
help me and thank...!

Laurent S

@Nurhasanah Ivansyah: See One way to do image segmentation is to organize your image into (x,y,intensity) datapoints, and then to apply a clustering algorithm.

it's excellent...but i don't know how to apply it to image(512x512). can you explain how to apply it in classification or segmentation image? help me n thanks...


Fixed bug with 1D datasets (thanks Xiaobo Li).

Improved handling of overclustering (thanks Sid S) and added a screenshot.

Small bugfix.

Removed dependency on randi for R2008a or lower (thanks Cassie).

Even faster, even less code and also fixed a few small bugs.

MATLAB Release Compatibility
Created with R2012b
Compatible with any release
Platform Compatibility
Windows macOS Linux

Inspired by: Kmeans Clustering

Inspired: kmeans_varpar(X,k), Sparsified K-Means

Discover Live Editor

Create scripts with code, output, and formatted text in a single executable document.

Learn About Live Editor