An efficient implementation of the k-means++ algorithm for clustering multivariate data. It has been shown that this algorithm has an upper bound for the expected value of the total intra-cluster distance which is log(k) competitive. Additionally, k-means++ usually converges in far fewer than vanilla k-means.
I'm having a bit of trouble understanding the bsxfun lines. Is there any documentation or explanation of why you are maximizing this and how it relates to the distance, can't quite understand where the C' comes from?
First of all thanks for the code. I actually trying to use kmeans to bin my 1D data, the C output could be used as the bin centers. However, if I want the function to output edges is there a non-trivial way to do it?
Hi, Laurent,
Thank you for your files. I try your codes and it works if I carefully choose the parameter. However, I cannot choose how many clusters to apply. In my perception, the parameter k should be the number of clusters, however, unfortunately, it is not.
In my case, X is [400*1] matrix, which means that my data is 1D and there are 400 data point. If I let k=1, the code works well, with the resulting C contains 22 unique centers. However, if I let k to be other number, say 4, it is the identical 4 columns with each column the same as in the case of k=1! Is it your initial purpose to do it? If it is the case, I don't understand the purpose of parameter 'k'.
Hope to know your answers! Anyway, thank you very much for your efforts and sharing!
@Micky: Good catch :-) Yes, using the distance instead of squared distance shows consistently better results on the datasets used by the original authors. If you happen to find a better heuristic, I would be happy to hear about it of course!
You seem to weigh the samples by their distance from the nearest existing centeroid, rather than by the square distance, although the article seems to recommend the square distance for k-means (rather than k-medians). Is this on purpose?
@Andreas: Thanks! To answer your question: you could try using PCA on the cluster centers to discover which clusters are the most important (although it depends on how you define 'important').
I am trying to get initial centriod using your kmeans++ matlab code ,but i am not able to understand the output
your function returns like for e.g. I need ten clusters for the given data . I don't know how to get this information
from [L , C].
Hi thanks for it.
I am just confused how we will choose the next centroid.
Let's suppose we have chosen c1 (first centroid) uniformly at random. I calculate prob of each point with c1 and it is {p1,p2,....pn}.
What is the criteria to choose the next point? e.g. shld I choose the the one with max prob? min prob? or just with random prob (see next)
for all point i \in n
if(rand (0.0,1.0) < p(i))
c2= p(i)
@Renato: it could be that it is difficult to discern clusters in your dataset. Try visualizing the k-means++ initialization for a simple 2D Gaussian mixture and you should see a marked improvement over randomly selected cluster centra. For more experiments, see [D. Arthur and S. Vassilvitskii, "k-means++: The Advantages of Careful Seeding", Technical Report 2006-13, Stanford InfoLab, 2006.].
I'm trying to make some experiments regarding the amount of time the algorithm takes to generate the initial seeds and have then extracted the below code from your submission:
(Please note my datasets are Entity x Feature, so there are some small adjustments)
L = [];
C = X(randi(size(X,1)),:);
L = ones(size(X,1),1);
for i = 2:k
D = X-C(L,:);
D = sqrt(dot(D',D'));
C(i,:) = X(find(rand < cumsum(D)/sum(D),1),:);
[~,L] = max(bsxfun(@minus,2*real(C*X'),dot(C',C').'));
end
But if I compare the accuracy of KMeans (statistics toolbox) using random seeds or the above generated seeds, I get about the same (in 100 runs).
Any thoughts?
I have standardized my dataset as x_ij = (x_ij - mean(x_j))/range(x_j)
Many thanks!!
I don't know why I run into an error like this each time:??? Error: File: litekmeans.m Line: 6 Column: 7
Expression or statement is incorrect--possibly unbalanced (, {, or [.
Error in ==> cluster at 3
f=litekmeans(X,3);
using
>>load data;
scatterd(X,y)
f=litekmeans(X,3);
scatterd(X,f)
the code broke at the line
[~,~,label] = unique(label); % remove empty clusters
could you help me with this?
I have code like this:
clc;clear all
info = imfinfo('patient1_T1.png');
X = im2double(imread('patient1_T1','png'));
X = imadjust(X);
k = 4;
[L,C] = kmeans(X,k);
imagNew = reshape(X,512,512);
figure;imshow(imagNew,[]), title('image labeled by kmeans++ cluster ');
but i don't know how to display Label Every Pixel in the image using result kmeans++ and create image that segment
help me and thank...!
@Nurhasanah Ivansyah: See http://en.wikipedia.org/wiki/Image_segmentation#Clustering_methods. One way to do image segmentation is to organize your image into (x,y,intensity) datapoints, and then to apply a clustering algorithm.
it's excellent...but i don't know how to apply it to image(512x512). can you explain how to apply it in classification or segmentation image? help me n thanks...
Updates
17 Nov 2010
Even faster, even less code and also fixed a few small bugs.
07 May 2011
Removed dependency on randi for R2008a or lower (thanks Cassie).
17 May 2011
Small bugfix.
07 Aug 2011
Improved handling of overclustering (thanks Sid S) and added a screenshot.