Code covered by the BSD License  

Highlights from
k-means++

4.66667

4.7 | 3 ratings Rate this file 113 Downloads (last 30 days) File Size: 1.71 KB File ID: #28804
image thumbnail

k-means++

by Laurent S

 

23 Sep 2010 (Updated 07 Aug 2011)

Cluster multivariate data using the k-means++ algorithm.

| Watch this File

File Information
Description

An efficient implementation of the k-means++ algorithm for clustering multivariate data. It has been shown that this algorithm has an upper bound for the expected value of the total intra-cluster distance which is log(k) competitive. Additionally, k-means++ usually converges in far fewer than vanilla k-means.

Acknowledgements

The author wishes to acknowledge the following in the creation of this submission:
kmeans clustering

MATLAB release MATLAB 7.12 (2011a)
Tags for This File  
Everyone's Tags
Tags I've Applied
Add New Tags Please login to tag files.
Comments and Ratings (16)
22 Dec 2010 Nurhasanah Ivansyah

it's excellent...but i don't know how to apply it to image(512x512). can you explain how to apply it in classification or segmentation image? help me n thanks...

22 Dec 2010 Laurent S

@Nurhasanah Ivansyah: See http://en.wikipedia.org/wiki/Image_segmentation#Clustering_methods. One way to do image segmentation is to organize your image into (x,y,intensity) datapoints, and then to apply a clustering algorithm.

23 Dec 2010 Nurhasanah Ivansyah

I have code like this:
clc;clear all
info = imfinfo('patient1_T1.png');
 X = im2double(imread('patient1_T1','png'));
X = imadjust(X);
k = 4;
[L,C] = kmeans(X,k);
imagNew = reshape(X,512,512);
figure;imshow(imagNew,[]), title('image labeled by kmeans++ cluster ');

but i don't know how to display Label Every Pixel in the image using result kmeans++ and create image that segment
help me and thank...!

23 Mar 2011 faizal

i ran this program but fail at this line:
C = X(:,randi(size(X,2)));

randi <any problem with this syntax?

24 Mar 2011 Jia

I don't know why I run into an error like this each time:??? Error: File: litekmeans.m Line: 6 Column: 7
Expression or statement is incorrect--possibly unbalanced (, {, or [.

Error in ==> cluster at 3
f=litekmeans(X,3);
using
>>load data;
scatterd(X,y)
f=litekmeans(X,3);
scatterd(X,f)

the code broke at the line
    [~,~,label] = unique(label); % remove empty clusters
could you help me with this?

24 Mar 2011 Laurent S

@Jia: the code you are referring to is not the code in this submission.

06 May 2011 Cassie

Dear Laurent,
My matlab version does not support randi() function either. may I know if there is another way to replace this function? Thanks

09 May 2011 Renato Amorim

Dear Laurent,

I'm trying to make some experiments regarding the amount of time the algorithm takes to generate the initial seeds and have then extracted the below code from your submission:
(Please note my datasets are Entity x Feature, so there are some small adjustments)

L = [];
C = X(randi(size(X,1)),:);

L = ones(size(X,1),1);
for i = 2:k
    D = X-C(L,:);
    D = sqrt(dot(D',D'));
    C(i,:) = X(find(rand < cumsum(D)/sum(D),1),:);
    [~,L] = max(bsxfun(@minus,2*real(C*X'),dot(C',C').'));
end

But if I compare the accuracy of KMeans (statistics toolbox) using random seeds or the above generated seeds, I get about the same (in 100 runs).
Any thoughts?
I have standardized my dataset as x_ij = (x_ij - mean(x_j))/range(x_j)
Many thanks!!

09 May 2011 Laurent S

@Cassie: I removed the dependency on randi for your convenience. The submission should be updated shortly.

09 May 2011 Laurent S

@Renato: it could be that it is difficult to discern clusters in your dataset. Try visualizing the k-means++ initialization for a simple 2D Gaussian mixture and you should see a marked improvement over randomly selected cluster centra. For more experiments, see [D. Arthur and S. Vassilvitskii, "k-means++: The Advantages of Careful Seeding", Technical Report 2006-13, Stanford InfoLab, 2006.].

24 Aug 2011 musi

Hi thanks for it.
I am just confused how we will choose the next centroid.
Let's suppose we have chosen c1 (first centroid) uniformly at random. I calculate prob of each point with c1 and it is {p1,p2,....pn}.

What is the criteria to choose the next point? e.g. shld I choose the the one with max prob? min prob? or just with random prob (see next)

for all point i \in n
if(rand (0.0,1.0) < p(i))
c2= p(i)

Thanks

07 Oct 2011 SD

I am trying to get initial centriod using your kmeans++ matlab code ,but i am not able to understand the output
your function returns like for e.g. I need ten clusters for the given data . I don't know how to get this information
from [L , C].

21 Dec 2011 Andreas

Works very nicely, even on large data sets (50K samples in 128 dimensions) - thanks.

A question - suppose I have a good clustering - how can one know which dimensions are less useful than others? Most grateful for any suggestions -

21 Dec 2011 Laurent S

@Andreas: Thanks! To answer your question: you could try using PCA on the cluster centers to discover which clusters are the most important (although it depends on how you define 'important').

28 Dec 2011 Micky

Thank you very much, Laurent.

You seem to weigh the samples by their distance from the nearest existing centeroid, rather than by the square distance, although the article seems to recommend the square distance for k-means (rather than k-medians). Is this on purpose?

02 Jan 2012 Laurent S

@Micky: Good catch :-) Yes, using the distance instead of squared distance shows consistently better results on the datasets used by the original authors. If you happen to find a better heuristic, I would be happy to hear about it of course!

Please login to add a comment or rating.
Updates
17 Nov 2010

Even faster, even less code and also fixed a few small bugs.

07 May 2011

Removed dependency on randi for R2008a or lower (thanks Cassie).

17 May 2011

Small bugfix.

07 Aug 2011

Improved handling of overclustering (thanks Sid S) and added a screenshot.

Tag Activity for this File
Tag Applied By Date/Time
data mining Laurent S 23 Sep 2010 09:55:07
clustering Laurent S 23 Sep 2010 09:55:07
kmeans Laurent S 23 Sep 2010 09:55:07
data mining Biju Bajracharya 28 Feb 2012 10:11:54
clustering Biju Bajracharya 28 Feb 2012 10:12:00
clustering Abhishek Kumar 28 Mar 2012 18:39:20
clustering Abhishek Kumar 28 Mar 2012 18:39:21
clustering Abhishek Kumar 28 Mar 2012 18:39:23
clustering Abhishek Kumar 28 Mar 2012 18:39:24
kmeans Abhishek Kumar 28 Mar 2012 18:39:26
kmeans Abhishek Kumar 28 Mar 2012 18:39:27
kmeans Phuong Nguyen 31 Mar 2012 23:25:26

Contact us at files@mathworks.com