idx = kmeans(X,k)
idx = kmeans(X,k,Name,Value)
[idx,C] = kmeans(___)
[idx,C,sumd] = kmeans(___)
[idx,C,sumd,D] = kmeans(___)
clustering to partition the observations of the
n-by-p data matrix
idx = kmeans(
k clusters, and returns an n-by-1 vector
idx) containing cluster indices of each observation. Rows of
X correspond to points and columns correspond to variables.
kmeans uses the squared Euclidean distance metric and the k-means++
algorithm for cluster center initialization.
Cluster data using k-means clustering, then plot the cluster regions.
Load Fisher's iris data set. Use the petal lengths and widths as predictors.
load fisheriris X = meas(:,3:4); figure; plot(X(:,1),X(:,2),'k*','MarkerSize',5); title 'Fisher''s Iris Data'; xlabel 'Petal Lengths (cm)'; ylabel 'Petal Widths (cm)';
The larger cluster seems to be split into a lower variance region and a higher variance region. This might indicate that the larger cluster is two, overlapping clusters.
Cluster the data. Specify k = 3 clusters.
rng(1); % For reproducibility [idx,C] = kmeans(X,3);
kmeans uses the k-means++ algorithm for centroid initialization and squared Euclidean distance by default. It is good practice to search for lower, local minima by setting the
'Replicates' name-value pair argument.
idx is a vector of predicted cluster indices corresponding to the observations in
C is a 3-by-2 matrix containing the final centroid locations.
kmeans to compute the distance from each centroid to points on a grid. To do this, pass the centroids (
C) and points on a grid to
kmeans, and implement one iteration of the algorithm.
x1 = min(X(:,1)):0.01:max(X(:,1)); x2 = min(X(:,2)):0.01:max(X(:,2)); [x1G,x2G] = meshgrid(x1,x2); XGrid = [x1G(:),x2G(:)]; % Defines a fine grid on the plot idx2Region = kmeans(XGrid,3,'MaxIter',1,'Start',C);
Warning: Failed to converge in 1 iterations.
% Assigns each node in the grid to the closest centroid
kmeans displays a warning stating that the algorithm did not converge, which you should expect since the software only implemented one iteration.
Plot the cluster regions.
figure; gscatter(XGrid(:,1),XGrid(:,2),idx2Region,... [0,0.75,0.75;0.75,0,0.75;0.75,0.75,0],'..'); hold on; plot(X(:,1),X(:,2),'k*','MarkerSize',5); title 'Fisher''s Iris Data'; xlabel 'Petal Lengths (cm)'; ylabel 'Petal Widths (cm)'; legend('Region 1','Region 2','Region 3','Data','Location','SouthEast'); hold off;
Randomly generate the sample data.
rng default; % For reproducibility X = [randn(100,2)*0.75+ones(100,2); randn(100,2)*0.5-ones(100,2)]; figure; plot(X(:,1),X(:,2),'.'); title 'Randomly Generated Data';
There appears to be two clusters in the data.
Partition the data into two clusters, and choose the best arrangement out of five initializations. Display the final output.
opts = statset('Display','final'); [idx,C] = kmeans(X,2,'Distance','cityblock',... 'Replicates',5,'Options',opts);
Replicate 1, 3 iterations, total sum of distances = 201.533. Replicate 2, 5 iterations, total sum of distances = 201.533. Replicate 3, 3 iterations, total sum of distances = 201.533. Replicate 4, 3 iterations, total sum of distances = 201.533. Replicate 5, 2 iterations, total sum of distances = 201.533. Best total sum of distances = 201.533
By default, the software initializes the replicates separately using k-means++.
Plot the clusters and the cluster centroids.
figure; plot(X(idx==1,1),X(idx==1,2),'r.','MarkerSize',12) hold on plot(X(idx==2,1),X(idx==2,2),'b.','MarkerSize',12) plot(C(:,1),C(:,2),'kx',... 'MarkerSize',15,'LineWidth',3) legend('Cluster 1','Cluster 2','Centroids',... 'Location','NW') title 'Cluster Assignments and Centroids' hold off
You can determine how well separated the clusters are by passing
Clustering large data sets might take time, particularly if you use online updates (set by default). If you have a Parallel Computing Toolbox ™ license and you set the options for parallel computing, then
kmeans runs each clustering task (or replicate) in parallel. And, if
Replicates>1, then parallel computing decreases time to convergence.
Randomly generate a large data set from a Gaussian mixture model.
Mu = bsxfun(@times,ones(20,30),(1:20)'); % Gaussian mixture mean rn30 = randn(30,30); Sigma = rn30'*rn30; % Symmetric and positive-definite covariance Mdl = gmdistribution(Mu,Sigma); % Define the Gaussian mixture distribution rng(1); % For reproducibility X = random(Mdl,10000);
Mdl is a 30-dimensional
gmdistribution model with 20 components.
X is a 10000-by-30 matrix of data generated from
Specify the options for parallel computing.
stream = RandStream('mlfg6331_64'); % Random number stream options = statset('UseParallel',1,'UseSubstreams',1,... 'Streams',stream);
The input argument
RandStream specifies to use the multiplicative lagged Fibonacci generator algorithm.
options is a structure array with fields that specify options for controlling estimation.
Cluster the data using k-means clustering. Specify that there are k = 20 clusters in the data and increase the number of iterations. Typically, the objective function contains local minima. Specify 10 replicates to help find a lower, local minimum.
tic; % Start stopwatch timer [idx,C,sumd,D] = kmeans(X,20,'Options',options,'MaxIter',10000,... 'Display','final','Replicates',10);
Starting parallel pool (parpool) using the 'local' profile ... connected to 6 workers. Replicate 5, 72 iterations, total sum of distances = 7.73161e+06. Replicate 1, 64 iterations, total sum of distances = 7.72988e+06. Replicate 3, 68 iterations, total sum of distances = 7.72576e+06. Replicate 4, 84 iterations, total sum of distances = 7.72696e+06. Replicate 6, 82 iterations, total sum of distances = 7.73006e+06. Replicate 7, 40 iterations, total sum of distances = 7.73451e+06. Replicate 2, 194 iterations, total sum of distances = 7.72953e+06. Replicate 9, 105 iterations, total sum of distances = 7.72064e+06. Replicate 10, 125 iterations, total sum of distances = 7.72816e+06. Replicate 8, 70 iterations, total sum of distances = 7.73188e+06. Best total sum of distances = 7.72064e+06
toc % Terminate stopwatch timer
Elapsed time is 61.915955 seconds.
The Command Window indicates that six workers are available. The number of workers might vary on your system. The Command Window displays the number of iterations and the terminal objective function value for each replicate. The output arguments contain the results of replicate 9 because it has the lowest total sum of distances.
Data, specified as a numeric matrix. The rows of
to observations, and the columns correspond to variables.
X is a numeric vector, then
it as an n-by-1 data matrix, regardless of its
k— Number of clusters
Number of clusters in the data, specified as a positive integer.
comma-separated pairs of
the argument name and
Value is the corresponding value.
Name must appear inside quotes. You can specify several name and value
pair arguments in any order as
'Distance','cosine','Replicates',10,'Options',statset('UseParallel',1)specifies the cosine distance,
10replicate clusters at different starting values, and to use parallel computing.
'Display'— Level of output to display
Level of output to display in the Command Window, specified
as the comma-separated pair consisting of
one of the following options:
'final' — Displays results
of the final iteration
'iter' — Displays results
of each iteration
'off' — Displays nothing
'Distance'— Distance metric
Distance metric, in
p-dimensional space, used for
minimization, specified as the comma-separated pair consisting of
kmeans computes centroid clusters differently for
the different, supported distance metrics. This table summarizes the
available distance metrics. In the formulae, x is an
observation (that is, a row of
c is a centroid (a row vector).
Squared Euclidean distance (default). Each centroid is the mean of the points in that cluster.
Sum of absolute differences, i.e., the L1 distance. Each centroid is the component-wise median of the points in that cluster.
One minus the cosine of the included angle between points (treated as vectors). Each centroid is the mean of the points in that cluster, after normalizing those points to unit Euclidean length.
One minus the sample correlation between points (treated as sequences of values). Each centroid is the component-wise mean of the points in that cluster, after centering and normalizing those points to zero mean and unit standard deviation.
This metric is only suitable for binary data.
It is the proportion of bits that differ. Each centroid is the component-wise median of points in that cluster.
where I is the indicator function.
'EmptyAction'— Action to take if cluster loses all member observations
Action to take if a cluster loses all its member observations,
specified as the comma-separated pair consisting of
one of the following options.
Treat an empty cluster as an error.
Create a new cluster consisting of the one point furthest from its centroid (default).
'MaxIter'— Maximum number of iterations
100(default) | positive integer
Maximum number of iterations, specified as the comma-separated
pair consisting of
'MaxIter' and a positive integer.
'OnlinePhase'— Online update flag
Online update flag, specified as the comma-separated pair consisting
kmeans performs an online update phase in
addition to a batch update phase. The online phase can be time consuming
for large data sets, but guarantees a solution that is a local minimum
of the distance criterion. In other words, the software finds a partition
of the data in which moving any single point to a different cluster
increases the total sum of distances.
'Options'— Options for controlling iterative algorithm for minimizing fitting criteria
(default) | structure array returned by
Options for controlling the iterative algorithm for minimizing
the fitting criteria, specified as the comma-separated pair consisting
'Options' and a structure array returned by
statset. These options require Parallel
This table summarizes the available options.
In that case, use a cell array the same size
as the parallel pool. If a parallel pool is not
|Set to |
To ensure more predictable
parpool and explicitly
create a parallel pool before invoking
'Replicates'— Number of times to repeat clustering using new initial cluster centroid positions
1(default) | positive integer
Number of times to repeat clustering using new initial cluster
centroid positions, specified as the comma-separated pair consisting
'Replicates' and an integer.
the solution with the lowest
You can set
'Replicates' implicitly by supplying
a 3-D array as the value for the
'Start'— Method for choosing initial cluster centroid positions
'uniform'| numeric matrix | numeric array
Method for choosing initial cluster centroid positions (or seeds),
specified as the comma-separated pair consisting of
'uniform', a numeric matrix, or a numeric array.
This table summarizes the available options for choosing
|Perform a preliminary clustering phase on a
random 10% subsample of |
The software treats
NaNs as missing data,
and removes any row of
X containing at least one
Removing rows of
X reduces the sample size.
idx— Cluster indices
Cluster indices, returned as a numeric column vector.
as many rows as
X, and each row indicates the cluster
assignment of the corresponding observation.
C— Cluster centroid locations
Cluster centroid locations, returned as a numeric matrix.
k-by-p matrix, where row j is
the centroid of cluster j.
sumd— Within-cluster sums of point-to-centroid distances
Within-cluster sums of point-to-centroid distances, returned
as a numeric column vector.
sumd is a
vector, where element j is the sum of point-to-centroid
distances within cluster j.
D— Distances from each point to every centroid
Distances from each point to every centroid, returned as a numeric
D is an n-by-
where element (j,m) is the distance
from observation j to centroid m.
k-means clustering, or Lloyd’s algorithm , is an iterative, data-partitioning algorithm that assigns n observations to exactly one of k clusters defined by centroids, where k is chosen before the algorithm starts.
The algorithm proceeds as follows:
Choose k initial cluster centers
(centroid). For example, choose
k observations at random (by using
'Start','sample') or use the k-means ++ algorithm for cluster
center initialization (the default).
Compute point-to-cluster-centroid distances of all observations to each centroid.
There are two ways to proceed (specified by
Batch update — Assign each observation to the cluster with the closest centroid.
Online update — Individually assign observations to a different centroid if the reassignment decreases the sum of the within-cluster, sum-of-squares point-to-cluster-centroid distances.
For more details, see Algorithms.
Compute the average of the observations in each cluster to obtain k new centroid locations.
Repeat steps 2 through 4 until cluster assignments do not change, or the maximum number of iterations is reached.
The k-means++ algorithm uses an heuristic to find centroid seeds for k-means clustering. According to Arthur and Vassilvitskii , k-means++ improves the running time of Lloyd’s algorithm, and the quality of the final solution.
The k-means++ algorithm chooses seeds as follows, assuming the number of clusters is k.
Select an observation uniformly at random from the data set, X. The chosen observation is the first centroid, and is denoted c1.
Compute distances from each observation to c1. Denote the distance between cj and the observation m as .
Select the next centroid, c2 at random from X with probability
To choose center j:
Compute the distances from each observation to each centroid, and assign each observation to its closest centroid.
For m = 1,...,n and p = 1,...,j – 1, select centroid j at random from X with probability
where Cp is the set of all observations closest to centroid cp and xm belongs to Cp.
That is, select each subsequent center with a probability proportional to the distance from itself to the closest center that you already chose.
Repeat step 4 until k centroids are chosen.
Arthur and Vassilvitskii  demonstrate, using a simulation study for several cluster orientations, that k-means++ achieves faster convergence to a lower sum of within-cluster, sum-of-squares point-to-cluster-centroid distances than Lloyd’s algorithm.
kmeans uses a two-phase iterative
algorithm to minimize the sum of point-to-centroid distances, summed
This first phase uses batch updates, where each iteration consists of reassigning points to their nearest cluster centroid, all at once, followed by recalculation of cluster centroids. This phase occasionally does not converge to solution that is a local minimum. That is, a partition of the data where moving any single point to a different cluster increases the total sum of distances. This is more likely for small data sets. The batch phase is fast, but potentially only approximates a solution as a starting point for the second phase.
This second phase uses online updates, where points are individually reassigned if doing so reduces the sum of distances, and cluster centroids are recomputed after each reassignment. Each iteration during this phase consists of one pass though all the points. This phase converges to a local minimum, although there might be other local minima with lower total sum of distances. In general, finding the global minimum is solved by an exhaustive choice of starting points, but using several replicates with random starting points typically results in a solution that is a global minimum.
If you enable the
1, then each worker selects seeds and clusters in parallel.
 Arthur, David, and Sergi Vassilvitskii. “K-means++: The Advantages of Careful Seeding.” SODA ‘07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. 2007, pp. 1027–1035.
 Lloyd, Stuart P. “Least Squares Quantization in PCM.” IEEE Transactions on Information Theory. Vol. 28, 1982, pp. 129–137.
 Seber, G. A. F. Multivariate Observations. Hoboken, NJ: John Wiley & Sons, Inc., 1984.
 Spath, H. Cluster Dissection and Analysis: Theory, FORTRAN Programs, Examples. Translated by J. Goldschmidt. New York: Halsted Press, 1985.
This function supports tall arrays for out-of-memory data with some limitations.
Only random sample initialization is supported. Supported syntaxes:
idx = kmeans(X,k) performs classic
[idx,C] = kmeans(X,k) also returns
k cluster centroid locations.
[idx,C,sumd] = kmeans(X,k) additionally
k within-cluster sums of point-to-centroid
[___] = kmeans(___,Name,Value) specifies
additional name-value pair options using any of the other syntaxes.
Valid options are:
'Start' — Method used to
choose the initial cluster centroid positions. Value can be:
'plus' (default) — Select
X using a variant of the kmeans++ algorithm
adapted for tall data.
'sample' — Select
X at random.
Numeric matrix — A k-by-p matrix to explicitly specify starting locations.
'Options' — An options structure
created using the
statset function. For tall
kmeans uses the fields listed here and
ignores all other fields in the options structure:
'Display' — Level of display.
'MaxIter' — Maximum number
of iterations. Default is
'TolFun' — Convergency tolerance
for the within-cluster sums of point-to-centroid distances. Default
1e-4. This option field only works with tall
For more information, see Tall Arrays (MATLAB).
Usage notes and limitations:
Start method uses random
selections, the initial centroid cluster positions might not match MATLAB®.
If the number of rows in
X is fixed,
code generation does not remove rows of
The cluster centroid locations in
have a different order than in MATLAB. In this case, the cluster
idx have corresponding differences.
If you provide
Display, its value
If you provide
Streams, it must
be empty and
UseSubstreams must be
When you set the
UseParallel option to
Some computations can execute in parallel even when
large data sets, when
1, consider setting the
UseParallel option to
parfor to create
loops that run in parallel on supported shared-memory multicore
platforms. Loops that run in parallel can be faster than loops
that run on a single thread. If your compiler does not support
the Open Multiprocessing (OpenMP) application interface or you
disable OpenMP library, MATLAB
Coder™ treats the
for-loops. To find supported compilers,
see Supported Compilers.
To run in parallel, set the
'UseParallel' option to
'UseParallel' field of the options structure to
statset and specify the
'Options' name-value pair argument in the call to this function.
For more information, see the
'Options' name-value pair argument.
This function fully supports GPU arrays. For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).