# kmeans

k-means clustering

## Syntax

• `idx = kmeans(X,k)` example
• `idx = kmeans(X,k,Name,Value)` example
• ```[idx,C] = kmeans(___)``` example
• ```[idx,C,sumd] = kmeans(___)``` example
• ```[idx,C,sumd,D] = kmeans(___)``` example

## Description

example

````idx = kmeans(X,k)` performs k-means clustering to partition the observations of the n-by-p data matrix `X` into `k` clusters, and returns an n-by-1 vector (`idx`) containing cluster indices of each observation. Rows of `X` correspond to points and columns correspond to variables.By default, `kmeans` uses the squared Euclidean distance measure and the k-means++ algorithm for cluster center initialization.```

example

````idx = kmeans(X,k,Name,Value)` returns the cluster indices with additional options specified by one or more `Name,Value` pair arguments.For example, specify the cosine distance, the number of times to repeat the clustering using new initial values, or to use parallel computing.```

example

``````[idx,C] = kmeans(___)``` returns the `k` cluster centroid locations in the `k`-by-p matrix `C`.```

example

``````[idx,C,sumd] = kmeans(___)``` returns the within-cluster sums of point-to-centroid distances in the `k`-by-1 vector `sumd`.```

example

``````[idx,C,sumd,D] = kmeans(___)``` returns distances from each point to every centroid in the n-by-`k` matrix `D`.```

## Examples

collapse all

### Train a k-Means Clustering Algorithm

Cluster data using k-means clustering, then plot the cluster regions.

Load Fisher's iris data set. Use the petal lengths and widths as predictors.

```load fisheriris X = meas(:,3:4); figure; plot(X(:,1),X(:,2),'k*','MarkerSize',5); title 'Fisher''s Iris Data'; xlabel 'Petal Lengths (cm)'; ylabel 'Petal Widths (cm)'; ```

The larger cluster seems to be split into a lower variance region and a higher variance region. This might indicate that the larger cluster is two, overlapping clusters.

Cluster the data. Specify k = 3 clusters.

```rng(1); % For reproducibility [idx,C] = kmeans(X,3); ```

`kmeans` uses the k-means++ algorithm for centroid initialization and squared Euclidean distance by default. It is good practice to search for lower, local minima by setting the `'Replicates'` name-value pair argument.

`idx` is a vector of predicted cluster indices corrresponding to the observations in `X`. `C` is a 3-by-2 matrix containing the final centroid locations.

Use `kmeans` to compute the distance from each centroid to points on a grid. To do this, pass the centroids (`C`) and points on a grid to `kmeans`, and implement one iteration of the algorithm.

```x1 = min(X(:,1)):0.01:max(X(:,1)); x2 = min(X(:,2)):0.01:max(X(:,2)); [x1G,x2G] = meshgrid(x1,x2); XGrid = [x1G(:),x2G(:)]; % Defines a fine grid on the plot idx2Region = kmeans(XGrid,3,'MaxIter',1,'Start',C);... % Assigns each node in the grid to the closest centroid ```
```Warning: Failed to converge in 1 iterations. ```

`kmeans` displays a warning stating that the algorithm did not converge, which you should expect since the software only implemented one iteration.

Plot the cluster regions.

```figure; gscatter(XGrid(:,1),XGrid(:,2),idx2Region,... [0,0.75,0.75;0.75,0,0.75;0.75,0.75,0],'..'); hold on; plot(X(:,1),X(:,2),'k*','MarkerSize',5); title 'Fisher''s Iris Data'; xlabel 'Petal Lengths (cm)'; ylabel 'Petal Widths (cm)'; legend('Region 1','Region 2','Region 3','Data','Location','Best'); hold off; ```

### Partition Data into Two Clusters

Randomly generate the sample data.

```rng default; % For reproducibility X = [randn(100,2)*0.75+ones(100,2); randn(100,2)*0.5-ones(100,2)]; figure; plot(X(:,1),X(:,2),'.'); title 'Randomly Generated Data'; ```

There appears to be two clusters in the data.

Partition the data into two clusters, and choose the best arrangement out of five intializations. Display the final output.

```opts = statset('Display','final'); [idx,C] = kmeans(X,2,'Distance','cityblock',... 'Replicates',5,'Options',opts); ```
```Replicate 1, 4 iterations, total sum of distances = 201.533. Replicate 2, 6 iterations, total sum of distances = 201.533. Replicate 3, 4 iterations, total sum of distances = 201.533. Replicate 4, 4 iterations, total sum of distances = 201.533. Replicate 5, 3 iterations, total sum of distances = 201.533. Best total sum of distances = 201.533 ```

By default, the software initializes the replicates separatly using k-means++.

Plot the clusters and the cluster centroids.

```figure; plot(X(idx==1,1),X(idx==1,2),'r.','MarkerSize',12) hold on plot(X(idx==2,1),X(idx==2,2),'b.','MarkerSize',12) plot(C(:,1),C(:,2),'kx',... 'MarkerSize',15,'LineWidth',3) legend('Cluster 1','Cluster 2','Centroids',... 'Location','NW') title 'Cluster Assignments and Centroids' hold off ```

You can determine how well separated the clusters are by passing `idx` to `silhouette`.

### Cluster Data Using Parallel Computing

Clustering large data sets might take time, particularly if you use online updates (set by default). If you have a Parallel Computing Toolbox™ license and you invoke a pool of workers, then `kmeans` runs each clustering task (or replicate) in parallel. Therefore, if `Replicates` > 1, then the parallel computing decreases time to convergence.

Randomly generate a large data set from a Gaussian mixture model.

```Mu = bsxfun(@times,ones(20,30),(1:20)'); % Gaussian mixture mean rn30 = randn(30,30); Sigma = rn30'*rn30; % Symmetric and positive-definite covariance Mdl = gmdistribution(Mu,Sigma); rng(1); % For reproducibility X = random(Mdl,10000); ```

`Mdl` is a 30-dimensional `gmdistribution` model with 20 components. `X` is a `10000`-by-`30` matrix of data generated from `Mdl`.

Invoke a parallel pool of workers. Specify options for parallel computing.

```pool = parpool; % Invokes workers stream = RandStream('mlfg6331_64'); % Random number stream options = statset('UseParallel',1,'UseSubstreams',1,... 'Streams',stream);```
`Starting parallel pool (parpool) using the 'local' profile ... connected to 2 workers.`

The input argument `'mlfg6331_64'` of `RandStream` specifies to use the multiplicative lagged Fibonacci generator algorithm. `options` is a structure array containing fields that specify options for controlling estimation.

The Command Window indicates that two workers are available. The number of workers might vary on your system.

Cluster the data using k-means clustering. Specify that there are k = 20 clusters in the data and increase the number of iterations. Typically, the objective function contains local minima. Specify 10 replicates to help find a lower, local minimum.

```tic; % Start stopwatch timer [idx,C,sumd,D] = kmeans(X,20,'Options',options,'MaxIter',10000,... 'Display','final','Replicates',10); toc % Terminate stopwatch timer```
```Replicate 4, 121 iterations, total sum of distances = 7.58059e+06. Replicate 7, 234 iterations, total sum of distances = 7.5904e+06. Replicate 3, 146 iterations, total sum of distances = 7.59086e+06. Replicate 2, 179 iterations, total sum of distances = 7.57758e+06. Replicate 6, 118 iterations, total sum of distances = 7.58614e+06. Replicate 5, 88 iterations, total sum of distances = 7.59462e+06. Replicate 1, 99 iterations, total sum of distances = 7.57765e+06. Replicate 9, 147 iterations, total sum of distances = 7.57639e+06. Replicate 10, 107 iterations, total sum of distances = 7.60079e+06. Replicate 8, 144 iterations, total sum of distances = 7.58117e+06. Best total sum of distances = 7.57639e+06 Elapsed time is 123.857736 seconds.```

The Command Window displays the number of iterations and the terminal objective function value for each replicate. The output arguments contain the results of replicate `9` because it has the lowest total sum of distances.

## Input Arguments

collapse all

### `X` — Datanumeric matrix

Data, specified as a numeric matrix. The rows of `X` correspond to observations, and the columns correspond to variables.

If `X` is a numeric vector, then `kmeans` treats it as an n-by-1 data matrix, regardless of its orientation.

Data Types: `single` | `double`

### `k` — Number of clusterspositive integer

Number of clusters in the data, specified as a positive integer.

Data Types: `single` | `double`

### Name-Value Pair Arguments

Specify optional comma-separated pairs of `Name,Value` arguments. `Name` is the argument name and `Value` is the corresponding value. `Name` must appear inside single quotes (`' '`). You can specify several name and value pair arguments in any order as `Name1,Value1,...,NameN,ValueN`.

Example: `'Distance','cosine','Replicates',10,'Options',statset('UseParallel',1)` specifies the cosine distance, `10` replicate clusters at different starting values, and to use parallel computing.

### `'Display'` — Level of output to display`'off'` (default) | `'final'` | `'iter'`

Level of output to display in the Command Window, specified as the comma-separated pair consisting of `'Display'` and a string. Available options are:

• `'final'` — Displays results of the final iteration

• `'iter'` — Displays results of each iteration

• `'off'` — Displays nothing

Example: `'Display','final'`

Data Types: `char`

### `'Distance'` — Distance measure`'sqeuclidean'` (default) | `'cityblock'` | `'cosine'` | `'correlation'` | `'hamming'`

Distance measure, in `p`-dimensional space, used for minimization, specified as the comma-separated pair consisting of `'Distance'` and a string.

`kmeans` computes centroid clusters differently for the different, supported distance measures. This table summarizes the available distance measures. In the formulae, x is an observation (that is, a row of `X`) and c is a centroid (a row vector).

Distance MeasureDescriptionFormula
`'sqeuclidean'`

Squared Euclidean distance (default). Each centroid is the mean of the points in that cluster.

$d\left(x,c\right)=\left(x-c\right)\left(x-c{\right)}^{\prime }$

`'cityblock'`

Sum of absolute differences, i.e., the L1 distance. Each centroid is the component-wise median of the points in that cluster.

$d\left(x,c\right)=\sum _{j=1}^{p}|{x}_{j}-{c}_{j}|$

`'cosine'`

One minus the cosine of the included angle between points (treated as vectors). Each centroid is the mean of the points in that cluster, after normalizing those points to unit Euclidean length.

$d\left(x,c\right)=1-\frac{xc\prime }{\sqrt{\left(x{x}^{\prime }\right)\left(cc\prime \right)}}$

`'correlation'`

One minus the sample correlation between points (treated as sequences of values). Each centroid is the component-wise mean of the points in that cluster, after centering and normalizing those points to zero mean and unit standard deviation.

$d\left(x,c\right)=1-\frac{\left(x-\stackrel{\to }{\overline{x}}\right){\left(c-\stackrel{\to }{\overline{c}}\right)}^{\prime }}{\sqrt{\left(x-\stackrel{\to }{\overline{x}}\right){\left(x-\stackrel{\to }{\overline{x}}\right)}^{\prime }}\sqrt{\left(c-\stackrel{\to }{\overline{c}}\right){\left(c-\stackrel{\to }{\overline{c}}\right)}^{\prime }}},$

where

• $\stackrel{\to }{\overline{x}}=\frac{1}{p}\left(\sum _{j=1}^{p}{x}_{j}\right){\stackrel{\to }{1}}_{p}$

• $\stackrel{\to }{\overline{c}}=\frac{1}{p}\left(\sum _{j=1}^{p}{c}_{j}\right){\stackrel{\to }{1}}_{p}$

• ${\stackrel{\to }{1}}_{p}$ is a row vector of p ones.

`'hamming'`

This measure is only suitable for binary data.

It is the proportion of bits that differ. Each centroid is the component-wise median of points in that cluster.

$d\left(x,y\right)=\frac{1}{p}\sum _{j=1}^{p}I\left\{{x}_{j}\ne {y}_{j}\right\},$

where I is the indicator function.

Example: `'Distance','cityblock'`

Data Types: `char`

### `'EmptyAction'` — Action to take if cluster loses all member observations`'singleton'` (default) | `'error'` | `'drop'`

Action to take if a cluster loses all its member observations, specified as the comma-separated pair consisting of `'EmptyAction'` and a string. This table summarizes the available options.

ValueDescription
`'error'`

Treat an empty cluster as an error.

`'drop'`

Remove any clusters that become empty. `kmeans` sets the corresponding return values in `C` and `D` to `NaN`.

`'singleton'`

Create a new cluster consisting of the one point furthest from its centroid (default).

Example: `'EmptyAction','error'`

Data Types: `char`

### `'MaxIter'` — Maximum number of iterations`100` (default) | positive integer

Maximum number of iterations, specified as the comma-separated pair consisting of `'MaxIter'` and a positive integer.

Example: `'MaxIter',1000`

Data Types: `double` | `single`

### `'OnlinePhase'` — Online update flag`'on'` (default) | `'off'`

Online update flag, specified as the comma-separated pair consisting of `'OnlinePhase'` and `'on'` or `'off'`.

If `OnlinePhase` is `on` (the default), then `kmeans` performs an online update phase in addition to a batch update phase. The online phase can be time consuming for large data sets, but guarantees a solution that is a local minimum of the distance criterion. In other words, the software finds a partition of the data in which moving any single point to a different cluster increases the total sum of distances.

Example: `'OnlinePhase','off'`

Data Types: `char`

### `'Options'` — Options for controlling iterative algorithm for minimizing fitting criteria`[]` (default) | structure array returned by `statset`

Options for controlling the iterative algorithm for minimizing the fitting criteria, specified as the comma-separated pair consisting of `'Options'` and a structure array returned by `statset`. These options require Parallel Computing Toolbox™.

This table summarizes the available options.

OptionDescription
`'Streams'`A `RandStream` object or cell array of such objects. If you do not specify `Streams`, `kmeans` uses the default stream or streams. If you specify `Streams`, use a single object except when:
• You have an open parallel pool

• `UseParallel` is `true`.

• `UseSubstreams` is `false`.

In that case, use a cell array the same size as the parallel pool. If a parallel pool is not open, then `Streams` must supply a single random number stream.

`'UseParallel'`
• If `true`, `Replicates` > 1, and if a parallel pool of workers from the Parallel Computing Toolbox is open, then the software implements k-means on each replicate in parallel.

• If the Parallel Computing Toolbox is not installed, or a parallel pool of workers is not open, computation occurs in serial mode. Default is `default`, meaning serial computation.

`'UseSubstreams'`Set to `true` to compute in parallel in a reproducible fashion. Default is `false`. To compute reproducibly, set `Streams` to a type allowing substreams: `'mlfg6331_64'` or `'mrg32k3a'`.

To ensure more predictable results, use `parpool` and explicitly create a parallel pool before invoking `kmeans` and setting `'Options',statset('UseParallel',1)`.

Example: `'Options',statset('UseParallel',1)`

Data Types: `struct`

### `'Replicates'` — Number of times to repeat clustering using new initial cluster centroid positions`1` (default) | positive integer

Number of times to repeat clustering using new initial cluster centroid positions, specified as the comma-separated pair consisting of `'Replicates'` and an integer. `kmeans` returns the solution with the lowest `sumd`.

You can set `'Replicates'` implicitly by supplying a 3-D array as the value for the `'Start'` name-value pair argument.

Example: `'Replicates',5`

Data Types: `double` | `single`

### `'Start'` — Method for choosing initial cluster centroid positions`'plus'` (default) | `'cluster'` | `'sample'` | `'uniform'` | numeric matrix | numeric array

Method for choosing initial cluster centroid positions (or seeds), specified as the comma-separated pair consisting of `'Start'` and a string, a numeric matrix, or a numeric array. This table summarizes the available options for choosing seeds.

ValueDescription
`'cluster'`Perform a preliminary clustering phase on a random 10% subsample of `X`. This preliminary phase is itself initialized using `'sample'`.
`'plus'` (default)Select `k` seeds by implementing the k-means++ algorithm for cluster center initialization.
`'sample'`Select `k` observations from `X` at random.
`'uniform'`Select `k` points uniformly at random from the range of `X`. Not valid with the Hamming distance.
numeric matrix`k`-by-p matrix of centroid starting locations. The rows of `Start` correspond to seeds. The software infers `k` from the first dimension of `Start`, so you can pass in `[]` for `k`.
numeric array`k`-by-p-r array of centroid starting locations. The rows of each page correspond to seeds. The third dimension invokes replication of the clustering routine. Page j contains the set of seeds for replicate j. The software infers the number of replicates (specified by the `'Replicates'` name-value pair argument) from the size of the third dimension.

Example: `'Start','sample'`

Data Types: `char` | `double` | `single`

 Note:   The software treats `NaN`s as missing data, and removes any row of `X` containing at least one `NaN`. Removing rows of `X` reduces the sample size.

## Output Arguments

collapse all

### `idx` — Cluster indicesnumeric column vector

Cluster indices, returned as a numeric column vector. `idx` has as many rows as `X`, and each row indicates the cluster assignment of the corresponding observation.

### `C` — Cluster centroid locationsnumeric matrix

Cluster centroid locations, returned as a numeric matrix. `C` is a `k`-by-p matrix, where row j is the centroid of cluster j.

### `sumd` — Within-cluster sums of point-to-centroid distancesnumeric column vector

Within-cluster sums of point-to-centroid distances, returned as a numeric column vector. `sumd` is a `k`-by-1 vector, where element j is the sum of point-to-centroid distances within cluster j.

### `D` — Distances from each point to every centroidnumeric matrix

Distances from each point to every centroid, returned as a numeric matrix. `D` is an n-by-`k` matrix, where element (j,m) is the distance from observation j to centroid m.

collapse all

### k-Means Clustering

k-means clustering, or Lloyd's algorithm [2], is an iterative, data-partitioning algorithm that assigns n observations to exactly one of k clusters defined by centroids, where k is chosen before the algorithm starts.

The algorithm proceeds as follows:

1. Choose k initial cluster centers (centroid). For example, choose k observations at random (by using `'Start','sample'`) or use the k-means ++ algorithm for cluster center initialization (the default).

2. Compute point-to-cluster-centroid distances of all observations to each centroid.

3. There are two ways to proceed (specified by `OnlinePhase`):

• Batch update — Assign each observation to the cluster with the closest centroid.

• Online update — Individually assign observations to a different centroid if the reassignment decreases the sum of the within-cluster, sum-of-squares point-to-cluster-centroid distances.

For more details, see Algorithms.

4. Compute the average of the observations in each cluster to obtain k new centroid locations.

5. Repeat steps 2 through 4 until cluster assignments do not change, or the maximum number of iterations is reached.

### k-means++ Algorithm

The k-means++ algorithm uses an heuristic to find centroid seeds for k-means clustering. According to Arthur and Vassilvitskii [1], k-means++ improves the running time of Lloyd's algorithm, and the quality of the final solution.

The k-means++ algorithm chooses seeds as follows, assuming the number of clusters is k.

1. Select an observation uniformly at random from the data set, X. The chosen observation is the first centroid, and is denoted c1.

2. Compute distances from each observation to c1. Denote the distance observation m is from cj $d\left({x}_{m},{c}_{j}\right)$.

3. Select the next centroid, c2 at random from X with probability

$\frac{{d}^{2}\left({x}_{m},{c}_{1}\right)}{\sum _{j=1}^{n}{d}^{2}\left({x}_{j},{c}_{1}\right)}.$

4. To choose center j:

1. Compute the distances from each observation to each centroid, and assign each observation to its closest centroid.

2. For m = 1,...,n and p = 1,...,j – 1, select centroid j at random from X with probability

$\frac{{d}^{2}\left({x}_{m},{c}_{p}\right)}{\sum _{\left\{h;{x}_{h}\in {C}_{p}\right\}}^{}{d}^{2}\left({x}_{h},{c}_{p}\right)},$

where Cp is the set of all observations closest to centroid cp and xm belongs to Cp.

That is, select each subsequent center with a probability proportional to the distance from itself to the closest center that you already chose.

5. Repeat step 4 until k centroids are chosen.

Arthur and Vassilvitskii [1] demonstrate, using a simulation study for several cluster orientations, that k-means++ achieves faster convergence to a lower sum of within-cluster, sum-of-squares point-to-cluster-centroid distances than Lloyd's algorithm.

### Algorithms

• `kmeans` uses a two-phase iterative algorithm to minimize the sum of point-to-centroid distances, summed over all `k` clusters.

1. This first phase uses batch updates, where each iteration consists of reassigning points to their nearest cluster centroid, all at once, followed by recalculation of cluster centroids. This phase occasionally does not converge to solution that is a local minimum. That is, a partition of the data where moving any single point to a different cluster increases the total sum of distances. This is more likely for small data sets. The batch phase is fast, but potentially only approximates a solution as a starting point for the second phase.

2. This second phase uses online updates, where points are individually reassigned if doing so reduces the sum of distances, and cluster centroids are recomputed after each reassignment. Each iteration during this phase consists of one pass though all the points. This phase converges to a local minimum, although there might be other local minima with lower total sum of distances. In general, finding the global minimum is solved by an exhaustive choice of starting points, but using several replicates with random starting points typically results in a solution that is a global minimum.

• If `Replicates` = r > 1 and `Start` is `plus` (the default), then the software selects r possibly different sets of seeds according to the k-means++ algorithm.

• If you enable the `UseParallel` option in `Options` and `Replicates` > 1, then each worker selects seeds and clusters in parallel.

## References

[1] Arthur, David, and Sergi Vassilvitskii. "K-means++: The Advantages of Careful Seeding." SODA ‘07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. 2007, pp. 1027–1035.

[2] Lloyd, Stuart P. "Least Squares Quantization in PCM." IEEE Transactions on Information Theory. Vol. 28, 1982, pp. 129–137.

[3] Seber, G. A. F. Multivariate Observations. Hoboken, NJ: John Wiley & Sons, Inc., 1984.

[4] Spath, H. Cluster Dissection and Analysis: Theory, FORTRAN Programs, Examples. Translated by J. Goldschmidt. New York: Halsted Press, 1985.