Note: This page has been translated by MathWorks. Click here to see

To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

*k*-means clustering

`idx = kmeans(X,k)`

`idx = kmeans(X,k,Name,Value)`

```
[idx,C]
= kmeans(___)
```

```
[idx,C,sumd]
= kmeans(___)
```

```
[idx,C,sumd,D]
= kmeans(___)
```

performs `idx`

= kmeans(`X`

,`k`

)*k*-means
clustering to partition the observations of the *n*-by-*p* data
matrix `X`

into `k`

clusters, and
returns an *n*-by-1 vector (`idx`

)
containing cluster indices of each observation. Rows of `X`

correspond
to points and columns correspond to variables.

By default, `kmeans`

uses the squared Euclidean distance metric and the *k*-means++
algorithm for cluster center initialization.

returns
the cluster indices with additional options specified by one or more `idx`

= kmeans(`X`

,`k`

,`Name,Value`

)`Name,Value`

pair
arguments.

For example, specify the cosine distance, the number of times to repeat the clustering using new initial values, or to use parallel computing.

`kmeans`

uses a two-phase iterative algorithm to minimize the sum of point-to-centroid distances, summed over all`k`

clusters.This first phase uses

*batch updates*, where each iteration consists of reassigning points to their nearest cluster centroid, all at once, followed by recalculation of cluster centroids. This phase occasionally does not converge to solution that is a local minimum. That is, a partition of the data where moving any single point to a different cluster increases the total sum of distances. This is more likely for small data sets. The batch phase is fast, but potentially only approximates a solution as a starting point for the second phase.This second phase uses

*online updates*, where points are individually reassigned if doing so reduces the sum of distances, and cluster centroids are recomputed after each reassignment. Each iteration during this phase consists of one pass though all the points. This phase converges to a local minimum, although there might be other local minima with lower total sum of distances. In general, finding the global minimum is solved by an exhaustive choice of starting points, but using several replicates with random starting points typically results in a solution that is a global minimum.

If

`Replicates`

=*r*> 1 and`Start`

is`plus`

(the default), then the software selects*r*possibly different sets of seeds according to the*k*-means++ algorithm.If you enable the

`UseParallel`

option in`Options`

and`Replicates`

> 1, then each worker selects seeds and clusters in parallel.

[1] Arthur, David, and Sergi Vassilvitskii. “K-means++: The Advantages of Careful Seeding.” SODA ‘07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. 2007, pp. 1027–1035.

[2] Lloyd, Stuart P. “Least Squares Quantization in PCM.” IEEE Transactions on Information Theory. Vol. 28, 1982, pp. 129–137.

[3] Seber, G. A. F. Multivariate Observations. Hoboken, NJ: John Wiley & Sons, Inc., 1984.

[4] Spath, H. Cluster Dissection and Analysis: Theory, FORTRAN Programs, Examples. Translated by J. Goldschmidt. New York: Halsted Press, 1985.

`clusterdata`

| `gmdistribution`

| `linkage`

| `parpool`

| `silhouette`

| `statset`

Was this topic helpful?