This topic provides a brief overview of the available clustering methods in Statistics and Machine Learning Toolbox™.

*Cluster analysis*, also called *segmentation
analysis* or *taxonomy analysis*, is a common
unsupervised learning method. Unsupervised learning is used to draw inferences from
data sets consisting of input data without labeled responses. For example, you can
use cluster analysis for exploratory data analysis to find hidden patterns or
groupings in unlabeled data.

Cluster analysis creates groups, or *clusters*, of data.
Objects that belong to the same cluster are similar to one another and distinct from
objects that belong to different clusters. To quantify "similar" and "distinct," you
can use a dissimilarity measure (or distance metric) that is specific to the domain of your application and
your data set. Also, depending on your application, you might consider scaling (or
standardizing) the variables in your data to give them equal importance during
clustering.

Statistics and Machine Learning Toolbox provides functionality for these clustering methods:

Hierarchical clustering groups data over a variety of scales by creating a
cluster tree, or *dendrogram*. The tree is not a single set
of clusters, but rather a multilevel hierarchy, where clusters at one level
combine to form clusters at the next level. This multilevel hierarchy allows you
to choose the level, or scale, of clustering that is most appropriate for your
application. Hierarchical clustering assigns every point in your data to a
cluster.

Use `clusterdata`

to perform
hierarchical clustering on input data. `clusterdata`

incorporates the `pdist`

, `linkage`

, and `cluster`

functions, which you
can use separately for more detailed analysis. The `dendrogram`

function plots the
cluster tree. For more information, see Introduction to Hierarchical Clustering.

*k*-means clustering and *k*-medoids
clustering partition data into *k* mutually exclusive clusters.
These clustering methods require that you specify the number of clusters
*k*. Both *k*-means and
*k*-medoids clustering assign every point in your data to a
cluster; however, unlike hierarchical clustering, these methods operate on
actual observations (rather than dissimilarity measures), and create a single
level of clusters. Therefore, *k*-means or
*k*-medoids clustering is often more suitable than hierarchical
clustering for large amounts of data.

Use `kmeans`

and `kmedoids`

to implement
*k*-means clustering and *k*-medoids
clustering, respectively. For more information, see Introduction to *k*-Means Clustering and *k*-Medoids Clustering.

DBSCAN is a density-based algorithm that identifies arbitrarily shaped
clusters and outliers (noise) in data. During clustering, DBSCAN identifies
points that do not belong to any cluster, which makes this method useful for
density-based outlier detection. Unlike *k*-means and
*k*-medoids clustering, DBSCAN does not require prior
knowledge of the number of clusters.

Use `dbscan`

to perform clustering on an input data matrix or on
pairwise distances between observations. For more information, see Introduction to DBSCAN.

A Gaussian mixture model (GMM) forms clusters as a mixture of multivariate
normal density components. For a given observation, the GMM assigns posterior
probabilities to each component density (or cluster). The posterior
probabilities indicate that the observation has some probability of belonging to
each cluster. A GMM can perform *hard* clustering by
selecting the component that maximizes the posterior probability as the assigned
cluster for the observation. You can also use a GMM to perform
*soft*, or *fuzzy*, clustering by
assigning the observation to multiple clusters based on the scores or posterior
probabilities of the observation for the clusters. A GMM can be a more
appropriate method than *k*-means clustering when clusters have
different sizes and different correlation structures within them.

Use `fitgmdist`

to fit a `gmdistribution`

object to your data.
You can also use `gmdistribution`

to create a GMM object
by specifying the distribution parameters. When you have a fitted GMM, you can
cluster query data by using the `cluster`

function. For more
information, see Cluster Using Gaussian Mixture Model.

*k*-nearest neighbor search finds the *k*
closest points in your data to a query point or set of query points. In
contrast, radius search finds all points in your data that are within a
specified distance from a query point or set of query points. The results of
these methods depend on the distance metric that you
specify.

Use the `knnsearch`

function to find
*k*-nearest neighbors or the `rangesearch`

function to find all
neighbors within a specified distance of your input data. You can also create a
searcher object using a training data set, and pass the object and query data
sets to the object functions (`knnsearch`

and `rangesearch`

). For more information,
see Classification Using Nearest Neighbors.

Spectral clustering is a graph-based algorithm for finding
*k* arbitrarily shaped clusters in data. The technique
involves representing the data in a low dimension. In the low dimension,
clusters in the data are more widely separated, enabling you to use algorithms
such as *k*-means or *k*-medoids clustering.
This low dimension is based on eigenvectors of a Laplacian matrix. A Laplacian
matrix is one way of representing a similarity graph that models the local
neighborhood relationships between data points as an undirected graph.

Use `spectralcluster`

to perform spectral clustering on an input data
matrix or on a similarity matrix of a similarity graph. `spectralcluster`

requires that you specify the number of
clusters. However, the algorithm for spectral clustering also provides a way to
estimate the number of clusters in your data. For more information, see Partition Data Using Spectral Clustering.

This table compares the features of available clustering methods in Statistics and Machine Learning Toolbox.

Method | Basis of Algorithm | Input to Algorithm | Requires Specified Number of Clusters | Cluster Shapes Identified | Useful for Outlier Detection |
---|---|---|---|---|---|

Hierarchical Clustering | Distance between objects | Pairwise distances between observations | No | Arbitrarily shaped clusters, depending on the specified
`'Linkage'` algorithm | No |

k-Means Clustering and k-Medoids Clustering | Distance between objects and centroids | Actual observations | Yes | Spheroidal clusters with equal diagonal covariance | No |

Density-Based Spatial Clustering of Algorithms with Noise (DBSCAN) | Density of regions in the data | Actual observations or pairwise distances between observations | No | Arbitrarily shaped clusters | Yes |

Gaussian Mixture Models | Mixture of Gaussian distributions | Actual observations | Yes | Spheroidal clusters with different covariance structures | Yes |

Nearest Neighbors | Distance between objects | Actual observations | No | Arbitrarily shaped clusters | Yes, depending on the specified number of neighbors |

Spectral Clustering (Partition Data Using Spectral Clustering) | Graph representing connections between data points | Actual observations or similarity matrix | Yes, but the algorithm also provides a way to estimate the number of clusters | Arbitrarily shaped clusters | No |