*Cluster analysis*, also called *segmentation
analysis* or *taxonomy analysis*,
creates groups, or *clusters*, of data. Clusters
are formed in such a way that objects in the same cluster are very
similar and objects in different clusters are very distinct. Measures
of similarity depend on the application.

Hierarchical Clustering groups
data over a variety of scales by creating a cluster tree or *dendrogram*.
The tree is not a single set of clusters, but rather a multilevel
hierarchy, where clusters at one level are joined as clusters at the
next level. This allows you to decide the level or scale of clustering
that is most appropriate for your application. The Statistics and Machine Learning Toolbox™ function `clusterdata`

performs all of the necessary
steps for you. It incorporates the `pdist`

, `linkage`

, and `cluster`

functions,
which may be used separately for more detailed analysis. The `dendrogram`

function plots the cluster tree.

*k*-Means Clustering is a
partitioning method. The function `kmeans`

partitions
data into *k* mutually exclusive clusters, and returns
the index of the cluster to which it has assigned each observation.
Unlike hierarchical clustering, *k*-means clustering
operates on actual observations (rather than the larger set of dissimilarity
measures), and creates a single level of clusters. The distinctions
mean that *k*-means clustering is often more suitable
than hierarchical clustering for large amounts of data.

Clustering Using Gaussian Mixture Models form clusters by
representing the probability density function of observed variables
as a mixture of multivariate normal densities. Mixture models of the `gmdistribution`

class
use an expectation maximization
(EM) algorithm to fit data, which assigns posterior probabilities
to each component density with respect to each observation. Clusters
are assigned by selecting the component that maximizes the posterior
probability. Clustering using Gaussian mixture models is sometimes
considered a soft clustering method. The posterior probabilities for
each point indicate that each data point has some probability of belonging
to each cluster. Like *k*-means clustering, Gaussian
mixture modeling uses an iterative algorithm that converges to a local
optimum. Gaussian mixture modeling may be more appropriate than *k*-means
clustering when clusters have different sizes and correlation within
them.

Was this topic helpful?