Cluster Analysis

Machine learning method for finding and visualizing natural groupings and patterns in data

Cluster analysis involves applying one or more clustering algorithms with the goal of finding hidden patterns or groupings in a dataset. Clustering algorithms form groupings or clusters in such a way that data within a cluster have a higher measure of similarity than data in any other cluster. The measure of similarity on which the clusters are modeled can be defined by Euclidean distance, probabilistic distance, or another metric.

Cluster analysis is an unsupervised learning method and an important task in exploratory data analysis. Popular clustering algorithms include:

  • Hierarchical clustering: builds a multilevel hierarchy of clusters by creating a cluster tree
  • k-Means clustering: partitions data into k distinct clusters based on distance to the centroid of a cluster
  • Gaussian mixture models: models clusters as a mixture of multivariate normal density components
  • Self-organizing maps: uses neural networks that learn the topology and distribution of the data

The distinguishing feature of each of these algorithms is the metric to measure similarity.

Cluster analysis is used in bioinformatics for sequence analysis and genetic clustering; in data mining for sequence and pattern mining; in medical imaging for image segmentation; and in computer vision for object recognition.

For more details on cluster analysis algorithms, see Statistics and Machine Learning Toolbox™ and Neural Network Toolbox™.

See also: machine learning, unsupervised learning, AdaBoost, data analysis, mathematical modeling