Calinski-Harabasz criterion clustering evaluation object
CalinskiHarabaszEvaluation is an object consisting of sample data,
clustering data, and Calinski-Harabasz criterion values used to evaluate the optimal
number of clusters. Create a Calinski-Harabasz criterion clustering evaluation object
creates a Calinski-Harabasz criterion clustering evaluation object using additional
options specified by one or more name-value pair arguments.
eva = evalclusters(
comma-separated pairs of
the argument name and
Value is the corresponding value.
Name must appear inside quotes. You can specify several name and value
pair arguments in any order as
'KList',[1:6]specifies to test 1, 2, 3, 4, 5, and 6 clusters to find the optimal number.
Clustering algorithm used to cluster the input data, stored
as a valid clustering algorithm name or function handle. If the clustering
solutions are provided in the input,
Name of the criterion used for clustering evaluation, stored as a valid criterion name.
Criterion values corresponding to each proposed number of clusters
Distance metric used for clustering data, stored as a valid distance metric name.
List of the number of proposed clusters for which to compute criterion values, stored as a vector of positive integer values.
Logical flag for excluded data, stored as a column vector of
logical values. If
Number of observations in the data matrix
Optimal number of clusters, stored as a positive integer value.
Optimal clustering solution corresponding to
Data used for clustering, stored as a matrix of numerical values.
Evaluate the optimal number of clusters using the Calinski-Harabasz clustering evaluation criterion.
Load the sample data.
The data contains length and width measurements from the sepals and petals of three species of iris flowers.
Evaluate the optimal number of clusters using the Calinski-Harabasz criterion. Cluster the data using
rng('default'); % For reproducibility eva = evalclusters(meas,'kmeans','CalinskiHarabasz','KList',[1:6])
eva = CalinskiHarabaszEvaluation with properties: NumObservations: 150 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 513.9245 561.6278 530.4871 456.1279 469.5068] OptimalK: 3
OptimalK value indicates that, based on the Calinski-Harabasz criterion, the optimal number of clusters is three.
Plot the Calinski-Harabasz criterion values for each number of clusters tested.
The plot shows that the highest Calinski-Harabasz value occurs at three clusters, suggesting that the optimal number of clusters is three.
Create a grouped scatter plot to examine the relationship between petal length and width. Group the data by suggested clusters.
PetalLength = meas(:,3); PetalWidth = meas(:,4); ClusterGroup = eva.OptimalY; figure; gscatter(PetalLength,PetalWidth,ClusterGroup,'rbg','xod');
The plot shows cluster 3 in the lower-left corner, completely separated from the other two clusters. Cluster 3 contains flowers with the smallest petal widths and lengths. Cluster 1 is in the upper-right corner, and contains flowers with the largest petal widths and lengths. Cluster 2 is near the center of the plot, and contains flowers with measurements between these two extremes.
The Calinski-Harabasz criterion is sometimes called the variance ratio criterion (VRC). The Calinski-Harabasz index is defined as
where SSB is the overall between-cluster variance, SSW is the overall within-cluster variance, k is the number of clusters, and N is the number of observations.
The overall between-cluster variance SSB is defined as
where k is the number of clusters, ni is the number of observations in cluster i, mi is the centroid of cluster i, m is the overall mean of the sample data, and is the L2 norm (Euclidean distance) between the two vectors.
The overall within-cluster variance SSW is defined as
where k is the number of clusters, x is a data point, ci is the ith cluster, mi is the centroid of cluster i, and is the L2 norm (Euclidean distance) between the two vectors.
Well-defined clusters have a large between-cluster variance (SSB) and a small within-cluster variance (SSW). The larger the VRCk ratio, the better the data partition. To determine the optimal number of clusters, maximize VRCk with respect to k. The optimal number of clusters is the solution with the highest Calinski-Harabasz index value.
The Calinski-Harabasz criterion is best suited for k-means clustering solutions with squared Euclidean distances.
 Calinski, T., and J. Harabasz. “A dendrite method for cluster analysis.” Communications in Statistics. Vol. 3, No. 1, 1974, pp. 1–27.