Calinski-Harabasz criterion clustering evaluation object
an object consisting of sample data, clustering data, and Calinski-Harabasz
criterion values used to evaluate the optimal number of clusters.
Create a Calinski-Harabasz criterion clustering evaluation object
a Calinski-Harabasz criterion clustering evaluation object using additional
options specified by one or more name-value pair arguments.
eva = evalclusters(
x— Input data
Input data, specified as an N-by-P matrix. N is the number of observations, and P is the number of variables.
clust— Clustering algorithm
'gmdistribution'| matrix of clustering solutions | function handle
Clustering algorithm, specified as one of the following.
|Cluster the data in |
|Cluster the data in |
|Cluster the data in |
'silhouette', you can specify a clustering algorithm
using a function handle. The function
must be of the form
C = clustfun(DATA,K), where
the data to be clustered, and
K is the number of
clusters. The output of
clustfun must be one of
A vector of integers representing the cluster index
for each observation in
DATA. There must be
values in this vector.
A numeric n-by-K matrix of score for n observations and K classes. In this case, the cluster index for each observation is determined by taking the largest score value in each row.
'silhouette', you can also specify
a n-by-K matrix containing the
proposed clustering solutions. n is the number
of observations in the sample data, and K is the
number of proposed clustering solutions. Column j contains
the cluster indices for each of the N points in
the jth clustering solution.
Specify optional comma-separated pairs of
Name is the argument
Value is the corresponding
Name must appear
inside single quotes (
You can specify several name and value pair
arguments in any order as
'KList',[1:6]specifies to test 1, 2, 3, 4, 5, and 6 clusters to find the optimal number.
'KList'— List of number of clusters to evaluate
List of number of clusters to evaluate, specified as the comma-separated
pair consisting of
'KList' and a vector of positive
integer values. You must specify
a clustering algorithm name or a function handle. When
be a character vector or a function handle, and you must specify
Clustering algorithm used to cluster the input data, stored
as a valid clustering algorithm name or function handle. If the clustering
solutions are provided in the input,
Name of the criterion used for clustering evaluation, stored as a valid criterion name.
Criterion values corresponding to each proposed number of clusters
Distance measure used for clustering data, stored as a valid distance measure name.
List of the number of proposed clusters for which to compute criterion values, stored as a vector of positive integer values.
Logical flag for excluded data, stored as a column vector of
logical values. If
Number of observations in the data matrix
Optimal number of clusters, stored as a positive integer value.
Optimal clustering solution corresponding to
Data used for clustering, stored as a matrix of numerical values.
|addK||Evaluate additional numbers of clusters|
|compact||Compact clustering evaluation object|
|plot||Plot clustering evaluation object criterion values|
The Calinski-Harabasz criterion is sometimes called the variance ratio criterion (VRC). The Calinski-Harabasz index is defined as
, where SSB is the overall between-cluster variance, SSW is the overall within-cluster variance, k is the number of clusters, and N is the number of observations.
The overall between-cluster variance SSB is defined as
where k is the number of clusters, mi is the centroid of cluster i, m is the overall mean of the sample data, and is the L2 norm (Euclidean distance) between the two vectors.
The overall within-cluster variance SSW is defined as
where k is the number of clusters, x is a data point, ci is the ith cluster, mi is the centroid of cluster i, and is the L2 norm (Euclidean distance) between the two vectors.
Well-defined clusters have a large between-cluster variance (SSB) and a small within-cluster variance (SSW). The larger the VRCk ratio, the better the data partition. To determine the optimal number of clusters, maximize VRCk with respect to k. The optimal number of clusters is the solution with the highest Calinski-Harabasz index value.
The Calinski-Harabasz criterion is best suited for k-means clustering solutions with squared Euclidean distances.
Evaluate the optimal number of clusters using the Calinski-Harabasz clustering evaluation criterion.
Load the sample data.
The data contains length and width measurements from the sepals and petals of three species of iris flowers.
Evaluate the optimal number of clusters using the Calinski-Harabasz
criterion. Cluster the data using
rng('default'); % For reproducibility eva = evalclusters(meas,'kmeans','CalinskiHarabasz','KList',[1:6])
eva = CalinskiHarabaszEvaluation with properties: NumObservations: 150 InspectecedK: [1 2 3 4 5 6] CriterionValues: [1x6 double] OptimalK: 3
OptimalK value indicates that, based
on the Calinski-Harabasz criterion, the optimal number of clusters
Plot the Calinski-Harabasz criterion values for each number of clusters tested.
The plot shows that the highest Calinski-Harabasz value occurs at three clusters, suggesting that the optimal number of clusters is three.
Create a grouped scatter plot to examine the relationship between petal length and width. Group the data by suggested clusters.
PetalLength = meas(:,3); PetalWidth = meas(:,4); ClusterGroup = eva.OptimalY; figure; gscatter(PetalLength,PetalWidth,ClusterGroup,'rbg','xod');
The plot shows cluster 1 in the lower-left corner, completely separated from the other two clusters. Cluster 1 contains flowers with the smallest petal widths and lengths. Cluster 3 is in the upper-right corner, and contains flowers with the largest petal widths and lengths. Cluster 2 is near the center of the plot, and contains flowers with measurements between these two extremes.
 Calinski, T., and J. Harabasz. "A dendrite method for cluster analysis." Communications in Statistics. Vol. 3, No. 1, 1974, pp. 1–27.