Documentation |
Package: clustering.evaluation
Superclasses: clustering.evaluation.ClusterCriterion
Calinski-Harabasz criterion clustering evaluation object
clustering.evaluation.CalinskiHarabaszEvaluation is an object consisting of sample data, clustering data, and Calinski-Harabasz criterion values used to evaluate the optimal number of clusters. Create a Calinski-Harabasz criterion clustering evaluation object using evalclusters.
eva = evalclusters(x,clust,'CalinskiHarabasz') creates a Calinski-Harabasz criterion clustering evaluation object.
eva = evalclusters(x,clust,'CalinskiHarabasz',Name,Value) creates a Calinski-Harabasz criterion clustering evaluation object using additional options specified by one or more name-value pair arguments.
addK | Evaluate additional numbers of clusters |
compact | Compact clustering evaluation object |
plot | Plot clustering evaluation object criterion values |
The Calinski-Harabasz criterion is sometimes called the variance ratio criterion (VRC). The Calinski-Harabasz index is defined as
$$VR{C}_{k}=\frac{S{S}_{B}}{S{S}_{W}}\times \frac{\left(N-k\right)}{\left(k-1\right)},$$
, where SS_{B} is the overall between-cluster variance, SS_{W} is the overall within-cluster variance, k is the number of clusters, and N is the number of observations.
The overall between-cluster variance SS_{B} is defined as
$$S{S}_{B}={\displaystyle \sum _{i=1}^{k}{n}_{i}{\Vert {m}_{i}-m\Vert}^{2}},$$
where k is the number of clusters, m_{i} is the centroid of cluster i, m is the overall mean of the sample data, and $$\Vert {m}_{i}-m\Vert $$ is the L^{2} norm (Euclidean distance) between the two vectors.
The overall within-cluster variance SS_{W} is defined as
$$S{S}_{W}={\displaystyle \sum _{i=1}^{k}{{\displaystyle \sum _{x\in {c}_{i}}\Vert x-{m}_{i}\Vert}}^{2},}$$
where k is the number of clusters, x is a data point, c_{i} is the ith cluster, m_{i} is the centroid of cluster i, and $$\Vert x-{m}_{i}\Vert $$ is the L^{2} norm (Euclidean distance) between the two vectors.
Well-defined clusters have a large between-cluster variance (SS_{B}) and a small within-cluster variance (SS_{W}). The larger the VRC_{k} ratio, the better the data partition. To determine the optimal number of clusters, maximize VRC_{k} with respect to k. The optimal number of clusters is the solution with the highest Calinski-Harabasz index value.
The Calinski-Harabasz criterion is best suited for k-means clustering solutions with squared Euclidean distances.