Documentation |
Package: clustering.evaluation
Superclasses: clustering.evaluation.ClusterCriterion
Gap criterion clustering evaluation object
clustering.evaluation.GapEvaluation is an object consisting of sample data, clustering data, and gap criterion values used to evaluate the optimal number of clusters. Create a gap criterion clustering evaluation object using evalclusters.
eva = evalclusters(x,clust,'Gap') creates a gap criterion clustering evaluation object.
eva = evalclusters(x,clust,'Gap',Name,Value) creates a gap criterion clustering evaluation object using additional options specified by one or more name-value pair arguments.
increaseB | Increase reference data sets |
addK | Evaluate additional numbers of clusters |
compact | Compact clustering evaluation object |
plot | Plot clustering evaluation object criterion values |
A common graphical approach to cluster evaluation involves plotting an error measurement versus several proposed numbers of clusters, and locating the "elbow" of this plot. The "elbow" occurs at the most dramatic decrease in error measurement. The gap criterion formalizes this approach by estimating the "elbow" location as the number of clusters with the largest gap value. Therefore, under the gap criterion, the optimal number of clusters occurs at the solution with the largest local or global gap value within a tolerance range.
The gap value is defined as
$$Ga{p}_{n}\left(k\right)={E}_{n}^{*}\left\{\mathrm{log}\left({W}_{k}\right)\right\}-\mathrm{log}\left({W}_{k}\right),$$
where n is the sample size, k is the number of clusters being evaluated, and W_{k} is the pooled within-cluster dispersion measurement
$${W}_{k}={\displaystyle \sum _{r=1}^{k}\frac{1}{2{n}_{r}}{D}_{r},}$$
where n_{r} is the number of data points in cluster r, and D_{r} is the sum of the pairwise distances for all points in cluster r.
The expected value $${E}_{n}^{*}\left\{\mathrm{log}\left({W}_{k}\right)\right\}$$ is determined by Monte Carlo sampling from a reference distribution, and log(W_{k}) is computed from the sample data.
The gap value is defined even for clustering solutions that contain only one cluster, and can be used with any distance metric. However, the gap criterion is more computationally expensive than other cluster evaluation criteria, because the clustering algorithm must be applied to the reference data for each proposed clustering solution.