Package: clustering.evaluation
Superclasses: clustering.evaluation.ClusterCriterion
Gap criterion clustering evaluation object
clustering.evaluation.GapEvaluation
is an
object consisting of sample data, clustering data, and gap criterion
values used to evaluate the optimal number of clusters. Create a gap
criterion clustering evaluation object using evalclusters
.
creates
a gap criterion clustering evaluation object.eva
= evalclusters(x
,clust
,'Gap')
creates
a gap criterion clustering evaluation object using additional options
specified by one or more namevalue pair arguments.eva
= evalclusters(x
,clust
,'Gap',Name,Value
)

Number of data sets generated from the reference distribution, stored as a positive integer value. 

Clustering algorithm used to cluster the input data, stored
as a valid clustering algorithm name string or function handle. If
the clustering solutions are provided in the input, 

Name of the criterion used for clustering evaluation, stored as a valid criterion name string. 

Criterion values corresponding to each proposed number of clusters
in 

Distance measure used for clustering data, stored as a valid distance measure name string. 

Expectation of the natural logarithm of W based
on the generated reference data, stored as a vector of scalar values. W is
the withincluster dispersion computed using the distance measurement 

List of the number of proposed clusters for which to compute criterion values, stored as a vector of positive integer values. 

Natural logarithm of W based on the input
data, stored as a vector of scalar values. W is
the withincluster dispersion computed using the distance measurement 

Logical flag for excluded data, stored as a column vector of
logical values. If 

Number of observations in the data matrix 

Optimal number of clusters, stored as a positive integer value. 

Optimal clustering solution corresponding to 

Reference data generation method, stored as a valid reference distribution name string. 

Standard error of the natural logarithm of W with
respect to the reference data for each number of clusters in 

Method for determining the optimal number of clusters, stored as a valid search method name string. 

Standard deviation of the natural logarithm of W with
respect to the reference data for each number of clusters in 

Data used for clustering, stored as a matrix of numerical values. 
increaseB  Increase reference data sets 
addK  Evaluate additional numbers of clusters 
compact  Compact clustering evaluation object 
plot  Plot clustering evaluation object criterion values 
A common graphical approach to cluster evaluation involves plotting an error measurement versus several proposed numbers of clusters, and locating the "elbow" of this plot. The "elbow" occurs at the most dramatic decrease in error measurement. The gap criterion formalizes this approach by estimating the "elbow" location as the number of clusters with the largest gap value. Therefore, under the gap criterion, the optimal number of clusters occurs at the solution with the largest local or global gap value within a tolerance range.
The gap value is defined as
$$Ga{p}_{n}\left(k\right)={E}_{n}^{*}\left\{\mathrm{log}\left({W}_{k}\right)\right\}\mathrm{log}\left({W}_{k}\right),$$
where n is the sample size, k is the number of clusters being evaluated, and W_{k} is the pooled withincluster dispersion measurement
$${W}_{k}={\displaystyle \sum _{r=1}^{k}\frac{1}{2{n}_{r}}{D}_{r},}$$
where n_{r} is the number of data points in cluster r, and D_{r} is the sum of the pairwise distances for all points in cluster r.
The expected value $${E}_{n}^{*}\left\{\mathrm{log}\left({W}_{k}\right)\right\}$$ is
determined by Monte Carlo sampling from a reference distribution,
and log(W_{k})
is
computed from the sample data.
The gap value is defined even for clustering solutions that contain only one cluster, and can be used with any distance metric. However, the gap criterion is more computationally expensive than other cluster evaluation criteria, because the clustering algorithm must be applied to the reference data for each proposed clustering solution.
[1] Tibshirani, R., G. Walther, and T. Hastie. "Estimating the number of clusters in a data set via the gap statistic." Journal of the Royal Statistical Society: Series B. Vol. 63, Part 2, 2001, pp. 411–423.