Gap criterion clustering evaluation object
GapEvaluation is an object consisting of sample data, clustering
data, and gap criterion values used to evaluate the optimal number of clusters. Create a
gap criterion clustering evaluation object using
comma-separated pairs of
the argument name and
Value is the corresponding value.
Name must appear inside quotes. You can specify several name and value
pair arguments in any order as
'KList',[1:5],'Distance','cityblock'specifies to test 1, 2, 3, 4, and 5 clusters using the city block distance metric.
'Distance'— Distance metric
'cityblock'| function | ...
Distance metric used for computing the criterion values, specified
as the comma-separated pair consisting of
'Distance' and one of the following.
|Squared Euclidean distance|
|Sum of absolute differences|
|One minus the cosine of the included angle between points (treated as vectors)|
|One minus the sample correlation between points (treated as sequences of values)|
For detailed information about each distance metric, see
You can also specify a function for the distance metric by using a function handle (MATLAB). The distance function must be of the form
d2 = distfun(XI,XJ),
XIis a 1-by-n vector corresponding to a single row of the input matrix
XJis an m2-by-n matrix corresponding to multiple rows of
distfunmust return an m2-by-1 vector of distances
d2, whose kth element is the distance between
Distance only accepts a function handle if the
clust accepts a function
handle as the distance metric. For example, the
kmeans clustering algorithm does not accept a
function handle as the distance metric. Therefore, if you use the
kmeans algorithm and then specify a function
Distance, the software errors.
evalclusters uses the distance metric
Distance to cluster the
Distance is either
'Euclidean', then the clustering
algorithm uses Euclidean distance and Ward linkage.
Distance is any other metric, then
the clustering algorithm uses the specified distance metric
and average linkage.
In all other cases, the distance metric specified for
Distance must match the distance
metric used in the clustering algorithm to obtain meaningful
Number of data sets generated from the reference distribution, stored as a positive integer value.
Clustering algorithm used to cluster the input data, stored
as a valid clustering algorithm name or function handle. If the clustering
solutions are provided in the input,
Name of the criterion used for clustering evaluation, stored as a valid criterion name.
Criterion values corresponding to each proposed number of clusters
Distance metric used for clustering data, stored as a valid distance metric name.
Expectation of the natural logarithm of W based on the
generated reference data, stored as a vector of scalar values.
W is the within-cluster dispersion computed using the
List of the number of proposed clusters for which to compute criterion values, stored as a vector of positive integer values.
Natural logarithm of W based on the input data, stored
as a vector of scalar values. W is the within-cluster
dispersion computed using the distance metric
Logical flag for excluded data, stored as a column vector of
logical values. If
Number of observations in the data matrix
Optimal number of clusters, stored as a positive integer value.
Optimal clustering solution corresponding to
Reference data generation method, stored as a valid reference distribution name.
Standard error of the natural logarithm of W with
respect to the reference data for each number of clusters in
Method for determining the optimal number of clusters, stored as a valid search method name.
Standard deviation of the natural logarithm of W with
respect to the reference data for each number of clusters in
Data used for clustering, stored as a matrix of numerical values.
|increaseB||Increase reference data sets|
Evaluate the optimal number of clusters using the gap clustering evaluation criterion.
Load the sample data.
The data contains sepal and petal measurements from three species of iris flowers.
Evaluate the number of clusters based on the gap criterion values. Cluster the data using
rng('default'); % For reproducibility eva = evalclusters(meas,'kmeans','gap','KList',[1:6])
eva = GapEvaluation with properties: NumObservations: 150 InspectedK: [1 2 3 4 5 6] CriterionValues: [0.0720 0.5928 0.8762 1.0114 1.0534 1.0720] OptimalK: 5
OptimalK value indicates that, based on the gap criterion, the optimal number of clusters is five.
Plot the gap criterion values for each number of clusters tested.
Based on the plot, the maximum value of the gap criterion occurs at six clusters. However, the value at five clusters is within one standard error of the maximum, so the suggested optimal number of clusters is five.
Create a grouped scatter plot to examine the relationship between petal length and width. Group the data by suggested clusters.
figure PetalLength = meas(:,3); PetalWidth = meas(:,4); ClusterGroup = eva.OptimalY; gscatter(PetalLength,PetalWidth,ClusterGroup,'rbgkc','xod^*');
The plot shows cluster 4 in the lower-left corner, completely separated from the other four clusters. Cluster 4 contains flowers with the smallest petal widths and lengths. Cluster 2 is in the upper-right corner and contains flowers with the largest petal widths and lengths. Cluster 5 is next to cluster 2 and contains flowers with similar petal widths as the flowers in cluster 2, but smaller petal lengths than the flowers in cluster 2. Clusters 1 and 3 are near the center of the plot and contain flowers with measurements between the extremes.
A common graphical approach to cluster evaluation involves plotting an error measurement versus several proposed numbers of clusters, and locating the “elbow” of this plot. The “elbow” occurs at the most dramatic decrease in error measurement. The gap criterion formalizes this approach by estimating the “elbow” location as the number of clusters with the largest gap value. Therefore, under the gap criterion, the optimal number of clusters occurs at the solution with the largest local or global gap value within a tolerance range.
The gap value is defined as
where n is the sample size, k is the number of clusters being evaluated, and Wk is the pooled within-cluster dispersion measurement
where nr is the number of data points in cluster r, and Dr is the sum of the pairwise distances for all points in cluster r.
The expected value is determined by Monte Carlo sampling from a reference
log(Wk) is computed
from the sample data.
The gap value is defined even for clustering solutions that contain only one cluster, and can be used with any distance metric. However, the gap criterion is more computationally expensive than other cluster evaluation criteria, because the clustering algorithm must be applied to the reference data for each proposed clustering solution.
 Tibshirani, R., G. Walther, and T. Hastie. “Estimating the number of clusters in a data set via the gap statistic.” Journal of the Royal Statistical Society: Series B. Vol. 63, Part 2, 2001, pp. 411–423.