SilhouetteEvaluation

Silhouette criterion clustering evaluation object

Description

SilhouetteEvaluation is an object consisting of sample data (X), clustering data (OptimalY), and silhouette criterion values (CriterionValues) used to evaluate the optimal number of data clusters (OptimalK). The silhouette value for each point (observation in X) is a measure of how similar that point is to other points in the same cluster, compared to points in other clusters. If most points have a high silhouette value, then the clustering solution is appropriate. If many points have a low or negative silhouette value, then the clustering solution might have too many or too few clusters. For more information, see Silhouette Value and Criterion.

Creation

Create a silhouette criterion clustering evaluation object by using the evalclusters function and specifying the criterion as "silhouette".

You can then use compact to create a compact version of the silhouette criterion clustering evaluation object. The function removes the contents of the properties X, OptimalY, and Missing.

Properties

expand all

Clustering Evaluation Properties

`ClusteringFunction` — Clustering algorithm
`'kmeans'` | `'linkage'` | `'gmdistribution'` | function handle | `[]`

This property is read-only.

Clustering algorithm used to cluster the sample data, returned as 'kmeans', 'linkage', 'gmdistribution', or a function handle. If you specify the clustering solutions as an input argument to evalclusters when you create the clustering evaluation object, then ClusteringFunction is empty.

Value	Description
`'kmeans'`	Cluster the data in `X` using the `kmeans` clustering algorithm, with `EmptyAction` set to `"singleton"` and `Replicates` set to `5`.
`'linkage'`	Cluster the data in `X` using the `clusterdata` agglomerative clustering algorithm, with `Linkage` set to `"ward"`.
`'gmdistribution'`	Cluster the data in `X` using the `gmdistribution` Gaussian mixture distribution algorithm, with `SharedCov` set to `true` and `Replicates` set to `5`.

Data Types: double | char | function_handle

`ClusterPriors` — Prior probabilities for each cluster
`'empirical'` | `'equal'`

This property is read-only.

Prior probabilities for each cluster, returned as 'empirical' or 'equal'.

Value	Description
`'empirical'`	Compute the silhouette criterion value for the clustering solution by averaging the silhouette values for all points. Each cluster contributes to the criterion value proportionally based on its size.
`'equal'`	Compute the silhouette criterion value for the clustering solution by averaging the silhouette values for all points within each cluster, and then averaging those values across all clusters. Regardless of its size, each cluster contributes equally to the criterion value.

`ClusterSilhouettes` — Average silhouette values
cell array of numeric vectors

This property is read-only.

Average silhouette values corresponding to each proposed number of clusters in InspectedK, returned as a cell array of numeric vectors. For each proposed number of clusters k, the vector ClusterSilhouettes{k} contains the average silhouette value for each cluster.

For example, suppose evaluation is a silhouette criterion clustering evaluation object and evaluation.InspectedK is 1:5. Then, evaluation.ClusterSilhouettes{4}(3) is the average silhouette value for the points in the third cluster of the clustering solution with four total clusters.

Data Types: cell

`CriterionName` — Name of criterion
`'Silhouette'`

This property is read-only.

Name of the criterion used for clustering evaluation, returned as 'Silhouette'.

`CriterionValues` — Criterion values
numeric vector

This property is read-only.

Criterion values, returned as a numeric vector. Each value corresponds to a proposed number of clusters in InspectedK.

Data Types: double

`Distance` — Distance metric
`'sqEuclidean'` | `'Euclidean'` | `'cityblock'` | function handle | numeric vector | ...

This property is read-only.

Distance metric used for clustering data and computing the criterion values, returned as one of the values in this table, a function handle, or a numeric vector returned by the function pdist.

Value	Description
`'sqEuclidean'`	Squared Euclidean distance
`'Euclidean'`	Euclidean distance
`'cityblock'`	Sum of absolute differences
`'cosine'`	One minus the cosine of the included angle between points (treated as vectors)
`'correlation'`	One minus the sample correlation between points (treated as sequences of values)
`'Hamming'`	Percentage of coordinates that differ
`'Jaccard'`	Percentage of nonzero coordinates that differ

Data Types: single | double | char | function_handle

`InspectedK` — List of number of proposed clusters
positive integer vector

This property is read-only.

List of the number of proposed clusters for which to compute criterion values, returned as a positive integer vector.

Data Types: double

`OptimalK` — Optimal number of clusters
positive integer scalar

This property is read-only.

Optimal number of clusters, returned as a positive integer scalar.

Data Types: double

`OptimalY` — Optimal clustering solution
positive integer column vector | `[]`

This property is read-only.

Optimal clustering solution corresponding to OptimalK, returned as a positive integer column vector. Each row of OptimalY represents the cluster index of the corresponding observation (or row) in X. If you specify the clustering solutions as an input argument to evalclusters when you create the clustering evaluation object, or if the clustering evaluation object is compact (see compact), then OptimalY is empty.

Data Types: double

Sample Data Properties

`Missing` — Excluded data
logical column vector | `[]`

This property is read-only.

Excluded data, returned as a logical column vector. If an element of Missing is true, then the corresponding observation (or row) in the data matrix X is not used in the clustering solutions. If the clustering evaluation object is compact (see compact), then Missing is empty.

Data Types: double | logical

`NumObservations` — Number of observations
positive integer scalar

This property is read-only.

Number of observations in the data matrix X, ignoring observations with missing (NaN) values, returned as a positive integer scalar.

Data Types: double

`X` — Data used for clustering
numeric matrix | `[]`

This property is read-only.

Data used for clustering, returned as a numeric matrix. Rows correspond to observations, and columns correspond to variables. If the clustering evaluation object is compact (see compact), then X is empty.

Data Types: single | double

Object Functions

`addK`	Evaluate additional numbers of clusters
`compact`	Compact clustering evaluation object
`plot`	Plot clustering evaluation object criterion values

Examples

collapse all

Evaluate Clustering Solution Using Silhouette Criterion

Open Live Script

Evaluate the optimal number of clusters using the silhouette clustering evaluation criterion.

Generate sample data containing random numbers from three multivariate distributions with different parameter values.

rng("default") % For reproducibility
n = 200;

mu1 = [2 2];
sigma1 = [0.9 -0.0255; -0.0255 0.9];

mu2 = [5 5];
sigma2 = [0.5 0; 0 0.3];

mu3 = [-2 -2];
sigma3 = [1 0; 0 0.9];

X = [mvnrnd(mu1,sigma1,n); ...
     mvnrnd(mu2,sigma2,n); ...
     mvnrnd(mu3,sigma3,n)];

Evaluate the optimal number of clusters using the silhouette criterion. Cluster the data using kmeans.

evaluation = evalclusters(X,"kmeans","silhouette","KList",1:6)

evaluation = 
  SilhouetteEvaluation with properties:

    NumObservations: 600
         InspectedK: [1 2 3 4 5 6]
    CriterionValues: [NaN 0.8055 0.8551 0.7155 0.6071 0.6232]
           OptimalK: 3

The OptimalK value indicates that, based on the silhouette criterion, the optimal number of clusters is three.

Plot the silhouette criterion values for each number of clusters tested.

plot(evaluation)

The plot shows that the highest silhouette value occurs at three clusters, suggesting that the optimal number of clusters is three.

Create a grouped scatter plot to visually examine the suggested clusters.

clusters = evaluation.OptimalY;
gscatter(X(:,1),X(:,2),clusters,[],"xod")

The plot shows three distinct clusters within the data: cluster 1 in the lower-left corner, cluster 2 in the upper-right corner, and cluster 3 near the center of the plot.

More About

expand all

Silhouette Value and Criterion

The silhouette value for each point is a measure of how similar that point is to other points in the same cluster, compared to points in other clusters.

The silhouette value s_i for the ith point is defined as

$s_{i} = \frac{(b_{i} - a_{i})}{\max (a_{i}, b_{i})},$

where a_i is the average distance from the ith point to the other points in the same cluster as i, and b_i is the minimum average distance from the ith point to points in a different cluster, minimized over the clusters. If the ith point is the only point in its cluster, then the silhouette value s_i is set to 1.

The silhouette values range from –1 to 1. A high silhouette value indicates that the point is well matched to its own cluster, and poorly matched to other clusters. If most points have a high silhouette value, then the clustering solution is appropriate. If many points have a low or negative silhouette value, then the clustering solution might have too many or too few clusters. You can use silhouette values as a clustering evaluation criterion with any distance metric.

The ClusterPriors value determines the silhouette criterion computation. If the value is 'empirical', then the software computes the silhouette criterion value for a clustering solution by averaging the silhouette values for all points. Each cluster contributes to the criterion value proportionally based on its size. If the ClusterPriors value is 'equal', then the software computes the silhouette criterion value for a clustering solution by averaging the silhouette values for all points within each cluster, and then averaging those values across all clusters. Regardless of its size, each cluster contributes equally to the criterion value. The optimal number of clusters corresponds to the solution with the highest silhouette criterion value.

References

[1] Kaufman, L., and P. J. Rouseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Hoboken, NJ: John Wiley & Sons, Inc., 1990.

[2] Rouseeuw, P. J. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.” Journal of Computational and Applied Mathematics. Vol. 20, No. 1, 1987, pp. 53–65.

Version History

Introduced in R2013b

SilhouetteEvaluation

Description

Creation

Properties

Clustering Evaluation Properties

ClusteringFunction — Clustering algorithm 'kmeans' | 'linkage' | 'gmdistribution' | function handle | []

ClusterPriors — Prior probabilities for each cluster 'empirical' | 'equal'

ClusterSilhouettes — Average silhouette values cell array of numeric vectors

CriterionName — Name of criterion 'Silhouette'

CriterionValues — Criterion values numeric vector

Distance — Distance metric 'sqEuclidean' | 'Euclidean' | 'cityblock' | function handle | numeric vector | ...

InspectedK — List of number of proposed clusters positive integer vector

OptimalK — Optimal number of clusters positive integer scalar

OptimalY — Optimal clustering solution positive integer column vector | []

Sample Data Properties

Missing — Excluded data logical column vector | []

NumObservations — Number of observations positive integer scalar

X — Data used for clustering numeric matrix | []

Object Functions

Examples

Evaluate Clustering Solution Using Silhouette Criterion

More About

Silhouette Value and Criterion

References

Version History

See Also

`ClusteringFunction` — Clustering algorithm
`'kmeans'` | `'linkage'` | `'gmdistribution'` | function handle | `[]`

`ClusterPriors` — Prior probabilities for each cluster
`'empirical'` | `'equal'`

`ClusterSilhouettes` — Average silhouette values
cell array of numeric vectors

`CriterionName` — Name of criterion
`'Silhouette'`

`CriterionValues` — Criterion values
numeric vector

`Distance` — Distance metric
`'sqEuclidean'` | `'Euclidean'` | `'cityblock'` | function handle | numeric vector | ...

`InspectedK` — List of number of proposed clusters
positive integer vector

`OptimalK` — Optimal number of clusters
positive integer scalar

`OptimalY` — Optimal clustering solution
positive integer column vector | `[]`

`Missing` — Excluded data
logical column vector | `[]`

`NumObservations` — Number of observations
positive integer scalar

`X` — Data used for clustering
numeric matrix | `[]`