# SilhouetteEvaluation

Silhouette criterion clustering evaluation object

## Description

`SilhouetteEvaluation`

is an object consisting of sample data
(`X`

), clustering data (`OptimalY`

), and silhouette criterion
values (`CriterionValues`

) used to
evaluate the optimal number of data clusters (`OptimalK`

). The silhouette value for
each point (observation in `X`

) is a measure of how similar that point is to
other points in the same cluster, compared to points in other clusters. If most points have a
high silhouette value, then the clustering solution is appropriate. If many points have a low
or negative silhouette value, then the clustering solution might have too many or too few
clusters. For more information, see Silhouette Value and Criterion.

## Creation

Create a silhouette criterion clustering evaluation object by using the `evalclusters`

function and specifying the criterion as
`"silhouette"`

.

You can then use `compact`

to create a compact version of the
silhouette criterion clustering evaluation object. The function removes the contents of the
properties `X`

, `OptimalY`

, and
`Missing`

.

## Properties

### Clustering Evaluation Properties

`ClusteringFunction`

— Clustering algorithm

`'kmeans'`

| `'linkage'`

| `'gmdistribution'`

| function handle | `[]`

This property is read-only.

Clustering algorithm used to cluster the sample data, returned as
`'kmeans'`

, `'linkage'`

,
`'gmdistribution'`

, or a function handle. If you specify the
clustering solutions as an input argument to `evalclusters`

when you
create the clustering evaluation object, then `ClusteringFunction`

is
empty.

Value | Description |
---|---|

`'kmeans'` | Cluster the data in `X` using the `kmeans` clustering
algorithm, with `EmptyAction` set to
`"singleton"` and `Replicates` set
to `5` . |

`'linkage'` | Cluster the data in `X` using the `clusterdata` agglomerative
clustering algorithm, with `Linkage` set to
`"ward"` . |

`'gmdistribution'` | Cluster the data in `X` using the `gmdistribution` Gaussian
mixture distribution algorithm, with `SharedCov` set to
`true` and `Replicates` set to
`5` . |

**Data Types: **`double`

| `char`

| `function_handle`

`ClusterPriors`

— Prior probabilities for each cluster

`'empirical'`

| `'equal'`

This property is read-only.

Prior probabilities for each cluster, returned as `'empirical'`

or `'equal'`

.

Value | Description |
---|---|

`'empirical'` | Compute the silhouette criterion value for the clustering solution by averaging the silhouette values for all points. Each cluster contributes to the criterion value proportionally based on its size. |

`'equal'` | Compute the silhouette criterion value for the clustering solution by averaging the silhouette values for all points within each cluster, and then averaging those values across all clusters. Regardless of its size, each cluster contributes equally to the criterion value. |

`ClusterSilhouettes`

— Average silhouette values

cell array of numeric vectors

This property is read-only.

Average silhouette values corresponding to each proposed number of clusters in
`InspectedK`

, returned as a cell array of numeric vectors. For
each proposed number of clusters `k`

, the vector
`ClusterSilhouettes{k}`

contains the average silhouette value for
each cluster.

For example, suppose `evaluation`

is a silhouette criterion
clustering evaluation object and `evaluation.InspectedK`

is
`1:5`

. Then, `evaluation.ClusterSilhouettes{4}(3)`

is the average silhouette value for the points in the third cluster of the clustering
solution with four total clusters.

**Data Types: **`cell`

`CriterionName`

— Name of criterion

`'Silhouette'`

This property is read-only.

Name of the criterion used for clustering evaluation, returned as
`'Silhouette'`

.

`CriterionValues`

— Criterion values

numeric vector

This property is read-only.

Criterion values, returned as a numeric vector. Each value corresponds to a proposed
number of clusters in `InspectedK`

.

**Data Types: **`double`

`Distance`

— Distance metric

`'sqEuclidean'`

| `'Euclidean'`

| `'cityblock'`

| function handle | numeric vector | ...

This property is read-only.

Distance metric used for clustering data and computing the criterion values,
returned as one of the values in this table, a function handle, or a numeric vector
returned by the function `pdist`

.

Value | Description |
---|---|

`'sqEuclidean'` | Squared Euclidean distance |

`'Euclidean'` | Euclidean distance |

`'cityblock'` | Sum of absolute differences |

`'cosine'` | One minus the cosine of the included angle between points (treated as vectors) |

`'correlation'` | One minus the sample correlation between points (treated as sequences of values) |

`'Hamming'` | Percentage of coordinates that differ |

`'Jaccard'` | Percentage of nonzero coordinates that differ |

**Data Types: **`single`

| `double`

| `char`

| `function_handle`

`InspectedK`

— List of number of proposed clusters

positive integer vector

This property is read-only.

List of the number of proposed clusters for which to compute criterion values, returned as a positive integer vector.

**Data Types: **`double`

`OptimalK`

— Optimal number of clusters

positive integer scalar

This property is read-only.

Optimal number of clusters, returned as a positive integer scalar.

**Data Types: **`double`

`OptimalY`

— Optimal clustering solution

positive integer column vector | `[]`

This property is read-only.

Optimal clustering solution corresponding to `OptimalK`

, returned
as a positive integer column vector. Each row of `OptimalY`

represents the cluster index of the corresponding observation (or row) in
`X`

. If you specify the clustering solutions as an input argument
to `evalclusters`

when you create the clustering evaluation object,
or if the clustering evaluation object is compact (see `compact`

), then `OptimalY`

is empty.

**Data Types: **`double`

### Sample Data Properties

`Missing`

— Excluded data

logical column vector | `[]`

This property is read-only.

Excluded data, returned as a logical column vector. If an element of
`Missing`

is `true`

, then the corresponding
observation (or row) in the data matrix `X`

is not used in the
clustering solutions. If the clustering evaluation object is compact (see `compact`

), then `Missing`

is empty.

**Data Types: **`double`

| `logical`

`NumObservations`

— Number of observations

positive integer scalar

This property is read-only.

Number of observations in the data matrix `X`

, ignoring
observations with missing (`NaN`

) values, returned as a positive
integer scalar.

**Data Types: **`double`

`X`

— Data used for clustering

numeric matrix | `[]`

This property is read-only.

Data used for clustering, returned as a numeric matrix. Rows correspond to
observations, and columns correspond to variables. If the clustering evaluation object
is compact (see `compact`

), then `X`

is
empty.

**Data Types: **`single`

| `double`

## Object Functions

## Examples

### Evaluate Clustering Solution Using Silhouette Criterion

Evaluate the optimal number of clusters using the silhouette clustering evaluation criterion.

Generate sample data containing random numbers from three multivariate distributions with different parameter values.

rng("default") % For reproducibility n = 200; mu1 = [2 2]; sigma1 = [0.9 -0.0255; -0.0255 0.9]; mu2 = [5 5]; sigma2 = [0.5 0; 0 0.3]; mu3 = [-2 -2]; sigma3 = [1 0; 0 0.9]; X = [mvnrnd(mu1,sigma1,n); ... mvnrnd(mu2,sigma2,n); ... mvnrnd(mu3,sigma3,n)];

Evaluate the optimal number of clusters using the silhouette criterion. Cluster the data using `kmeans`

.

evaluation = evalclusters(X,"kmeans","silhouette","KList",1:6)

evaluation = SilhouetteEvaluation with properties: NumObservations: 600 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 0.8055 0.8551 0.7155 0.6071 0.6232] OptimalK: 3

The `OptimalK`

value indicates that, based on the silhouette criterion, the optimal number of clusters is three.

Plot the silhouette criterion values for each number of clusters tested.

plot(evaluation)

The plot shows that the highest silhouette value occurs at three clusters, suggesting that the optimal number of clusters is three.

Create a grouped scatter plot to visually examine the suggested clusters.

```
clusters = evaluation.OptimalY;
gscatter(X(:,1),X(:,2),clusters,[],"xod")
```

The plot shows three distinct clusters within the data: cluster 1 in the lower-left corner, cluster 2 in the upper-right corner, and cluster 3 near the center of the plot.

## More About

### Silhouette Value and Criterion

The silhouette value for each point is a measure of how similar that point is to other points in the same cluster, compared to points in other clusters.

The silhouette value *s _{i}* for the

*i*th point is defined as

$${s}_{i}=\frac{\left({b}_{i}-{a}_{i}\right)}{\mathrm{max}\left({a}_{i},{b}_{i}\right)},$$

where *a _{i}* is the average
distance from the

*i*th point to the other points in the same cluster as

*i*, and

*b*is the minimum average distance from the

_{i}*i*th point to points in a different cluster, minimized over the clusters. If the

*i*th point is the only point in its cluster, then the silhouette value

*s*is set to 1.

_{i}The silhouette values range from –1 to 1. A high silhouette value indicates that the point is well matched to its own cluster, and poorly matched to other clusters. If most points have a high silhouette value, then the clustering solution is appropriate. If many points have a low or negative silhouette value, then the clustering solution might have too many or too few clusters. You can use silhouette values as a clustering evaluation criterion with any distance metric.

The `ClusterPriors`

value determines the silhouette criterion computation. If the value is
`'empirical'`

, then the software computes the silhouette criterion value
for a clustering solution by averaging the silhouette values for all points. Each cluster
contributes to the criterion value proportionally based on its size. If the
`ClusterPriors`

value is `'equal'`

, then the software
computes the silhouette criterion value for a clustering solution by averaging the
silhouette values for all points within each cluster, and then averaging those values across
all clusters. Regardless of its size, each cluster contributes equally to the criterion
value. The optimal number of clusters corresponds to the solution with the highest
silhouette criterion value.

## References

[1] Kaufman, L., and P. J. Rouseeuw.
*Finding Groups in Data: An Introduction to Cluster Analysis*. Hoboken,
NJ: John Wiley & Sons, Inc., 1990.

[2] Rouseeuw, P. J.
“Silhouettes: a graphical aid to the interpretation and validation of cluster
analysis.” *Journal of Computational and Applied Mathematics*.
Vol. 20, No. 1, 1987, pp. 53–65.

## Version History

**Introduced in R2013b**

## Open Example

You have a modified version of this example. Do you want to open this example with your edits?

## MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

# Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)