Silhouette criterion clustering evaluation object
an object consisting of sample data, clustering data, and silhouette
criterion values used to evaluate the optimal number of data clusters.
Create a silhouette criterion clustering evaluation object using
x— Input data
Input data, specified as an N-by-P matrix. N is the number of observations, and P is the number of variables.
clust— Clustering algorithm
'gmdistribution'| matrix of clustering solutions | function handle
Clustering algorithm, specified as one of the following.
|Cluster the data in |
|Cluster the data in |
|Cluster the data in |
'silhouette', you can specify a clustering algorithm
using a function handle. The function
must be of the form
C = clustfun(DATA,K), where
the data to be clustered, and
K is the number of
clusters. The output of
clustfun must be one of
A vector of integers representing the cluster index
for each observation in
DATA. There must be
values in this vector.
A numeric n-by-K matrix of score for n observations and K classes. In this case, the cluster index for each observation is determined by taking the largest score value in each row.
'silhouette', you can also specify
a n-by-K matrix containing the
proposed clustering solutions. n is the number
of observations in the sample data, and K is the
number of proposed clustering solutions. Column j contains
the cluster indices for each of the N points in
the jth clustering solution.
Specify optional comma-separated pairs of
Name is the argument
Value is the corresponding
Name must appear
inside single quotes (
You can specify several name and value pair
arguments in any order as
'KList',[1:5],'Distance','cityblock'specifies to test 1, 2, 3, 4, and 5 clusters using the sum of absolute differences distance measure.
'ClusterPriors'— Prior probabilities for each cluster
Prior probabilities for each cluster, specified as the comma-separated
pair consisting of
'ClusterPriors' and one of the
|Compute the overall silhouette value for the clustering solution by averaging the silhouette values for all points. Each cluster contributes to the overall silhouette value proportionally to its size.|
|Compute the overall silhouette value for the clustering solution by averaging the silhouette values for all points within each cluster, and then averaging those values across all clusters. Each cluster contributes equally to the overall silhouette value, regardless of its size.|
'Distance'— Distance metric
'cityblock'| vector | function | ...
Distance metric used for computing the criterion values, specified
as the comma-separated pair consisting of
one of the following.
|Squared Euclidean distance|
|Sum of absolute differences|
|One minus the cosine of the included angle between points (treated as vectors)|
|One minus the sample correlation between points (treated as sequences of values)|
|Percentage of coordinates that differ. This option is only
valid for the |
|Percentage of nonzero coordinates that differ. This option
is only valid for the |
For detailed information about each distance metric, see
You can also specify a function for the distance metric using
a function handle. The distance
function must be of the form
d2 = distfun(XI,XJ),
XI is a 1-by-n vector
corresponding to a single row of the input matrix
XJ is an m2-by-n matrix
corresponding to multiple rows of
return an m2-by-1 vector
d2, whose kth element
is the distance between
you can also specify
Distance as the output vector
output created by the function
Clust a character vector representing
a built-in clustering algorithm,
the distance metric specified for
Distance to cluster
the data, except for the following:
Distance is either
then the clustering algorithm uses Euclidean distance and Ward linkage.
any other metric, then the clustering algorithm uses the specified
distance metric and average linkage.
In all other cases, the distance metric specified for
match the distance metric used in the clustering algorithm to obtain
Distance only accepts a function handle if
the clustering algorithm
clust accepts a function
handle as the distance metric. For example, the
algorithm does not accept a function handle as the distance metric.
Therefore, if you use the
kmeansalgorithm and then
specify a function handle for
Distance, the software
'KList'— List of number of clusters to evaluate
List of number of clusters to evaluate, specified as the comma-separated
pair consisting of
'KList' and a vector of positive
integer values. You must specify
a clustering algorithm name or a function handle. When
be a character vector or a function handle, and you must specify
Clustering algorithm used to cluster the input data, stored
as a valid clustering algorithm name or function handle. If the clustering
solutions are provided in the input,
Prior probabilities for each cluster, stored as valid prior probability name.
Silhouette values corresponding to each proposed number of clusters
Name of the criterion used for clustering evaluation, stored as a valid criterion name.
Criterion values corresponding to each proposed number of clusters
Distance measure used for clustering data, stored as a valid distance measure name.
List of the number of proposed clusters for which to compute criterion values, stored as a vector of positive integer values.
Logical flag for excluded data, stored as a column vector of
logical values. If
Number of observations in the data matrix
Optimal number of clusters, stored as a positive integer value.
Optimal clustering solution corresponding to
Data used for clustering, stored as a matrix of numerical values.
|addK||Evaluate additional numbers of clusters|
|compact||Compact clustering evaluation object|
|plot||Plot clustering evaluation object criterion values|
The silhouette value for each point is a measure of how similar
that point is to points in its own cluster, when compared to points
in other clusters. The silhouette value for the
Si, is defined as
Si = (bi-ai)/ max(ai,bi)
ai is the average distance from the
point to the other points in the same cluster as
bi is the minimum average distance from the
point to points in a different cluster, minimized over clusters.
The silhouette value ranges from -1 to +1.
A high silhouette value indicates that
i is well-matched
to its own cluster, and poorly-matched to neighboring clusters. If
most points have a high silhouette value, then the clustering solution
is appropriate. If many points have a low or negative silhouette value,
then the clustering solution may have either too many or too few clusters.
The silhouette clustering evaluation criterion can be used with any
Evaluate the optimal number of clusters using the silhouette clustering evaluation criterion.
Generate sample data containing random numbers from three multivariate distributions with different parameter values.
rng('default'); % For reproducibility mu1 = [2 2]; sigma1 = [0.9 -0.0255; -0.0255 0.9]; mu2 = [5 5]; sigma2 = [0.5 0 ; 0 0.3]; mu3 = [-2, -2]; sigma3 = [1 0 ; 0 0.9]; N = 200; X = [mvnrnd(mu1,sigma1,N);... mvnrnd(mu2,sigma2,N);... mvnrnd(mu3,sigma3,N)];
Evaluate the optimal number of clusters using the silhouette criterion. Cluster the data using
E = evalclusters(X,'kmeans','silhouette','klist',[1:6])
E = SilhouetteEvaluation with properties: NumObservations: 600 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 0.8055 0.8551 0.7155 0.6071 0.6232] OptimalK: 3
OptimalK value indicates that, based on the silhouette criterion, the optimal number of clusters is three.
Plot the silhouette criterion values for each number of clusters tested.
The plot shows that the highest silhouette value occurs at three clusters, suggesting that the optimal number of clusters is three.
Create a grouped scatter plot to visually examine the suggested clusters.
The plot shows three distinct clusters within the data: Cluster 1 is in the lower-left corner, cluster 2 is near the center of the plot, and cluster 3 is in the upper-right corner.
 Kaufman L. and P. J. Rouseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Hoboken, NJ: John Wiley & Sons, Inc., 1990.
 Rouseeuw, P. J. "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis." Journal of Computational and Applied Mathematics. Vol. 20, No. 1, 1987, pp. 53–65.