silhouette
Silhouette plot
Syntax
Description
silhouette(
accepts one or more additional distance metric parameter values when you specify
X
,clust
,Distance
,DistParameter
)Distance
as a custom distance function handle
@
that accepts the additional
parameter values.distfun
Examples
Create Silhouette Plot
Create silhouette plots from clustered data using different distance metrics.
Generate random sample data.
rng('default') % For reproducibility X = [randn(10,2)+3;randn(10,2)3];
Create a scatter plot of the data.
scatter(X(:,1),X(:,2));
title('Randomly Generated Data');
The scatter plot shows that the data appears to be split into two clusters of equal size.
Partition the data into two clusters using kmeans
with the default squared Euclidean distance metric.
clust = kmeans(X,2);
clust
contains the cluster indices of the data.
Create a silhouette plot from the clustered data using the default squared Euclidean distance metric.
silhouette(X,clust)
The silhouette plot shows that the data is split into two clusters of equal size. All the points in the two clusters have large silhouette values (0.8 or greater), indicating that the clusters are well separated.
Create a silhouette plot from the clustered data using the Euclidean distance metric.
silhouette(X,clust,'Euclidean')
The silhouette plot shows that the data is split into two clusters of equal size. All the points in the two clusters have large silhouette values (0.6 or greater), indicating that the clusters are well separated.
Compute Silhouette Values
Compute the silhouette values from clustered data.
Generate random sample data.
rng('default') % For reproducibility X = [randn(10,2)+1;randn(10,2)1];
Cluster the data in X
based on the sum of absolute differences in distance by using kmeans
.
clust = kmeans(X,2,'distance','cityblock');
clust
contains the cluster indices of the data.
Compute the silhouette values from the clustered data. Specify the distance metric as 'cityblock'
to indicate that the kmeans
clustering is based on the sum of absolute differences.
s = silhouette(X,clust,'cityblock')
s = 20×1
0.0816
0.5848
0.1906
0.2781
0.3954
0.4050
0.0897
0.5416
0.6203
0.6664
⋮
Find Silhouette Values Using Custom Distance Metric
Find silhouette values from clustered data using a custom chisquare distance metric. Verify that the chisquare distance metric is equivalent to the Euclidean distance metric, but with an optional scaling parameter.
Generate random sample data.
rng('default'); % For reproducibility X = [randn(10,2)+3;randn(10,2)3];
Cluster the data in X
using kmeans
with the default squared Euclidean distance metric.
clust = kmeans(X,2);
Find silhouette values and create a silhouette plot from the clustered data using the Euclidean distance metric.
[s,h] = silhouette(X,clust,'Euclidean')
s = 20×1
0.6472
0.7241
0.5682
0.7658
0.7864
0.6397
0.7253
0.7783
0.7054
0.7442
⋮
h = Figure (1) with properties: Number: 1 Name: '' Color: [1 1 1] Position: [348 376 583 437] Units: 'pixels' Show all properties
The chisquare distance between J
dimensional points x and z is
$$\chi (x,z)=\sqrt{{\displaystyle \sum _{j=1}^{J}{w}_{j}{({x}_{j}{z}_{j})}^{2}}},$$
where $${w}_{j}$$ is the weight associated with dimension j.
Set weights for each dimension and specify the chisquare distance function. The distance function must:
Take as input arguments the nbyp input data matrix
X
, one row ofX
(for example,x
), and a scaling (or weight) parameterw
.Calculate the distance from
x
to each row ofX
.Return a vector of length n. Each element of the vector is the distance between the observation corresponding to
x
and the observations corresponding to each row ofX
.
w = [0.4; 0.6]; % Set arbitrary weights for illustration
chiSqrDist = @(x,Z,w)sqrt(((xZ).^2)*w);
Find silhouette values from the clustered data using the custom distance metric chiSqrDist
.
s1 = silhouette(X,clust,chiSqrDist,w)
s1 = 20×1
0.6288
0.7239
0.6244
0.7696
0.7957
0.6688
0.7386
0.7865
0.7223
0.7572
⋮
Set the weight for both dimensions to 1 to use chiSqrDist
as the Euclidean distance metric. Find silhouette values and verify that they are the same as the values in s
.
w2 = [1; 1]; s2 = silhouette(X,clust,chiSqrDist,w2); AreValuesEqual = isequal(s2,s)
AreValuesEqual = logical
1
The silhouette values are the same in s
and s2
.
Input Arguments
X
— Input data
numeric matrix
Input data, specified as a numeric matrix of size nbyp. Rows correspond to points, and columns correspond to coordinates.
Data Types: single
 double
clust
— Cluster assignment
categorical variable  numeric vector  character matrix  string array  cell array of character vectors
Cluster assignment, specified as a categorical variable, numeric vector, character
matrix, string array, or cell array of character vectors containing a cluster name for
each point in X
.
silhouette
treats NaN
s and empty values in
clust
as missing values and ignores the corresponding rows of
X
.
Data Types: single
 double
 char
 string
 cell
 categorical
Distance
— Distance metric
'sqEuclidean'
(default)  'Euclidean'
 'cityblock'
 function handle  vector of pairwise distances  ...
Distance metric, specified as a character vector, string scalar, or function handle, as described in this table.
Metric  Description 

'Euclidean'  Euclidean distance 
'sqEuclidean'  Squared Euclidean distance (default) 
'cityblock'  Sum of absolute differences 
'cosine'  One minus the cosine of the included angle between points (treated as vectors) 
'correlation'  One minus the sample correlation between points (treated as sequences of values) 
'Hamming'  Percentage of coordinates that differ 
'Jaccard'  Percentage of nonzero coordinates that differ 
Vector  A numeric row vector of pairwise distances, in the form created by the
pdist function.
X is not used in this case, and can safely be set to
[] . 
@  Custom distance function handle. A distance function has the form function D = distfun(X0,X,

For more information, see Distance Metrics.
Example: 'cosine'
Data Types: char
 string
 function_handle
 single
 double
DistParameter
— Distance metric parameter value
positive scalar  numeric vector  numeric matrix
Distance metric parameter value, specified as a positive scalar, numeric vector, or
numeric matrix. This argument is valid only when you specify a custom distance function
handle @
that accepts one or more
parameter values in addition to the input parameters distfun
X0
and
X
.
Example:
silhouette(X,clust,distfun,p1,p2)
where p1
and
p2
are additional distance metric parameter values for
@
distfun
Data Types: single
 double
Output Arguments
s
— Silhouette values
nby1
vector of values ranging from
–1
to 1
Silhouette values, returned as an nby1
vector of values ranging from –1
to 1
. A
silhouette value measures how similar a point is to points in its own cluster, when
compared to points in other clusters. Values range from –1
to 1
. A high silhouette value indicates that a point is well
matched to its own cluster, and poorly matched to other clusters.
Data Types: single
 double
h
— Figure handle
scalar
Figure handle, returned as a scalar. You can use the figure handle to query and modify figure properties. For more information, see Figure Properties.
More About
Silhouette Value
The silhouette value for each point is a measure of how similar that point is to other points in the same cluster, compared to points in other clusters.
The silhouette value s_{i} for the ith point is defined as
$${s}_{i}=\frac{\left({b}_{i}{a}_{i}\right)}{\mathrm{max}\left({a}_{i},{b}_{i}\right)},$$
where a_{i} is the average distance from the ith point to the other points in the same cluster as i, and b_{i} is the minimum average distance from the ith point to points in a different cluster, minimized over the clusters. If the ith point is the only point in its cluster, then the silhouette value s_{i} is set to 1.
The silhouette values range from –1 to 1. A high silhouette value indicates that the point is well matched to its own cluster, and poorly matched to other clusters. If most points have a high silhouette value, then the clustering solution is appropriate. If many points have a low or negative silhouette value, then the clustering solution might have too many or too few clusters. You can use silhouette values as a clustering evaluation criterion with any distance metric.
References
[1] Kaufman L., and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Hoboken, NJ: John Wiley & Sons, Inc., 1990.
Version History
Introduced before R2006a
Open Example
You have a modified version of this example. Do you want to open this example with your edits?
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
 América Latina (Español)
 Canada (English)
 United States (English)
Europe
 Belgium (English)
 Denmark (English)
 Deutschland (Deutsch)
 España (Español)
 Finland (English)
 France (Français)
 Ireland (English)
 Italia (Italiano)
 Luxembourg (English)
 Netherlands (English)
 Norway (English)
 Österreich (Deutsch)
 Portugal (English)
 Sweden (English)
 Switzerland
 United Kingdom (English)