Note: This page has been translated by MathWorks. Please click here

To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

**MathWorks Machine Translation**

The automated translation of this page is provided by a general purpose third party translator tool.

MathWorks does not warrant, and disclaims all liability for, the accuracy, suitability, or fitness for purpose of the translation.

Evaluate clustering solutions

- example
`eva = evalclusters(x,clust,criterion)`

`eva = evalclusters(x,clust,criterion,Name,Value)`

creates
a clustering evaluation object using additional options specified
by one or more name-value pair arguments.`eva`

= evalclusters(`x`

,`clust`

,`criterion`

,`Name,Value`

)

Evaluate the optimal number of clusters using the Calinski-Harabasz clustering evaluation criterion.

Load the sample data.

`load fisheriris;`

The data contains length and width measurements from the sepals and petals of three species of iris flowers.

Evaluate the optimal number of clusters using the Calinski-Harabasz
criterion. Cluster the data using `kmeans`

.

rng('default'); % For reproducibility eva = evalclusters(meas,'kmeans','CalinskiHarabasz','KList',[1:6])

eva = CalinskiHarabaszEvaluation with properties: NumObservations: 150 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 513.9245 561.6278 530.4871 456.1279 469.5068] OptimalK: 3

The `OptimalK`

value indicates that, based
on the Calinski-Harabasz criterion, the optimal number of clusters
is three.

Use an input matrix of proposed clustering solutions to evaluate the optimal number of clusters.

Load the sample data.

`load fisheriris;`

The data contains length and width measurements from the sepals and petals of three species of iris flowers.

Use `kmeans`

to create an input matrix
of proposed clustering solutions for the sepal length measurements,
using 1, 2, 3, 4, 5, and 6 clusters.

clust = zeros(size(meas,1),6); for i=1:6 clust(:,i) = kmeans(meas,i,'emptyaction','singleton',... 'replicate',5); end

Each row of `clust`

corresponds to one sepal
length measurement. Each of the six columns corresponds to a clustering
solution containing 1 to 6 clusters.

Evaluate the optimal number of clusters using the Calinski-Harabasz criterion.

`eva = evalclusters(meas,clust,'CalinskiHarabasz')`

eva = CalinskiHarabaszEvaluation with properties: NumObservations: 150 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 513.9245 561.6278 530.7658 495.5415 470.4474] OptimalK: 3

The `OptimalK`

value indicates that, based
on the Calinski-Harabasz criterion, the optimal number of clusters
is three.

Use a function handle to specify the clustering algorithm, then evaluate the optimal number of clusters.

Load the sample data.

`load fisheriris;`

The data contains length and width measurements from the sepals and petals of three species of iris flowers.

Use a function handle to specify the clustering algorithm.

myfunc = @(X,K)(kmeans(X, K, 'emptyaction','singleton',... 'replicate',5));

Evaluate the optimal number of clusters for the sepal length data using the Calinski-Harabasz criterion.

eva = evalclusters(meas,myfunc,'CalinskiHarabasz',... 'klist',[1:6])

eva = CalinskiHarabaszEvaluation with properties: NumObservations: 150 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 513.9245 561.6278 530.4871 495.5415 473.8506] OptimalK: 3

The `OptimalK`

value indicates that, based
on the Calinski-Harabasz criterion, the optimal number of clusters
is three.

`x`

— Input datamatrix

Input data, specified as an *N*-by-*P* matrix. *N* is
the number of observations, and *P* is the number
of variables.

**Data Types: **`single`

| `double`

`clust`

— Clustering algorithm`'kmeans'`

| `'linkage'`

| `'gmdistribution'`

| matrix of clustering solutions | function handleClustering algorithm, specified as one of the following.

`'kmeans'` | Cluster the data in `x` using the `kmeans` clustering algorithm, with `'EmptyAction'` set
to `'singleton'` and `'Replicates'` set
to `5` . |

`'linkage'` | Cluster the data in `x` using the `clusterdata` agglomerative clustering algorithm,
with `'Linkage'` set to `'ward'` . |

`'gmdistribution'` | Cluster the data in `x` using the `gmdistribution` Gaussian mixture distribution
algorithm, with `'SharedCov'` set to `true` and `'Replicates'` set
to `5` . |

If `Criterion`

is `'CalinskHarabasz'`

, `'DaviesBouldin'`

,
or `'silhouette'`

, you can specify a clustering algorithm
using a function handle. The function
must be of the form `C = clustfun(DATA,K)`

, where `DATA`

is
the data to be clustered, and `K`

is the number of
clusters. The output of `clustfun`

must be one of
the following:

A vector of integers representing the cluster index for each observation in

`DATA`

. There must be`K`

unique values in this vector.A numeric

*n*-by-*K*matrix of score for*n*observations and*K*classes. In this case, the cluster index for each observation is determined by taking the largest score value in each row.

If `Criterion`

is `'CalinskHarabasz'`

, `'DaviesBouldin'`

,
or `'silhouette'`

, you can also specify `clust`

as
a *n*-by-*K* matrix containing the
proposed clustering solutions. *n* is the number
of observations in the sample data, and *K* is the
number of proposed clustering solutions. Column *j* contains
the cluster indices for each of the *N* points in
the *j*th clustering solution.

`criterion`

— Clustering evaluation criterion`'CalinskiHarabasz'`

| `'DaviesBouldin'`

| `'gap'`

| `'silhouette'`

Clustering evaluation criterion, specified as one of the following.

`'CalinskiHarabasz'` | Create a `CalinskiHarabaszEvaluation` clustering
evaluation object containing Calinski-Harabasz index values. |

`'DaviesBouldin'` | Create a `DaviesBouldinEvaluation` cluster
evaluation object containing Davies-Bouldin index values. |

`'gap'` | Create a `GapEvaluation` cluster evaluation
object containing gap criterion values. |

`'silhouette'` | Create a `SilhouetteEvaluation` cluster evaluation
object containing silhouette values. |

Specify optional comma-separated pairs of `Name,Value`

arguments.
`Name`

is the argument
name and `Value`

is the corresponding
value. `Name`

must appear
inside single quotes (`' '`

).
You can specify several name and value pair
arguments in any order as `Name1,Value1,...,NameN,ValueN`

.

`'KList',[1:5],'Distance','cityblock'`

specifies
to test 1, 2, 3, 4, and 5 clusters using the sum of absolute differences
distance measure.`'KList'`

— List of number of clusters to evaluatevector

List of number of clusters to evaluate, specified as the comma-separated
pair consisting of `'KList'`

and a vector of positive
integer values. You must specify `KList`

when `clust`

is
a clustering algorithm name or a function handle. When `criterion`

is `'gap'`

, `clust`

must
be a character vector or a function handle, and you must specify `KList`

.

**Example: **`'KList',[1:6]`

`'Distance'`

— Distance metric`'sqEuclidean'`

(default) | `'Euclidean'`

| `'cityblock'`

| vector | function | ...Distance metric used for computing the criterion values, specified
as the comma-separated pair consisting of `'Distance'`

and
one of the following.

`'sqEuclidean'` | Squared Euclidean distance |

`'Euclidean'` | Euclidean distance |

`'cityblock'` | Sum of absolute differences |

`'cosine'` | One minus the cosine of the included angle between points (treated as vectors) |

`'correlation'` | One minus the sample correlation between points (treated as sequences of values) |

`'Hamming'` | Percentage of coordinates that differ. This option is only
valid for the `Silhouette` criterion. |

`'Jaccard'` | Percentage of nonzero coordinates that differ. This option
is only valid for the `Silhouette` criterion. |

For detailed information about each distance metric, see `pdist`

.

You can also specify a function for the distance metric using
a function handle. The distance
function must be of the form `d2 = distfun(XI,XJ)`

,
where `XI`

is a 1-by-*n* vector
corresponding to a single row of the input matrix `X`

,
and `XJ`

is an *m*_{2}-by-*n* matrix
corresponding to multiple rows of `X`

. `distfun`

must
return an *m*_{2}-by-1 vector
of distances `d2`

, whose *k*th element
is the distance between `XI`

and `XJ(k,:)`

.

If `Criterion`

is `'silhouette'`

,
you can also specify `Distance`

as the output vector
output created by the function `pdist`

.

When `Clust`

a character vector representing
a built-in clustering algorithm, `evalclusters`

uses
the distance metric specified for `Distance`

to cluster
the data, except for the following:

If

`Clust`

is`'linkage'`

, and`Distance`

is either`'sqEuclidean'`

or`'Euclidean'`

, then the clustering algorithm uses Euclidean distance and Ward linkage.If

`Clust`

is`'linkage'`

and`Distance`

is any other metric, then the clustering algorithm uses the specified distance metric and average linkage.

In all other cases, the distance metric specified for `Distance`

must
match the distance metric used in the clustering algorithm to obtain
meaningful results.

`Distance`

only accepts a function handle if
the clustering algorithm `clust`

accepts a function
handle as the distance metric. For example, the `kmeans`

clustering
algorithm does not accept a function handle as the distance metric.
Therefore, if you use the `kmeans`

algorithm and then
specify a function handle for `Distance`

, the software
errors.

**Example: **`'Distance','Euclidean'`

`'ClusterPriors'`

— Prior probabilities for each cluster`'empirical'`

(default) | `'equal'`

Prior probabilities for each cluster, specified as the comma-separated
pair consisting of `'ClusterPriors'`

and one of the
following.

`'empirical'` | Compute the overall silhouette value for the clustering solution by averaging the silhouette values for all points. Each cluster contributes to the overall silhouette value proportionally to its size. |

`'equal'` | Compute the overall silhouette value for the clustering solution by averaging the silhouette values for all points within each cluster, and then averaging those values across all clusters. Each cluster contributes equally to the overall silhouette value, regardless of its size. |

**Example: **`'ClusterPriors','empirical'`

`'B'`

— Number of reference data sets`100`

(default) | positive integer valueNumber of reference data sets generated from the reference distribution `ReferenceDistribution`

,
specified as the comma-separated pair consisting of `'B'`

and
a positive integer value.

**Example: **`'B',150`

`'ReferenceDistribution'`

— Reference data generation method`'PCA'`

(default) | `'uniform'`

Reference data generation method, specified as the comma-separated
pair consisting of `'ReferenceDistributions'`

and
one of the following.

`'PCA'` | Generate reference data from a uniform distribution over a
box aligned with the principal components of the data matrix `x` . |

`'uniform'` | Generate reference data uniformly over the range of each feature
in the data matrix `x` . |

**Example: **`'ReferenceDistribution','uniform'`

`'SearchMethod'`

— Method for selecting optimal number of clusters`'globalMaxSE'`

(default) | `'firstMaxSE'`

Method for selecting the optimal number of clusters, specified
as the comma-separated pair consisting of `'SearchMethod'`

and
one of the following.

`'globalMaxSE'` | Evaluate each proposed number of clusters in `KList` and
select the smallest number of clusters satisfying$$\text{Gap}\left(K\right)\ge GAPMAX-\text{SE}(GAPMAX),$$ K is the number
of clusters, Gap(K) is the gap value for the clustering
solution with K clusters, GAPMAX is
the largest gap value, and SE(GAPMAX) is the standard
error corresponding to the largest gap value. |

`'firstMaxSE'` | Evaluate each proposed number of clusters in `KList` and
select the smallest number of clusters satisfying$$\text{Gap}(K)\ge \text{Gap}(K+1)-\text{SE}(K+1),$$ K is the number
of clusters, Gap(K) is the gap value for the clustering
solution with K clusters, and SE(K +
1) is the standard error of the clustering solution with K +
1 clusters. |

**Example: **`'SearchMethod','globalMaxSE'`

`eva`

— Clustering evaluation dataclustering evaluation object

Clustering evaluation data, returned as a clustering evaluation object.

`clustering.evaluation.CalinskiHarabaszEvaluation`

| `clustering.evaluation.DaviesBouldinEvaluation`

| `clustering.evaluation.GapEvaluation`

| `clustering.evaluation.SilhouetteEvaluation`

Was this topic helpful?

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

You can also select a location from the following list:

- Canada (English)
- United States (English)

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)