clustering.evaluation.DaviesBouldinEvaluation class

Package: clustering.evaluation
Superclasses: clustering.evaluation.ClusterCriterion

Davies-Bouldin criterion clustering evaluation object

Description

clustering.evaluation.DaviesBouldinEvaluation is an object consisting of sample data, clustering data, and Davies-Bouldin criterion values used to evaluate the optimal number of clusters. Create a Davies-Bouldin criterion clustering evaluation object using evalclusters.

Construction

eva = evalclusters(x,clust,'DaviesBouldin') creates a Davies-Bouldin criterion clustering evaluation object.

eva = evalclusters(x,clust,'DaviesBouldin',Name,Value) creates a Davies-Bouldin criterion clustering evaluation object using additional options specified by one or more name-value pair arguments.

Input Arguments

expand all

x — Input datamatrix

Input data, specified as an N-by-P matrix. N is the number of observations, and P is the number of variables.

Data Types: single | double

clust — Clustering algorithm'kmeans' | 'linkage' | 'gmdistribution' | matrix of clustering solutions | function handle

Clustering algorithm, specified as one of the following.

'kmeans'Cluster the data in x using the kmeans clustering algorithm, with 'EmptyAction' set to 'singleton' and 'Replicates' set to 5.
'linkage'Cluster the data in x using the clusterdata agglomerative clustering algorithm, with 'Linkage' set to 'ward'.
'gmdistribution'Cluster the data in x using the gmdistribution Gaussian mixture distribution algorithm, with 'SharedCov' set to true and 'Replicates' set to 5.

If Criterion is 'CalinskHarabasz', 'DaviesBouldin', or 'silhouette', you can specify a clustering algorithm using the function_handle (@) operator. The function must be of the form C = clustfun(DATA,K), where DATA is the data to be clustered, and K is the number of clusters. The output of clustfun must be one of the following:

  • A vector of integers representing the cluster index for each observation in DATA. There must be K unique values in this vector.

  • A numeric n-by-K matrix of score for n observations and K classes. In this case, the cluster index for each observation is determined by taking the largest score value in each row.

If Criterion is 'CalinskHarabasz', 'DaviesBouldin', or 'silhouette', you can also specify clust as a n-by-K matrix containing the proposed clustering solutions. n is the number of observations in the sample data, and K is the number of proposed clustering solutions. Column j contains the cluster indices for each of the N points in the jth clustering solution.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'KList',[1:5] specifies to test 1, 2, 3, 4, and 5 clusters to find the optimal number.

'KList' — List of number of clusters to evaluatevector

List of number of clusters to evaluate, specified as the comma-separated pair consisting of 'KList' and a vector of positive integer values. You must specify KList when clust is a clustering algorithm name string or a function handle. When criterion is 'gap', clust must be a string or a function handle, and you must specify KList.

Example: 'KList',[1:6]

Properties

ClusteringFunction

Clustering algorithm used to cluster the input data, stored as a valid clustering algorithm name string or function handle. If the clustering solutions are provided in the input, ClusteringFunction is empty.

CriterionName

Name of the criterion used for clustering evaluation, stored as a valid criterion name string.

CriterionValues

Criterion values corresponding to each proposed number of clusters in InspectedK, stored as a vector of numerical values.

InspectedK

List of the number of proposed clusters for which to compute criterion values, stored as a vector of positive integer values.

Missing

Logical flag for excluded data, stored as a column vector of logical values. If Missing equals true, then the corresponding value in the data matrix x is not used in the clustering solution.

NumObservations

Number of observations in the data matrix X, minus the number of missing (NaN) values in X, stored as a positive integer value.

OptimalK

Optimal number of clusters, stored as a positive integer value.

OptimalY

Optimal clustering solution corresponding to OptimalK, stored as a column vector of positive integer values. If the clustering solutions are provided in the input, OptimalY is empty.

X

Data used for clustering, stored as a matrix of numerical values.

Methods

Inherited Methods

addKEvaluate additional numbers of clusters
compactCompact clustering evaluation object
plot Plot clustering evaluation object criterion values

Definitions

Davies-Bouldin Criterion

The Davies-Bouldin criterion is based on a ratio of within-cluster and between-cluster distances. The Davies-Bouldin index is defined as

DB=1ki=1kmaxji{Di,j},

where Di,j is the within-to-between cluster distance ratio for the ith and jth clusters. In mathematical terms,

Di,j=(d¯i+d¯j)di,j.

d¯i is the average distance between each point in the ith cluster and the centroid of the ith cluster. d¯j is the average distance between each point in the ith cluster and the centroid of the jth cluster. di,j is the Euclidean distance between the centroids of the ith and jth clusters.

The maximum value of Di,j represents the worst-case within-to-between cluster ratio for cluster i. The optimal clustering solution has the smallest Davies-Bouldin index value.

Examples

expand all

Evaluate the Clustering Solution Using Davies-Bouldin Criterion

Evaluate the optimal number of clusters using the Davies-Bouldin clustering evaluation criterion.

Generate sample data containing random numbers from three multivariate distributions with different parameter values.

rng('default');  % For reproducibility
mu1 = [2 2];
sigma1 = [0.9 -0.0255; -0.0255 0.9];

mu2 = [5 5];
sigma2 = [0.5 0 ; 0 0.3];

mu3 = [-2, -2];
sigma3 = [1 0 ; 0 0.9];
    
N = 200;

X = [mvnrnd(mu1,sigma1,N);...
     mvnrnd(mu2,sigma2,N);...
     mvnrnd(mu3,sigma3,N)];

Evaluate the optimal number of clusters using the Davies-Bouldin criterion. Cluster the data using kmeans.

E = evalclusters(X,'kmeans','DaviesBouldin','klist',[1:6])
E = 

  DaviesBouldinEvaluation with properties:

    NumObservations: 600
         InspectedK: [1 2 3 4 5 6]
    CriterionValues: [NaN 0.4663 0.4454 0.8300 0.7283 0.9199]
           OptimalK: 3

The OptimalK value indicates that, based on the Davies-Bouldin criterion, the optimal number of clusters is three.

Plot the Davies-Bouldin criterion values for each number of clusters tested.

figure;
plot(E)

The plot shows that the lowest Davies-Bouldin value occurs at three clusters, suggesting that the optimal number of clusters is three.

Create a grouped scatter plot to visually examine the suggested clusters.

figure;
gscatter(X(:,1),X(:,2),E.OptimalY,'rbg','xod')

The plot shows three distinct clusters within the data: Cluster 1 is in the lower-left corner, cluster 2 is near the center of the plot, and cluster 3 is in the upper-right corner.

References

[1] Davies, D. L., and D. W. Bouldin. "A Cluster Separation Measure." IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. PAMI-1, No. 2, 1979, pp. 224–227.

Was this topic helpful?