Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

clusterdata

Agglomerative clusters from data

Syntax

T = clusterdata(X,cutoff)
T = clusterdata(X,Name,Value)

Description

T = clusterdata(X,cutoff) returns the cluster indices (T) for each observation (row) of the data (X) while adhering to a threshold for cutting the hierarchical tree (cutoff).

T = clusterdata(X,Name,Value) clusters with additional options specified by one or more Name,Value pair arguments.

Input Arguments

X

Matrix with two or more rows. The rows represent observations, the columns represent categories or dimensions.

cutoff

When 0 < cutoff < 2, clusterdata forms clusters when inconsistent values are greater than cutoff (see inconsistent). When cutoff is an integer ≥ 2, clusterdata interprets cutoff as the maximum number of clusters to keep in the hierarchical tree generated by linkage.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

'criterion'

Either 'inconsistent' or 'distance'.

'cutoff'

Cutoff for inconsistent or distance measure, a positive scalar. When 0 < cutoff < 2, clusterdata forms clusters when inconsistent values are greater than cutoff (see inconsistent). When cutoff is an integer ≥ 2, clusterdata interprets cutoff as the maximum number of clusters to keep in the hierarchical tree generated by linkage.

'depth'

Depth for computing inconsistent values, a positive integer.

'distance'

Any of the distance metric names allowed by pdist:

ValueDescription
'euclidean'

Euclidean distance (default).

'squaredeuclidean'

Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.)

'seuclidean'

Standardized Euclidean distance. Each coordinate difference between observations is scaled by dividing by the corresponding element of the standard deviation, S = nanstd(X).

'mahalanobis'

Mahalanobis distance using the sample covariance of X, C = nancov(X).

'cityblock'

City block distance.

'minkowski'

Minkowski distance. The default exponent is 2. To use a different exponent P, specify P after 'minkowski', where P is a positive scalar value: 'minkowski',P

'chebychev'

Chebychev distance (maximum coordinate difference).

'cosine'

One minus the cosine of the included angle between points (treated as vectors).

'correlation'

One minus the sample correlation between points (treated as sequences of values).

'hamming'

Hamming distance, which is the percentage of coordinates that differ.

'jaccard'

One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ.

'spearman'

One minus the sample Spearman's rank correlation between observations (treated as sequences of values).

@distfun

Custom distance function handle. A distance function has the form

function D2 = DISTFUN(ZI,ZJ)
% calculation of  distance
...
where

  • ZI is a 1-by-n vector containing a single observation.

  • ZJ is an m2-by-n matrix containing multiple observations. distfun must accept a matrix XJ with an arbitrary number of observations.

  • D2 is an m2-by-1 vector of distances, and D2(k) is the distance between observations ZI and ZJ(k,:).

If your data is not sparse, generally it is faster to use a built-in distance than to use a function handle.

For more information on these distance metrics, see Distance Metrics.

'linkage'

Any of the linkage methods allowed by the linkage function:

  • 'average'

  • 'centroid'

  • 'complete'

  • 'median'

  • 'single'

  • 'ward'

  • 'weighted'

For details, see the definitions in the linkage function reference page.

'maxclust'

Maximum number of clusters to form, a positive integer.

'savememory'

Either 'on' or 'off'. When applicable, the 'on' setting causes clusterdata to construct clusters without computing the distance matrix. savememory is applicable when:

  • 'linkage' is 'centroid', 'median', or 'ward'

  • 'distance' is 'euclidean' (default)

When savememory is 'on', linkage run time is proportional to the number of dimensions (number of columns of X). When savememory is 'off', linkage memory requirement is proportional to N2, where N is the number of observations. So choosing the best (least-time) setting for savememory depends on the problem dimensions, number of observations, and available memory. The default savememory setting is a rough approximation of an optimal setting.

Default: 'on' when X has 20 columns or fewer, or the computer does not have enough memory to store the distance matrix; otherwise 'off'

Output Arguments

T

T is a vector of size m containing a cluster number for each observation.

  • When 0 < cutoff < 2, T = clusterdata(X,cutoff) is equivalent to:

    Y = pdist(X,'euclid'); 
    Z = linkage(Y,'single'); 
    T = cluster(Z,'cutoff',cutoff); 
  • When cutoff is an integer ≥ 2, T = clusterdata(X,cutoff) is equivalent to:

    Y = pdist(X,'euclid'); 
    Z = linkage(Y,'single'); 
    T = cluster(Z,'maxclust',cutoff); 

Examples

collapse all

This example shows how to create a hierarchical cluster tree from sample data, and visualize the clusters using a 3-dimensional scatter plot.

Generate sample data matrices containing random numbers from the standard uniform distribution.

rng('default');  % For reproducibility
X = [gallery('uniformdata',[10 3],12);...
    gallery('uniformdata',[10 3],13)+1.2;...
    gallery('uniformdata',[10 3],14)+2.5];

Compute the distances between items and create a hierarchical cluster tree from the sample data. List all of the items in cluster 2.

T = clusterdata(X,'Maxclust',3);
find(T==2)
ans =

    11
    12
    13
    14
    15
    16
    17
    18
    19
    20

Plot the data with each cluster shown in a different color.

scatter3(X(:,1),X(:,2),X(:,3),100,T,'filled')

This example shows how to create a hierarchical cluster tree using Ward's linkage, and visualize the clusters using a 3-dimensional scatter plot.

Create a 20,000-by-3 matrix of sample data generated from the standard uniform distribution.

rng default;  % For reproducibility
X = rand(20000,3);

Create a hierarchical cluster tree from the sample data using Ward's linkage. Set 'savememory' to 'on' to construct clusters without computing the distance matrix.

c = clusterdata(X,'linkage','ward','savememory','on','maxclust',4);

Plot the data with each cluster shown in a different color.

scatter3(X(:,1),X(:,2),X(:,3),10,c)

Tips

  • The centroid and median methods can produce a cluster tree that is not monotonic. This occurs when the distance from the union of two clusters, r and s, to a third cluster is less than the distance between r and s. In this case, in a dendrogram drawn with the default orientation, the path from a leaf to the root node takes some downward steps. To avoid this, use another method. The following image shows a nonmonotonic cluster tree.

    In this case, cluster 1 and cluster 3 are joined into a new cluster, while the distance between this new cluster and cluster 2 is less than the distance between cluster 1 and cluster 3. This leads to a nonmonotonic tree.

  • You can provide the output T to other functions including dendrogram to display the tree, cluster to assign points to clusters, inconsistent to compute inconsistent measures, and cophenet to compute the cophenetic correlation coefficient.

Introduced before R2006a

Was this topic helpful?