Skip to Main Content Skip to Search
Product Documentation

clusterdata - Agglomerative clusters from data

Syntax

T = clusterdata(X,cutoff)
T = clusterdata(X,Name,Value)

Description

T = clusterdata(X,cutoff)

T = clusterdata(X,Name,Value) clusters with additional options specified by one or more Name,Value pair arguments.

Tips

Input Arguments

X

Matrix with two or more rows. The rows represent observations, the columns represent categories or dimensions.

cutoff

When 0 < cutoff < 2, clusterdata forms clusters when inconsistent values are greater than cutoff (see inconsistent). When cutoff is an integer ≥ 2, clusterdata interprets cutoff as the maximum number of clusters to keep in the hierarchical tree generated by linkage.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments, where Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

'criterion'

Either 'inconsistent' or 'distance'.

'cutoff'

Cutoff for inconsistent or distance measure, a positive scalar. When 0 < cutoff < 2, clusterdata forms clusters when inconsistent values are greater than cutoff (see inconsistent). When cutoff is an integer ≥ 2, clusterdata interprets cutoff as the maximum number of clusters to keep in the hierarchical tree generated by linkage.

'depth'

Depth for computing inconsistent values, a positive integer.

'distance'

Any of the distance metric names allowed by pdist (follow the 'minkowski' option by the value of the exponent p):

MetricDescription
'euclidean'

Euclidean distance (default).

'seuclidean'

Standardized Euclidean distance. Each coordinate difference between rows in X is scaled by dividing by the corresponding element of the standard deviation S=nanstd(X). To specify another value for S, use D=pdist(X,'seuclidean',S).

'cityblock'

City block metric.

'minkowski'

Minkowski distance. The default exponent is 2. To specify a different exponent, use D = pdist(X,'minkowski',P), where P is a scalar positive value of the exponent.

'chebychev'

Chebychev distance (maximum coordinate difference).

'mahalanobis'

Mahalanobis distance, using the sample covariance of X as computed by nancov. To compute the distance with a different covariance, use D = pdist(X,'mahalanobis',C), where the matrix C is symmetric and positive definite.

'cosine'

One minus the cosine of the included angle between points (treated as vectors).

'correlation'

One minus the sample correlation between points (treated as sequences of values).

'spearman'

One minus the sample Spearman's rank correlation between observations (treated as sequences of values).

'hamming'

Hamming distance, which is the percentage of coordinates that differ.

'jaccard'

One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ.

custom distance function

A distance function specified using @:
D = pdist(X,@distfun)

A distance function must be of form

d2 = distfun(XI,XJ)

taking as arguments a 1-by-n vector XI, corresponding to a single row of X, and an m2-by-n matrix XJ, corresponding to multiple rows of X. distfun must accept a matrix XJ with an arbitrary number of rows. distfun must return an m2-by-1 vector of distances d2, whose kth element is the distance between XI and XJ(k,:).

'linkage'

Any of the linkage methods allowed by the linkage function:

  • 'average'

  • 'centroid'

  • 'complete'

  • 'median'

  • 'single'

  • 'ward'

  • 'weighted'

For details, see the definitions in the linkage function reference page.

'maxclust'

Maximum number of clusters to form, a positive integer.

'savememory'

A string, either 'on' or 'off'. When applicable, the 'on' setting causes clusterdata to construct clusters without computing the distance matrix. savememory is applicable when:

  • linkage is 'centroid', 'median', or 'ward'

  • distance is 'euclidean' (default)

When savememory is 'on', linkage run time is proportional to the number of dimensions (number of columns of X). When savememory is 'off', linkage memory requirement is proportional to N2, where N is the number of observations. So choosing the best (least-time) setting for savememory depends on the problem dimensions, number of observations, and available memory. The default savememory setting is a rough approximation of an optimal setting.

Default: 'on' when X has 20 columns or fewer, or the computer does not have enough memory to store the distance matrix; otherwise 'off'

Output Arguments

T

T is a vector of size m containing a cluster number for each observation.

  • When 0 < cutoff < 2, T = clusterdata(X,cutoff) is equivalent to:

    Y = pdist(X,'euclid'); 
    Z = linkage(Y,'single'); 
    T = cluster(Z,'cutoff',cutoff); 
  • When cutoff is an integer ≥ 2, T = clusterdata(X,cutoff) is equivalent to:

    Y = pdist(X,'euclid'); 
    Z = linkage(Y,'single'); 
    T = cluster(Z,'maxclust',cutoff); 

Examples

The example first creates a sample data set of random numbers. It then uses clusterdata to compute the distances between items in the data set and create a hierarchical cluster tree from the data set. Finally, the clusterdata function groups the items in the data set into three clusters. The example uses the find function to list all the items in cluster 2, and the scatter3 function to plot the data with each cluster shown in a different color.

X = [gallery('uniformdata',[10 3],12);...
gallery('uniformdata',[10 3],13)+1.2;...
gallery('uniformdata',[10 3],14)+2.5];
T = clusterdata(X,'maxclust',3); 
find(T==2)
ans =
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
scatter3(X(:,1),X(:,2),X(:,3),100,T,'filled')

 

Create a hierarchical cluster tree for a data with 20000 observations using Ward's linkage. If you set savememory to 'off', you can get an out-of-memory error if your machine doesn't have enough memory to hold the distance matrix.

X = rand(20000,3);
c = clusterdata(X,'linkage','ward','savememory','on',...
    'maxclust',4);
scatter3(X(:,1),X(:,2),X(:,3),10,c)

See Also

cluster | inconsistent | kmeans | linkage | pdist

  


 © 1984-2012- The MathWorks, Inc.    -   Site Help   -   Patents   -   Trademarks   -   Privacy Policy   -   Preventing Piracy   -   RSS