T = clusterdata(X,cutoff) T = clusterdata(X,Name,Value)

Description

T = clusterdata(X,cutoff) returns
the cluster indices (T) for each observation (row)
of the data (X) while adhering to a threshold for
cutting the hierarchical tree (cutoff).

T = clusterdata(X,Name,Value) clusters
with additional options specified by one or more Name,Value pair
arguments.

Input Arguments

X

Matrix with two or more rows. The rows represent observations,
the columns represent categories or dimensions.

cutoff

When 0 < cutoff < 2, clusterdata forms
clusters when inconsistent values are greater than cutoff (see inconsistent). When cutoff is
an integer ≥ 2, clusterdata interprets cutoff as
the maximum number of clusters to keep in the hierarchical tree generated
by linkage.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments.
Name is the argument
name and Value is the corresponding
value. Name must appear
inside single quotes (' ').
You can specify several name and value pair
arguments in any order as Name1,Value1,...,NameN,ValueN.

'criterion'

Either 'inconsistent' or 'distance'.

'cutoff'

Cutoff for inconsistent or distance measure, a positive scalar.
When 0 < cutoff < 2, clusterdata forms
clusters when inconsistent values are greater than cutoff (see inconsistent). When cutoff is
an integer ≥ 2, clusterdata interprets cutoff as
the maximum number of clusters to keep in the hierarchical tree generated
by linkage.

'depth'

Depth for computing inconsistent values, a positive integer.

'distance'

Any of the distance metric names allowed by pdist (follow the 'minkowski' option
by the value of the exponent p):

Metric

Description

'euclidean'

Euclidean distance (default).

'seuclidean'

Standardized Euclidean distance. Each coordinate difference
between rows in X is scaled by dividing by the corresponding element
of the standard deviation S=nanstd(X).
To specify another value for S, use D=pdist(X,'seuclidean',S).

'cityblock'

City block metric.

'minkowski'

Minkowski distance. The default exponent is 2. To specify
a different exponent, use D = pdist(X,'minkowski',P),
where P is a scalar positive value of the exponent.

Mahalanobis distance, using the sample covariance of X as
computed by nancov. To compute
the distance with a different covariance, use D = pdist(X,'mahalanobis',C),
where the matrix C is symmetric and positive definite.

'cosine'

One minus the cosine of the included angle between points
(treated as vectors).

'correlation'

One minus the sample correlation between points (treated
as sequences of values).

'spearman'

One minus the sample Spearman's rank correlation between
observations (treated as sequences of values).

'hamming'

Hamming distance, which is the percentage of coordinates
that differ.

'jaccard'

One minus the Jaccard coefficient, which is the percentage
of nonzero coordinates that differ.

custom distance function

A distance function specified using @: D
= pdist(X,@distfun)

A distance function must
be of form

d2 = distfun(XI,XJ)

taking
as arguments a 1-by-n vector XI,
corresponding to a single row of X, and an m2-by-n matrix XJ,
corresponding to multiple rows of X. distfun must
accept a matrix XJ with an arbitrary number of
rows. distfun must return an m2-by-1
vector of distances d2, whose kth
element is the distance between XI and XJ(k,:).

'linkage'

Any of the linkage methods allowed by the linkage function:

'average'

'centroid'

'complete'

'median'

'single'

'ward'

'weighted'

For details, see the definitions in the linkage function
reference page.

'maxclust'

Maximum number of clusters to form, a positive integer.

'savememory'

A string, either 'on' or 'off'.
When applicable, the 'on' setting causes clusterdata to
construct clusters without computing the distance matrix. savememory is
applicable when:

linkage is 'centroid', 'median',
or 'ward'

distance is 'euclidean' (default)

When savememory is 'on', linkage run
time is proportional to the number of dimensions (number of columns
of X). When savememory is 'off', linkage memory
requirement is proportional to N^{2},
where N is the number of observations. So choosing
the best (least-time) setting for savememory depends
on the problem dimensions, number of observations, and available memory.
The default savememory setting is a rough approximation
of an optimal setting.

Default: 'on' when X has 20
columns or fewer, or the computer does not have enough memory to store
the distance matrix; otherwise 'off'

Output Arguments

T

T is a vector of size m containing
a cluster number for each observation.

When 0 < cutoff < 2, T
= clusterdata(X,cutoff) is equivalent to:

Y = pdist(X,'euclid');
Z = linkage(Y,'single');
T = cluster(Z,'cutoff',cutoff);

When cutoff is an integer ≥ 2, T
= clusterdata(X,cutoff) is equivalent to:

Y = pdist(X,'euclid');
Z = linkage(Y,'single');
T = cluster(Z,'maxclust',cutoff);

This example shows how to create a hierarchical cluster tree using Ward's linkage, and visualize the clusters using a 3-dimensional scatter plot.

Create a 20,000-by-3 matrix of sample data generated from the standard uniform distribution.

rng default; % For reproducibility
X = rand(20000,3);

Create a hierarchical cluster tree from the sample data using Ward's linkage. Set 'savememory' to 'on' to construct clusters without computing the distance matrix.

c = clusterdata(X,'linkage','ward','savememory','on','maxclust',4);

Plot the data with each cluster shown in a different color.

The centroid and median methods
can produce a cluster tree that is not monotonic. This occurs when
the distance from the union of two clusters, r and s,
to a third cluster is less than the distance between r and s.
In this case, in a dendrogram drawn with the default orientation,
the path from a leaf to the root node takes some downward steps.
To avoid this, use another method. The following image shows a nonmonotonic
cluster tree.

In
this case, cluster 1 and cluster 3 are joined into a new cluster,
while the distance between this new cluster and cluster 2 is less
than the distance between cluster 1 and cluster 3. This leads to a
nonmonotonic tree.

You can provide the output T to
other functions including dendrogram to
display the tree, cluster to
assign points to clusters, inconsistent to
compute inconsistent measures, and cophenet to
compute the cophenetic correlation coefficient.