Z = linkage(X) Z = linkage(X,method) Z = linkage(X,method,metric) Z = linkage(X,method,pdist_inputs) Z = linkage(X,method,metric,'savememory',value) Z = linkage(Y) Z = linkage(Y,method)

Description

Z = linkage(X) returns
a matrix Z that encodes a tree of hierarchical
clusters of the rows of the real matrix X.

Z = linkage(X,method) creates
the tree using the specified method, where method describes
how to measure the distance between clusters.

Z = linkage(X,method,metric) performs
clustering using the distance measure metric to
compute distances between the rows of X.

Z = linkage(X,method,pdist_inputs) passes
parameters to the pdist function,
which is the function that computes the distance between rows of X.

Z = linkage(X,method,metric,'savememory',value) uses
a memory-saving algorithm when value is 'true',
and uses the standard algorithm when value is 'false'.

Z = linkage(Y) uses
a vector representation Y of a distance matrix. Y can
be a distance matrix as computed by pdist,
or a more general dissimilarity matrix conforming to the output format
of pdist.

Z = linkage(Y,method) creates
the tree using the specified method, where method describes
how to measure the distance between clusters.

Input Arguments

X

Matrix with two or more rows. The rows represent observations,
the columns represent categories or dimensions.

method

Algorithm for computing distance between clusters.

Method

Description

'average'

Unweighted average distance (UPGMA)

'centroid'

Centroid distance (UPGMC), appropriate for Euclidean
distances only

'complete'

Furthest distance

'median'

Weighted center of mass distance (WPGMC), appropriate
for Euclidean distances only

'single'

Shortest distance

'ward'

Inner squared distance (minimum variance algorithm),
appropriate for Euclidean distances only

'weighted'

Weighted average distance (WPGMA)

Default: 'single'

metric

Any distance metric that the pdist function
accepts.

Metric

Description

'euclidean'

Euclidean distance (default).

'seuclidean'

Standardized Euclidean distance. Each coordinate difference
between rows in X is scaled by dividing by the corresponding element
of the standard deviation S=nanstd(X).
To specify another value for S, use D=pdist(X,'seuclidean',S).

'cityblock'

City block metric.

'minkowski'

Minkowski distance. The default exponent is 2. To specify
a different exponent, use D = pdist(X,'minkowski',P),
where P is a scalar positive value of the exponent.

Mahalanobis distance, using the sample covariance of X as
computed by nancov. To compute
the distance with a different covariance, use D = pdist(X,'mahalanobis',C),
where the matrix C is symmetric and positive definite.

'cosine'

One minus the cosine of the included angle between points
(treated as vectors).

'correlation'

One minus the sample correlation between points (treated
as sequences of values).

'spearman'

One minus the sample Spearman's rank correlation between
observations (treated as sequences of values).

'hamming'

Hamming distance, which is the percentage of coordinates
that differ.

'jaccard'

One minus the Jaccard coefficient, which is the percentage
of nonzero coordinates that differ.

custom distance function

A distance function specified using @: D
= pdist(X,@distfun)

A distance function must
be of form

d2 = distfun(XI,XJ)

taking
as arguments a 1-by-n vector XI,
corresponding to a single row of X, and an m2-by-n matrix XJ,
corresponding to multiple rows of X. distfun must
accept a matrix XJ with an arbitrary number of
rows. distfun must return an m2-by-1
vector of distances d2, whose kth
element is the distance between XI and XJ(k,:).

Default: 'euclidean'

pdist_inputs

A cell array of parameters accepted by the pdist function. For example, to set the metric to minkowski and
use an exponent of 5, set pdist_inputs to {'minkowski',5}.

savememory

A string, either 'on' or 'off'.
When applicable, the 'on' setting causes linkage to
construct clusters without computing the distance matrix. savememory is
applicable when:

linkage is 'centroid', 'median',
or 'ward'

distance is 'euclidean' (default)

When savememory is 'on', linkage run
time is proportional to the number of dimensions (number of columns
of X). When savememory is 'off', linkage memory
requirement is proportional to N^{2},
where N is the number of observations. So choosing
the best (least-time) setting for savememory depends
on the problem dimensions, number of observations, and available memory.
The default savememory setting is a rough approximation
of an optimal setting.

Default: 'on' when X has 20
columns or fewer, or the computer does not have enough memory to store
the distance matrix; otherwise 'off'

Y

A vector of distances with the same format as the output of
the pdist function:

A row vector of length m(m–1)/2,
corresponding to pairs of observations in a matrix X with m rows

Distances arranged in the order (2,1), (3,1), ...,
(m,1), (3,2), ..., (m,2), ...,
(m,m–1))

Y can be a more general dissimilarity matrix
conforming to the output format of pdist.

Output Arguments

Z

Z is a (m – 1)-by-3
matrix, where m is the number of observations in
the original data. Columns 1 and 2 of Z contain
cluster indices linked in pairs to form a binary tree. The leaf nodes
are numbered from 1 to m. Leaf nodes are the singleton
clusters from which all higher clusters are built. Each newly-formed
cluster, corresponding to row Z(I,:), is assigned
the index m+I. Z(I,1:2) contains
the indices of the two component clusters that form cluster m+I.
There are m-1 higher clusters which correspond
to the interior nodes of the clustering tree. Z(I,3) contains
the linkage distances between the two clusters merged in row Z(I,:).

For example, suppose there are 30 initial nodes and at step
12 cluster 5 and cluster 7 are combined. Suppose their distance at
that time is 1.5. Then Z(12,:) will be [5,
7, 1.5]. The newly formed cluster will have index 12 + 30 = 42.
If cluster 42 appears in a later row, it means the cluster created
at step 12 is being combined into some larger cluster.

The following notation is used to describe the linkages used
by the various methods:

Cluster r is formed from clusters p and q.

n_{r} is
the number of objects in cluster r.

x_{ri} is
the ith object in cluster r.

Single linkage,
also called nearest neighbor, uses the smallest
distance between objects in the two clusters:

Complete linkage, also called furthest
neighbor, uses the largest distance between objects in
the two clusters:

Average linkage uses the average distance between
all pairs of objects in any two clusters:

Centroid linkage uses the Euclidean distance
between the centroids of the two clusters:

where

Median linkage uses the Euclidean
distance between weighted centroids of the two clusters,

where and are weighted centroids for the
clusters r and s. If cluster r was
created by combining clusters p and q, is defined recursively as

Ward's linkage uses the incremental
sum of squares; that is, the increase in the total within-cluster
sum of squares as a result of joining two clusters. The within-cluster
sum of squares is defined as the sum of the squares of the distances
between all objects in the cluster and the centroid of the cluster.
The sum of squares measure is equivalent to the following distance
measure d(r,s),
which is the formula linkage uses:

where

is Euclidean distance

and are the centroids of clusters r and s

n_{r} and n_{s} are
the number of elements in clusters r and s

In some references the Ward linkage does not use the factor
of 2 multiplying n_{r}n_{s}.
The linkage function uses this factor so the
distance between two singleton clusters is the same as the Euclidean
distance.

Weighted average
linkage uses a recursive definition for the distance between
two clusters. If cluster r was created by combining
clusters p and q, the distance
between r and another cluster s is
defined as the average of the distance between p and s and
the distance between q and s:

Computing linkage(Y) can be slow
when Y is a vector representation of the distance
matrix. For the 'centroid', 'median',
and 'ward' methods, linkage checks
whether Y is a Euclidean distance. Avoid this time-consuming
check by passing in X instead of Y.

The centroid and median methods
can produce a cluster tree that is not monotonic. This occurs when
the distance from the union of two clusters, r and s,
to a third cluster is less than the distance between r and s.
In this case, in a dendrogram drawn with the default orientation,
the path from a leaf to the root node takes some downward steps.
To avoid this, use another method. The following image shows a nonmonotonic
cluster tree.

In
this case, cluster 1 and cluster 3 are joined into a new cluster,
while the distance between this new cluster and cluster 2 is less
than the distance between cluster 1 and cluster 3. This leads to a
nonmonotonic tree.

You can provide the output Z to
other functions including dendrogram to
display the tree, cluster to
assign points to clusters, inconsistent to
compute inconsistent measures, and cophenet to
compute the cophenetic correlation coefficient.