Skip to Main Content Skip to Search
Product Documentation

linkage - Agglomerative hierarchical cluster tree

Syntax

Z = linkage(X)
Z = linkage(X,method)
Z = linkage(X,method,metric)
Z = linkage(X,method,pdist_inputs)
Z = linkage(X,method,metric,'savememory',value)
Z = linkage(Y)
Z = linkage(Y,method)

Description

Z = linkage(X) returns a matrix Z that encodes a tree of hierarchical clusters of the rows of the real matrix X.

Z = linkage(X,method) creates the tree using the specified method, where method describes how to measure the distance between clusters.

Z = linkage(X,method,metric) performs clustering using the distance measure metric to compute distances between the rows of X.

Z = linkage(X,method,pdist_inputs) passes parameters to the pdist function, which is the function that computes the distance between rows of X.

Z = linkage(X,method,metric,'savememory',value) uses a memory-saving algorithm when value is 'true', and uses the standard algorithm when value is 'false'.

Z = linkage(Y) uses a vector representation Y of a distance matrix. Y can be a distance matrix as computed by pdist, or a more general dissimilarity matrix conforming to the output format of pdist.

Z = linkage(Y,method) creates the tree using the specified method, where method describes how to measure the distance between clusters.

Tips

Input Arguments

X

Matrix with two or more rows. The rows represent observations, the columns represent categories or dimensions.

method

Algorithm for computing distance between clusters.

MethodDescription
'average'

Unweighted average distance (UPGMA)

'centroid'

Centroid distance (UPGMC), appropriate for Euclidean distances only

'complete'

Furthest distance

'median'

Weighted center of mass distance (WPGMC), appropriate for Euclidean distances only

'single'

Shortest distance

'ward'

Inner squared distance (minimum variance algorithm), appropriate for Euclidean distances only

'weighted'

Weighted average distance (WPGMA)

Default: 'single'

metric

Any distance metric that the pdist function accepts.

MetricDescription
'euclidean'

Euclidean distance (default).

'seuclidean'

Standardized Euclidean distance. Each coordinate difference between rows in X is scaled by dividing by the corresponding element of the standard deviation S=nanstd(X). To specify another value for S, use D=pdist(X,'seuclidean',S).

'cityblock'

City block metric.

'minkowski'

Minkowski distance. The default exponent is 2. To specify a different exponent, use D = pdist(X,'minkowski',P), where P is a scalar positive value of the exponent.

'chebychev'

Chebychev distance (maximum coordinate difference).

'mahalanobis'

Mahalanobis distance, using the sample covariance of X as computed by nancov. To compute the distance with a different covariance, use D = pdist(X,'mahalanobis',C), where the matrix C is symmetric and positive definite.

'cosine'

One minus the cosine of the included angle between points (treated as vectors).

'correlation'

One minus the sample correlation between points (treated as sequences of values).

'spearman'

One minus the sample Spearman's rank correlation between observations (treated as sequences of values).

'hamming'

Hamming distance, which is the percentage of coordinates that differ.

'jaccard'

One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ.

custom distance function

A distance function specified using @:
D = pdist(X,@distfun)

A distance function must be of form

d2 = distfun(XI,XJ)

taking as arguments a 1-by-n vector XI, corresponding to a single row of X, and an m2-by-n matrix XJ, corresponding to multiple rows of X. distfun must accept a matrix XJ with an arbitrary number of rows. distfun must return an m2-by-1 vector of distances d2, whose kth element is the distance between XI and XJ(k,:).

Default: 'euclidean'

pdist_inputs

A cell array of parameters accepted by the pdist function. For example, to set the metric to minkowski and use an exponent of 5, set pdist_inputs to {'minkowski',5}.

savememory

A string, either 'on' or 'off'. When applicable, the 'on' setting causes linkage to construct clusters without computing the distance matrix. savememory is applicable when:

  • linkage is 'centroid', 'median', or 'ward'

  • distance is 'euclidean' (default)

When savememory is 'on', linkage run time is proportional to the number of dimensions (number of columns of X). When savememory is 'off', linkage memory requirement is proportional to N2, where N is the number of observations. So choosing the best (least-time) setting for savememory depends on the problem dimensions, number of observations, and available memory. The default savememory setting is a rough approximation of an optimal setting.

Default: 'on' when X has 20 columns or fewer, or the computer does not have enough memory to store the distance matrix; otherwise 'off'

Y

A vector of distances with the same format as the output of the pdist function:

  • A row vector of length m(m–1)/2, corresponding to pairs of observations in a matrix X with m rows

  • Distances arranged in the order (2,1), (3,1), ..., (m,1), (3,2), ..., (m,2), ..., (m,m–1))

Y can be a more general dissimilarity matrix conforming to the output format of pdist.

Output Arguments

Z

Z is a (m – 1)-by-3 matrix, where m is the number of observations in the original data. Columns 1 and 2 of Z contain cluster indices linked in pairs to form a binary tree. The leaf nodes are numbered from 1 to m. Leaf nodes are the singleton clusters from which all higher clusters are built. Each newly-formed cluster, corresponding to row Z(I,:), is assigned the index m+I. Z(I,1:2) contains the indices of the two component clusters that form cluster m+I. There are m-1 higher clusters which correspond to the interior nodes of the clustering tree. Z(I,3) contains the linkage distances between the two clusters merged in row Z(I,:).

For example, suppose there are 30 initial nodes and at step 12 cluster 5 and cluster 7 are combined. Suppose their distance at that time is 1.5. Then Z(12,:) will be [5, 7, 1.5]. The newly formed cluster will have index 12 + 30 = 42. If cluster 42 appears in a later row, it means the cluster created at step 12 is being combined into some larger cluster.

Definitions

Linkages

The following notation is used to describe the linkages used by the various methods:

Examples

Compute four clusters of the Fisher iris data using Ward linkage and ignoring species information, and see how the cluster assignments correspond to the three species.

load fisheriris
Z = linkage(meas,'ward','euclidean');
c = cluster(Z,'maxclust',4);
crosstab(c,species)
firstfive = Z(1:5,:) % first 5 rows of Z
dendrogram(Z)

ans =
     0    25     1
     0    24    14
     0     1    35
    50     0     0

firstfive =
  102.0000  143.0000         0
    8.0000   40.0000    0.1000
    1.0000   18.0000    0.1000
   10.0000   35.0000    0.1000
  129.0000  133.0000    0.1000

 

Create a hierarchical cluster tree for a data with 20000 observations using Ward's linkage. If you set savememory to 'off', you can get an out-of-memory error if your machine doesn't have enough memory to hold the distance matrix. Cluster the data into four groups and plot the result.

X = rand(20000,3);
Z = linkage(X,'ward','euclidean','savememory','on');
c = cluster(Z,'maxclust',4);
scatter3(X(:,1),X(:,2),X(:,3),10,c)

See Also

cluster | clusterdata | cophenet | dendrogram | inconsistent | kmeans | pdist | silhouette | squareform

How To

  


 © 1984-2012- The MathWorks, Inc.    -   Site Help   -   Patents   -   Trademarks   -   Privacy Policy   -   Preventing Piracy   -   RSS