| Contents | Index |
Z = linkage(X)
Z = linkage(X,method)
Z = linkage(X,method,metric)
Z = linkage(X,method,pdist_inputs)
Z = linkage(X,method,metric,'savememory',value)
Z = linkage(Y)
Z = linkage(Y,method)
Z = linkage(X) returns a matrix Z that encodes a tree of hierarchical clusters of the rows of the real matrix X.
Z = linkage(X,method) creates the tree using the specified method, where method describes how to measure the distance between clusters.
Z = linkage(X,method,metric) performs clustering using the distance measure metric to compute distances between the rows of X.
Z = linkage(X,method,pdist_inputs) passes parameters to the pdist function, which is the function that computes the distance between rows of X.
Z = linkage(X,method,metric,'savememory',value) uses a memory-saving algorithm when value is 'true', and uses the standard algorithm when value is 'false'.
Z = linkage(Y) uses a vector representation Y of a distance matrix. Y can be a distance matrix as computed by pdist, or a more general dissimilarity matrix conforming to the output format of pdist.
Z = linkage(Y,method) creates the tree using the specified method, where method describes how to measure the distance between clusters.
Computing linkage(Y) can be slow when Y is a vector representation of the distance matrix. For the 'centroid', 'median', and 'ward' methods, linkage checks whether Y is a Euclidean distance. Avoid this time-consuming check by passing in X instead of Y.
The centroid and median methods can produce a cluster tree that is not monotonic. This occurs when the distance from the union of two clusters, r and s, to a third cluster is less than the distance between r and s. In this case, in a dendrogram drawn with the default orientation, the path from a leaf to the root node takes some downward steps. To avoid this, use another method. The following image shows a nonmonotonic cluster tree.

In this case, cluster 1 and cluster 3 are joined into a new cluster, while the distance between this new cluster and cluster 2 is less than the distance between cluster 1 and cluster 3. This leads to a nonmonotonic tree.
You can provide the output Z to other functions including dendrogram to display the tree, cluster to assign points to clusters, inconsistent to compute inconsistent measures, and cophenet to compute the cophenetic correlation coefficient.
X |
Matrix with two or more rows. The rows represent observations, the columns represent categories or dimensions. | ||||||||||||||||||||||||||
method |
Algorithm for computing distance between clusters.
Default: 'single' | ||||||||||||||||||||||||||
metric |
Any distance metric that the pdist function accepts.
Default: 'euclidean' | ||||||||||||||||||||||||||
pdist_inputs |
A cell array of parameters accepted by the pdist function. For example, to set the metric to minkowski and use an exponent of 5, set pdist_inputs to {'minkowski',5}. | ||||||||||||||||||||||||||
savememory |
A string, either 'on' or 'off'. When applicable, the 'on' setting causes linkage to construct clusters without computing the distance matrix. savememory is applicable when:
When savememory is 'on', linkage run time is proportional to the number of dimensions (number of columns of X). When savememory is 'off', linkage memory requirement is proportional to N2, where N is the number of observations. So choosing the best (least-time) setting for savememory depends on the problem dimensions, number of observations, and available memory. The default savememory setting is a rough approximation of an optimal setting. Default: 'on' when X has 20 columns or fewer, or the computer does not have enough memory to store the distance matrix; otherwise 'off' | ||||||||||||||||||||||||||
Y |
A vector of distances with the same format as the output of the pdist function:
Y can be a more general dissimilarity matrix conforming to the output format of pdist. |
The following notation is used to describe the linkages used by the various methods:
Cluster r is formed from clusters p and q.
nr is the number of objects in cluster r.
xri is the ith object in cluster r.
Single linkage, also called nearest neighbor, uses the smallest distance between objects in the two clusters:
![]()
Complete linkage, also called furthest neighbor, uses the largest distance between objects in the two clusters:
![]()
Average linkage uses the average distance between all pairs of objects in any two clusters:
![]()
Centroid linkage uses the Euclidean distance between the centroids of the two clusters:
![]()
where
![]()
Median linkage uses the Euclidean distance between weighted centroids of the two clusters,
![]()
where
and
are weighted centroids for the
clusters r and s. If cluster r was
created by combining clusters p and q,
is defined recursively as
![]()
Ward's linkage uses the incremental sum of squares; that is, the increase in the total within-cluster sum of squares as a result of joining two clusters. The within-cluster sum of squares is defined as the sum of the squares of the distances between all objects in the cluster and the centroid of the cluster. The sum of squares measure is equivalent to the following distance measure d(r,s), which is the formula linkage uses:
![]()
where
is Euclidean distance
and
are the centroids of clusters r and s
nr and ns are the number of elements in clusters r and s
In some references the Ward linkage does not use the factor of 2 multiplying nrns. The linkage function uses this factor so the distance between two singleton clusters is the same as the Euclidean distance.
Weighted average linkage uses a recursive definition for the distance between two clusters. If cluster r was created by combining clusters p and q, the distance between r and another cluster s is defined as the average of the distance between p and s and the distance between q and s:
![]()
Compute four clusters of the Fisher iris data using Ward linkage and ignoring species information, and see how the cluster assignments correspond to the three species.
load fisheriris
Z = linkage(meas,'ward','euclidean');
c = cluster(Z,'maxclust',4);
crosstab(c,species)
firstfive = Z(1:5,:) % first 5 rows of Z
dendrogram(Z)
ans =
0 25 1
0 24 14
0 1 35
50 0 0
firstfive =
102.0000 143.0000 0
8.0000 40.0000 0.1000
1.0000 18.0000 0.1000
10.0000 35.0000 0.1000
129.0000 133.0000 0.1000

Create a hierarchical cluster tree for a data with 20000 observations using Ward's linkage. If you set savememory to 'off', you can get an out-of-memory error if your machine doesn't have enough memory to hold the distance matrix. Cluster the data into four groups and plot the result.
X = rand(20000,3); Z = linkage(X,'ward','euclidean','savememory','on'); c = cluster(Z,'maxclust',4); scatter3(X(:,1),X(:,2),X(:,3),10,c)

cluster | clusterdata | cophenet | dendrogram | inconsistent | kmeans | pdist | silhouette | squareform
| © 1984-2012- The MathWorks, Inc. - Site Help - Patents - Trademarks - Privacy Policy - Preventing Piracy - RSS |