Agglomerative hierarchical cluster tree
Z = linkage(X)
Z = linkage(X,method)
Z = linkage(X,method,metric)
Z = linkage(X,method,pdist_inputs)
Z = linkage(X,method,metric,'savememory',value)
Z = linkage(Y)
Z = linkage(Y,method)
Z = linkage( returns
Z that encodes a tree of hierarchical
clusters of the rows of the real matrix
Matrix with two or more rows. The rows represent observations, the columns represent categories or dimensions.
Algorithm for computing distance between clusters.
Any distance metric that the
A cell array of parameters accepted by the
A string, either
A vector of distances with the same format as the output of
For example, suppose there are 30 initial nodes and at step
12 cluster 5 and cluster 7 are combined. Suppose their distance at
that time is 1.5. Then
Load the sample data.
Compute four clusters of the Fisher iris data using Ward linkage and ignoring species information.
Z = linkage(meas,'ward','euclidean'); c = cluster(Z,'maxclust',4);
See how the cluster assignments correspond to the three species.
ans = 0 25 1 0 24 14 0 1 35 50 0 0
Display the first five rows of Z.
firstfive = Z(1:5,:)
firstfive = 102.0000 143.0000 0 8.0000 40.0000 0.1000 1.0000 18.0000 0.1000 10.0000 35.0000 0.1000 129.0000 133.0000 0.1000
Create a dendrogram plot of
Randomly generate the sample data with 20000 observations.
rng default; % For reproducibility X = rand(20000,3);
Create a hierarchical cluster tree using Ward's linkage.
Z = linkage(X,'ward','euclidean','savememory','on');
If you set
'off' , you can get an out-of-memory error if your machine doesn't have enough memory to hold the distance matrix.
Cluster data into four groups and plot the result.
c = cluster(Z,'maxclust',4); scatter3(X(:,1),X(:,2),X(:,3),10,c)
The following notation is used to describe the linkages used by the various methods:
r is formed from clusters
the number of objects in cluster r.
ith object in cluster
Single linkage, also called nearest neighbor, uses the smallest distance between objects in the two clusters:
Complete linkage, also called furthest neighbor, uses the largest distance between objects in the two clusters:
Median linkage uses the Euclidean distance between weighted centroids of the two clusters,
where and are weighted centroids for the clusters r and s. If cluster r was created by combining clusters p and q, is defined recursively as
Ward's linkage uses the incremental
sum of squares; that is, the increase in the total within-cluster
sum of squares as a result of joining two clusters. The within-cluster
sum of squares is defined as the sum of the squares of the distances
between all objects in the cluster and the centroid of the cluster.
The sum of squares measure is equivalent to the following distance
which is the formula
is Euclidean distance
and are the centroids of clusters r and s
nr and ns are the number of elements in clusters r and s
In some references the Ward linkage does not use the factor
of 2 multiplying nrns.
linkage function uses this factor so the
distance between two singleton clusters is the same as the Euclidean
Weighted average linkage uses a recursive definition for the distance between two clusters. If cluster r was created by combining clusters p and q, the distance between r and another cluster s is defined as the average of the distance between p and s and the distance between q and s:
linkage(Y) can be slow
Y is a vector representation of the distance
matrix. For the
Y is a Euclidean distance. Avoid this time-consuming
check by passing in
X instead of
can produce a cluster tree that is not monotonic. This occurs when
the distance from the union of two clusters, r and s,
to a third cluster is less than the distance between r and s.
In this case, in a dendrogram drawn with the default orientation,
the path from a leaf to the root node takes some downward steps.
To avoid this, use another method. The following image shows a nonmonotonic
In this case, cluster 1 and cluster 3 are joined into a new cluster, while the distance between this new cluster and cluster 2 is less than the distance between cluster 1 and cluster 3. This leads to a nonmonotonic tree.
You can provide the output
other functions including
display the tree,
assign points to clusters,
compute inconsistent measures, and
compute the cophenetic correlation coefficient.