# Documentation

### This is machine translation

Translated by
Mouseover text to see original. Click the button below to return to the English verison of the page.

Agglomerative hierarchical cluster tree

## Syntax

```Z = linkage(X) Z = linkage(X,method) Z = linkage(X,method,metric) Z = linkage(X,method,pdist_inputs) Z = linkage(X,method,metric,'savememory',value) Z = linkage(Y) Z = linkage(Y,method) ```

## Description

`Z = linkage(X)` returns a matrix `Z` that encodes a tree of hierarchical clusters of the rows of the real matrix `X`.

`Z = linkage(X,method)` creates the tree using the specified `method`, where `method` describes how to measure the distance between clusters.

`Z = linkage(X,method,metric)` performs clustering using the distance measure `metric` to compute distances between the rows of `X`.

`Z = linkage(X,method,pdist_inputs)` passes parameters to the `pdist` function, which is the function that computes the distance between rows of `X`.

`Z = linkage(X,method,metric,'savememory',value)` uses a memory-saving algorithm when `value` is `'true'`, and uses the standard algorithm when `value` is `'false'`.

`Z = linkage(Y)` uses a vector representation `Y` of a distance matrix. `Y` can be a distance matrix as computed by `pdist`, or a more general dissimilarity matrix conforming to the output format of `pdist`.

`Z = linkage(Y,method)` creates the tree using the specified `method`, where `method` describes how to measure the distance between clusters.

## Input Arguments

`X`

Matrix with two or more rows. The rows represent observations, the columns represent categories or dimensions.

`method`

Algorithm for computing distance between clusters.

MethodDescription
`'average'`

Unweighted average distance (UPGMA)

`'centroid'`

Centroid distance (UPGMC), appropriate for Euclidean distances only

`'complete'`

Furthest distance

`'median'`

Weighted center of mass distance (WPGMC), appropriate for Euclidean distances only

`'single'`

Shortest distance

`'ward'`

Inner squared distance (minimum variance algorithm), appropriate for Euclidean distances only

`'weighted'`

Weighted average distance (WPGMA)

Default: `'single'`

`metric`

Any distance metric that the `pdist` function accepts.

Use `pdist_inputs` instead of `metric` to specify an additional input argument, `DistParameter` of `pdist`, for `'seuclidean'`, `'minkowski'`, or `'mahalanobis'`.

ValueDescription
`'euclidean'`

Euclidean distance (default).

`'squaredeuclidean'`

Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.)

`'seuclidean'`

Standardized Euclidean distance. Each coordinate difference between observations is scaled by dividing by the corresponding element of the standard deviation, ```S = nanstd(X)```. Use `DistParameter` to specify another value for `S`.

`'mahalanobis'`

Mahalanobis distance using the sample covariance of `X`, `C = nancov(X)`. Use `DistParameter` to specify another value for `C`, where the matrix `C` is symmetric and positive definite.

`'cityblock'`

City block distance.

`'minkowski'`

Minkowski distance. The default exponent is 2. Use `DistParameter` to specify a different exponent `P`, where `P` is a positive scalar value of the exponent.

`'chebychev'`

Chebychev distance (maximum coordinate difference).

`'cosine'`

One minus the cosine of the included angle between points (treated as vectors).

`'correlation'`

One minus the sample correlation between points (treated as sequences of values).

`'hamming'`

Hamming distance, which is the percentage of coordinates that differ.

`'jaccard'`

One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ.

`'spearman'`

One minus the sample Spearman's rank correlation between observations (treated as sequences of values).

`@distfun`

Custom distance function handle. A distance function has the form

```function D2 = DISTFUN(ZI,ZJ) % calculation of distance ...```
where

• `ZI` is a `1`-by-`n` vector containing a single observation.

• `ZJ` is an `m2`-by-`n` matrix containing multiple observations. `distfun` must accept a matrix `XJ` with an arbitrary number of observations.

• `D2` is an `m2`-by-`1` vector of distances, and `D2(k)` is the distance between observations `ZI` and `ZJ(k,:)`.

If your data is not sparse, you can generally compute distance more quickly by using a built-in distance instead of a function handle.

Default: `'euclidean'`

`pdist_inputs`

Distance metric and distance metric option, specified as a cell array of the comma-separated pair consisting of the two input arguments `Distance` and `DistParameter` of the function `pdist`. For example, to use `'minkowski'` with an exponent of 5, set `pdist_inputs` to `{'minkowski',5}`. This argument is valid only when you use `'seuclidean'`, `'minkowski'`, or `'mahalanobis'`.

`savememory`

Either `'on'` or `'off'`. When applicable, the `'on'` setting causes `linkage` to construct clusters without computing the distance matrix. `savememory` is applicable when:

• `linkage` is `'centroid'`, `'median'`, or `'ward'`

• `distance` is `'euclidean'` (default)

When `savememory` is `'on'`, `linkage` run time is proportional to the number of dimensions (number of columns of `X`). When `savememory` is `'off'`, `linkage` memory requirement is proportional to `N2`, where `N` is the number of observations. So choosing the best (least-time) setting for `savememory` depends on the problem dimensions, number of observations, and available memory. The default `savememory` setting is a rough approximation of an optimal setting.

Default: `'on'` when `X` has 20 columns or fewer, or the computer does not have enough memory to store the distance matrix; otherwise `'off'`

`Y`

A vector of distances with the same format as the output of the `pdist` function:

• A row vector of length m(m–1)/2, corresponding to pairs of observations in a matrix `X` with m rows

• Distances arranged in the order (2,1), (3,1), ..., (m,1), (3,2), ..., (m,2), ..., (m,m–1))

`Y` can be a more general dissimilarity matrix conforming to the output format of `pdist`.

## Output Arguments

 `Z` `Z` is a (`m` – 1)-by-3 matrix, where `m` is the number of observations in the original data. Columns 1 and 2 of `Z` contain cluster indices linked in pairs to form a binary tree. The leaf nodes are numbered from 1 to `m`. Leaf nodes are the singleton clusters from which all higher clusters are built. Each newly-formed cluster, corresponding to row `Z(I,:)`, is assigned the index `m`+`I`. `Z(I,1:2)` contains the indices of the two component clusters that form cluster `m`+`I`. There are `m`-1 higher clusters which correspond to the interior nodes of the clustering tree. `Z(I,3)` contains the linkage distances between the two clusters merged in row `Z(I,:)`. For example, suppose there are 30 initial nodes and at step 12 cluster 5 and cluster 7 are combined. Suppose their distance at that time is 1.5. Then `Z(12,:)` will be ```[5, 7, 1.5]```. The newly formed cluster will have index 12 + 30 = 42. If cluster 42 appears in a later row, it means the cluster created at step 12 is being combined into some larger cluster.

## Examples

collapse all

```load fisheriris ```

Compute four clusters of the Fisher iris data using Ward linkage and ignoring species information.

```Z = linkage(meas,'ward','euclidean'); c = cluster(Z,'maxclust',4); ```

See how the cluster assignments correspond to the three species.

```crosstab(c,species) ```
```ans = 0 25 1 0 24 14 0 1 35 50 0 0 ```

Display the first five rows of Z.

```firstfive = Z(1:5,:) ```
```firstfive = 102.0000 143.0000 0 8.0000 40.0000 0.1000 1.0000 18.0000 0.1000 10.0000 35.0000 0.1000 129.0000 133.0000 0.1000 ```

Create a dendrogram plot of `Z` .

```dendrogram(Z) ```

Randomly generate the sample data with 20000 observations.

```rng default; % For reproducibility X = rand(20000,3); ```

Create a hierarchical cluster tree using Ward's linkage.

```Z = linkage(X,'ward','euclidean','savememory','on'); ```

If you set `savememory` to `'off'` , you can get an out-of-memory error if your machine doesn't have enough memory to hold the distance matrix.

Cluster data into four groups and plot the result.

```c = cluster(Z,'maxclust',4); scatter3(X(:,1),X(:,2),X(:,3),10,c) ```

collapse all

The following notation is used to describe the linkages used by the various methods:

• Cluster `r` is formed from clusters `p` and `q`.

• `n`r is the number of objects in cluster r.

• `x`ri is the `i`th object in cluster `r`.

• Single linkage, also called nearest neighbor, uses the smallest distance between objects in the two clusters:

`$d\left(r,s\right)=\mathrm{min}\left(dist\left({x}_{ri},{x}_{sj}\right)\right),i\in \left(i,...,{n}_{r}\right),j\in \left(1,...,{n}_{s}\right)$`
• Complete linkage, also called furthest neighbor, uses the largest distance between objects in the two clusters:

`$d\left(r,s\right)=\mathrm{max}\left(dist\left({x}_{ri},{x}_{sj}\right)\right),i\in \left(1,...,{n}_{r}\right),j\in \left(1,...,{n}_{s}\right)$`
• Average linkage uses the average distance between all pairs of objects in any two clusters:

`$d\left(r,s\right)=\frac{1}{{n}_{r}{n}_{s}}\sum _{i=1}^{{n}_{r}}\sum _{j=1}^{{n}_{s}}dist\left({x}_{ri},{x}_{sj}\right)$`
• Centroid linkage uses the Euclidean distance between the centroids of the two clusters:

`$d\left(r,s\right)={‖{\overline{x}}_{r}-{\overline{x}}_{s}‖}_{2}$`

where

`${\overline{x}}_{r}=\frac{1}{{n}_{r}}\sum _{i=1}^{{n}_{r}}{x}_{ri}$`
• Median linkage uses the Euclidean distance between weighted centroids of the two clusters,

`$d\left(r,s\right)={‖{\stackrel{˜}{x}}_{r}-{\stackrel{˜}{x}}_{s}‖}_{2}$`

where ${\stackrel{˜}{x}}_{r}$ and ${\stackrel{˜}{x}}_{s}$ are weighted centroids for the clusters r and s. If cluster r was created by combining clusters p and q, ${\stackrel{˜}{x}}_{r}$ is defined recursively as

`${\stackrel{˜}{x}}_{r}=\frac{1}{2}\left({\stackrel{˜}{x}}_{p}+{\stackrel{˜}{x}}_{q}\right)$`
• Ward's linkage uses the incremental sum of squares; that is, the increase in the total within-cluster sum of squares as a result of joining two clusters. The within-cluster sum of squares is defined as the sum of the squares of the distances between all objects in the cluster and the centroid of the cluster. The sum of squares measure is equivalent to the following distance measure d(r,s), which is the formula `linkage` uses:

`$d\left(r,s\right)=\sqrt{\frac{2{n}_{r}{n}_{s}}{\left({n}_{r}+{n}_{s}\right)}}{‖{\overline{x}}_{r}-{\overline{x}}_{s}‖}_{2},$`

where

• ${‖‖}_{2}$ is Euclidean distance

• ${\overline{x}}_{r}$ and ${\overline{x}}_{s}$ are the centroids of clusters r and s

• nr and ns are the number of elements in clusters r and s

In some references the Ward linkage does not use the factor of 2 multiplying nrns. The `linkage` function uses this factor so the distance between two singleton clusters is the same as the Euclidean distance.

• Weighted average linkage uses a recursive definition for the distance between two clusters. If cluster r was created by combining clusters p and q, the distance between r and another cluster s is defined as the average of the distance between p and s and the distance between q and s:

`$d\left(r,s\right)=\frac{\left(d\left(p,s\right)+d\left(q,s\right)\right)}{2}$`

## Tips

• Computing `linkage(Y)` can be slow when `Y` is a vector representation of the distance matrix. For the `'centroid'`, `'median'`, and `'ward'` methods, `linkage` checks whether `Y` is a Euclidean distance. Avoid this time-consuming check by passing in `X` instead of `Y`.

• The `centroid` and `median` methods can produce a cluster tree that is not monotonic. This occurs when the distance from the union of two clusters, r and s, to a third cluster is less than the distance between r and s. In this case, in a dendrogram drawn with the default orientation, the path from a leaf to the root node takes some downward steps. To avoid this, use another method. The following image shows a nonmonotonic cluster tree.

In this case, cluster 1 and cluster 3 are joined into a new cluster, while the distance between this new cluster and cluster 2 is less than the distance between cluster 1 and cluster 3. This leads to a nonmonotonic tree.

• You can provide the output `Z` to other functions including `dendrogram` to display the tree, `cluster` to assign points to clusters, `inconsistent` to compute inconsistent measures, and `cophenet` to compute the cophenetic correlation coefficient.