hierarchical agglomerative clustering: distance matrix

2 views (last 30 days)
Hello,
Please help me with the following:
In hierarcical clustering, initially every patern is considered a cluster (singleton clusters).
As the process continues, the most similar patterns are merged to form a new cluster.
This similarity is expressed through different methods such as "single", "complete" and others. These methods are used as input in :
Z = LINKAGE(X,METHOD)
Apart from entering the matrix "X" that contains the patterns, the LINKAGE function can take the
"Y" matrix that can be distance matrix as computed by PDIST, so we have:
Z = LINKAGE(Y,METHOD)
Instead of merging the most similar patterns as in "X" , or the most similar distances of the patterns as in "Y", is there a way to merge the two patterns that refer to the lowest pair of values of "Y".
For example, finding the two elements of "Y" with the lowest values between all elements of "Y" and merge them.
When merging two elements of "Y" that does not necessarely mean that refer to lowest values of the whole "Y", but to those values that are the most similar (they can have random values between the lowest and highest values of "Y").
My attempt will be like this:
Y=pdist(X);
Y_sorted=sort(Y);
Y_sorted_square=squareform(Y_sorted);
Z = LINKAGE(Y_sorted_square,"single");
but this does not lead to the result described above.
Thank you very much.
Best,
Pavlos
  3 Comments
pavlos
pavlos on 6 Aug 2014
Consider matrix X to be 100x10.
Each row contains a pattern to be clustered and each column refers to the element of the respective row-vector.
I am interested to cluster the 100 patterns into 4 clusters using hierarchical clustering:
Using the following code, the data are clustered with the "single" linkage:
Y=pdist(X,'euclidean');
Z=linkage(Dist,'single');
cluster_lab=cluster(Z,'maxclust',4);
Column vector "cluster_lab" contains the cluster labels of each of the 100 patterns of X.
I an interested in modifying the merging process by selecting in each merge step those two patterns who corresponds at each step to the lowest values of matrix "Y".
Using the above code, the patterns that are selected to be merged, are those who are more similar to each other, neglecting their corresponding values of "Y".
This means that there is a possibility that the two selected patterns may have for example the largest values at "Y" but are selected to be merged because these values are the closest (the patterns are the most similar ones).
My preference is to select those two patterns that refer to the lowest values of "Y" not only on the beginning of the process but also in each step of the hierarchical process.
That`s why I tried to sort the distance matrix in my previous post.
Tom Lane
Tom Lane on 8 Aug 2014
I don't find this to be the case. Can you provide an example where it happens? Here's a demonstration that for one set of data, the first two merges correspond to the pairs with lowest distance:
>> load hald
>> d = pdist(ingredients);
>> l = linkage(d,'single');
>> l(1:2,:)
ans =
12.0000 13.0000 2.4495
3.0000 6.0000 2.4495
>> s = squareform(d);
>> m = min(min(s+100*eye(13)));
>> [i,j] = find(s==m)
i =
6
3
13
12
j =
3
6
12
13
Eventually merges of two nodes will involve a node that is the result of a merge that has already taken place. Is that where you see something you do not expect?

Sign in to comment.

Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!