use a custom distance with the kmeans

Hello everyone.
I'm not very good with matlab so I ask you for help. For a university project I need to be able to group users who are furthest away from each other within a rectangular area. I am using kmeans and I have two possibilities: the first is to create a custom function, but I have read that I should use kMedoids; the second is to pass it the custom distance matrix.
At the moment I am following the second path but I do not understand how to do it. I am attaching the code of the one done so far.
N = 10
x=rand(N,1)*5
y=rand(N,1)*2.5
figure
scatter(x,y)
M = [x,y]
num_medoids = 1;
eucli_dis = pdist(M);
eucli_dis = squareform(eucli_dis);
inv_eucli_dis = 1./eucli_dis;
for ii = 1:10
text(M(ii,1),M(ii,2),num2str(ii));
end
gscatter(M(:,1),M(:,2))

6 Comments

kmedoids does accept a custom distance function.
idx = kmedoids(X , k, 'Distance', @mycustomDistfnc)
But when i made a custom “reverse euclidean” distance function it tells me there are not enough inputs. I would only pass him the matrix M. where is that wrong?
How are you coding the custom distance function?
I have a suspicion that your function wants three inputs, one of which is the number of clusters (k), but that you are not passing that in. See paramfun
so in practice I should pass the x coordinate, the y coordinate and finally the number of clusters? however in my calculations where I measure the inverse Euclidean distance I don't use k (number of clusters)
What is the code for your custom distance function?
What is your code for your call to kmedoids ?
the code for the custom distance function:
function inv_eucli_dis = customeucli (x,y)
M = [x,y]
eucli_dis = 1+sqrt(sum(bsxfun(@minus,M,reshape(M',1,size(M,2),size(M,1))).^2,2)); %calculation of the Euclidean distance
inv_eucli_dis = 1/eucli_dis; %inverse distance calculation
dist_mat = squeeze(inv_eucli_dis); %result representation in matrix form
code for kmedoids:
[idx,C,sumd,D] = Ckmedoids (M,num_medoids, 'distance','customeucli');

Sign in to comment.

 Accepted Answer

Here is code to randomly lay down points and draw thin black lines between a pair of points if they are far away from each other and draw a green line between endpoints of a pair if the points are close together:
N = 10
x=rand(N,1)*5
y=rand(N,1)*2.5
plot(x, y, 'b.', 'MarkerSize', 30);
grid on;
xy = [x(:), y(:)];
distances = pdist2(xy, xy)
% Zero out lower triangle because it's a repeat of the upper triangle
distances = triu(distances)
nonZeroIndexes = distances > 0;
medianDistance = median(distances(nonZeroIndexes))
thresholdValue = medianDistance/2; % Whatever you want.
% Find pairs that are far apart.
[rows, columns] = find(distances > thresholdValue);
hold on;
% Plot pairs that are far apart.
for k = 1 : length(rows)
index1 = columns(k);
index2 = rows(k);
xp = [x(index1), x(index2)];
yp = [y(index1), y(index2)];
plot(xp, yp, 'k-', 'LineWidth', 1, 'MarkerSize', 30);
end
% Find pairs that are close together.
[rows2, columns2] = find((distances > 0) & (distances <= thresholdValue));
hold on;
% Plot pairs that are close together.
for k = 1 : length(rows2)
index1 = columns2(k);
index2 = rows2(k);
xp = [x(index1), x(index2)];
yp = [y(index1), y(index2)];
plot(xp, yp, 'r-', 'LineWidth', 2, 'MarkerSize', 30);
end
title('Black lines are far away, red lines are close')

7 Comments

ok, it may be a solution to set a certain threshold value, but my problem is to group them into groups, that is the clusters.
obviously to group the farthest points I have to exploit the inverse Euclidean distance
You have not shown your data. With uniformly randomly distributed points, there are no groups or clusters really, or just a random number of clusters with random number of members like I showed above. In my posted example there are 5 clusters: 2 with 3 members, one with 2 members, and 2 clusters with only one member. Of course that would all be different with a different set of starting locations.
What does your actual data look like?
I think you need to look at my dbscan demo, which uses the built-in dbscan method, which is roughly related to what I did but more sophisticated. Unlike kmeans, etc. it clusters based on closeness and not on some predetermined number of clusters.
I have no starting data to analyze. I have to generate random positions of N users within a geographic area. Since the project is on beamforming, I have to make sure that two users close to each other are not reached by the same beam. That's why I have to group the users furthest away from each other.
And I've given you several ways to do that. For example pdist2() and thresholding will get two groups - those that are closer than some distance and those that are farther away than some distance. SVM will also do that. It will give you the best groups in aggregate that are farthest away though it's possible some from group1 are closer to some points in group2 that you'd prefer. Did you see my other answer below with diagrams from SVM and dbscan?
yes I've seen them. however, I wanted to know for sure if it could also be solved with kmeans by using a custom distance function or by passing it an inverse distance matrix. that's all
I wanted to know for sure if it could also be solved with kmeans by using a custom distance function or by passing it an inverse distance matrix.
No, it cannot be solve that way "for sure". Unless there are exactly two choices in two dimensions, then The Voting Paradox shows that there is no possible algorithm that can reliably generate optimal outcomes for all nodes.

Sign in to comment.

More Answers (2)

Not sure why you think there are clusters. I'd just use pdist2() and then threshold to find points that are farther apart than some distance. Something like
xy = [x(:), y(:)];
distances = pdist2(xy, xy);
thresholdValue = .3; % Whatever you want.
[rows, columns] = find(distances > thresholdValue);
Image Analyst
Image Analyst on 14 Jan 2022
Edited: Image Analyst on 14 Jan 2022
You might also consider SVM. It tries to find a dividing line between two groups such that the gap between the two groups is widest so the two groups are farthest apart. See
or use dbscan (demo attached):
It tries to find all points that can be connected with a distance less than what you specify:
A point that is found to be in a cluster with more than a certain number of close neighbors is called a "core point". It can also be part of the cluster if any points are within that distance. So for example, a dumbbell shape could have core points in the ends, connected points in the middle, and the whole thing being one single cluster. If it's an isolated point not closer to any other point than your specified distance, it's not a core point. (Like point N above.)
In this diagram, minPts = 4. Point A and the other red points are core points, because the area surrounding these points in an ε radius contain at least 4 points (including the point itself). Because they are all reachable from one another, they form a single cluster. Points B and C are not core points, but are reachable from A (via other core points) and thus belong to the cluster as well. Point N is a noise point that is neither a core point nor directly-reachable. You can consider it as essentially being in a clluster all by itself.

Categories

Find more on Statistics and Machine Learning Toolbox in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!