how to find feature distribution in kmeans clustering

Question

Dhruvin Naik on 10 Feb 2022

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/1647290-how-to-find-feature-distribution-in-kmeans-clustering

Commented: Image Analyst on 15 Feb 2022

I am trying to to do kmeans clustering on the data available to me. The data consists of information for each student (56 students in total) and their features like scores for each subject, other metrics like performance parameter, etc. There are total 39 features for each student. So the data matrix is (56*39). I used kmeans clustering to group the students in two clusters. I have attached the result of the clustering in the figure below. The data is plotted along the principal components. I want to know how the features are distributed along these clusters ? Something like score1 is high (above certain value) in cluster1 and low in cluster2, score2 is low in cluster 1 and high in cluster2. Is there a way to know how the features are distributed in these two clusters ? I want to find features that contribute to each Kmeans cluster.

i have used idx = kmeans(X,k) function in Matlab

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Image Analyst on 10 Feb 2022

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/1647290-how-to-find-feature-distribution-in-kmeans-clustering#answer_893435

Edited: Image Analyst on 10 Feb 2022

You can call pca() to get the loadings and scores for each of the 39 different features for each PC. Like the first column represents PC1 and the 39 different values in the loadings vector represent the weights of the 39 different original feature values. You can also ask pca() for the amount of output variation explained by each of the original feature, like feature 1 (score) explains 60% of the variation, and feature 2 (performance metric 2, like days of class missed or whatever) explains 30% of the variation.

I'm not sure why you're doing kmeans on PCs in the first place. Seems weird to me. I mean all the PC's are supposed to be independent so plotting any of them vs the other would just look like a random shotgun blast, kind of like yours does. There is only very weak correlation, as expected. So why do clustering on them? If anything you'd do kmeans on the original data, not the principal components.

6 Comments
Show 4 older commentsHide 4 older comments

Image Analyst on 11 Feb 2022

Think of PCs as being like a rotation of the axes. Let's say you had a dumbbell-shaped collection of points that was slanted along 45 degrees if you plotted the points feature2 value vs feature1 value. PC1 would go at a 45 degree angle along the axis of the dumbbell. PC2 would go perpendicular to that. So now each point has a new coordinate in the PC2 vs. PC1 coordinate system. Each point was classified in feature space. The fact that it now has additional coordinates in a new tilted coordinate system does not mean the points wouldbe classified differently so the distribution would be the same. Some points will be in one range of PC1 values (like the left ball of the dumbbell), and the other class's points will be in a different range of PC1 values (like the right ball of the dumbbell). I guess I'm not sure what you mean when you say you want to "know how the features are distributed". You can colorize the classes and plot them in PC space if you want so that the major axes of the scatterplot will now go along the x, y, and z axes (PC1, PC2, and PC3) whereas before they might not have (might have been slanted when plotted vs feature1value, feature2value, and feature3 value).

Dhruvin Naik on 15 Feb 2022

I did the PCA on the two clusters and got the principle components for both the clusters. Can you please tell me how should i compare the principle components from two clusters and map it to the original feature so that i can know if a given feature is more dominant in cluster one or cluster two ?

Image Analyst on 15 Feb 2022

The coefficients (first returned variable from pca()) give you that - they give you the relative weights of the original variables that are used when making the PC from the original variable values.

Sign in to comment.

how to find feature distribution in kmeans clustering

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

6 Comments
Show 4 older commentsHide 4 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

how to find feature distribution in kmeans clustering

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

6 Comments Show 4 older commentsHide 4 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

6 Comments
Show 4 older commentsHide 4 older comments