how to find feature distribution in kmeans clustering

8 views (last 30 days)
I am trying to to do kmeans clustering on the data available to me. The data consists of information for each student (56 students in total) and their features like scores for each subject, other metrics like performance parameter, etc. There are total 39 features for each student. So the data matrix is (56*39). I used kmeans clustering to group the students in two clusters. I have attached the result of the clustering in the figure below. The data is plotted along the principal components. I want to know how the features are distributed along these clusters ? Something like score1 is high (above certain value) in cluster1 and low in cluster2, score2 is low in cluster 1 and high in cluster2. Is there a way to know how the features are distributed in these two clusters ? I want to find features that contribute to each Kmeans cluster.
i have used idx = kmeans(X,k) function in Matlab

Answers (1)

Image Analyst
Image Analyst on 10 Feb 2022
Edited: Image Analyst on 10 Feb 2022
You can call pca() to get the loadings and scores for each of the 39 different features for each PC. Like the first column represents PC1 and the 39 different values in the loadings vector represent the weights of the 39 different original feature values. You can also ask pca() for the amount of output variation explained by each of the original feature, like feature 1 (score) explains 60% of the variation, and feature 2 (performance metric 2, like days of class missed or whatever) explains 30% of the variation.
I'm not sure why you're doing kmeans on PCs in the first place. Seems weird to me. I mean all the PC's are supposed to be independent so plotting any of them vs the other would just look like a random shotgun blast, kind of like yours does. There is only very weak correlation, as expected. So why do clustering on them? If anything you'd do kmeans on the original data, not the principal components.
  6 Comments
Dhruvin Naik
Dhruvin Naik on 15 Feb 2022
I did the PCA on the two clusters and got the principle components for both the clusters. Can you please tell me how should i compare the principle components from two clusters and map it to the original feature so that i can know if a given feature is more dominant in cluster one or cluster two ?
Image Analyst
Image Analyst on 15 Feb 2022
The coefficients (first returned variable from pca()) give you that - they give you the relative weights of the original variables that are used when making the PC from the original variable values.

Sign in to comment.

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!