kmeans clustering of matrices

Hi All,
I have 12X190 cells. Each cell contains a complex matrix of size n*550 (assuming each row is an observation on 550 variables. The number of observations varies cell to cell but the variables are the same for each matrix). I need to classify these matrices using kmeans and I am trying to cluster the large matrix (i.e., 12*190*n*550 and I am not working with each matrix separately).
Any idea how I can do that? Any method better than kmeans to cluster these data? Any input would be appreciated.

11 Comments

I think you'll need to give us a lot more context, for us to be able to help. K-means clustering is typically (always?) used on a number of observations ("n"), each of which has a number of features/variables ("p").
So, I would understand how to apply K-means to one of your n-by-550 arrays, because you have n observations and p=550 features. (I guess I would handle the complex components as two separate features, so maybe n-by-1100?)
But I have no idea what to do with the 12-by-190 cell. What do those 12 and 190 represent? Do you just want to make one large (but 2-dimensional) array, by concatenating each individual matrix? ...
[M{1,1}; M{1,2}; ...
such that you have a (sum of all the individual n values from the 12-by-190 smaller matrices)-by-550 matrix, with lots and lots of observations, but still 550 features?
You seem to want to cluster matrices, not observations ... but I don't really know what that means.
Please give us more context and detail.
Susan
Susan on 4 Jun 2021
Edited: Susan on 4 Jun 2021
Thank you so much for your response.
You're right. "I just want to make one large (but 2-dimensional) array, by concatenating each individual matrix such that I have a (sum of all the individual n values from the 12-by-190 smaller matrices)-by-550 matrix, with lots and lots of observations, but still 550 features".
As you mentioned, I want to cluster matrices not observations. For each matrix in the 12-by-190 = 2280 cell array I have one label from 0 to 10. (in most examples that I've seen so far usually we have a lable which is assigned to either a scalar or a vector, but here a lable is assigned to a matrix). Each of these cell arrays is the output of one expriement and we got bunch of them by changing some parameters, so I think we can consider them somehow as observations. So I have 2280 cells each contains a n*p matrix and a 1-by-2280 vector which contains the label.
My aim is to see if the matrices with the same label can be clustered to gether or not.
And later, when I have an unseen input matrix (n-by-550) I can find which cluster this matrix is belonged to and somehow predict the label.
Moreover, I'm interested in figuring out which of these features p are more impactful and which one I can get rid off.
Please let me know if you need more detailts to be able to help. Thanks!
Wish you had mentioned the labels earlier. :-)
OK, so each matrix is the result of an experiment. And each experiment results in n measurements of 550 features. (The value of n can vary for each experiment.) Each experiment also results in a label.
Then, given a new matrix (with unknown label), you want to assign the correct label.
The major stumbling block (at least in my mind) here is that your measured variables are features of the observations, not of the matrices. If you want to predict the label of an unseen matrix, you need features of the matrices. Presumably you can build features of the matrices from the features of the observations, but I'm not sure how that would work. (Specifically, I don't see how k-means helps.)
I think I would try to simplify this, to really sort out the specifics of how to do this. For example:
  • imagine you have the same n for all matrices (and imagine it is small, like 5)
  • instead of 550 feature, suppose you only have 3
  • instead of 12x190 matrices, just fix that number to something like 10
  • instead of 11 labels, maybe just 2 or three
Then really think through what you really mean by "some matrices are more similar to each other, and therefore should have the same label". That thinking might help you see the proper mathematical method for getting there.
Susan
Susan on 4 Jun 2021
Edited: Susan on 4 Jun 2021
Thank you so much for your detailed response and helping me to think through it.
I need your thought on some thing else, please.
Each matrix is the result of an experiment and each experiment also results in a lable. Each matrix has n observations through time (n rows) and p features (p columns).
If I assume that the label of each matrix, can be assigned to each column of this matrix, then for a specific experiment I'll have an n-by-550 matrix as an input data and an 550-by-1 vector which contains label.
given a new column (n-by-1 which changes over time) with an unknown label, how can I assign the correct label?
Many thanks in advance
the cyclist
the cyclist on 4 Jun 2021
Edited: the cyclist on 4 Jun 2021
Sorry, but this makes no sense to me.
Suppose your experiment is on humans, and instead of 550 features, you have just 3 features: Height, Hair color, and Eye color. You do the experiment with this person, and assign label 4.
You are saying that "4" is assigned to height, hair color, and eye color.
Now, you add a new feature: Body temperature. And you want a different label for some reason?
I don't follow your logic.
And I'm still hung up on the data being complex numbers: "Each cell contains a complex matrix". So is each of the n-by-550 numbers complex? What do the real parts represent? What do the imaginary parts represent? I don't know that I've heard of kmeans being applied to complex numbers, though maybe it could.
Aside from that, why did you choose kmeans as your classification algorithm? Did you try calssification learner app and you learded from that that kmeans was the most accurate? If so, it exports the code for you.
Susan
Susan on 5 Jun 2021
Edited: Susan on 5 Jun 2021
@the cyclist Thanks again for your response. Yup, I agree with you that in your example, labeling each feature separately doesn't make any sense. In my data set, however, each column (of a n-by-550 matrix) representing the environment. By changing some parameters, I've got a new representation of the environment in each column. All of the columns in a matrix follow the same pattern (when I plot each column vs time since each column is time variant like sin() function), but different amplitude. Maybe calling each column as a feature is not correct but I can't think of anything else since each column is based on an unique setup and give me new representation. Please let me know if it makes sense.
Susan
Susan on 5 Jun 2021
Edited: Susan on 5 Jun 2021
@Image Analyst Thanks for your response. Yes, each of the n-by-550 numbers is complex and real/imaginary parts don't represent anything specific seperately, but the complex number represent the signal. I don't think that kmeans can be applied to complex number either. As @the cyclist suggested I am going to consider an n-by-1100 real matrix and then apply the kmeans.
The reason that I'd like to apply kmeans is that I'm wondering will the matrices that have the same labels be clustered into the same cluster or not? Unfortunately, I have difficulty in using the learner app and I've sought help but no success so far. The issue with the app that I have been encountering is, the app doesn’t select the right input-label from the workspace, and I cannot fix the issue. Any help with this problem and the learner app would be greatly appreciated.
kmeans() does reject complex matrices.
OK, so you're just going to consider the real part of the complex numbers. So, how many clusters do you believe there to be? What did you put in for k (if you put in anything)? Do you think there are 3 clusters? 6? 100? Or no idea?
@Image Analyst There would be 19 cluster

Sign in to comment.

 Accepted Answer

Walter Roberson
Walter Roberson on 5 Jun 2021
k-means is not the right technology for situations in which you have labels, except for the situation in which the labels have numeric values that can be made commensurate with the numberic coordinates. For example if you can say that having a label differ by no more than 1 is 10.28 times as important as having column 3 differ by 1, then you might be able to use k-means by adding the numeric value of the label as an additional coordinate. But this is not the usual case.
When you have matrices of numbers and a label associated with the matrix, then Deep Learning or (Shallow) Neural Network techniques are more appropriate. Consider that if you have a matrix of data and a label, and the matrices are all the same size, that that situation could be treated the same was as if the matrix of data were an "image"

5 Comments

Thanks for your response, @Walter Roberson.
In the problem that I deal with, given one column of a matrix, I'd like to estimate the label.
Each label is assigned to a matrix; however, all the columns of a matrix represent a same thing, here an office environment. All columns follow the same pattern but have different amplitude.
What is the best approach for this problem?
Many thanks in advance
Split the matrices into columns (you can use num2cell) and assign the same label to each column. Then you have reduced the situation to having a lot of different columns with associated labels, having discarded the "matrix".
Or equivalently, concatenate all of the matrices together into one giant matrix, replicating each of the labels according to the number of columns in the matrix, so that you end up with a large matrix and a corresponding array of labels, with one label for each column.
If the relationship between rows inside of a matrix can be discarded, and what matters is one column at a time and the associated label, then you should remove the layer that breaks it all up into seperate matrices, as that is holding you back from making progress.
@Walter Roberson Thanks for your response. Appreciate it.
As you suggested, I splitted the matrices into columns and assign the same label to each column. Then selected 30% of columns and their associated labels randomly and trained the network.
Now, the issue is whenever I run the code, I get different accuracy on test data. I think one reason is selecting the training set randomly. Assuming I have 100 coulmns in total, every time I select 30 columns randomly for training.
Any idea how I can solve this issue? Does k-fold cross validation help here?
Many thanks!
Yes! This is expected, and is a fundamental challenge of this kind of learning: to determine the best subset of data to train on for the highest accuracy and lowest over-training.
k-fold cross validation is indeed one of the techniques that is used. It will reduce the variation you see, but do expect that there will still be some variation depending on the random choice.
Thanks!

Sign in to comment.

More Answers (0)

Asked:

on 4 Jun 2021

Commented:

on 7 Jun 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!