No a priori knowledge of the number of bins, or the distance between bins, is required. This approach relies on the relative difference between (sorted) elements of the data, and works well when the difference between clusters is bigger than the difference between elements within a cluster.
CLUSTERS = clusterData(DATA);
Operates column-by-column. An optional input allows you to specify the sensitivity of each columnwise clustering. Additional outputs also specify the indices of the cluster each row of data, and the bounds used to separate them.
Each column may have a different interpretation. For instance, an Mx4 array of data may represent x-data in the first column, y- in the second, z- in the third, and t- in the fourth. Returns a Px1 cell array, CLUSTERS, specifying the data points in each of the P clusters detected.
The final clustering utilizes all columns.
NOTE: This submission incorporates, expands, and replaces my earlier submission ezCluster.
clusterData works on vectors, not on matrices. Well, sort of. If you input a non-vector matrix, it clusters each column, and then clusters based on the columnwise clustering. What do your data represent?
how I can cluster a dataset of 1700X400 matrix.?
Shall I directly run this code upon my dataset?
please can you indicate me how can i obtain the number of clusters?
Ah thanks for the catch Brett, just got around to run the code.
I didn’t spend a lot of time trying to understand your data, but I did manage to cluster them in less than 1 second, using clusterData. I noticed that your column 2 isn’t fully filled out. I think that’s why you’re seeing the long delay when you include column 2. If you were to exclude the pairs with missing values, it would process a lot faster. (In fact, I’m not sure how I treated missing variables. Maybe as NaNs.)
Let me know if the clustering you get with
[clusters,clusterInds,clusterBounds] = clusterData(Binningbydensity(1:3216,:));
works for you. (Those are the rows without missing column-two values.)
If I use your suggested method would it just group data together based on densities and not consider the relative distance of the data between each other? For example let’s just say the data ranges from 1 to 10. The observations of 1 are the same as 10. Observations in between are markedly different, would your function then just put 1 & 10 in the same bin?
For my purposes, I would just want to group bins that are adjacent of the same/similar density together.
I also included a web link for my data just to give you an idea of what kind of data I am dealing with. I provided 2 cols, each is a different random variable.
Another question is that for the dataset on the 2nd column it seem to run for a particularly long time, the data are just integers centering around 1 with dispersion to as far as 7, any work around for that?
@tsan: Hi Tsan,
It's difficult to comment without seeing your data, but it sounds like you could just create and analyze a vector of densities. ClusterData will spit out the indices for the groupings. (You may need to tweak the sensitivity.)
Great code, I got a question on how I could use the code for my purposes:
How would you recommend I could use the code if I am looking to bin a sample of data together based on its density (Number of Occurrence/ Length of edge). And the length of the edges are determined by if the adjacent data groups have similar density. (Similar density are grouped together, but if the neighboring bin is 40% more or less in density, it would require another bin).
It seems like what your code is doing is grouping data based on how close they are to each other.
Haha. My data set is supposed to give me an array of numbers, but sometimes I got a singleton. That's how I found out. By the way, excellent submission!
thank you for your submission!
Hmmm. Well, that's clearly a "bug" in the sense that I could have dealt with that case more gracefully, but then--well, let's just say that I never anticipated that anyone would try to cluster a single scalar. :)
It seems like the program get stuck (running forever) when I try to cluster a singleton, say clusterData(3).
Han, did you find some problem with the submission that led you to rate this so poorly? Do you have any comments to share that might help me understand why it merits a two-star rating?
Very cool submission. I was searching different options to kind 'k' automatically in the k-means. This submission does it nicely.
PLEASE NOTE that this code uses tildes for argument placeholders. As such, it will not work without modification on releases prior to R2009b. Feel free to edit the code, or upgrade to a newer MATLAB!!!
Modified the help to correct a doc bug. Higher sensitivity results in fewer clusters, not more. (No code change.)