File Exchange

image thumbnail

clusterData

version 1.1.0.1 (3.43 KB) by

Clusters an MxN array of data into an unspecified number (P) of bins.

4.375
8 Ratings

17 Downloads

Updated

View License

No a priori knowledge of the number of bins, or the distance between bins, is required. This approach relies on the relative difference between (sorted) elements of the data, and works well when the difference between clusters is bigger than the difference between elements within a cluster.

SYNTAX:
CLUSTERS = clusterData(DATA);
Operates column-by-column. An optional input allows you to specify the sensitivity of each columnwise clustering. Additional outputs also specify the indices of the cluster each row of data, and the bounds used to separate them.

Each column may have a different interpretation. For instance, an Mx4 array of data may represent x-data in the first column, y- in the second, z- in the third, and t- in the fourth. Returns a Px1 cell array, CLUSTERS, specifying the data points in each of the P clusters detected.

The final clustering utilizes all columns.

NOTE: This submission incorporates, expands, and replaces my earlier submission ezCluster.

Comments and Ratings (21)

Brett Shoelson

Brett Shoelson (view profile)

@sayanti:
clusterData works on vectors, not on matrices. Well, sort of. If you input a non-vector matrix, it clusters each column, and then clusters based on the columnwise clustering. What do your data represent?
Brett

how I can cluster a dataset of 1700X400 matrix.?
Shall I directly run this code upon my dataset?

Fritz

Fritz (view profile)

Brett Shoelson

Brett Shoelson (view profile)

@nadjoua:
numel(clusters)?

nadjoua

please can you indicate me how can i obtain the number of clusters?
thanks

tsan toso

Ah thanks for the catch Brett, just got around to run the code.

Brett Shoelson

Brett Shoelson (view profile)

Hi Tsan,

I didn’t spend a lot of time trying to understand your data, but I did manage to cluster them in less than 1 second, using clusterData. I noticed that your column 2 isn’t fully filled out. I think that’s why you’re seeing the long delay when you include column 2. If you were to exclude the pairs with missing values, it would process a lot faster. (In fact, I’m not sure how I treated missing variables. Maybe as NaNs.)

Let me know if the clustering you get with

[clusters,clusterInds,clusterBounds] = clusterData(Binningbydensity(1:3216,:));

works for you. (Those are the rows without missing column-two values.)

Cheers,
Brett

tsan toso

Hi Brett,

If I use your suggested method would it just group data together based on densities and not consider the relative distance of the data between each other? For example let’s just say the data ranges from 1 to 10. The observations of 1 are the same as 10. Observations in between are markedly different, would your function then just put 1 & 10 in the same bin?

For my purposes, I would just want to group bins that are adjacent of the same/similar density together.

I also included a web link for my data just to give you an idea of what kind of data I am dealing with. I provided 2 cols, each is a different random variable.

https://docs.google.com/spreadsheet/ccc?key=0Anv9v54gTjMedGtiRW5fanFRUFBOcW4xUTJ4NHFWbFE&usp=drive_web#gid=0

Another question is that for the dataset on the 2nd column it seem to run for a particularly long time, the data are just integers centering around 1 with dispersion to as far as 7, any work around for that?

Thanks.

Brett Shoelson

Brett Shoelson (view profile)

@tsan: Hi Tsan,
It's difficult to comment without seeing your data, but it sounds like you could just create and analyze a vector of densities. ClusterData will spit out the indices for the groupings. (You may need to tweak the sensitivity.)
Cheers,
Brett

tsan toso

Hi Brett,

Great code, I got a question on how I could use the code for my purposes:

How would you recommend I could use the code if I am looking to bin a sample of data together based on its density (Number of Occurrence/ Length of edge). And the length of the edges are determined by if the adjacent data groups have similar density. (Similar density are grouped together, but if the neighboring bin is 40% more or less in density, it would require another bin).

It seems like what your code is doing is grouping data based on how close they are to each other.

Thanks.

Hoi Wong

Haha. My data set is supposed to give me an array of numbers, but sometimes I got a singleton. That's how I found out. By the way, excellent submission!

Hoi Wong

Xiong

Xiong (view profile)

thank you for your submission!

Brett Shoelson

Brett Shoelson (view profile)

@Hoi,
Hmmm. Well, that's clearly a "bug" in the sense that I could have dealt with that case more gracefully, but then--well, let's just say that I never anticipated that anyone would try to cluster a single scalar. :)

Hoi Wong

It seems like the program get stuck (running forever) when I try to cluster a singleton, say clusterData(3).

Brett Shoelson

Brett Shoelson (view profile)

Han, did you find some problem with the submission that led you to rate this so poorly? Do you have any comments to share that might help me understand why it merits a two-star rating?
Thanks,
Brett

Han

Han (view profile)

Deanna

Deanna (view profile)

Joel

Joel (view profile)

Excellent submission

Venkat R

Very cool submission. I was searching different options to kind 'k' automatically in the k-means. This submission does it nicely.

Brett Shoelson

Brett Shoelson (view profile)

PLEASE NOTE that this code uses tildes for argument placeholders. As such, it will not work without modification on releases prior to R2009b. Feel free to edit the code, or upgrade to a newer MATLAB!!!

Updates

1.1.0.1

Updated license

1.1

Modified the help to correct a doc bug. Higher sensitivity results in fewer clusters, not more. (No code change.)

MATLAB Release
MATLAB 7.13 (R2011b)
Acknowledgements

Inspired: Data clustering using Bat Algorithm

Download apps, toolboxes, and other File Exchange content using Add-On Explorer in MATLAB.

» Watch video