Gap statistic is a method used to estimate the most possible number of clusters in a partition clustering, e.g. k-means clustering (but consider more robust clustering). This measurement was originated by Trevor Hastie, Robert Tibshirani, and Guenther Walther, all from Standford University.
I posted here since I haven't found any Gapstatistics implementation to validate my code, therefore feel free to report bugs and improvements.
Note that the Statistics Toolbox implements the gap statistic as a class in the package clustering.evaluation since R2013b: http://www.mathworks.com/help/stats/clustering.evaluation.gapevaluationclass.html
there is a question that when I run the test.m file time is so long that I don't know what problem exists.
Hi, I used this method to compute optional number of clusters, but if I run the program, get for example 4 clusters, when I run it another time, I get 3 clusters on the same signal without any changes... How it is possible? My signal is 10 seconds (5000 samples) od ECQ signal, where are cut only QRS complexes (usually 50 samples).
Thank you for both your comments.
Regarding the bug xpos and ypos, you were right and I fixed it.
Regarding the distribution, according to the paper the distribution is "uniform", I have no idea what is the difference in this case but now the code is according to the paper
Hi, thank you for sharing the code.
Only comment from me is that the reference distribution you use is multivariate normal distribution, not uniform distribution described in the paper. I am not sure if there is any difference in the result.
Out of curiosity, what is the following code for? I ran into the issue of xpos =  and noticed this when going through the code. Am I correct in assuming that this is suppose to xpos instead of ypos?
ypos = max(num_clusters);
My original though process in instances like this were to make xpos = NaN and then take the mode of the opt_index matrix after 10+ iterations.
Thank you for your comments,
These are two different problems,
1. The estimation of the dispersion rely (in my code) on the clustering made by k-means which is not the most robust method of the world. So the problem if you simplify the complexity to 10 samples per cluster is based on the classifier, you should substitute kmeans with something more robust (e.g. kmedoids or spectral clustering).
2. For what I have seen Gap statistics works 80% of the case, again because you relay on kmeans or on other randomization. That's why I put 10 iterations and then I show the mean. If you have any idea how to improve this practical aspects (random seed and kmeans), I am welcome to improve it, but referring to the Hastie paper, I implemented exactly as they say.
Hi Crimi, it took me sometime to process the codes.
so i simply changed the test data to
test_datas = rand(10,2)*1;
test_data2 = rand(10,2)*1+10;
test_datas = cat(1,test_datas, test_data2);
and it turned out:
Warning: There is a NaN or Inf among the results
??? Attempted to access xpos(1); index out of bounds because numel(xpos)=0.
Error in ==> gap_statistics at 54
opt_k = num_clusters(xpos(1));
Error in ==> test_gap_statistics at 38
[ opt_index(ii), max_gap(ii)] = gap_statistics(test_datas, num_clusters, num_reference_bootstraps, compactness_as_dispersion);
so i tried again with the original number of test data, which is 200. and i got,
"The max gap is for 3 cluster/s.."
which means the number of clusters it estimates is 3? shouldn't it be 2 based on the test data? please correct me if I'm wrong.
Thank you very much on your valuable time! very much appreciated!
better code and communication of some validation
Download apps, toolboxes, and other File Exchange content using Add-On Explorer in MATLAB.