Gap statistic is a method used to estimate the most possible number of clusters in a partition clustering, e.g. k-means clustering (but consider more robust clustering). This measurement was originated by Trevor Hastie, Robert Tibshirani, and Guenther Walther, all from Standford University.

I posted here since I haven't found any Gapstatistics implementation to validate my code, therefore feel free to report bugs and improvements.

Note that the Statistics Toolbox implements the gap statistic as a class in the package clustering.evaluation since R2013b: http://www.mathworks.com/help/stats/clustering.evaluation.gapevaluationclass.html

Hi, I used this method to compute optional number of clusters, but if I run the program, get for example 4 clusters, when I run it another time, I get 3 clusters on the same signal without any changes... How it is possible? My signal is 10 seconds (5000 samples) od ECQ signal, where are cut only QRS complexes (usually 50 samples).

Hi
Thank you for both your comments.
Regarding the bug xpos and ypos, you were right and I fixed it.
Regarding the distribution, according to the paper the distribution is "uniform", I have no idea what is the difference in this case but now the code is according to the paper

Hi, thank you for sharing the code.
Only comment from me is that the reference distribution you use is multivariate normal distribution, not uniform distribution described in the paper. I am not sure if there is any difference in the result.

Out of curiosity, what is the following code for? I ran into the issue of xpos = [] and noticed this when going through the code. Am I correct in assuming that this is suppose to xpos instead of ypos?

if(isempty(ypos))
ypos = max(num_clusters);
end

My original though process in instances like this were to make xpos = NaN and then take the mode of the opt_index matrix after 10+ iterations.

Hi
Thank you for your comments,
These are two different problems,
1. The estimation of the dispersion rely (in my code) on the clustering made by k-means which is not the most robust method of the world. So the problem if you simplify the complexity to 10 samples per cluster is based on the classifier, you should substitute kmeans with something more robust (e.g. kmedoids or spectral clustering).
2. For what I have seen Gap statistics works 80% of the case, again because you relay on kmeans or on other randomization. That's why I put 10 iterations and then I show the mean. If you have any idea how to improve this practical aspects (random seed and kmeans), I am welcome to improve it, but referring to the Hastie paper, I implemented exactly as they say.