I have trouble understanding this line in k-means++ implementation:
idx(i) = find(cumsum(D)/sum(D)>rand,1);
This doesn't follow the description of the algorithm in the paper. You're supposed to randomly choose a data point with probability p(x) = sqrdist(x,C)/sum(sqrdist(X,C)).
In your implementation that uses cumulative sum, the higher the index of a data point, the higher the probability that it will be selected. That doesn't make sense to me. For example, the last data point will have cumsum(D)/sum(D) = 1, i.e. very biased.