Random sampling using k-means cluster without replacement
Version 1.0.0 (1.36 KB) by Dipankar
Preparation of training dataset from a categorical sample with a well representation of a maximum possible samples from each cluster
Updated 19 Aug 2021
Random sampling without replacement does not ensures picking up all possible clusters of a sample data set. For example, in case of iris dataset a random sampling from a particular class may miss some important cluster. In case of all the three species of flowers, missing clusters among the three classes may be too high and training dataset prepared from this random selection may provide a poor classification/regression model.
MATLAB function randperm() is in generally used to create a random number series without replacement. But this function does not ensure index values from all possible clusters. To identify all possible clusters one may use k-means clusters, then from each cluster certain percent of values may be extracted to prepare the training dataset.
close all; clear; clc
%%load divided input data set
iris(:,5)=[repmat(1,50,1);repmat(2,50,1);repmat(3,50,1)]; %last column ac categorical vaues 1,2 ,3
k=3; %no of class labels i.e., 'setosa, 'versicolor' , 'virginica '
per=0.7; %percentage of sample from each cluster,
% per is percentage of sample from each cluster, i.e., 70% from each of the
% clusters of the 3 classes ('setosa, 'versicolor' , 'virginica ')
%k-means will create floot(length(class)/7) clusters in each class i.e.
%within 'setosa, 'versicolor' , and 'virginica '
class_col=5; %class label column no
[s,td]= split_data_kmean(iris,class_col.per,k); %'kmsampledata is kmean data
Step 1: Inputs data (For iris data 5 columns 150 records dataset), No of labelled class = k (For iris the value is 3), Percentage of data from each cluster = per, Column no of labelled class = class_col (For iris data it is 5thcolumn)
Step 2: For c = 1 to k /* 3: 1->'setosa,2->'versicolor',3->'virginica ' */
Step 3: find row indexes of class no c (c=1 implies satosa) -> i
Step 4: ub = max(i);lb =min(i);
Find the range lb and ub : rng = lb:1:ub;
Step 5: Find the length of data(i,: ) then thke the lower bound
Step 6: Use k-means clustering over data(i,: ) for cluster no f
idx : clusrer indexes , C clister centres
Step 7: Find the range of cluster indices
Step 8: For the range of clusters i=mn:mx
Find indices of each cluster with a class
t1 = find(idx(:)==i);
t1 = t1 +(lb-1);
Step 9: Random sampling without repetition from the clusters of th cth class
trn = randsample(t1,round(per*length(t1)));
Step 10: Extract those rows from the dataset for training dataset
Step 11: Append the rows into an empty list.
td = [td;traing_points];
Step 12: Append the indices into an empty set
s = [s;trn];
Step 13: end of for loop
Step 14: end of for loop
Step 15: end of algorithm
Dipankar (2023). Random sampling using k-means cluster without replacement (https://www.mathworks.com/matlabcentral/fileexchange/97894-random-sampling-using-k-means-cluster-without-replacement), MATLAB Central File Exchange. Retrieved .
MATLAB Release Compatibility
Created with R2021a
Compatible with any release
Platform CompatibilityWindows macOS Linux
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!Start Hunting!
Discover Live Editor
Create scripts with code, output, and formatted text in a single executable document.