Split dataset into three different size sets without overlapping

1 view (last 30 days)
I am working on image processing using Matlab. I need to split a large dataset into three non-overlapped subsets (25%, 25% and 50%). The dataset (let's say has 1K images) has 10 classes (each has 100 images). from class 1, 25% of images should be in the training set, other 25% should be stored in the validation set and the rest (50%) should be stored in the testset. there should not repetition. I mean if an image from a class has been stored in a subset, it must not be stored in other subsets of the class. How do I do that in Matlab?
My code is as follows:
load ('data.mat')
for i = 1:size(data, 1)
for j = 1:78
if mod(i,2)==0
trainingset(i/2,j) = data(i,j);
else
remainset((i-1)/2+1,j) = data(i,j);
end
end
end
for i = 1:size(remainset, 1)
for j = 1:78
if mod(i,2)==0
testset(i/2,j) = remainset(i,j);
else
validationset((i-1)/2+1,j) = remainset(i,j);
end
end
end
Although it somehow works, I am looking for a better algorithm as some parts of data are lost.
  2 Comments
david jones
david jones on 3 Sep 2016
As I need to split the data into three subsets, using 'datasample', it calculates indices for one subset. But, if I use it again to calculate indices of other subsets, is likely to have duplicate indices in different subset. I can use randperm, but the same issue exists. I need to split the dataset into three different subsets that each of the subsets contains a percentage of each class of data. using simple sampling method like using
1:250,1:250,1:500
does not work as the subsets have members of some of the classes. Example: subset 1 should have 25% of class1, 25% of class 2, 25% of class 3, ... , 25% of class n. subset 2 should have 25% of class1, 25% of class 2, 25% of class 3, ... , 25% of class n. subset 3 should have 50% of class1, 50% of class 2, 50% of class 3, ... , 50% of class n.
intersection of subset 1,subset 2 and subset 3 must be zero and union of subsets must cover the whole dataset.

Sign in to comment.

Answers (1)

Frank B.
Frank B. on 8 May 2018
Here is a quick answer using datasample, for a single vector named data. Loop over your classes or use indexes if they have to be shared.
load ('data.mat')
% Declaring data division ratio
% 25% for training, 25% for validation, 50% for test
dataset_div=[0.25 0.25 0.5];
% Number of data in each set
nb_train=(dataset_div(1)/sum(dataset_div))*length(data);
nb_valid=(dataset_div(2)/sum(dataset_div))*length(data);
nb_test=(dataset_div(3)/sum(dataset_div))*length(data);
% Splitting data in 3 un-overlapping vector
% Training data
[data_train,idx_sample]=datasample(data,nb_train,'Replace',false);
% Removing used values
idx_left=1:length(data);
idx_left(idx_sample)=[];
val_left=data(idx_left);
% Validation data
[data_valid,idx_sample]=datasample(val_left,nb_valid,'Replace',false);
% Removing used values
idx_left=1:length(val_left);
idx_left(idx_sample)=[];
val_left=data(idx_left);
% Test data
[data_test,idx_sample]=datasample(val_left,nb_test,'Replace',false);
Cheers

Categories

Find more on Deep Learning Toolbox in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!