How to partition data in a very specific way
Show older comments
I am designing a neural network to classify subjects into two classes and I am having some trouble in preparing the data.
I have been looking for a while for the proper way of partitioning data that is going to be fed to a neural network in a specific way, but I don't find it.
Concretely, what I need is to divide my data in such a way that the 30% of the observations is going to be the test set. From the remaining observations, another two groups will be formed with a ratio of 50/50 and will be used as training and validation set using cross validation (I know this are not the traditional ratios, but I've been asked to do it in this way). For this second partition, as I will be changing the training and validation sets, I will implement cross validation to ensure the data independence.
Initially I used crossvalind function, but I noticed that this function doesn't take into account the classes proportion. Later I tried to use cvpartition, as one of it's implementations allows me to apply stratified k-fold cv, but I don't know how to form groups with a specific ratio.
This is the way in which actually I divide data into test and "other" sets (the last one will be then divided into training and validation set):
INDICES = crossvalind('Kfold',size(data,1),10); % Dividing data into 10 groups...
testInd = (INDICES == 1 | INDICES == 2 | INDICES == 3); otherInd = ~testInd; % and grouping 3 of them in test (30%)
testSet = data(testInd,:); otherSet = data(otherInd,:);
testTarg = targets(testInd); otherTarg = targets(otherInd);
And this one the way in which I form the training and validation test from the remaining data:
CVO = cvpartition(otherTarg,'k',K);
for i=1:K
trainIdx = CVO.training(k); valIdx = CVO.test(k);
trainPos = find(trainIdx); valPos = find(valIdx);
trainSet = otherSet(trainPos,:); valSet = otherSet(valPos,:);
trainTarg = otherTarg(trainPos); valTarg = otherTarg(valPos);
end
As far as I know, the test and other sets don't have proportional classes and the training and validation sets do not have the required amount of data (half the data of the other group). At this point, I wonder if there is a function that allows me to do what I want, or I can do it with the functions I already know, but I'm not using them correctly.
Thank you in advanced for your attention.
Answers (3)
Alejandro De Felipe
on 23 Jul 2017
Greg Heath
on 24 Jul 2017
The NN Toolbox can be used to obtain many sufficiently independent estimations of error by replacing stratification with double randomization.
1. Random data division
2. Random weight initialization
This much easier to use than n-fold crossvalidation .
Greg
Greg Heath
on 27 Jul 2017
There are a variety of ways to divide your data. See the help and doc descriptions of
divideblock, divideind and divideint
Hope this helps.
Thank you for formally accepting my answer
Greg
Categories
Find more on Deep Learning Toolbox in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!