cvpartition grouped data for stratified hold-out classification?

9 views (last 30 days)
Hi,
My ultimate objective is to build a machine learning classifier that takes student academic results and their school as input features and an alphabetic grade as the output.
I have tried using cvpartition to partition a 100 x N array into stratified 70% training and 30% hold-out testing for machine learning classification. The stratification is important because I want to maintain a similar class distribution for both the training and test sets as for the whole dataset.
cvpartition works when each row (sample) in my array is independent of all other rows (samples). For instance, this data are 100 students randomly picked from 100 different schools. The N variables are their average academic performance across all the subjects they study and their final alphabetic grade (A+,A,A-,B+,B,B-,C,D or F). In this situation, cvpartition readily partitions my original data for hold-out testing where none of the data was used in training.
However, I now have a 500 x N+1 where the rows (samples) are grouped where the new variable is their school. In this new dataset, 5 students are picked at random from 100 different schools and I want to train a learner that classifies students based also on their school. However, if I use cvpartition it will bias my results because it is possible that data from the same school (but for different students) was used both for training and for testing.
So, I want to create stratified 70% training and 30% hold-out testing datasets where none of the GROUPED data in testing was used for training, and I want to repeat this training/testing loop say 1000 times with random sets of training and hold-out testing data. I want to ensure that in each loop, student data from the schools used for testing are not from the schools used for training.
Is there a Matlab command that allows us to easily cvpartition grouped data in the manner I require? For instance, does the command diverand help in this case?
Thank you in advance.

Answers (1)

ahmed nebli
ahmed nebli on 2 Sep 2018
Edited: ahmed nebli on 2 Sep 2018
i think you should split you data into training and testing maualy without even using cv partition because as u said, cvpartition test on data that had been training with it.
to do so, create a function that splits your data and save the training and testing into a .mat file
  1 Comment
Cuong Quang
Cuong Quang on 3 Sep 2018
Hi Ahmed,
Yes as you say there are more manual methods available.
What I'm wondering is whether there is the equivalent of a single cvpartition command for grouped data?
Thanks Cuong

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!