Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
crossvalind -- size of training/testing set?

Subject: crossvalind -- size of training/testing set?

From: Sophia Yuditskaya

Date: 1 Feb, 2009 22:40:18

Message: 1 of 11

Hi,

I am calling crossvalind as follows:

[train, test] = crossvalind('holdOut', groups);

What proportion of the original data is put into training vs testing sets? I'm assuming it's 50% each ... but instead I'd like to use 25% of the data for training and 75% for testing. How do I specify this? I've tried

[train, test] = crossvalind('holdOut', groups, 0.25);

but I get an OutOfMemoryError.

Any help would be appreciated.

Thanks,

Sophia

Subject: crossvalind -- size of training/testing set?

From: Peter Perkins

Date: 2 Feb, 2009 16:38:25

Message: 2 of 11

Sophia Yuditskaya wrote:

> [train, test] = crossvalind('holdOut', groups, 0.25);
>
> but I get an OutOfMemoryError.

Sophia, that appears to be the correct syntax. You haven't said what's in your variable "groups".

Subject: crossvalind -- size of training/testing set?

From: Greg Heath

Date: 2 Feb, 2009 18:44:03

Message: 3 of 11

On Feb 1, 5:40 pm, "Sophia Yuditskaya" <scyud...@mit.edu> wrote:
> Hi,
>
> I am calling crossvalind as follows:
>
> [train, test] = crossvalind('holdOut', groups);
>
> What proportion of the original data is put into training vs testing sets? I'm assuming it's 50% each ... but instead I'd like to use 25% of the data for training and 75% for testing. How do I specify this? I've tried
>
> [train, test] = crossvalind('holdOut', groups, 0.25);
>
> but I get an OutOfMemoryError.
>
> Any help would be appreciated.

Carefully read

doc crossvalind

or

help crossvalind

because, according to

http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/ref/crossvalind.html

it looks like the proper syntax is

P = 0.75
[train, test] = crossvalind('HoldOut', groups, N, P);

If you still get OutOfMemoryError, it looks like you might have to
reduce N.

Unfortunately you have not given us N or the number of groups.
More unfortunately, crossvalind does not allow the specification
of a validation set for determining training parameters.

In general, the data set should have a 3-way train/validate/test
split. See the comp.ai.neural.net FAQ and archives. Also see many
of my posts in both CSSM and CANN regarding how to choose Ntrn,
Nval and Ntst. Search Google Groups with

greg-heath validation

What you will find is that the ACTUAL subset SIZES are important;
NOT their FRACTION of the total data set.

Hope this helps.

Greg

Subject: crossvalind -- size of training/testing set?

From: Johan Carlson

Date: 2 Feb, 2009 19:58:02

Message: 4 of 11

Greg Heath <heath@alumni.brown.edu> wrote in message <b70e2def-294a-4bce-adc0-cd2681cd6014@k36g2000pri.googlegroups.com>...
> On Feb 1, 5:40 pm, "Sophia Yuditskaya" <scyud...@mit.edu> wrote:
> > Hi,
> >
> > I am calling crossvalind as follows:
> >
> > [train, test] = crossvalind('holdOut', groups);
> >
> > What proportion of the original data is put into training vs testing sets? I'm assuming it's 50% each ... but instead I'd like to use 25% of the data for training and 75% for testing. How do I specify this? I've tried
> >
> > [train, test] = crossvalind('holdOut', groups, 0.25);
> >
> > but I get an OutOfMemoryError.
> >
> > Any help would be appreciated.
>
> Carefully read
>
> doc crossvalind
>
> or
>
> help crossvalind
>
> because, according to
>
> http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/ref/crossvalind.html
>
> it looks like the proper syntax is
>
> P = 0.75
> [train, test] = crossvalind('HoldOut', groups, N, P);
>
> If you still get OutOfMemoryError, it looks like you might have to
> reduce N.
>
> Unfortunately you have not given us N or the number of groups.
> More unfortunately, crossvalind does not allow the specification
> of a validation set for determining training parameters.
>
> In general, the data set should have a 3-way train/validate/test
> split. See the comp.ai.neural.net FAQ and archives. Also see many
> of my posts in both CSSM and CANN regarding how to choose Ntrn,
> Nval and Ntst. Search Google Groups with
>
> greg-heath validation
>
> What you will find is that the ACTUAL subset SIZES are important;
> NOT their FRACTION of the total data set.
>
> Hope this helps.
>
> Greg


Greg is right! It's the size of the subset, but also to some extent how many of them you can form that matters. The reference below gives some interesting insight, although it's a bit "stiff".

/JC


AUTHOR = {J. Shao},
  TITLE = {Linear Model Order Selection by Cross-Validation},
  JOURNAL = {J. Am. Stat. Assoc.},
  YEAR = {1993},
  volume = {88},
  number = {422},
  pages = {486--494},
  month = {June},

Subject: crossvalind -- size of training/testing set?

From: Sophia

Date: 2 Feb, 2009 20:43:01

Message: 5 of 11

Thanks for your responses. I can give you the dataset size, but a bigger question that I have in this context is -- why does it work fine with the same dataset size using "[train, test] = crossvalind('holdOut', groups);", while explicitly specifying the training set size seems to require a whole lot more memory?

I will double check the documentation, but I couldn't seem to find any info regarding which data subdivision P corresponds to -- is P the proportion of data going to training, or to testing?

The dataset size is 5000.

Thanks,

Sophia


"Johan Carlson" <Johan.E.Carlson@gmail.com> wrote in message <gm7j8a$a2u$1@fred.mathworks.com>...
> Greg Heath <heath@alumni.brown.edu> wrote in message <b70e2def-294a-4bce-adc0-cd2681cd6014@k36g2000pri.googlegroups.com>...
> > On Feb 1, 5:40 pm, "Sophia Yuditskaya" <scyud...@mit.edu> wrote:
> > > Hi,
> > >
> > > I am calling crossvalind as follows:
> > >
> > > [train, test] = crossvalind('holdOut', groups);
> > >
> > > What proportion of the original data is put into training vs testing sets? I'm assuming it's 50% each ... but instead I'd like to use 25% of the data for training and 75% for testing. How do I specify this? I've tried
> > >
> > > [train, test] = crossvalind('holdOut', groups, 0.25);
> > >
> > > but I get an OutOfMemoryError.
> > >
> > > Any help would be appreciated.
> >
> > Carefully read
> >
> > doc crossvalind
> >
> > or
> >
> > help crossvalind
> >
> > because, according to
> >
> > http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/ref/crossvalind.html
> >
> > it looks like the proper syntax is
> >
> > P = 0.75
> > [train, test] = crossvalind('HoldOut', groups, N, P);
> >
> > If you still get OutOfMemoryError, it looks like you might have to
> > reduce N.
> >
> > Unfortunately you have not given us N or the number of groups.
> > More unfortunately, crossvalind does not allow the specification
> > of a validation set for determining training parameters.
> >
> > In general, the data set should have a 3-way train/validate/test
> > split. See the comp.ai.neural.net FAQ and archives. Also see many
> > of my posts in both CSSM and CANN regarding how to choose Ntrn,
> > Nval and Ntst. Search Google Groups with
> >
> > greg-heath validation
> >
> > What you will find is that the ACTUAL subset SIZES are important;
> > NOT their FRACTION of the total data set.
> >
> > Hope this helps.
> >
> > Greg
>
>
> Greg is right! It's the size of the subset, but also to some extent how many of them you can form that matters. The reference below gives some interesting insight, although it's a bit "stiff".
>
> /JC
>
>
> AUTHOR = {J. Shao},
> TITLE = {Linear Model Order Selection by Cross-Validation},
> JOURNAL = {J. Am. Stat. Assoc.},
> YEAR = {1993},
> volume = {88},
> number = {422},
> pages = {486--494},
> month = {June},

Subject: crossvalind -- size of training/testing set?

From: Lucio Andrade-Cetto

Date: 2 Feb, 2009 22:02:20

Message: 6 of 11

[train, test] = crossvalind('holdOut', groups, 0.25);
puts 75% into the training and holds 25%, if you omit the third input P defaults to 0.5 and the 50% are held out.
You should definitively not get an out of memory problem, please contact support so they can help you diagnosing your problem.
You may also send me your variable "groups" if you want.
Lucio Cetto, TMW.

"Sophia Yuditskaya" <scyudits@mit.edu> wrote in message <gm58ci$hrf$1@fred.mathworks.com>...
> Hi,
>
> I am calling crossvalind as follows:
>
> [train, test] = crossvalind('holdOut', groups);
>
> What proportion of the original data is put into training vs testing sets? I'm assuming it's 50% each ... but instead I'd like to use 25% of the data for training and 75% for testing. How do I specify this? I've tried
>
> [train, test] = crossvalind('holdOut', groups, 0.25);
>
> but I get an OutOfMemoryError.
>
> Any help would be appreciated.
>
> Thanks,
>
> Sophia

Subject: crossvalind -- size of training/testing set?

From: Greg Heath

Date: 3 Feb, 2009 18:11:45

Message: 7 of 11

On Feb 2, 5:02=A0pm, "Lucio Andrade-Cetto" <lce...@nospam.mathworks.com>
wrote:
> [train, test] =3D crossvalind('holdOut', groups, 0.25);
> puts 75% into the training and holds 25%,

You are correct. However, for some reasom, the OP wanted to holdout
7%.

The documentation says 'HoldOut'. Is the quantity case invariant?

Hope this helps.

Greg
> if you omit the third input P defaults to 0.5 and the 50% are held out.
> You should definitively not get an out of memory problem, please contact =
support so they can help you diagnosing your problem.
> You may also send me your variable "groups" if you want.
> Lucio Cetto, TMW.
>
>
>
> "Sophia Yuditskaya" <scyud...@mit.edu> wrote in message <gm58ci$hr...@fre=
d.mathworks.com>...
> > Hi,
>
> > I am calling crossvalind as follows:
>
> > [train, test] =3D crossvalind('holdOut', groups);
>
> > What proportion of the original data is put into training vs testing se=
ts? I'm assuming it's 50% each ... but instead I'd like to use 25% of the d=
ata for training and 75% for testing. How do I specify this? I've tried
>
> > [train, test] =3D crossvalind('holdOut', groups, 0.25);
>
> > but I get an OutOfMemoryError.
>
> > Any help would be appreciated.
>
> > Thanks,
>
> > Sophia- Hide quoted text -
>
> - Show quoted text -

Subject: crossvalind -- size of training/testing set?

From: Greg Heath

Date: 3 Feb, 2009 18:25:00

Message: 8 of 11

On Feb 2, 3:43 pm, "Sophia " <scyud...@mit.edu> wrote:
> Thanks for your responses. I can give you the dataset size, but a bigger =
question that I have in this context is -- why does it work fine with the s=
ame dataset size using "[train, test] =3D crossvalind('holdOut', groups);",=
 while explicitly specifying the training set size seems to require a whole=
 lot more memory?

Do both 'holdOut' and 'HoldOut' work?


> I will double check the documentation, but I couldn't seem to find any in=
fo regarding which data subdivision P corresponds to -- is P the proportion=
 of data going to training, or to testing?

When Method =3D 'HoldOut', P =3D the proportion held out.

> The dataset size is 5000.

How many classes and how many input variables? As you
can see from my previous posts

greg-heath pretraining advice
greg-heath Neq Nw

unless you are using overtraining mitigation, the minimum
size of Ntrn is determined by the number of inputs, hidden
nodes and classes.

Hope this helps.

Greg

Subject: crossvalind -- size of training/testing set?

From: Greg Heath

Date: 3 Feb, 2009 18:27:05

Message: 9 of 11

On Feb 3, 1:11=A0pm, Greg Heath <he...@alumni.brown.edu> wrote:
> On Feb 2, 5:02=A0pm, "Lucio Andrade-Cetto" <lce...@nospam.mathworks.com>
> wrote:
>
> > [train, test] =3D crossvalind('holdOut', groups, 0.25);
> > puts 75% into the training and holds 25%,
>
> You are correct. However, for some reasom, the OP wanted to holdout
> 7%.

Sorry, ...typo.... 75%

Greg

Subject: crossvalind -- size of training/testing set?

From: Ting Su

Date: 4 Feb, 2009 05:40:55

Message: 10 of 11

Sophia,
I have a few points to make.
1. As greg pointed out, you should use [train, test] =
crossvalind('holdOut', groups, 0.25) to get 75% for testing.

2, However, you should not get out of memory error for a dataset with size
5000. Please check whether you have provided the 'groups' variable
correctly. The 2nd input of crossvalind can be either a positve integer
specifing the number of observations , or a grouing varialbe(in this case,
crossvalind performs stratified crossvalidation or holdout. A grouing
variable sepecifies the class label for each observation. It can a numeric
vector, a logical vector, a cell vector of strings, or a character matrix
with each row representing a group label.

3. You may want to try 'cvpartition' in the Statistics toolbox to do the
holdout. It 's newer than crossvalind.

Ting Su
-The Mathworks

"Sophia " <scyudits@mit.edu> wrote in message
news:gm7lsl$di9$1@fred.mathworks.com...
> Thanks for your responses. I can give you the dataset size, but a bigger
> question that I have in this context is -- why does it work fine with the
> same dataset size using "[train, test] = crossvalind('holdOut', groups);",
> while explicitly specifying the training set size seems to require a whole
> lot more memory?
>
> I will double check the documentation, but I couldn't seem to find any
> info regarding which data subdivision P corresponds to -- is P the
> proportion of data going to training, or to testing?
>
> The dataset size is 5000.
>
> Thanks,
>
> Sophia
>
>
> "Johan Carlson" <Johan.E.Carlson@gmail.com> wrote in message
> <gm7j8a$a2u$1@fred.mathworks.com>...
>> Greg Heath <heath@alumni.brown.edu> wrote in message
>> <b70e2def-294a-4bce-adc0-cd2681cd6014@k36g2000pri.googlegroups.com>...
>> > On Feb 1, 5:40 pm, "Sophia Yuditskaya" <scyud...@mit.edu> wrote:
>> > > Hi,
>> > >
>> > > I am calling crossvalind as follows:
>> > >
>> > > [train, test] = crossvalind('holdOut', groups);
>> > >
>> > > What proportion of the original data is put into training vs testing
>> > > sets? I'm assuming it's 50% each ... but instead I'd like to use 25%
>> > > of the data for training and 75% for testing. How do I specify this?
>> > > I've tried
>> > >
>> > > [train, test] = crossvalind('holdOut', groups, 0.25);
>> > >
>> > > but I get an OutOfMemoryError.
>> > >
>> > > Any help would be appreciated.
>> >
>> > Carefully read
>> >
>> > doc crossvalind
>> >
>> > or
>> >
>> > help crossvalind
>> >
>> > because, according to
>> >
>> > http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/ref/crossvalind.html
>> >
>> > it looks like the proper syntax is
>> >
>> > P = 0.75
>> > [train, test] = crossvalind('HoldOut', groups, N, P);
>> >
>> > If you still get OutOfMemoryError, it looks like you might have to
>> > reduce N.
>> >
>> > Unfortunately you have not given us N or the number of groups.
>> > More unfortunately, crossvalind does not allow the specification
>> > of a validation set for determining training parameters.
>> >
>> > In general, the data set should have a 3-way train/validate/test
>> > split. See the comp.ai.neural.net FAQ and archives. Also see many
>> > of my posts in both CSSM and CANN regarding how to choose Ntrn,
>> > Nval and Ntst. Search Google Groups with
>> >
>> > greg-heath validation
>> >
>> > What you will find is that the ACTUAL subset SIZES are important;
>> > NOT their FRACTION of the total data set.
>> >
>> > Hope this helps.
>> >
>> > Greg
>>
>>
>> Greg is right! It's the size of the subset, but also to some extent how
>> many of them you can form that matters. The reference below gives some
>> interesting insight, although it's a bit "stiff".
>>
>> /JC
>>
>>
>> AUTHOR = {J. Shao},
>> TITLE = {Linear Model Order Selection by Cross-Validation},
>> JOURNAL = {J. Am. Stat. Assoc.},
>> YEAR = {1993},
>> volume = {88},
>> number = {422},
>> pages = {486--494},
>> month = {June},

Subject: crossvalind -- size of training/testing set?

From: Greg Heath

Date: 6 Feb, 2009 18:55:44

Message: 11 of 11

On Feb 4, 12:40=A0am, "Ting Su" <Ting...@mathworks.com> wrote:
> Sophia,
> I have a few points to make.
> 1. As greg pointed out, you should use [train, test] =3D
> crossvalind('holdOut', groups, 0.25) to get 75% for testing.

NO!

The third input is the percent held out for testing.

Hope this helps.

Greg

Tags for this Thread

No tags are associated with this thread.

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us