Path: news.mathworks.com!not-for-mail
From: <HIDDEN>
Newsgroups: comp.soft-sys.matlab
Subject: Re: crossvalind -- size of training/testing set?
Date: Mon, 2 Feb 2009 19:58:02 +0000 (UTC)
Organization: Lulea University of Technology
Lines: 68
Message-ID: <gm7j8a$a2u$1@fred.mathworks.com>
References: <gm58ci$hrf$1@fred.mathworks.com> <b70e2def-294a-4bce-adc0-cd2681cd6014@k36g2000pri.googlegroups.com>
Reply-To: <HIDDEN>
NNTP-Posting-Host: webapp-02-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1233604682 10334 172.30.248.37 (2 Feb 2009 19:58:02 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Mon, 2 Feb 2009 19:58:02 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 1595763
Xref: news.mathworks.com comp.soft-sys.matlab:515502


Greg Heath <heath@alumni.brown.edu> wrote in message <b70e2def-294a-4bce-adc0-cd2681cd6014@k36g2000pri.googlegroups.com>...
> On Feb 1, 5:40 pm, "Sophia Yuditskaya" <scyud...@mit.edu> wrote:
> > Hi,
> >
> > I am calling crossvalind as follows:
> >
> > [train, test] = crossvalind('holdOut', groups);
> >
> > What proportion of the original data is put into training vs testing sets? I'm assuming it's 50% each ... but instead I'd like to use 25% of the data for training and 75% for testing. How do I specify this? I've tried
> >
> > [train, test] = crossvalind('holdOut', groups, 0.25);
> >
> > but I get an OutOfMemoryError.
> >
> > Any help would be appreciated.
> 
> Carefully read
> 
> doc crossvalind
> 
> or
> 
> help crossvalind
> 
> because, according to
> 
> http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/ref/crossvalind.html
> 
> it looks like the proper syntax is
> 
> P = 0.75
> [train, test] = crossvalind('HoldOut', groups, N, P);
> 
> If you still get OutOfMemoryError, it looks like you might have to
> reduce N.
> 
> Unfortunately you have not given us N or the number of groups.
> More unfortunately, crossvalind does not allow the specification
> of a validation set for determining training parameters.
> 
> In general, the data set should have a 3-way train/validate/test
> split. See the comp.ai.neural.net FAQ and archives. Also see many
> of my posts in both CSSM and CANN regarding how to choose Ntrn,
> Nval and Ntst. Search Google Groups with
> 
> greg-heath validation
> 
> What you will find is that the ACTUAL subset SIZES are important;
> NOT their FRACTION of the total data set.
> 
> Hope this helps.
> 
> Greg


Greg is right! It's the size of the subset, but also to some extent how many of them you can form that matters. The reference below gives some interesting insight, although it's a bit "stiff".

/JC


AUTHOR =       {J. Shao},
  TITLE =        {Linear Model Order Selection by Cross-Validation},
  JOURNAL =      {J. Am. Stat. Assoc.},
  YEAR =         {1993},
  volume =       {88},
  number =       {422},
  pages =        {486--494},
  month =        {June},