Path: news.mathworks.com!not-for-mail
From: "Sophia " <scyudits@mit.edu>
Newsgroups: comp.soft-sys.matlab
Subject: Re: crossvalind -- size of training/testing set?
Date: Mon, 2 Feb 2009 20:43:01 +0000 (UTC)
Organization: The MathWorks, Inc.
Lines: 80
Message-ID: <gm7lsl$di9$1@fred.mathworks.com>
References: <gm58ci$hrf$1@fred.mathworks.com> <b70e2def-294a-4bce-adc0-cd2681cd6014@k36g2000pri.googlegroups.com> <gm7j8a$a2u$1@fred.mathworks.com>
Reply-To: "Sophia " <scyudits@mit.edu>
NNTP-Posting-Host: webapp-03-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1233607381 13897 172.30.248.38 (2 Feb 2009 20:43:01 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Mon, 2 Feb 2009 20:43:01 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 1695765
Xref: news.mathworks.com comp.soft-sys.matlab:515511


Thanks for your responses. I can give you the dataset size, but a bigger question that I have in this context is -- why does it work fine with the same dataset size using "[train, test] = crossvalind('holdOut', groups);", while explicitly specifying the training set size seems to require a whole lot more memory?

I will double check the documentation, but I couldn't seem to find any info regarding which data subdivision P corresponds to -- is P the proportion of data going to training, or to testing?

The dataset size is 5000.

Thanks,

Sophia


"Johan Carlson" <Johan.E.Carlson@gmail.com> wrote in message <gm7j8a$a2u$1@fred.mathworks.com>...
> Greg Heath <heath@alumni.brown.edu> wrote in message <b70e2def-294a-4bce-adc0-cd2681cd6014@k36g2000pri.googlegroups.com>...
> > On Feb 1, 5:40 pm, "Sophia Yuditskaya" <scyud...@mit.edu> wrote:
> > > Hi,
> > >
> > > I am calling crossvalind as follows:
> > >
> > > [train, test] = crossvalind('holdOut', groups);
> > >
> > > What proportion of the original data is put into training vs testing sets? I'm assuming it's 50% each ... but instead I'd like to use 25% of the data for training and 75% for testing. How do I specify this? I've tried
> > >
> > > [train, test] = crossvalind('holdOut', groups, 0.25);
> > >
> > > but I get an OutOfMemoryError.
> > >
> > > Any help would be appreciated.
> > 
> > Carefully read
> > 
> > doc crossvalind
> > 
> > or
> > 
> > help crossvalind
> > 
> > because, according to
> > 
> > http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/ref/crossvalind.html
> > 
> > it looks like the proper syntax is
> > 
> > P = 0.75
> > [train, test] = crossvalind('HoldOut', groups, N, P);
> > 
> > If you still get OutOfMemoryError, it looks like you might have to
> > reduce N.
> > 
> > Unfortunately you have not given us N or the number of groups.
> > More unfortunately, crossvalind does not allow the specification
> > of a validation set for determining training parameters.
> > 
> > In general, the data set should have a 3-way train/validate/test
> > split. See the comp.ai.neural.net FAQ and archives. Also see many
> > of my posts in both CSSM and CANN regarding how to choose Ntrn,
> > Nval and Ntst. Search Google Groups with
> > 
> > greg-heath validation
> > 
> > What you will find is that the ACTUAL subset SIZES are important;
> > NOT their FRACTION of the total data set.
> > 
> > Hope this helps.
> > 
> > Greg
> 
> 
> Greg is right! It's the size of the subset, but also to some extent how many of them you can form that matters. The reference below gives some interesting insight, although it's a bit "stiff".
> 
> /JC
> 
> 
> AUTHOR =       {J. Shao},
>   TITLE =        {Linear Model Order Selection by Cross-Validation},
>   JOURNAL =      {J. Am. Stat. Assoc.},
>   YEAR =         {1993},
>   volume =       {88},
>   number =       {422},
>   pages =        {486--494},
>   month =        {June},