Path: news.mathworks.com!not-for-mail
From: "Ting Su" <Ting.Su@mathworks.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: crossvalind -- size of training/testing set?
Date: Wed, 4 Feb 2009 00:40:55 -0500
Organization: The MathWorks, Inc.
Lines: 117
Message-ID: <gmb9pb$e93$1@fred.mathworks.com>
References: <gm58ci$hrf$1@fred.mathworks.com> <b70e2def-294a-4bce-adc0-cd2681cd6014@k36g2000pri.googlegroups.com> <gm7j8a$a2u$1@fred.mathworks.com> <gm7lsl$di9$1@fred.mathworks.com>
Reply-To: "Ting Su" <Ting.Su@mathworks.com>
NNTP-Posting-Host: sut.dhcp.mathworks.com
X-Trace: fred.mathworks.com 1233726059 14627 172.31.57.42 (4 Feb 2009 05:40:59 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Wed, 4 Feb 2009 05:40:59 +0000 (UTC)
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2900.5512
X-RFC2646: Format=Flowed; Original
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5579
Xref: news.mathworks.com comp.soft-sys.matlab:515882


Sophia,
I have a few points to make.
1. As greg pointed out, you should use [train, test] = 
crossvalind('holdOut', groups, 0.25) to get 75% for testing.

2, However, you should not get out of memory error for a dataset with size 
5000. Please check whether you have provided the 'groups' variable 
correctly. The 2nd input of crossvalind can be either a positve integer 
specifing the number of observations , or a grouing varialbe(in this case, 
crossvalind performs stratified crossvalidation or holdout. A grouing 
variable sepecifies the class label for each observation. It can a numeric 
vector, a  logical vector, a cell vector of strings, or a character matrix 
with  each row representing a group label.

3. You may want to try 'cvpartition' in the Statistics toolbox to do the 
holdout. It 's newer than crossvalind.

Ting Su
-The Mathworks

"Sophia " <scyudits@mit.edu> wrote in message 
news:gm7lsl$di9$1@fred.mathworks.com...
> Thanks for your responses. I can give you the dataset size, but a bigger 
> question that I have in this context is -- why does it work fine with the 
> same dataset size using "[train, test] = crossvalind('holdOut', groups);", 
> while explicitly specifying the training set size seems to require a whole 
> lot more memory?
>
> I will double check the documentation, but I couldn't seem to find any 
> info regarding which data subdivision P corresponds to -- is P the 
> proportion of data going to training, or to testing?
>
> The dataset size is 5000.
>
> Thanks,
>
> Sophia
>
>
> "Johan Carlson" <Johan.E.Carlson@gmail.com> wrote in message 
> <gm7j8a$a2u$1@fred.mathworks.com>...
>> Greg Heath <heath@alumni.brown.edu> wrote in message 
>> <b70e2def-294a-4bce-adc0-cd2681cd6014@k36g2000pri.googlegroups.com>...
>> > On Feb 1, 5:40 pm, "Sophia Yuditskaya" <scyud...@mit.edu> wrote:
>> > > Hi,
>> > >
>> > > I am calling crossvalind as follows:
>> > >
>> > > [train, test] = crossvalind('holdOut', groups);
>> > >
>> > > What proportion of the original data is put into training vs testing 
>> > > sets? I'm assuming it's 50% each ... but instead I'd like to use 25% 
>> > > of the data for training and 75% for testing. How do I specify this? 
>> > > I've tried
>> > >
>> > > [train, test] = crossvalind('holdOut', groups, 0.25);
>> > >
>> > > but I get an OutOfMemoryError.
>> > >
>> > > Any help would be appreciated.
>> >
>> > Carefully read
>> >
>> > doc crossvalind
>> >
>> > or
>> >
>> > help crossvalind
>> >
>> > because, according to
>> >
>> > http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/ref/crossvalind.html
>> >
>> > it looks like the proper syntax is
>> >
>> > P = 0.75
>> > [train, test] = crossvalind('HoldOut', groups, N, P);
>> >
>> > If you still get OutOfMemoryError, it looks like you might have to
>> > reduce N.
>> >
>> > Unfortunately you have not given us N or the number of groups.
>> > More unfortunately, crossvalind does not allow the specification
>> > of a validation set for determining training parameters.
>> >
>> > In general, the data set should have a 3-way train/validate/test
>> > split. See the comp.ai.neural.net FAQ and archives. Also see many
>> > of my posts in both CSSM and CANN regarding how to choose Ntrn,
>> > Nval and Ntst. Search Google Groups with
>> >
>> > greg-heath validation
>> >
>> > What you will find is that the ACTUAL subset SIZES are important;
>> > NOT their FRACTION of the total data set.
>> >
>> > Hope this helps.
>> >
>> > Greg
>>
>>
>> Greg is right! It's the size of the subset, but also to some extent how 
>> many of them you can form that matters. The reference below gives some 
>> interesting insight, although it's a bit "stiff".
>>
>> /JC
>>
>>
>> AUTHOR =       {J. Shao},
>>   TITLE =        {Linear Model Order Selection by Cross-Validation},
>>   JOURNAL =      {J. Am. Stat. Assoc.},
>>   YEAR =         {1993},
>>   volume =       {88},
>>   number =       {422},
>>   pages =        {486--494},
>>   month =        {June},