Path: news.mathworks.com!newsfeed-00.mathworks.com!newscon02.news.prodigy.net!prodigy.net!news.glorb.com!postnews.google.com!59g2000hsb.googlegroups.com!not-for-mail
From: Greg Heath <heath@alumni.brown.edu>
Newsgroups: comp.soft-sys.matlab
Subject: Re: small data set
Date: Wed, 7 May 2008 09:08:26 -0700 (PDT)
Organization: http://groups.google.com
Lines: 120
Message-ID: <7b9a4920-fe33-4218-82b2-ff043c46c40c@59g2000hsb.googlegroups.com>
References: <fvc63i$qfc$1@fred.mathworks.com> <afd7b073-23ba-496a-865f-b5655a22c64e@f36g2000hsa.googlegroups.com> 
NNTP-Posting-Host: 69.141.173.117
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Trace: posting.google.com 1210176506 31420 127.0.0.1 (7 May 2008 16:08:26 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Wed, 7 May 2008 16:08:26 +0000 (UTC)
Complaints-To: groups-abuse@google.com
Injection-Info: 59g2000hsb.googlegroups.com; posting-host=69.141.173.117; 
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; 
Xref: news.mathworks.com comp.soft-sys.matlab:467190


-----SNIP
> > > > What are the dimensions of your input and output?
> > > > Exactly what type of problem do you have and what
> > > > exactly do you want the neural net to do?
>
-----SNIP
> > > I have data from 25 people. 20 of them have lung cancer and
> > > 5 don't. I have 6 different characteristic for each person.
> > > (so the array is 25X6)
>
> > > the tasks are:to produce two classifiers
> > > 1st: to classify between a constant value - 2 outputs)
> > > 2nd: to classify the stage of cancer 0,1,2,3 or 4 so - 5
> > > outputs)
>
> > > I tried to use SVM, Linear regresion, Backpropagation and
> > > RBF Neural Nets and KNN.
>
> > > I tried to reshuffle my data using Leave One Out Cross
> > > Validation (LOOCV) so keeping each time one for testing and
> > > 24 for training.
>
> > > hope I gave you the picture..?
>
> > What kind of error rates are you getting for each method?
> > What are the largest error rates that you would accept?
>
> > When you plot the desired {0,1} classification vs each
> > of the inputs does there appear to be predictive capability?
> > What are the corresponding correlation coefficients?
>
> the best results I can get till now are:

I assume this is the first classifier:
{0,1} => {no cancer, cancer}

> using 1st: 3 of the 5 characteristics

I thought there were 6.
Did using more inputs decrease the performance?

>       2nd: 2-fold cross validation (using all the
>            combinations )

What does "using all the combinations" mean?

2-fold XVAL ==> 13 training and 12 testing, then switch.

I thought you originally said you were using LOO.

> and at the end getting the average
>            error rate)
>       3rd: KNN classification giving 75% correct
>            cp.Correct.rate

How depressing.

I just got 80% on your data without using the
computer.

what does the cp in cp.Correct.rate mean?

What are the class conditional error rates, i.e.,
What are the separate error rates for the 20 negatives
and 5 positives?

> and RBF neural network giving 68%
>            cp.Correct.rate

What about the MLP using NEWFF?

How are you compensating for the 4:1 imbalance?

> this error can be acceptable but because of the small data
> set

No it is not.

I would consider less than 80% for the 5 positives as completely
unacceptable!

> i have available i am not confident if these results can
> be assumed reliable

Bootstrapping and many trials of 10-fold XVAL will yield enough
replications so that you can estimate confidence levels.

> and if the method of reshuffling the data is acceptable.

 What reshuffling?  You said 2-fold XVAL ... how many trials?

> -can you please explain me which plot you ask to do?

You need to look at, at least, the 6*5/2 = 15 color coded pairwise
2-D  projections of the 6-Ddata to see how the 5 differ from
the 20. Also, looking at dominant PCA projections "may" help.

For the RBF and MLP you need to either create 15 extra copies
of the 5 or use some other method to balance 0/1 training.
Search the archive of comp.ai.neural-nets and CSSM using

greg-heath unbalanced

sorting by date will help separate the earlier useful posts from
the later referrals.

If you get 0.5+eps on the 20 negatives and 0.5-eps on the 5
positives you will get a 100% correct classification rate and a
0.25 mean-square error. If you change the sign of eps you will
get the same MSE with a 0% correct classification rate.

Nevertheless, it is useful to record the overall and class-conditional
MSEs.

Record the class-conditional as well as the overall performance.
The former are more important!

Hope this helps.

Greg