Skip to Main Content Skip to Search
Login
File Exchange
MATLAB Newsgroup
Link Exchange
  Blogs  
 Contest 
MathWorks.com

Thread Subject: small data set

Subject: small data set

From: giannis

Date: 01 May, 2008 10:30:10

Message: 1 of 13

Hello.

I am doing a statistical research using KNN, neural nets and
SVM.. The problem is the very small data set (25 speciments).

I am using cross validation to resample the data but I am
not sure if my results can be accurate with such a small
data set.

can you please suggest any method to use as best as possible
 such a small data set?
thank you in advance

Subject: Re: small data set

From: Greg Heath

Date: 01 May, 2008 11:22:44

Message: 2 of 13

On May 1, 6:30=A0am, "giannis " <fanzi...@yahoo.co.uk> wrote:
> Hello.
>
> I am doing a statistical research using KNN,neuralnets and
> SVM.. The problem is the very small data set (25 speciments).
>
> I am using cross validation to resample the data but I am
> not sure if my results can be accurate with such a small
> data set.
>
> can you please suggest any method to use as best as possible
> =A0such a small data set?
> thank you in advance =A0

Bootstrapping

Search the mathworks website.

Hope this helps.

Greg

Subject: Re: small data set

From: John D'Errico

Date: 01 May, 2008 11:50:05

Message: 3 of 13

"giannis " <fanzio12@yahoo.co.uk> wrote in message
<fvc63i$qfc$1@fred.mathworks.com>...
> Hello.
>
> I am doing a statistical research using KNN, neural nets and
> SVM.. The problem is the very small data set (25 speciments).
>
> I am using cross validation to resample the data but I am
> not sure if my results can be accurate with such a small
> data set.
>
> can you please suggest any method to use as best as possible
> such a small data set?
> thank you in advance

The fact is, you only have 25 data points.

No matter how hard you squeeze that rock,
the only blood you will get from the rock
will be that amount you leave behind from
your own hands.

Only pharmaceutical companies know the
secret methodologies used to manufacture
information where none actually exists.

John

Subject: Re: small data set

From: roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson)

Date: 01 May, 2008 17:49:42

Message: 4 of 13

In article <fvcapd$oc8$1@fred.mathworks.com>,
John D'Errico <woodchips@rochester.rr.com> wrote:

>The fact is, you only have 25 data points.

>No matter how hard you squeeze that rock,
>the only blood you will get from the rock
>will be that amount you leave behind from
>your own hands.

>Only pharmaceutical companies know the
>secret methodologies used to manufacture
>information where none actually exists.

Aye. We're getting amazingly good here at manufacturering new
-data- from old, but manufacturing new -information- is still
eluding us.

Though much more often, the problem here is in manufacturing useful
information from *too much* data.
--
  "If there were no falsehood in the world, there would be no
  doubt; if there were no doubt, there would be no inquiry; if no
  inquiry, no wisdom, no knowledge, no genius."
                                              -- Walter Savage Landor

Subject: Re: small data set

From: giannis

Date: 03 May, 2008 18:25:04

Message: 5 of 13

Hello,

thank you for your reply,
the fact is that I am doing a medical application so my data
have medical nature.
it would be very interesting if I could produce "new" data
from the old ones and test the results.
it would be a huge help if you could help me with this in
any way.

regards

 


roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
message <fvcvrm$i5n$1@canopus.cc.umanitoba.ca>...
> In article <fvcapd$oc8$1@fred.mathworks.com>,
> John D'Errico <woodchips@rochester.rr.com> wrote:
>
> >The fact is, you only have 25 data points.
>
> >No matter how hard you squeeze that rock,
> >the only blood you will get from the rock
> >will be that amount you leave behind from
> >your own hands.
>
> >Only pharmaceutical companies know the
> >secret methodologies used to manufacture
> >information where none actually exists.
>
> Aye. We're getting amazingly good here at manufacturering new
> -data- from old, but manufacturing new -information- is still
> eluding us.
>
> Though much more often, the problem here is in
manufacturing useful
> information from *too much* data.
> --
> "If there were no falsehood in the world, there would be no
> doubt; if there were no doubt, there would be no
inquiry; if no
> inquiry, no wisdom, no knowledge, no genius."
> -- Walter
Savage Landor

Subject: Re: small data set

From: carlos lopez

Date: 04 May, 2008 20:30:22

Message: 6 of 13

I agree with Mr. Greg Heath; your last resort is
bootstrapping. That might increase the confidence on the
statistical result you have, i.e. to "fully" trust in the
standard deviation estimate or alike... but if the highly
beliavable estimate itself is not good for you there is no
further solution!
So all of the comments have value; if the (for example)
standard deviation estimate is very precise but it is too
large for your needs you will need extra data. No way to
avoid that... or at least I am unaware of!
Regards
Carlos

Subject: Re: small data set

From: Greg Heath

Date: 05 May, 2008 10:58:03

Message: 7 of 13

On May 1, 7:22=A0am, Greg Heath <he...@alumni.brown.edu> wrote:
> On May 1, 6:30=A0am, "giannis " <fanzi...@yahoo.co.uk> wrote:
>
> > Hello.
>
> > I am doing a statistical research using KNN,neuralnets and
> > SVM.. The problem is the very small data set (25 speciments).
>
> > I am using cross validation to resample the data but I am
> > not sure if my results can be accurate with such a small
> > data set.
>
> > can you please suggest any method to use as best as possible
> > =A0such a small data set?
> > thank you in advance =A0
>
> Bootstrapping
>
> Search the mathworks website.

If you have prior information on the form of the probability
distribution function, you can use the 25 observations to
estimate the parameters and then generate more "data".
The danger is that, even in one dimension, 25 observations
will not give you precise parameter estimates.

If you don't have such prior information you can test
hypotheses as to which distribution the data might be
from. However, with only 25 observations the testing will
be far from definitive. You may test several distributions,
find that you can reject all except one. However, that does
not guarantee that it will be the correct distribution.

=2E..suddenly I have the feeling that the data is not
1-dimensional!

What are the dimensions of your input and output?
Exactly what type of problem do you have and what
exactly do you want the neural net to do?

Hope this helps.

Greg

Subject: Re: small data set

From: giannis

Date: 05 May, 2008 21:59:04

Message: 8 of 13

Hello Greg,

thank you for all your help.

I have data from 25 people. 20 of them have lung cancer and
5 don't. I have 6 different characteristic for each person.
(so the array is 25X6)

the tasks are:to produce two classifiers
1st: to classify between a constant value - 2 outputs)
2nd: to classify the stage of cancer 0,1,2,3 or 4 so - 5
outputs)

I tried to use SVM, Linear regresion, Backpropagation and
RBF Neural Nets and KNN.

I tried to reshuffle my data using Leave One Out Cross
Validation (LOOCV) so keeping each time one for testing and
24 for training.

hope I gave you the picture..?
 


Greg Heath <heath@alumni.brown.edu> wrote in message
<9b4c2a53-7f64-42a4-a546-5a8e0f9e2cb9@k13g2000hse.googlegroups.com>...
> On May 1, 7:22=A0am, Greg Heath <he...@alumni.brown.edu>
wrote:
> > On May 1, 6:30=A0am, "giannis " <fanzi...@yahoo.co.uk>
wrote:
> >
> > > Hello.
> >
> > > I am doing a statistical research using KNN,neuralnets and
> > > SVM.. The problem is the very small data set (25
speciments).
> >
> > > I am using cross validation to resample the data but I am
> > > not sure if my results can be accurate with such a small
> > > data set.
> >
> > > can you please suggest any method to use as best as
possible
> > > =A0such a small data set?
> > > thank you in advance =A0
> >
> > Bootstrapping
> >
> > Search the mathworks website.
>
> If you have prior information on the form of the probability
> distribution function, you can use the 25 observations to
> estimate the parameters and then generate more "data".
> The danger is that, even in one dimension, 25 observations
> will not give you precise parameter estimates.
>
> If you don't have such prior information you can test
> hypotheses as to which distribution the data might be
> from. However, with only 25 observations the testing will
> be far from definitive. You may test several distributions,
> find that you can reject all except one. However, that does
> not guarantee that it will be the correct distribution.
>
> =2E..suddenly I have the feeling that the data is not
> 1-dimensional!
>
> What are the dimensions of your input and output?
> Exactly what type of problem do you have and what
> exactly do you want the neural net to do?
>
> Hope this helps.
>
> Greg
>

Subject: Re: small data set

From: roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson)

Date: 06 May, 2008 05:22:12

Message: 9 of 13

In article <fvnvv8$jgg$1@fred.mathworks.com>,
giannis <fanzio12@yahoo.co.uk> wrote:

>I have data from 25 people. 20 of them have lung cancer and
>5 don't. I have 6 different characteristic for each person.
>(so the array is 25X6)

>the tasks are:to produce two classifiers
>1st: to classify between a constant value - 2 outputs)
>2nd: to classify the stage of cancer 0,1,2,3 or 4 so - 5
>outputs)

Is this a scientific investigation, or a class exercise of some
sort? If it is a a scientific investigation, then it is the sort
of thing that my group does routinely and we may be able to help you.

--
  "The whole history of civilization is strewn with creeds and
  institutions which were invaluable at first, and deadly
  afterwards." -- Walter Bagehot

Subject: Re: small data set

From: Greg Heath

Date: 06 May, 2008 07:43:15

Message: 10 of 13

Corrected for the heinous sin of top-posting.

On May 5, 5:59=A0pm, "giannis " <fanzi...@yahoo.co.uk> wrote:
>
> Greg Heath <he...@alumni.brown.edu> wrote in message
>
> <9b4c2a53-7f64-42a4-a546-5a8e0f9e2...@k13g2000hse.googlegroups.com>...
> > On May 1, 7:22=3DA0am, Greg Heath <he...@alumni.brown.edu>
> wrote:
> > > On May 1, 6:30=3DA0am, "giannis " <fanzi...@yahoo.co.uk>
> wrote:
>
> > > > Hello.
>
> > > > I am doing a statistical research using KNN,neuralnets and
> > > > SVM.. The problem is the very small data set (25
> speciments).
>
> > > > I am using cross validation to resample the data but I am
> > > > not sure if my results can be accurate with such a small
> > > > data set.
>
> > > > can you please suggest any method to use as best as
> possible
> > > > =3DA0such a small data set?
> > > > thank you in advance =3DA0
>
> > > Bootstrapping
>
> > > Search the mathworks website.
>
> > If you have prior information on the form of the probability
> > distribution function, you can use the 25 observations to
> > estimate the parameters and then generate more "data".
> > The danger is that, even in one dimension, 25 observations
> > will not give you precise parameter estimates.
>
> > If you don't have such prior information you can test
> > hypotheses as to which distribution the data might be
> > from. However, with only 25 observations the testing will
> > be far from definitive. You may test several distributions,
> > find that you can reject all except one. However, that does
> > not guarantee that it will be the correct distribution.
>
> > =3D2E..suddenly I have the feeling that the data is not
> > 1-dimensional!
>
> > What are the dimensions of your input and output?
> > Exactly what type of problem do you have and what
> > exactly do you want the neural net to do?
>
> Hello Greg,
>
> thank you for all your help.
>
> I have data from 25 people. 20 of them have lung cancer and
> 5 don't. I have 6 different characteristic for each person.
> (so the array is 25X6)
>
> the tasks are:to produce two classifiers
> 1st: to classify between a constant value - 2 outputs)
> 2nd: to classify the stage of cancer 0,1,2,3 or 4 so - 5
> outputs) =A0 =A0
>
> I tried to use SVM, Linear regresion, Backpropagation and
> RBF Neural Nets and KNN.
>
> I tried to reshuffle my data using Leave One Out Cross
> Validation (LOOCV) so keeping each time one for testing and
> 24 for training.
>
> hope I gave you the picture..?

What kind of error rates are you getting for each method?
What are the largest error rates that you would accept?

When you plot the desired {0,1} classification vs each
of the inputs does there appear to be predictive capability?
What are the corresponding correlation coefficients?

Hope this helps.

Greg

Subject: Re: small data set

From: giannis

Date: 07 May, 2008 08:04:03

Message: 11 of 13

Hello,

thank you for your interest,
this is part of my master project, and I will be grateful if
you can help me in any way.
thank you
giannis

roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
message <fvopu4$eg$1@canopus.cc.umanitoba.ca>...
> In article <fvnvv8$jgg$1@fred.mathworks.com>,
> giannis <fanzio12@yahoo.co.uk> wrote:
>
> >I have data from 25 people. 20 of them have lung cancer and
> >5 don't. I have 6 different characteristic for each person.
> >(so the array is 25X6)
>
> >the tasks are:to produce two classifiers
> >1st: to classify between a constant value - 2 outputs)
> >2nd: to classify the stage of cancer 0,1,2,3 or 4 so - 5
> >outputs)
>
> Is this a scientific investigation, or a class exercise of
some
> sort? If it is a a scientific investigation, then it is
the sort
> of thing that my group does routinely and we may be able
to help you.
>
> --
> "The whole history of civilization is strewn with creeds and
> institutions which were invaluable at first, and deadly
> afterwards." -- Walter
Bagehot

Subject: Re: small data set

From: giannis

Date: 07 May, 2008 08:25:06

Message: 12 of 13

Greg Heath <heath@alumni.brown.edu> wrote in message
<85d534d6-2338-43dc-aa50-31add8d56fe3@j22g2000hsf.googlegroups.com>...
> Corrected for the heinous sin of top-posting.
>
> On May 5, 5:59=A0pm, "giannis " <fanzi...@yahoo.co.uk> wrote:
> >
> > Greg Heath <he...@alumni.brown.edu> wrote in message
> >
> >
<9b4c2a53-7f64-42a4-a546-5a8e0f9e2...@k13g2000hse.googlegroups.com>...
> > > On May 1, 7:22=3DA0am, Greg Heath <he...@alumni.brown.edu>
> > wrote:
> > > > On May 1, 6:30=3DA0am, "giannis " <fanzi...@yahoo.co.uk>
> > wrote:
> >
> > > > > Hello.
> >
> > > > > I am doing a statistical research using
KNN,neuralnets and
> > > > > SVM.. The problem is the very small data set (25
> > speciments).
> >
> > > > > I am using cross validation to resample the data
but I am
> > > > > not sure if my results can be accurate with such a
small
> > > > > data set.
> >
> > > > > can you please suggest any method to use as best as
> > possible
> > > > > =3DA0such a small data set?
> > > > > thank you in advance =3DA0
> >
> > > > Bootstrapping
> >
> > > > Search the mathworks website.
> >
> > > If you have prior information on the form of the
probability
> > > distribution function, you can use the 25 observations to
> > > estimate the parameters and then generate more "data".
> > > The danger is that, even in one dimension, 25 observations
> > > will not give you precise parameter estimates.
> >
> > > If you don't have such prior information you can test
> > > hypotheses as to which distribution the data might be
> > > from. However, with only 25 observations the testing will
> > > be far from definitive. You may test several
distributions,
> > > find that you can reject all except one. However, that
does
> > > not guarantee that it will be the correct distribution.
> >
> > > =3D2E..suddenly I have the feeling that the data is not
> > > 1-dimensional!
> >
> > > What are the dimensions of your input and output?
> > > Exactly what type of problem do you have and what
> > > exactly do you want the neural net to do?
> >
> > Hello Greg,
> >
> > thank you for all your help.
> >
> > I have data from 25 people. 20 of them have lung cancer and
> > 5 don't. I have 6 different characteristic for each person.
> > (so the array is 25X6)
> >
> > the tasks are:to produce two classifiers
> > 1st: to classify between a constant value - 2 outputs)
> > 2nd: to classify the stage of cancer 0,1,2,3 or 4 so - 5
> > outputs) =A0 =A0
> >
> > I tried to use SVM, Linear regresion, Backpropagation and
> > RBF Neural Nets and KNN.
> >
> > I tried to reshuffle my data using Leave One Out Cross
> > Validation (LOOCV) so keeping each time one for testing and
> > 24 for training.
> >
> > hope I gave you the picture..?
>
> What kind of error rates are you getting for each method?
> What are the largest error rates that you would accept?
>
> When you plot the desired {0,1} classification vs each
> of the inputs does there appear to be predictive capability?
> What are the corresponding correlation coefficients?
>
> Hope this helps.
>
> Greg



hello Greg,

the best results I can get till now are:

using 1st: 3 of the 5 characteristics
      2nd: 2-fold cross validation (using all the
           combinations and at the end getting the average
           error rate)
      3rd: KNN classification giving 75% correct
           cp.Correct.rate and RBF neural network giving 68%
           cp.Correct.rate

this error can be acceptable but because of the small data
set i have available i am not confident if these results can
be assumed reliable and if the method of reshuffling the
data is acceptable.

-can you please explain me which plot you ask to do?

thank you

giannis

Subject: Re: small data set

From: Greg Heath

Date: 07 May, 2008 16:08:26

Message: 13 of 13

-----SNIP
> > > > What are the dimensions of your input and output?
> > > > Exactly what type of problem do you have and what
> > > > exactly do you want the neural net to do?
>
-----SNIP
> > > I have data from 25 people. 20 of them have lung cancer and
> > > 5 don't. I have 6 different characteristic for each person.
> > > (so the array is 25X6)
>
> > > the tasks are:to produce two classifiers
> > > 1st: to classify between a constant value - 2 outputs)
> > > 2nd: to classify the stage of cancer 0,1,2,3 or 4 so - 5
> > > outputs)
>
> > > I tried to use SVM, Linear regresion, Backpropagation and
> > > RBF Neural Nets and KNN.
>
> > > I tried to reshuffle my data using Leave One Out Cross
> > > Validation (LOOCV) so keeping each time one for testing and
> > > 24 for training.
>
> > > hope I gave you the picture..?
>
> > What kind of error rates are you getting for each method?
> > What are the largest error rates that you would accept?
>
> > When you plot the desired {0,1} classification vs each
> > of the inputs does there appear to be predictive capability?
> > What are the corresponding correlation coefficients?
>
> the best results I can get till now are:

I assume this is the first classifier:
{0,1} => {no cancer, cancer}

> using 1st: 3 of the 5 characteristics

I thought there were 6.
Did using more inputs decrease the performance?

> 2nd: 2-fold cross validation (using all the
> combinations )

What does "using all the combinations" mean?

2-fold XVAL ==> 13 training and 12 testing, then switch.

I thought you originally said you were using LOO.

> and at the end getting the average
> error rate)
> 3rd: KNN classification giving 75% correct
> cp.Correct.rate

How depressing.

I just got 80% on your data without using the
computer.

what does the cp in cp.Correct.rate mean?

What are the class conditional error rates, i.e.,
What are the separate error rates for the 20 negatives
and 5 positives?

> and RBF neural network giving 68%
> cp.Correct.rate

What about the MLP using NEWFF?

How are you compensating for the 4:1 imbalance?

> this error can be acceptable but because of the small data
> set

No it is not.

I would consider less than 80% for the 5 positives as completely
unacceptable!

> i have available i am not confident if these results can
> be assumed reliable

Bootstrapping and many trials of 10-fold XVAL will yield enough
replications so that you can estimate confidence levels.

> and if the method of reshuffling the data is acceptable.

 What reshuffling? You said 2-fold XVAL ... how many trials?

> -can you please explain me which plot you ask to do?

You need to look at, at least, the 6*5/2 = 15 color coded pairwise
2-D projections of the 6-Ddata to see how the 5 differ from
the 20. Also, looking at dominant PCA projections "may" help.

For the RBF and MLP you need to either create 15 extra copies
of the 5 or use some other method to balance 0/1 training.
Search the archive of comp.ai.neural-nets and CSSM using

greg-heath unbalanced

sorting by date will help separate the earlier useful posts from
the later referrals.

If you get 0.5+eps on the 20 negatives and 0.5-eps on the 5
positives you will get a 100% correct classification rate and a
0.25 mean-square error. If you change the sign of eps you will
get the same MSE with a 0% correct classification rate.

Nevertheless, it is useful to record the overall and class-conditional
MSEs.

Record the class-conditional as well as the overall performance.
The former are more important!

Hope this helps.

Greg

Tags for this Thread

Everyone's Tags:

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Tag Activity for This Thread
Tag Applied By Date/Time
bootstrapping giannis 01 May, 2008 07:31:55
rssFeed for this Thread

envelope graphic E-mail this page to a colleague

Public Submission Policy
NOTICE: Any content you submit to MATLAB Central, including personal information, is not subject to the protections which may be afforded information collected under other sections of The MathWorks, Inc. Web site. You are entirely responsible for all content that you upload, post, e-mail, transmit or otherwise make available via MATLAB Central. The MathWorks does not control the content posted by visitors to MATLAB Central and, does not guarantee the accuracy, integrity, or quality of such content. Under no circumstances will The MathWorks be liable in any way for any content not authored by The MathWorks, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available via MATLAB Central. Read the complete Disclaimer prior to use.
Related Topics