Path: news.mathworks.com!newsfeed-00.mathworks.com!newsfeed2.dallas1.level3.net!news.level3.com!postnews.google.com!19g2000hsx.googlegroups.com!not-for-mail
From:  Greg Heath <heath@alumni.brown.edu>
Newsgroups: comp.soft-sys.matlab
Subject: Re: HELP!!!
Date: Mon, 12 Nov 2007 13:13:41 -0800
Organization: http://groups.google.com
Lines: 45
Message-ID: <1194902021.040892.300920@19g2000hsx.googlegroups.com>
References: <fha8l9$f1l$1@fred.mathworks.com>
NNTP-Posting-Host: 68.39.14.248
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
X-Trace: posting.google.com 1194902021 12608 127.0.0.1 (12 Nov 2007 21:13:41 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Mon, 12 Nov 2007 21:13:41 +0000 (UTC)
In-Reply-To: <fhaf65$q6u$1@canopus.cc.umanitoba.ca>
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; AT&T CSM6.0; AT&T CSM 6; (R1 1.3); .NET CLR 1.0.3705),gzip(gfe),gzip(gfe)
Complaints-To: groups-abuse@google.com
Injection-Info: 19g2000hsx.googlegroups.com; posting-host=68.39.14.248;
Xref: news.mathworks.com comp.soft-sys.matlab:437178


On Nov 12, 4:05 pm, rober...@ibd.nrc-cnrc.gc.ca (Walter Roberson)
wrote:
> In article <1194899672.262350.297...@22g2000hsm.googlegroups.com>,
> Greg Heath  <he...@alumni.brown.edu> wrote:
>
> >In general, clustering a mixture of multiple class data
> >via unsupervised clustering yields a suboptimal cluster
> >based classifier.  However, cluster based classification
> >can be improved, significantly, if supervised clustering
> >using class labels, is used.
>
> *If*, that is, the class labels are correct. Which turns
> out to be a problem in practice. It is unfortunately not "rare"
> for us to receive datasets in which samples have been misclassified.
>
> The "Gold Standard" is classification by a trained experienced human
> expert, but even experts make mistakes or are mislead by the data
> subset that they examine to classify by (e.g., the visual shape of a
> cell). We have found that for some datasets, that our unsupervised
> classification methods have an accuracy significantly exceeding the
> "Gold Standard".
>
> A related issue that we deal with a lot is that when the datasets
> contain large amounts of data (e.g., most any of the modern medical
> "scanners" such as CT, MRS, MRI, infra-red), humans have a lot of
> difficulty in perceiving the abstract multidimensional patterns
> needed in order to create class labels in the first place. Spectral
> noise certainly doesn't help!
>
> Supervised classification is great if you already know exactly
> what you are looking for, but it is not very good at figuring out
> new relationships. If you have your eye on peaks in the oxygen
> flow, you are likely to completely miss the much better correlation
> with (say) the calcium concentration information...
> --

That is why I have always recommended (search on greg-heath
pretraining advice) that unsupervised methods such as unsupervised
clustering and principal component analysis be used, before
supervised learning, in order to torture the data until they confess.

Hope this helps.

Greg