Path: news.mathworks.com!not-for-mail
From: Peter Perkins <Peter.PerkinsRemoveThis@mathworks.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: eigenvalues of the covarience matrix (princomp)
Date: Thu, 15 Nov 2007 12:55:59 -0500
Organization: The MathWorks, Inc.
Lines: 57
Message-ID: <fhi17f$dhh$1@fred.mathworks.com>
References: <fhhkfc$9oa$1@fred.mathworks.com> <16367855.1195143889907.JavaMail.jakarta@nitrogen.mathforum.org>
NNTP-Posting-Host: perkinsp.dhcp.mathworks.com
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: fred.mathworks.com 1195149359 13873 172.31.57.88 (15 Nov 2007 17:55:59 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Thu, 15 Nov 2007 17:55:59 +0000 (UTC)
User-Agent: Thunderbird 2.0.0.6 (Windows/20070728)
In-Reply-To: <16367855.1195143889907.JavaMail.jakarta@nitrogen.mathforum.org>
Xref: news.mathworks.com comp.soft-sys.matlab:437764



yakir gagnon wrote:
>>> and princomp(zscore( X )) is a CORRECT PCA...
>> There is absolutely no point in doing this 
> 
> why? doing princomp(X) or princomp(zscore(X)) yields two different answers. and zscore(X) = zscore(zscore(X))

Yes, princomp(X) and princomp(zscore(X)) do give different results.  All 
I meant was that princomp(X./repmat(std(X,1),size(X,1),1)) and 
princomp(zscore(X)) will give the same results, because princomp already 
centers the data to have zero mean, and so the centering step in zscore 
is redundant.  On the other hand, since it's easier to type zscore(X) 
than X./repmat(std(X,1),size(X,1),1), choosing the former does no harm.


>> (as opposed to what you've 
>> called "correlation PCA"), since PRINCOMP already
>> centers the data.
> 
> here you say 'centre the data' which makes me confused since I thought you were talking about the zscoring (in which case I thought it was called standardizing), but I might be wrong.

ZSCORE centers each column to have zero mean, and normalizes each column 
to have unit variance.  "Standardized" is kind of an ambiguous term; the 
best description of what ZSCORE does is "type zscore".

PRINCOMP always centers the data to have zero mean before doing 
anything.  There's limited use in doing PCA on non-centered data, 
because the first component will typically describe the mean of the 
data, and that's not what most people want out of PCA (some would argue 
with that).


> so why would I choose to do a so called "correlation PCA"? what is it good for?

There are a lot of differing opinions on this.  My own opinion is that 
doing PCA on unstandardized variables implies that you think that the 
scales on which the different variables are measured are somehow 
"natural" and "comparable", in the sense that variation of some absolute 
magnitude in one variable is no more or less important than the same 
amount of absolute variation in another variable.  Doing PCA on 
standardized variables (scaling each column by the inverse of its sample 
std dev) implies that you think that the scales of the different 
variables are an artifact of the units in which you measured them, and 
that you need to rescale in order to make the variation in the different 
variables "comparable".  The classic example is doing PCA on things like 
body measurements.  Should your PCA results differ if you choose to 
measure weight in grams vs. stones?  Probably it shouldn't.

Whether or not you center the data before doing PCA affects these 
arguments too.

I would not describe either as "correct", but would apply a method as 
appropriate to circumstances.  Again some would argue with that.

Hope this helps.

- Peter Perkins
   The MathWorks, Inc.