Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
PCA dimensionality reduction / component identifications

Subject: PCA dimensionality reduction / component identifications

From: Pierrre Gogin

Date: 16 Jul, 2010 16:02:06

Message: 1 of 18

Hi everybody,

I have a question regarding the reduction of dimensions with pca (princomp command in Matalb). The reduction itself is not the problem, but I could not figure out, how to indentify the original features, that are actually important components and the one that are not important.
Small example:
I create a feature space of 100 observations and 3 features:
vec1 = rand(1,100)*2;
vec2 = rand(1,100)*20;
vec3 = rand(1,100)*12;
A = [vec1; vec2; vec3];

Obviously the 2nd feature has a higher variance the the 3rd, etc…So from this generic data I would expect, that the vec2 is contribution most to describe my dataset. I followed the “normal” approach to do the pca:
[COEFF2,SCORE2, e] = princomp(A)

From the output I get the eigenvector and eigenvalues, telling me, which dimension of the transformed (!) feature space contributed how much to the representation of the dataset? I want to know, with which percentage each of my three features (vec1,vec2,vec3) from the original(!) distribution is contributing to my dataset, without having the prior knowledge of how they are build. From the output of Matlab I can’t tell. Does somebody have an idea how to get this information?

Thanks in advance
Pierre

Subject: PCA dimensionality reduction / component identifications

From: Peter Perkins

Date: 16 Jul, 2010 18:10:19

Message: 2 of 18

On 7/16/2010 12:02 PM, Pierrre Gogin wrote:
> I create a feature space of 100 observations and 3 features:
> vec1 = rand(1,100)*2;
> vec2 = rand(1,100)*20;
> vec3 = rand(1,100)*12;
> A = [vec1; vec2; vec3];
>
> Obviously the 2nd feature has a higher variance the the 3rd,
> etc…So from this generic data I would expect, that the vec2 is
> contribution most to describe my dataset. I followed the
> “normal” approach to do the pca:
> [COEFF2,SCORE2, e] = princomp(A)


Pierre, I bet this (note the transpose) will be a bit more obvious to
figure out:

 >> [COEFF2,~,e] = princomp(A')
COEFF2 =
   -0.00042265 -0.0045291 0.99999
       0.99997 0.0081763 0.00045967
    -0.0081783 0.99996 0.0045255
e =
        26.508
        10.351
       0.33826

Which is to say, the first PC [-0.00042265 0.99997 -0.0081783]' picks
out the second feature, and accounts for 26/37=71% of the total
variance, and so on. PRINCOMP, like all functions in the Statistics
Toolbox, is column-oriented.

Hope this helps.

Subject: PCA dimensionality reduction / component identifications

From: Pierrre Gogin

Date: 16 Jul, 2010 20:13:21

Message: 3 of 18

Hi Peter,

thanks for your answer. Unfortunately it did not really answer my question (see below)
>
>
> Pierre, I bet this (note the transpose) will be a bit more obvious to
> figure out:
I had the transpose in my code, you forget to copy it, so I got the same results (at least the format, with the rand() it will of course be a bit different each time)

>
> Which is to say, the first PC [-0.00042265 0.99997 -0.0081783]' picks
> out the second feature, and accounts for 26/37=71% of the total
> variance, and so on.


So here's the point. The first PC accounts 71% of the total variance. I agree but I guess that it is the first PC of the transformed () feature space. The question I'm interested in is: How much % ver1, ver2 and ver3 accounts for the total variance.
Pca is a linear transformation, so the backwards should be possible but I just don't know how.

Subject: PCA dimensionality reduction / component identifications

From: Peter Perkins

Date: 16 Jul, 2010 21:10:15

Message: 4 of 18

On 7/16/2010 4:13 PM, Pierrre Gogin wrote:
> So here's the point. The first PC accounts 71% of the total variance. I
> agree but I guess that it is the first PC of the transformed () feature
> space. The question I'm interested in is: How much % ver1, ver2 and ver3
> accounts for the total variance. Pca is a linear transformation, so the
> backwards should be possible but I just don't know how.

Is this what you're asking?

 >> [COEFF2,~, e] = princomp(A')
COEFF2 =
     -0.017612 -0.024113 0.99955
       0.99486 0.099277 0.019924
     -0.099713 0.99477 0.022241
e =
        33.805
        10.657
       0.33649
 >> sum(e)
ans =
        44.799
 >> S = cov(A')
S =
       0.35288 -0.61112 -0.18879
      -0.61112 33.564 -2.3009
      -0.18879 -2.3009 10.882
 >> sum(diag(S))
ans =
        44.799
 >> diag(S)/sum(diag(S))
ans =
     0.0078769
       0.74921
       0.24291

Subject: PCA dimensionality reduction / component identifications

From: Pierrre Gogin

Date: 17 Jul, 2010 15:15:21

Message: 5 of 18

Hi Peter,

Thanks a lot that looks very much what I’m looking for. I was confused because actually the pca is not really necessary to get the information I’m looking for. It is sufficient to evaluation the main diagonal of the covariance matrix. The princomp command is matlab does not much more that that I guess. Only if compute also the second output that you might get the input Matrix rotated by the matrix containing all eigenvalues. Correct?


Peter Perkins <Peter.Perkins@MathRemoveThisWorks.com> wrote in message <i1qhrn$kio$1@fred.mathworks.com>...
> On 7/16/2010 4:13 PM, Pierrre Gogin wrote:
> > So here's the point. The first PC accounts 71% of the total variance. I
> > agree but I guess that it is the first PC of the transformed () feature
> > space. The question I'm interested in is: How much % ver1, ver2 and ver3
> > accounts for the total variance. Pca is a linear transformation, so the
> > backwards should be possible but I just don't know how.
>
> Is this what you're asking?
>
> >> [COEFF2,~, e] = princomp(A')
> COEFF2 =
> -0.017612 -0.024113 0.99955
> 0.99486 0.099277 0.019924
> -0.099713 0.99477 0.022241
> e =
> 33.805
> 10.657
> 0.33649
> >> sum(e)
> ans =
> 44.799
> >> S = cov(A')
> S =
> 0.35288 -0.61112 -0.18879
> -0.61112 33.564 -2.3009
> -0.18879 -2.3009 10.882
> >> sum(diag(S))
> ans =
> 44.799
> >> diag(S)/sum(diag(S))
> ans =
> 0.0078769
> 0.74921
> 0.24291

Subject: PCA dimensionality reduction / component identifications

From: Peter Perkins

Date: 17 Jul, 2010 16:38:24

Message: 6 of 18

On 7/17/2010 11:15 AM, Pierrre Gogin wrote:
> Thanks a lot that looks very much what I’m looking for. I was
> confused because actually the pca is not really necessary to get the
> information I’m looking for. It is sufficient to evaluation the
> main diagonal of the covariance matrix. The princomp command is matlab
> does not much more that that I guess.

PRINCOMP relies on SVD. I would not describe that as "not much more".
The principal component coefficients, i.e., the first output of
PRINCOMP, are certainly not derived just from the diagonal of the cov
matrix, nor are the eigenvalues.

> Only if compute also the second
> output that you might get the input Matrix rotated by the matrix
> containing all eigenvalues. Correct?

No, the rotation matrix, i.e. the PC coefs, contains the eigen
_vectors_. That second output of rotated data is known as the "scores".

Subject: PCA dimensionality reduction / component identifications

From: Pierrre Gogin

Date: 17 Jul, 2010 22:10:06

Message: 7 of 18

Hi Peter,
So I think your proposal is not solving my questions, because it computes the magnitude of variances and not the eigenvalues of each dimension.
I search a bit more and found, that this questions had be discussed before but without a satisfying result, e.g.:
http://www.mathworks.in/matlabcentral/newsreader/view_thread/156868
or
http://www.mathworks.de/matlabcentral/newsreader/view_thread/148851

All of this people wants the same thing: associating the eigenvalues to their original dataset.
One user proposed to use the first column of u in [u,s,v] = svd(A) (See 2nd link), but these are the singular values not the eigenvalues

Subject: PCA dimensionality reduction / component identifications

From: Peter Perkins

Date: 19 Jul, 2010 14:57:40

Message: 8 of 18

On 7/17/2010 6:10 PM, Pierrre Gogin wrote:
> I search a bit more and found, that this questions had be discussed
> before but without a satisfying result, e.g.:
> http://www.mathworks.in/matlabcentral/newsreader/view_thread/156868
> or http://www.mathworks.de/matlabcentral/newsreader/view_thread/148851
>
> All of this people wants the same thing: associating the eigenvalues to
> their original dataset.

Pierre, you have not defined what you mean by that. In the output of
PRINCOMP, each eigenvalue corresponds to one of the coordinate axes in
the rotated space. But what does, say, the first eigenvalue correspond
to in the original data? Two answers might be:

1) It doesn't correspond to any one thing. The eigenvalues that
PRINCOMP returns are the eigenvalues of the covariance matrix of the
original data, that's all. There's no correspondence to any particular
variable, because the PCs, and therefore their eigenvalues, are a
rotation of the original space.
2) It corresponds to the linear combination of the original variables
that defines the first PC.

It may be that you want to look at the variances of your original
variables (as I already suggested), or you may want to look at the PC
coefficients (ditto) to try and identify a small subset of the original
variables that accounts for a suitable proportion of the total variance.
  But if you are expecting to be able to match eigenvalues to your
original variables, 1:1, I don't know what to say. I may just be
misunderstanding what you're asking.

> One user proposed to use the first column of u
> in [u,s,v] = svd(A) (See 2nd link), but these are the singular values
> not the eigenvalues

I don't see that in either thread. One person in the first thread
decided that what he need to do to _compare two PCA solutions_ was to
try and find a correspondence between PCs in those _two solutions_ by
taking dot products between PCs. In any case, the first column of the
first output of SVD is the dominant left singular vector, which in this
case corresponds to the scores along the first PC. The second output of
SVD is the singular values, which are the sqrts of the eigenvalues of
cov(X). See the code in PRINCOMP.

Subject: PCA dimensionality reduction / component identifications

From: Rob Campbell

Date: 19 Jul, 2010 18:05:10

Message: 9 of 18

>know, with which percentage each of my three features (vec1,vec2,vec3) from the
>original(!) distribution is contributing to my dataset, without having the prior knowledge
>of how they are build. From the output of Matlab I can’t tell. Does somebody have an
>idea how to get this information?

So you want to know how much each of the original dimensions contributes to a given eigenvector? If you can calculate that then you can you obtain this information across any number of PCs.

Subject: PCA dimensionality reduction / component identifications

From: Philip Mewes

Date: 20 Jul, 2010 14:39:04

Message: 10 of 18


> So you want to know how much each of the original dimensions contributes to a given eigenvector?

Not exactly. I want to know how much each of my original dimensions contributes to the complete set of dimensions. From a list of sorted(!) eigenvalues I guess I want be able to see it.

Subject: PCA dimensionality reduction / component identifications

From: Rob Campbell

Date: 20 Jul, 2010 15:30:22

Message: 11 of 18


> > So you want to know how much each of the original dimensions contributes to a given eigenvector?
>
> Not exactly. I want to know how much each of my original dimensions contributes to the complete set of dimensions. From a list of sorted(!) eigenvalues I guess I want be able to see it.

You're saying you want to know how much each of the original dimensions contributes to the complete set eigenvectors? I don't understand what you're asking for.

I was trying to break down the question: If you can calculate the contribution of your original dimensions to one PC then you can easily do this for any number of PCs. Isn't this what you want to know?

Can you explain what you want to know. Maybe the reason that nobody can help you is because there's a better way of addressing your underlying question.

Subject: PCA dimensionality reduction / component identifications

From: Rob Campbell

Date: 20 Jul, 2010 15:35:20

Message: 12 of 18

>The reduction itself is not the problem, but I could not figure out, how to indentify the
>original features, that are actually important components and the one that are not
>important.

I see, this is what you want to know? Why not look at the direction of the eigenvectors? Plot the eigenvectors: the resulting "shapes" will tell you what features of your data they explain.

Subject: PCA dimensionality reduction / component identifications

From: Philip Mewes

Date: 20 Jul, 2010 15:36:04

Message: 13 of 18


> or you may want to look at the PC
> coefficients (ditto) to try and identify a small subset of the original
> variables that accounts for a suitable proportion of the total variance.

that's exactly what I want to do, but I haven't understood yet how pca can help me with this. If you have a look at my initial post. I introduced 3 vectors (vec1,vec2,vec3) each of them contains 100 observations. Let's say because of computational cost I would like to spare a number of this vectors. Let’s also assume, that I don't care about how many vectors are spared out (1 or 2) but it matters how much % the remaining variables represents of my initial number of variables. How would I do that?

Do I need to evaluate how much each of the original dimensions contributes to a given eigenvector, like Rob proposed in a previous post?

For me it is especially important, that I must do this pca step only once and as a result I know which vector (vec1,vec2,vec3) i can spare in the future. I assume and I know, that the characteristics (var, mean, std, etc.) will always be more or less the same for upcoming vectors. So for a new set of vectors, which is very similar to one I did the pca with I want to say, which of them is important, and which one is not.

I thinks to get knowledge it is not sufficient to look at the variances, because they do not take into account the correlations between the vectors.

Subject: PCA dimensionality reduction / component identifications

From: Philip Mewes

Date: 20 Jul, 2010 15:47:05

Message: 14 of 18

Hi Rob,

thanks for your help. I just replied to peters post explaining, what my motivations. Does that help you to understand my problem?
I also can say a bit more about the application: It's about image processing.Say, you have an image and you want to classify this image in two classes:
1.: Image contains a car
2.: Image contains no car

So you extract from your image a big number of features that might help you to address this classification step. Some of the features might be very useful, some of them a redundant and this is exactly what I want to find out and I think in principal pca is the way to do go. I want to find this out in a way that for future images, that I can spare some of the feature extraction steps, because I know, that these features are not important.

Does it became a bit more clear?

"Rob Campbell" <matlab@robertREMOVEcampbell.removethis.co.uk> wrote in message <i24fee$gur$1@fred.mathworks.com>...
>
> > > So you want to know how much each of the original dimensions contributes to a given eigenvector?
> >
> > Not exactly. I want to know how much each of my original dimensions contributes to the complete set of dimensions. From a list of sorted(!) eigenvalues I guess I want be able to see it.
>
> You're saying you want to know how much each of the original dimensions contributes to the complete set eigenvectors? I don't understand what you're asking for.
>
> I was trying to break down the question: If you can calculate the contribution of your original dimensions to one PC then you can easily do this for any number of PCs. Isn't this what you want to know?
>
> Can you explain what you want to know. Maybe the reason that nobody can help you is because there's a better way of addressing your underlying question.

Subject: PCA dimensionality reduction / component identifications

From: Rob Campbell

Date: 20 Jul, 2010 16:11:23

Message: 15 of 18

Generally people will:
1. Do PCA.
2. Choose a suitable number of dimensions based upon the eigenvalues.
3. Reconstruct original observations using this reduced space. (see help pcares)

But it seems that you want to know which of the original dimensions are relatively unimportant so that you don't have to spend time acquiring them in the future. Right?

> Do I need to evaluate how much each of the original dimensions contributes to a
>given eigenvector, like Rob proposed in a previous post?
If you want to do what you state then it seems that you might want to know this. I think you would want to decide how many PCs you need to describe your data. Then work out how much each of the original dimensions contributes to each of these PCs. Add up these numbers and try excluding those dimensions which explain the least. I think you need a statistical test to determine whether or not removing a dimension results in a significantly worse description.


Do you have any key descriptive statistics which you calculate on your data? Perhaps a standard deviation or maybe your data fall into different clusters? If so, one possibility is to calculate this statistic on the full space then drop dimensions in a systematic manner and re-calculate. In other words, a step-wise regression.

Subject: PCA dimensionality reduction / component identifications

From: Philip Mewes

Date: 20 Jul, 2010 16:12:05

Message: 16 of 18

"Rob Campbell" <matlab@robertREMOVEcampbell.removethis.co.uk> wrote in message <i24fno$6s3$1@fred.mathworks.com>...
> >The reduction itself is not the problem, but I could not figure out, how to indentify the
> >original features, that are actually important components and the one that are not
> >important.
>
> I see, this is what you want to know? Why not look at the direction of the eigenvectors? Plot the eigenvectors: the resulting "shapes" will tell you what features of your data they explain.

if I do [vec ~ eigenv] = pca(A) and A is a matrix that contains my vector in each row. vec is than containing the eigenvectors in the rows of a matrix. Are those eigenvectors still correlated to my original data?

Subject: PCA dimensionality reduction / component identifications

From: Rob Campbell

Date: 20 Jul, 2010 16:18:04

Message: 17 of 18

Ah! So you have two groups and you can produce a training set where you know whether or not each image contains a car? In that case, you have a supervised classification problem. PCA isn't the right way to go. Why not conduct a discriminant analysis? This will produce a single direction in your space which best separates the car from non-car images. You can plot your data as two histograms along this axis. The direction of the vector will tell you the basis upon which the discrimination was made: each of your original variables will have a "weighting" and you can use the magnitude of each weighting to decide whether or not it is significant. You can calculate confidence intervals for these weightings (maybe using a permutation test) to help you determine significance.

You want:
help classify

Subject: PCA dimensionality reduction / component identifications

From: Greg Heath

Date: 21 Jul, 2010 04:58:48

Message: 18 of 18

On Jul 16, 12:02 pm, "Pierrre Gogin" <pierre.go...@freemail.de> wrote:
> Hi everybody,
>
> I have a question regarding the reduction of dimensions with pca (princomp command in Matalb). The reduction itself is not the problem, but I could not figure out, how to indentify the original features, that are actually important components and the one that are not important.
> Small example:
> I create a feature space of 100 observations and 3 features:
> vec1 = rand(1,100)*2;
> vec2 = rand(1,100)*20;
> vec3 = rand(1,100)*12;
> A = [vec1; vec2; vec3];
>
> Obviously the 2nd feature has a higher variance the the 3rd, etc…So from this generic data I would expect, that the vec2 is contribution most to describe my dataset.

Is contributing the most what? That's almost like saying
the volume of a rectangular box V = L*W*H is

V = (12 in)*( 1 ft)*( (1/3) yard) = 4 in-ft-yards

and the length of 12 is contributing the most to the volume.

For many variable comparisons it is usually prudent to

1. Use standardized variables to be independent of
   scaling and the choice of origin location.
2. Define a characteristic for making comparisons
3. Use a quantitative measure of the amount of
   characteristic created by an arbitrary variable
   subset.
4. Be cognizant of the fact that if the variables
are correlated, the contribution of a variable will
depend on what other vaiables are present. For
example, when all variables are present, removing
x2 might decrease the measure the most. However,
when no variables are present, the measure might
be increased the most when x3 is added.


> I followed the “normal” approach to do the pca:
> [COEFF2,SCORE2, e] = princomp(A)
>
> From the output I get the eigenvector and eigenvalues, telling me, which dimension of the transformed (!) feature space contributed how >much to the representation of the dataset?

Representation of what characteristic? The eigenvalues
of the covariance matrix indicate scale dependent spread.
Is spread the characteristic that is most important ?
What happens if you rescale the data?
What about the classification of data from two
parallel cigar-shaped distributions? ... The spread
is largest along the length of the cigars. However,
the direction of largest class separation can be in
the direction perpendicuar to the maximum spread
direction.

>I want to know, with which percentage each of my three features (vec1,vec2,vec3) from the original(!) distribution is contributing to my dataset, without having the prior knowledge of how they are build. From the output of Matlab I can’t tell. Does somebody have an idea how to get this information?

As implied above, the meaning of "contributing to the
dataset" has to be defined.

If you are only interested in the directions
of largest spread, beware that they may be
useless if you are trying to separate classes
of different data types.

Begin by defining an output as a function of input
variables and a godness measure for the output.

The rest depends on your particular problem and
may be very difficult. However, you'll seldom
achieve your goal without a well defined goal,
a well planned approach and a good start.

Hope this helps.

Greg

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us