Issues in pca transformation

Question

0 votes

As I was trying to conduct PCA transformation for testing set based on the training set in Matlab 2019b on a Windows 10 machine, the matrix after PCA transformation seems to be different in 3 different conditions which I originally expect to be identical. I engaged in testing the behavior of pca transformation after I found that the classification accuracy in SVM was not perfectly 50% when the testing set is composed of 2 identical set of data, one coded 1 and another coded 0. That cause my suspection and later I figure that was the problem of the pca transformation I conducted.

%generate training set with normalization per every observation
X=rand(100,50000);
train=zeros(100,50000);
for i=1:100
    train(i,:)=normalize(X(i,:));
end
%generate testing set with normalization per every observation
x=rand(10,50000);
test=zeros(10,50000);
for i=1:10
    test(i,:)=normalize(x(i,:));
end
%compute pca coefficient based on training set
[coeff,~,latent]=pca(train);
%record if difference exist
allDiff=[]; %reocrd if exist difference among one two and three
tempDiff=[]; %reocrd if exist difference between two and three
%selection of number of pca component to be included in the transformation
for count=1:size(coeff,2)
    matrix=coeff(:,1:count);
    
    %cases
    one=test*matrix;
    temp=[test;test]*matrix;
    two=temp(1:10,:);
    three=temp(11:20,:);
    
    %comparison on whether the three expectedly identical matrix are indeed identical
    if isequal(one,two,three)==false %check if exist difference among one two and three
        allDiff=[allDiff,count];
    end
    if isequal(two,three)==false %check if exist difference between two and three
        tempDiff=[tempDiff,count];
    end
end

The difference as recorded in allDiff starts at 2 pca component, while that of tempDiff at 20 pca components. Occasionally, some component count will return with identical matrixs among one, two and three.

Is this issue related to the rounding error in matrix multiplication? And more importantly, which is the correct matrix after pca transformation? (I guess that is one) Thanks.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Image Analyst on 10 Aug 2020

0 votes

Since you're using random numbers, why do you think that exactly 50% of your points should fall into each of two classes? Your numbers are continuously valued. It's not like they're in two distinct, well separated clusters. So of course there may not be exactly 50% in each class.

10 Comments
Show 8 older comments Hide 8 older comments

Tze Hei Cho on 11 Aug 2020

Open in MATLAB Online

Sorry for the confusion. I originally do classification on my data set and found that it was not 50%, and therefore I continue to troubleshoot to find the source of error. That process eventually lead me to the pca transformation. The code I posted is just a demonstration on how pca transformation (i.e. the matrix multiplication with the PCA coefficient) might lead to discrepancy of result as the dimension (dimension 1: row) changes, not the original code I was using to do the classification. I further (after the original post) checked that there will be another result if I conduct the PCA transformation in a row by row basis:

%generate training set with normalization
X=rand(100,50000);
train=zeros(100,50000);
for i=1:100
    train(i,:)=normalize(X(i,:));
end
%generate testing set with normalization
x=rand(10,50000);
test=zeros(10,50000);
for i=1:10
    test(i,:)=normalize(x(i,:));
end
%compute pca coefficient based on training set
[coeff,~,latent]=pca(train);
%save for analysis
allDiff=[]; %exist difference among one two and three
tempDiff=[]; %exist difference between two and three
absolteDiff=[]; %exist difference with row by row option
%selection of number of pca component to be included in the transformation
for count=1:size(coeff,2)
    matrix=coeff(:,1:count);
    
    %cases
    zero=[];
    for row=1:10
        zero=[zero; test(row,:)*matrix]; %pca row by row
    end
    one=test*matrix; %pca as a batch of the whole test data set
    temp=[test;test]*matrix;
    two=temp(1:10,:); %pca as a double batch: first half
    three=temp(11:20,:); %pca as a double batch: second half
    
    %comparison on whether the three expectedly identical matrix are indeed identical
    if isequal(zero,one,two,three)==false
        absolteDiff=[absolteDiff,count];
    end
    if isequal(one,two,three)==false
        allDiff=[allDiff,count];
    end
    if isequal(two,three)==false
        tempDiff=[tempDiff,count];
    end
end

Thanks.

Tze Hei Cho on 11 Aug 2020

Open in MATLAB Online

The classification that I have done is something like the following:

%generating simulated training set
X=rand(100,50000);
train=zeros(100,50000);
for i=1:100
    train(i,:)=normalize(X(i,:));
end
train_label=[ones(1,100),zeros(1,100)] %labeling of exactly half as 1 and another half as 0
%generate simulated testing set
x=rand(10,50000);
test=zeros(10,50000);
for i=1:10
    test(i,:)=normalize(x(i,:));
end
test_label=[ones(1,10),zeros(1,10)]; %labeling of exactly half as 1 and another half as 0
%compute pca coefficient based on training set
[coeff,~,latent]=pca(train);
pca_count=20;
matrix=coeff(:,1:pca_count);
%pca transformation
transformed_train=[train;train]*matrix;
transformed_test=[test;test]*matrix;
%train and predict with SVM model
Mdl=fitcsvm(transformed_train,train_label);
prediction=predict(Mdl,transformed_train);
%check accuracy
accuracy=mean([prediction==test_label]');

As I use exactly 2 copies of the train data with one copy marked with 1 and 0 respecitvely, and similarly for test data, I expect the performance of the classification to be exactly 0.5. Sorry again for the confusion caused with the inappropriate example used.

The problem (accuracy not equal to 0.5) was later resolved by computing the pca transformation for each half of data respectively which lead me to suspect that there is a rounding issue. (i.e. transformed_train=[train*matrix; train*matrix], and similarly for test). The presence of rounding error is somehow confirmed as there is a further discrepancy of result if pca transformation is done row-by-row.

Tze Hei Cho on 12 Aug 2020

The actual calssifaiction is actually done on a dataset instead of random numbers. I was just trying to show the simulated situation. Sorry that I have not delievered that clearly.

I guess the problem still holds even if the principal components and SVM model trained is meaningless - because of the design of training set (i.e. two copies of the same set of data, one in class 1 and one in class 2), the accuracy should always be 50% (i.e. the prediction must be correct in one copy, and wrong in the other). I think the confusion might be due to my poor variable naming in the aforementioned code: test is with 10 observarions, while test_label is with 20 observations. In the code I later have two copies of test stacked together with 20 observations in total.

The 50% performance is achieved on my test dataset when the PCA transformation is conducted separately for each copy (i.e. [test*coeff;test*coeff]) instead of together (i.e. [test;test]*coeff). Later I find that the PCA transformation result will also be different with the former two cases if conduct PCA transformation row-by-row. Therefore I was wondering whether there exist a rounding issue during matrix multiplication if the dimension of matrix is large. Thanks.

Image Analyst on 14 Aug 2020

Either I don't know what you're doing or you don't. Because something doesn't make sense to me. With the pca() function, you pass it data and it gives you back the data in the new PC coordinate system. It figures out what the transform is, not you. So I don't understand it when you say " the same PCA coeff is used in all pca trasnformations." If you start with different sets of data, you will not end up with the same coefficients. They may be close but they will not be the same. It almost sounds like you're transforming your data, like rotating your coordinate system, and getting new coordinates and expecting/hoping that pca() will give you the same transform you used.

Tze Hei Cho on 14 Aug 2020

What I am trying to do here is to transform the testing data to the PC space of the training data. Therefore, the input of the pca() function is the same training data, with the output as coeff. A number of PC (n) in coeff is selected such that it explains a given level of variance. I later use this coeff to transform the testing data on the PC space of the training data by different methods (i.e. row-by-row, stand alone, or two copies stack together as [testing; testing]) through matrix multiplication (i.e. data*coeff(:,1:n)). The results of these different methods of matrix multiplication are found to be different in a scale of e-14 or less, which I now believe is an issue of float point arithmetic.

Sign in to comment.

Issues in pca transformation

0 Comments
Show -2 older comments Hide -2 older comments

Answers (1)

10 Comments
Show 8 older comments Hide 8 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

Issues in pca transformation

0 Comments Show -2 older comments Hide -2 older comments

Answers (1)

10 Comments Show 8 older comments Hide 8 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

10 Comments
Show 8 older comments Hide 8 older comments