Issues in pca transformation

As I was trying to conduct PCA transformation for testing set based on the training set in Matlab 2019b on a Windows 10 machine, the matrix after PCA transformation seems to be different in 3 different conditions which I originally expect to be identical. I engaged in testing the behavior of pca transformation after I found that the classification accuracy in SVM was not perfectly 50% when the testing set is composed of 2 identical set of data, one coded 1 and another coded 0. That cause my suspection and later I figure that was the problem of the pca transformation I conducted.
%generate training set with normalization per every observation
X=rand(100,50000);
train=zeros(100,50000);
for i=1:100
train(i,:)=normalize(X(i,:));
end
%generate testing set with normalization per every observation
x=rand(10,50000);
test=zeros(10,50000);
for i=1:10
test(i,:)=normalize(x(i,:));
end
%compute pca coefficient based on training set
[coeff,~,latent]=pca(train);
%record if difference exist
allDiff=[]; %reocrd if exist difference among one two and three
tempDiff=[]; %reocrd if exist difference between two and three
%selection of number of pca component to be included in the transformation
for count=1:size(coeff,2)
matrix=coeff(:,1:count);
%cases
one=test*matrix;
temp=[test;test]*matrix;
two=temp(1:10,:);
three=temp(11:20,:);
%comparison on whether the three expectedly identical matrix are indeed identical
if isequal(one,two,three)==false %check if exist difference among one two and three
allDiff=[allDiff,count];
end
if isequal(two,three)==false %check if exist difference between two and three
tempDiff=[tempDiff,count];
end
end
The difference as recorded in allDiff starts at 2 pca component, while that of tempDiff at 20 pca components. Occasionally, some component count will return with identical matrixs among one, two and three.
Is this issue related to the rounding error in matrix multiplication? And more importantly, which is the correct matrix after pca transformation? (I guess that is one) Thanks.

Answers (1)

Image Analyst
Image Analyst on 10 Aug 2020
Since you're using random numbers, why do you think that exactly 50% of your points should fall into each of two classes? Your numbers are continuously valued. It's not like they're in two distinct, well separated clusters. So of course there may not be exactly 50% in each class.

10 Comments

Sorry for the confusion. I originally do classification on my data set and found that it was not 50%, and therefore I continue to troubleshoot to find the source of error. That process eventually lead me to the pca transformation. The code I posted is just a demonstration on how pca transformation (i.e. the matrix multiplication with the PCA coefficient) might lead to discrepancy of result as the dimension (dimension 1: row) changes, not the original code I was using to do the classification. I further (after the original post) checked that there will be another result if I conduct the PCA transformation in a row by row basis:
%generate training set with normalization
X=rand(100,50000);
train=zeros(100,50000);
for i=1:100
train(i,:)=normalize(X(i,:));
end
%generate testing set with normalization
x=rand(10,50000);
test=zeros(10,50000);
for i=1:10
test(i,:)=normalize(x(i,:));
end
%compute pca coefficient based on training set
[coeff,~,latent]=pca(train);
%save for analysis
allDiff=[]; %exist difference among one two and three
tempDiff=[]; %exist difference between two and three
absolteDiff=[]; %exist difference with row by row option
%selection of number of pca component to be included in the transformation
for count=1:size(coeff,2)
matrix=coeff(:,1:count);
%cases
zero=[];
for row=1:10
zero=[zero; test(row,:)*matrix]; %pca row by row
end
one=test*matrix; %pca as a batch of the whole test data set
temp=[test;test]*matrix;
two=temp(1:10,:); %pca as a double batch: first half
three=temp(11:20,:); %pca as a double batch: second half
%comparison on whether the three expectedly identical matrix are indeed identical
if isequal(zero,one,two,three)==false
absolteDiff=[absolteDiff,count];
end
if isequal(one,two,three)==false
allDiff=[allDiff,count];
end
if isequal(two,three)==false
tempDiff=[tempDiff,count];
end
end
Thanks.
Again, I don't see why you expect an exact 50% classification. Why should it be that?
PCA basically finds a new coordinate system, a rotated one that aligns better with the shape of your data scatterplot than your original coordinate system. It basically decouples the coordinate system to give a new one that is orthogonal to your data.
The classification that I have done is something like the following:
%generating simulated training set
X=rand(100,50000);
train=zeros(100,50000);
for i=1:100
train(i,:)=normalize(X(i,:));
end
train_label=[ones(1,100),zeros(1,100)] %labeling of exactly half as 1 and another half as 0
%generate simulated testing set
x=rand(10,50000);
test=zeros(10,50000);
for i=1:10
test(i,:)=normalize(x(i,:));
end
test_label=[ones(1,10),zeros(1,10)]; %labeling of exactly half as 1 and another half as 0
%compute pca coefficient based on training set
[coeff,~,latent]=pca(train);
pca_count=20;
matrix=coeff(:,1:pca_count);
%pca transformation
transformed_train=[train;train]*matrix;
transformed_test=[test;test]*matrix;
%train and predict with SVM model
Mdl=fitcsvm(transformed_train,train_label);
prediction=predict(Mdl,transformed_train);
%check accuracy
accuracy=mean([prediction==test_label]');
As I use exactly 2 copies of the train data with one copy marked with 1 and 0 respecitvely, and similarly for test data, I expect the performance of the classification to be exactly 0.5. Sorry again for the confusion caused with the inappropriate example used.
The problem (accuracy not equal to 0.5) was later resolved by computing the pca transformation for each half of data respectively which lead me to suspect that there is a rounding issue. (i.e. transformed_train=[train*matrix; train*matrix], and similarly for test). The presence of rounding error is somehow confirmed as there is a further discrepancy of result if pca transformation is done row-by-row.
I'm not sure you understand what you're doing. You're basically taking 50,000 random numbers between 0 and 1 and calling half of them class 1 and half of them class 2. Why do you think there should be any meaningful principal components in that? It's basically a shotgun blast with no discernible structure to it. So your PCs are going to be random noise at random angles plus there is no guarantee that in the test set you'll have exactly 50% be in each class.
The actual calssifaiction is actually done on a dataset instead of random numbers. I was just trying to show the simulated situation. Sorry that I have not delievered that clearly.
I guess the problem still holds even if the principal components and SVM model trained is meaningless - because of the design of training set (i.e. two copies of the same set of data, one in class 1 and one in class 2), the accuracy should always be 50% (i.e. the prediction must be correct in one copy, and wrong in the other). I think the confusion might be due to my poor variable naming in the aforementioned code: test is with 10 observarions, while test_label is with 20 observations. In the code I later have two copies of test stacked together with 20 observations in total.
The 50% performance is achieved on my test dataset when the PCA transformation is conducted separately for each copy (i.e. [test*coeff;test*coeff]) instead of together (i.e. [test;test]*coeff). Later I find that the PCA transformation result will also be different with the former two cases if conduct PCA transformation row-by-row. Therefore I was wondering whether there exist a rounding issue during matrix multiplication if the dimension of matrix is large. Thanks.
Show us a scatterplot of your points. I seriously doubt that rounding is the problem. You know that if you have three different sets of points that are similar but not identical, you will have three different sets of PC's, right? This is just because your data is different, not because of any rounding in the smallest decimal places. For example if you have a scatter pattern of points in a stick-like shape along the x-axis with small deviations along the y axis and got the 2 PC vectors, then if you had a different but similar set of random points in the same general location your PC's would be different, right? They would not be the same but it's because your data is different, not because of any rounding/truncation inaccuracy way out at the final decimal places. They would be similar because your data is similar but not exactly the same because your data is not exactly the same.
The PC set is identical in all different method of matrix multiplication, as the same PCA coeff is used in all pca trasnformations. After the transformation, the difference in each component among the different method (i.e. row-by-row multiplication, multiplication with one copy, multiplication with two copy (the first and the second)) are at the scale of e-14 or less.
The observation presented in the plots are one with difference among the 4 methods. Y-axis is the value, and x-axis is the PC. The value of PCs across the 4 method stacks on one another:
For a closer look in the first component:
Sorry for not including these details earlier. Once again, thank you for your help.
Either I don't know what you're doing or you don't. Because something doesn't make sense to me. With the pca() function, you pass it data and it gives you back the data in the new PC coordinate system. It figures out what the transform is, not you. So I don't understand it when you say " the same PCA coeff is used in all pca trasnformations." If you start with different sets of data, you will not end up with the same coefficients. They may be close but they will not be the same. It almost sounds like you're transforming your data, like rotating your coordinate system, and getting new coordinates and expecting/hoping that pca() will give you the same transform you used.
What I am trying to do here is to transform the testing data to the PC space of the training data. Therefore, the input of the pca() function is the same training data, with the output as coeff. A number of PC (n) in coeff is selected such that it explains a given level of variance. I later use this coeff to transform the testing data on the PC space of the training data by different methods (i.e. row-by-row, stand alone, or two copies stack together as [testing; testing]) through matrix multiplication (i.e. data*coeff(:,1:n)). The results of these different methods of matrix multiplication are found to be different in a scale of e-14 or less, which I now believe is an issue of float point arithmetic.

Sign in to comment.

Categories

Asked:

on 10 Aug 2020

Commented:

on 14 Aug 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!