Issues in pca transformation
Show older comments
As I was trying to conduct PCA transformation for testing set based on the training set in Matlab 2019b on a Windows 10 machine, the matrix after PCA transformation seems to be different in 3 different conditions which I originally expect to be identical. I engaged in testing the behavior of pca transformation after I found that the classification accuracy in SVM was not perfectly 50% when the testing set is composed of 2 identical set of data, one coded 1 and another coded 0. That cause my suspection and later I figure that was the problem of the pca transformation I conducted.
%generate training set with normalization per every observation
X=rand(100,50000);
train=zeros(100,50000);
for i=1:100
train(i,:)=normalize(X(i,:));
end
%generate testing set with normalization per every observation
x=rand(10,50000);
test=zeros(10,50000);
for i=1:10
test(i,:)=normalize(x(i,:));
end
%compute pca coefficient based on training set
[coeff,~,latent]=pca(train);
%record if difference exist
allDiff=[]; %reocrd if exist difference among one two and three
tempDiff=[]; %reocrd if exist difference between two and three
%selection of number of pca component to be included in the transformation
for count=1:size(coeff,2)
matrix=coeff(:,1:count);
%cases
one=test*matrix;
temp=[test;test]*matrix;
two=temp(1:10,:);
three=temp(11:20,:);
%comparison on whether the three expectedly identical matrix are indeed identical
if isequal(one,two,three)==false %check if exist difference among one two and three
allDiff=[allDiff,count];
end
if isequal(two,three)==false %check if exist difference between two and three
tempDiff=[tempDiff,count];
end
end
The difference as recorded in allDiff starts at 2 pca component, while that of tempDiff at 20 pca components. Occasionally, some component count will return with identical matrixs among one, two and three.
Is this issue related to the rounding error in matrix multiplication? And more importantly, which is the correct matrix after pca transformation? (I guess that is one) Thanks.
Answers (1)
Image Analyst
on 10 Aug 2020
0 votes
Since you're using random numbers, why do you think that exactly 50% of your points should fall into each of two classes? Your numbers are continuously valued. It's not like they're in two distinct, well separated clusters. So of course there may not be exactly 50% in each class.
10 Comments
Tze Hei Cho
on 11 Aug 2020
Image Analyst
on 11 Aug 2020
Again, I don't see why you expect an exact 50% classification. Why should it be that?
PCA basically finds a new coordinate system, a rotated one that aligns better with the shape of your data scatterplot than your original coordinate system. It basically decouples the coordinate system to give a new one that is orthogonal to your data.
Tze Hei Cho
on 11 Aug 2020
Image Analyst
on 11 Aug 2020
I'm not sure you understand what you're doing. You're basically taking 50,000 random numbers between 0 and 1 and calling half of them class 1 and half of them class 2. Why do you think there should be any meaningful principal components in that? It's basically a shotgun blast with no discernible structure to it. So your PCs are going to be random noise at random angles plus there is no guarantee that in the test set you'll have exactly 50% be in each class.
Tze Hei Cho
on 12 Aug 2020
Tze Hei Cho
on 13 Aug 2020
Image Analyst
on 13 Aug 2020
Show us a scatterplot of your points. I seriously doubt that rounding is the problem. You know that if you have three different sets of points that are similar but not identical, you will have three different sets of PC's, right? This is just because your data is different, not because of any rounding in the smallest decimal places. For example if you have a scatter pattern of points in a stick-like shape along the x-axis with small deviations along the y axis and got the 2 PC vectors, then if you had a different but similar set of random points in the same general location your PC's would be different, right? They would not be the same but it's because your data is different, not because of any rounding/truncation inaccuracy way out at the final decimal places. They would be similar because your data is similar but not exactly the same because your data is not exactly the same.
Tze Hei Cho
on 14 Aug 2020
Image Analyst
on 14 Aug 2020
Either I don't know what you're doing or you don't. Because something doesn't make sense to me. With the pca() function, you pass it data and it gives you back the data in the new PC coordinate system. It figures out what the transform is, not you. So I don't understand it when you say " the same PCA coeff is used in all pca trasnformations." If you start with different sets of data, you will not end up with the same coefficients. They may be close but they will not be the same. It almost sounds like you're transforming your data, like rotating your coordinate system, and getting new coordinates and expecting/hoping that pca() will give you the same transform you used.
Tze Hei Cho
on 14 Aug 2020
Categories
Find more on Dimensionality Reduction and Feature Extraction in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!
