Hello every body i am trying to perform PCA on a large dataset 1000*1290240 i heard about iterative PCA : 1-calculate the the mean values per column 2-Calculate the covariance matrix: # Calculate all cross-products # Save those crossproducts in a variable # Repeat 1-2 until end of file. # divide by the number of rows minus 1 to get the covariance. I tried the cross function cross(A,B) but what will be the second term ,A is my subdataset . Can someone help to solve this problem
If you can load the dataset into memory, the rest should be easy. Here is how you compute coefficients and scores for wide data (many variables and just a few observation):
X = rand(100,1000000); X = bsxfun(@minus,X,mean(X)); % center [U,D] = eig(X*X'); >> rank(D) ans = 99 invS = diag(D).^(-1/2); [~,imax] = max(invS); invS(imax) = 0; coeff = X'*U*diag(invS); score = X*coeff;
Note that for 100 observations you can have at most 99 principal components after centering X. This is why you need to set the inverse of the smallest eigenvalue to zero. You may need to set more inverse values to zero for your data.