- https://www.mathworks.com/help/stats/pca.html#:~:text=Description-,on,-Default.%20pca
- https://www.mathworks.com/help/matlab/ref/cov.html#:~:text=is%20defined%20as-,cov,),-where%20%CE%BCA

# question about how the function pca() calculates the covariance matrix internally

66 views (last 30 days)

Show older comments

I was puzzled by the output of pca() when using mean centering or not. I am using Matlab 2024a.

pca.m uses the internal function c = ncnancov(x,Rows,centered) which seems to provide the covariance matrix of x

however,

1) it uses the formula for the population covariance, i.e. it calculates x'*x/n not x'*x/(n-1) - what is the rationale behind that?

2) it does not mean center x. This is surprising because without mean centering x the formula x'*x/n (or x'*x/(n-1) for that matter) does NOT provide the covariance matrix

The second point causes the call [coeff,score,latent]=pca(D, 'Algorithm','eig’,'Centered','off') to produce different coeff, and latent from the call [coeff,score,latent]=pca(D, 'Algorithm','eig’). The scores will obviosuly be different but coeff and latent should not be affected by mean centering as can be shown by comparing the output of:

load('Data_Table8p1.mat');

Dm = D-mean(D);

[coeff,eigValues] = eig(cov(D));

[eigValues, idx] = sort(diag(eigValues), 'descend'); % sort

coeff = coeff(:, idx);

score = D/coeff'; % get scores of mean centered data

with:

[coeff_m,eigValues_m] = eig(cov(Dm));

[eigValues_m, idx] = sort(diag(eigValues_m), 'descend'); % sort

coeff_m = coeff_m(:, idx);

score_m = Dm/coeff_m'; % get scores of mean centered data

Probably I am missing something, but the internal function ncnancov() as used in pca is unclear to me. Any explanation is much appreciated!

##### 0 Comments

### Answers (1)

Divyam
on 18 Jul 2024 at 12:15

Hi Florian, the "pca" and the "cov" functions perform "mean centering" by default as mentioned here:

The example in the question leads to the same coefficients since both the "cov" calls return the same "coeff" and "coeff_m" as the data "D" is being mean centered by default. To illustrate this, I have written a code for calculating the covariance without mean centering and ran it on your data, the coefficients are different in this scenario. The code is added below for your reference:

% Not using the "cov" function

[N,M] = size(D);

cov_matrix = (1/(N-1)) * (D' * D);

[coeffFinal, eigValuesFinal] = eig(cov_matrix);

[eigValuesFinal, idx] = sort(diag(eigValuesFinal), 'descend');

coeffFinal = coeffFinal(:, idx);

Here is the output of the code:

##### 4 Comments

Florian Meirer
on 19 Jul 2024 at 16:52

Edited: Florian Meirer
on 19 Jul 2024 at 17:10

Divyam
on 22 Jul 2024 at 4:05

Hi @Florian Meirer, the data used for PCA is very small and sparse (as evident in your plot) and thus using population covariance matrix is not helpful here. You are correct in using a sample covariance matrix. For this specific case, running "pca" with mean centering will unequivocably lead to correct results. In the code you will find that when you turn mean centering on, the sample covariance matrix is used to compute the results, which is exactly what you are doing in your non 'pca' code.

% In "ncnancov"

% Line 542

d = d + centered; % Here d becomes 1 when mean centering is on

% Line 551

c = x'*x/(n-d) % This becomes the result of sample covariance matrix

### See Also

### Categories

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!