Remove rows with NaNs from 2 variables to do a corrcoef and a polyfit

1 view (last 30 days)
Hi all! Since polyfit and corrcoef do not remove NaNs I am trying to remove NaNs first and then do the correlation.
The situation is the following: I have two variables, one is the observed and the other is the forecasted one. Both of them have 17 columns (each forecast time) and 9500 lines aprox.
What I know is that I need to ask for NaNs in at least one of them and then remove that row.
I tried this (my variables are rh_media_3 and hr_ref2):
for i=1:size(rh_media_3,2)
tmp= [ rh_media_3(:,i) hr_ref2(:,i) ];
%rh_valid=rh_media_3(~any(isnan(tmp),2),:);
%hr_ref2_valid=hr_ref2(~any(isnan(tmp),2),:);
rowsWithNaN = any(isnan(tmp),2);
rh_media_3=rh_media_3(~rowsWithNaN,:);
hr_ref2=hr_ref2(~rowsWithNaN,:);
end
for i=1:17
tmp=polyfit(rh_media_3(:,i),hr_ref2(:,i),1);
pendiente_hr(i)=tmp(1);
temp=corrcoef(rh_media_3(:,i),hr_ref2(:,i));
coef_lineal_hr(i)=temp(2,1);
end
No matter what I do I get coefficients greater than 1, which is not real. Can anyone help please?

Accepted Answer

dpb
dpb on 20 Jul 2013
Edited: dpb on 21 Jul 2013
To compute the R-square you've got to compute the two SS terms -- SSe and SSt, the SSerror and SStotal, respectively
If have a fit of y=p(x), from
p=polyfit(x,y,1);
then
yhat = polyval(p,x); % fit results
ye = y-yhat; % residuals (error)
SSe = sum(ye.^2); % SS error
SSt = (length(y)-1)*var(y); % SS total
Rsq = 1-SSe/SSt; % R-square
Try that instead of comparing the slope and expecting it to be some specific value. (Substitute your appropriate variables for x,y and p, obviously)
  2 Comments
Tamara Schonholz
Tamara Schonholz on 20 Jul 2013
ok, I've tried that, but this works for polyfit. My actual idea is to know how well my predicted data fits into the observed data. So I thought that by checking the slope of the fitted curve it would be a great idea and so I used polyval of order 1 and corrcoef, but I get this values: For polyval: Columns 1 through 8
1.0509 1.0172 0.9196 0.8857 0.6510 0.5994 0.6138 0.5386
Columns 9 through 16
0.4974 0.6112 0.5794 0.5496 0.5305 0.5352 0.5463 0.4906
Column 17
0.4343
for corrcoef:
Columns 1 through 8
0.7036 0.6623 0.6100 0.5926 0.7073 0.7714 0.7745 0.7273
Columns 9 through 16
0.6135 0.6178 0.5770 0.5496 0.6900 0.7430 0.7403 0.7015
Column 17
0.5801
and though they musn´t be exactly the same, they should be somehow similar. Then I thought it may have to do with the data.
What do you suggest? Thanks so much for your help.
dpb
dpb on 20 Jul 2013
Edited: dpb on 21 Jul 2013
Well, if the predicted were identical to the observed and linear the slope would be precisely unity, yes. But, anything that modifies those conditions will cause that correlation to go away to more or less a degree.
You could, for example have a perfect prediction but the form of the relationship a parabola and the slope of a linear fit be identically zero if the observations were taken at symmetric points in x.
As far as interpreting your data above, I don't have a clue what am looking at for what you label as "polyval of order 1". If they are the slopes of the line then it seems somewhat peculiar that they decrease nearly uniformly for the higher columns. One would begin to wonder why that is and start looking for systematic errors or other causes or non-included explanatory variables, perhaps. As I've said repeatedly, undoubtedly plotting the data would be revealing of any gross anomalies that aren't apparent just by the numbers.
As for the CC's, those do seem to me to be "somehow similar". They range from about 0.6 to 0.8 w/ one or two a little low; hardly an unusual event for experimental data. What it tells me is that there's either quite a lot of scatter in one set of data or the other or the linearity isn't very good. Again, a plot would reveal much methinks.
As for what else one might suggest, one needs far more information on what is measured, the model and such to have any real specific answer to that.
I would be interested to know what a plot does look like and what does the R-sq values turn out to be if it is reasonably linear altho I'm guessing you'll discover that a linear model just isn't very appropriate for the data.

Sign in to comment.

More Answers (1)

dpb
dpb on 19 Jul 2013
If the concern is over the slopes of the linear fit, there's nothing that says limits its magnitude; that'll be wholly dependent upon the way the response variable (hr_ref2) varies w/ the independent (rh_media_3).
If the forecast response is preferentially biased high against the observed value, that's exactly what one would expect.
Plotting the data and the fitted line would probably reveal a lot.
  2 Comments
Tamara Schonholz
Tamara Schonholz on 19 Jul 2013
Well, the slope of it says what percentage of variance is explained by the curve, that is why it cannot be greater than 1.
dpb
dpb on 20 Jul 2013
Edited: dpb on 20 Jul 2013
Say what? The slope of what says anything at all about the variance contributions? The R-square value does (sorta') say that, but certainly not the slope--all it does is reflect the slope (duh) of the least-squares fit line of the two vectors.
Plot it and see; I'm sure it'll be obvious for your data that the slope is >1 owing to the way the data fall out...now why that is so if it isn't what you think should be is another question entirely.
See Answer section for how to compute R-sq from coefficients and data.

Sign in to comment.

Categories

Find more on Descriptive Statistics in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!