Remove rows with NaNs from 2 variables to do a corrcoef and a polyfit
1 view (last 30 days)
Show older comments
Hi all! Since polyfit and corrcoef do not remove NaNs I am trying to remove NaNs first and then do the correlation.
The situation is the following: I have two variables, one is the observed and the other is the forecasted one. Both of them have 17 columns (each forecast time) and 9500 lines aprox.
What I know is that I need to ask for NaNs in at least one of them and then remove that row.
I tried this (my variables are rh_media_3 and hr_ref2):
for i=1:size(rh_media_3,2)
tmp= [ rh_media_3(:,i) hr_ref2(:,i) ];
%rh_valid=rh_media_3(~any(isnan(tmp),2),:);
%hr_ref2_valid=hr_ref2(~any(isnan(tmp),2),:);
rowsWithNaN = any(isnan(tmp),2);
rh_media_3=rh_media_3(~rowsWithNaN,:);
hr_ref2=hr_ref2(~rowsWithNaN,:);
end
for i=1:17
tmp=polyfit(rh_media_3(:,i),hr_ref2(:,i),1);
pendiente_hr(i)=tmp(1);
temp=corrcoef(rh_media_3(:,i),hr_ref2(:,i));
coef_lineal_hr(i)=temp(2,1);
end
No matter what I do I get coefficients greater than 1, which is not real. Can anyone help please?
0 Comments
Accepted Answer
dpb
on 20 Jul 2013
Edited: dpb
on 21 Jul 2013
To compute the R-square you've got to compute the two SS terms -- SSe and SSt, the SSerror and SStotal, respectively
If have a fit of y=p(x), from
p=polyfit(x,y,1);
then
yhat = polyval(p,x); % fit results
ye = y-yhat; % residuals (error)
SSe = sum(ye.^2); % SS error
SSt = (length(y)-1)*var(y); % SS total
Rsq = 1-SSe/SSt; % R-square
Try that instead of comparing the slope and expecting it to be some specific value. (Substitute your appropriate variables for x,y and p, obviously)
2 Comments
dpb
on 20 Jul 2013
Edited: dpb
on 21 Jul 2013
Well, if the predicted were identical to the observed and linear the slope would be precisely unity, yes. But, anything that modifies those conditions will cause that correlation to go away to more or less a degree.
You could, for example have a perfect prediction but the form of the relationship a parabola and the slope of a linear fit be identically zero if the observations were taken at symmetric points in x.
As far as interpreting your data above, I don't have a clue what am looking at for what you label as "polyval of order 1". If they are the slopes of the line then it seems somewhat peculiar that they decrease nearly uniformly for the higher columns. One would begin to wonder why that is and start looking for systematic errors or other causes or non-included explanatory variables, perhaps. As I've said repeatedly, undoubtedly plotting the data would be revealing of any gross anomalies that aren't apparent just by the numbers.
As for the CC's, those do seem to me to be "somehow similar". They range from about 0.6 to 0.8 w/ one or two a little low; hardly an unusual event for experimental data. What it tells me is that there's either quite a lot of scatter in one set of data or the other or the linearity isn't very good. Again, a plot would reveal much methinks.
As for what else one might suggest, one needs far more information on what is measured, the model and such to have any real specific answer to that.
I would be interested to know what a plot does look like and what does the R-sq values turn out to be if it is reasonably linear altho I'm guessing you'll discover that a linear model just isn't very appropriate for the data.
More Answers (1)
dpb
on 19 Jul 2013
If the concern is over the slopes of the linear fit, there's nothing that says limits its magnitude; that'll be wholly dependent upon the way the response variable (hr_ref2) varies w/ the independent (rh_media_3).
If the forecast response is preferentially biased high against the observed value, that's exactly what one would expect.
Plotting the data and the fitted line would probably reveal a lot.
2 Comments
dpb
on 20 Jul 2013
Edited: dpb
on 20 Jul 2013
Say what? The slope of what says anything at all about the variance contributions? The R-square value does (sorta') say that, but certainly not the slope--all it does is reflect the slope (duh) of the least-squares fit line of the two vectors.
Plot it and see; I'm sure it'll be obvious for your data that the slope is >1 owing to the way the data fall out...now why that is so if it isn't what you think should be is another question entirely.
See Answer section for how to compute R-sq from coefficients and data.
See Also
Categories
Find more on Descriptive Statistics in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!