How to calculate R^2 using 1 - (SSR/SST)? For normal fit distribution.

Hello, I have used the fitlm function to find R^2 (see below), to see how good of a fit the normal distribution is to the actual data. The answer is 0.9172.
How can I manually calculate R^2?
R^2 = 1 - (SSR/SST) or in other words 1 - ((sum(predicted - actual)^2) / ((sum(actual - mean of actual)^2)). I am having a hard time getting the correct answer.
Table = readtable("practice3.xlsx");
actual_values = Table.values;
actual_values = sort(actual_values);
normalfit = fitdist(actual_values,'Normal'); % fit the normal distribution to the data
cdfplot(actual_values); % Plot the empirical CDF
x = 0:2310;
hold on
plot(x, cdf(normalfit, x), 'Color', 'r') % plot the normal distribution
hold off
grid on
nonExceedanceProb = sum(actual_values'<=actual_values,2)/numel(actual_values);
Table.nonExceedanceProb=nonExceedanceProb;
mdl=fitlm(cdf(normalfit, actual_values),Table.nonExceedanceProb);
mdl.Rsquared.Ordinary % R^2
ans = 0.9172
mdl.SSR
ans = 0.7567
mdl.SST
ans = 0.8250
% How can I manually calculate R^2 (or SSR and SST)?
% SSR = sum(((predicted data - actual data).^2))
% TSS = sum((actual data - mean(actual data)).^2)
% Rsquared = 1 - SSR/TSS

 Accepted Answer

In my opinion, it does not make sense to fit a linear function to the value pairs (cdf(normalfit, actual_values),Table.nonExceedanceProb) as you do above.
In principle, the blue points below should lie on the red line. This would mean that the empirical cdf is perfectly reproduced by the normal distribution.
So if you really want to compare the two distributions, you should consider the distance of the blue points (achieved quality of fit) to the red line (perfect fit).
Table = readtable("practice3.xlsx");
actual_values = Table.values;
actual_values = sort(actual_values);
normalfit = fitdist(actual_values,'Normal'); % fit the normal distribution to the data
nonExceedanceProb = sum(actual_values'<=actual_values,2)/numel(actual_values);
hold on
plot(nonExceedanceProb,cdf(normalfit, actual_values),'o')
plot([0 1],[0 1])
xlabel('P(empirical)')
ylabel('P(normal)')
hold off
grid on

12 Comments

Ok, that makes sense. A linear function should not be fitted to the values pairs because they are not linear in pattern. So how would I compare the blue points to the red line in the original plot? I'm able to do it for the linear plot as seen here, and I got 0.8408. But I think I understand that it would be more approriate to compare the blue points to the red line in the second plot (the last one). I am getting stuck on what to input for "predicted".
Table = readtable("practice3.xlsx");
actual_values = Table.values;
actual_values = sort(actual_values);
normalfit = fitdist(actual_values,'Normal'); % fit the normal distribution to the data
nonExceedanceProb = sum(actual_values'<=actual_values,2)/numel(actual_values);
hold on
plot(nonExceedanceProb,cdf(normalfit, actual_values),'o')
plot([0 1],[0 1])
xlabel('P(empirical)')
ylabel('P(normal)')
hold off
grid on
predicted = cdf(normalfit, actual_values);
SSR = sum(((predicted - nonExceedanceProb).^2));
TSS = sum((nonExceedanceProb - mean(nonExceedanceProb)).^2);
Rsquared = 1 - SSR/TSS
Rsquared = 0.8408
cdfplot(actual_values);
x = 0:2310;
hold on
plot(x, cdf(normalfit, x), 'Color', 'r')
hold off
grid on
%So how would I input the red data into my R^2 formula?
%predicted_2 = ?????
%SSR = sum(((predicted_2 - nonExceedanceProb).^2));
%TSS = sum((nonExceedanceProb - mean(nonExceedanceProb)).^2);
%Rsquared = 1 - SSR/TSS
I can't understand why you calculate an R^2 for the problem above. You have two independent curves - an empirical cdf and normal cdf, both based on a common data vector "actual_values". You don't make any regression here. So in my opinion, an R^2 is inappropriate.
I think here you can get a good start on how it can be seen which distribution fits your data best:
R^2 is not mentionned.
I intended to fit a normal distribution to the data. The plot is meant to display a visual goodness of fit between empirical data and the distribution, and now I am trying to quantitatively assess the goodness of fit by computing R^2. (Which I will repeat for gamma, weibull, and other fitted distributions to see which distribution fits the data the best).
As you can see from the link I gave, there are measures to compare goodness of fit for distributions, but R^2 is definitely no such measure.
But just speaking in terms of the code, since I am instructed to calculate an R^2, do you know how I can calculate R^2 for these two lines anyways?
SSR = sum(((predicted data - actual data).^2));
TSS = sum((actual data - mean(actual data)).^2);
Rsquared = 1 - SSR/TSS
I am stuck on how to properly assign predicted data and actual data.
I am stuck on how to properly assign predicted data and actual data.
I have the same problem. Actual data are only your array "actual_values". Both the empirical CDF as well as the normal CDF are independently deduced curves from these data. So both of them can be seen as some kind of "prediction data".
I cannot help you further because an R^2 used in regression statistics is inappropriate here.
Here is another discussion of the problem:
It is explained how an R^2 can be deduced by comparing the empirical and continuous pdf's, not the cdf's.
Ok, thank you for your help, I may be missing something here so I will go back and make sure I am creating the appropriate plots.
I don't think the empirical CDF is a prediction curve. The x-values are the "actual_values" and the y-values are what percent of the actual_values fall below a certain x-value.
I've been advised by my professor to compute nonexceedance probability of the actual data and from there calculate R^2 between the observed nonexceedance probabilities and the predicted probabilities (predicted by normal fit).
Although it's very doubtful if it makes sense, this should be what your professor has in mind (and what you already did above):
Table = readtable("practice3.xlsx");
actual_values = Table.values;
actual_values = sort(actual_values);
normalfit = fitdist(actual_values,'Normal'); % fit the normal distribution to the data
yi = sum(actual_values'<=actual_values,2)/numel(actual_values);
fi = cdf(normalfit, actual_values);
ybar = mean(yi);
SS_res = sum((yi-fi).^2);
SS_tot = sum((yi-ybar).^2);
Rsquared = 1 - SS_res/SS_tot
Rsquared = 0.8408
And of course an empirical CDF is a prediction curve because empirical data are transformed to probabilities for events to happen.
So now that I have my assignment graded, he said is was good, and it was one way of calculating one type of R^2. And that another kind of R^2 could also be made using the following function:
Rsquared = corr(actual_values,normalfit)^2;
So you mean:
Table = readtable("practice3.xlsx");
actual_values = Table.values;
actual_values = sort(actual_values);
normalfit = fitdist(actual_values,'Normal'); % fit the normal distribution to the data
yi = sum(actual_values'<=actual_values,2)/numel(actual_values);
fi = cdf(normalfit, actual_values);
Rsquared1 = corr(yi,fi)^2
Rsquared1 = 0.9172
ybar = mean(yi);
SS_res = sum((yi-fi).^2);
SS_tot = sum((yi-ybar).^2);
Rsquared2 = 1 - SS_res/SS_tot
Rsquared2 = 0.8408
?
Yes, Rsquared1 he said is the "pearson correlation coefficient" and Rsquared2 is the "coefficient of determination".
corr(yi,fi) is the pearson correlation coeffcient - I don't know why he wanted to square it.
Anyway: congratulations that you finished your assignment successfully.

Sign in to comment.

More Answers (0)

Products

Release

R2022b

Asked:

on 15 Feb 2023

Commented:

on 21 Feb 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!