Line of Best Fit through Scattered Data

1 view (last 30 days)
I need to find the line of best fit through my scatterplot. I have attached my text file, and my code is the following.
clear all
fid = fopen( 'oo20.txt');
data = textscan(fid, '%f%f', 'Delimiter', '|', 'TreatAsEmpty','~');
fclose(fid);
GalList.year = data{1};
D = data{2};
X1 = GalList.year;
Y1 = D;
scatter(X1,Y1);
ylim([0 20])
  3 Comments
jgillis16
jgillis16 on 11 Oct 2015
Thanks for the reminder, Star.
X1 represents the year of the events plotted, while Y1 represents the distances the events are happening at. I just wanted to see if there was some sort of fit I could derive from the data, rather than just stopping at a messy scatterplot.
Star Strider
Star Strider on 11 Oct 2015
I just peeked at it and it seems reasonable to delete this one:
~|3.4028886
Does it have any specific significance?

Sign in to comment.

Accepted Answer

Star Strider
Star Strider on 11 Oct 2015
I don’t see anything strange about the data, but the regression is failing. The data are both column vectors. With the full set of data, the parameters I estimate are both zero for the biparametric regression, and for the uniparametric (origin intercept) regression, the single parameter is zero. Trying it with polyfit results in both parameters being NaN, so it’s not my code. (I deleted the polyfit calls in the posted code.) It’s not obvious to me what problems there may be, but with three attempts failing, something is very wrong somewhere.
Of interest, everything works fine with a random sample of 280 data pairs, giving an intercept of 31 BCE (so that’s when astronomy began!), and a slope of +0.022 (is that Galaxies Discovered/Year?). Any more than 280 breaks the code for some reason.
I don’t believe the duplicated years should cause problems, since linear regression is usually robust to such. If you have any insights as to what the problem may be with your full data set, please share them. You know them better than I do, and what they should look like.
I plotted a linear fit tonight. Any others that might be more descriptive of whatever you’re observing that you’d like to try?
This is the end of my day, so I’ll come back to this in the morning.
My code:
fidi = fopen('jgillis16 oo20.txt', 'rt');
D = textscan(fidi, '%f|%f', 'CollectOutput',1, 'TreatAsEmpty','~');
X1 = D{:}(:,1);
Y1 = D{:}(:,2);
RandRows = randi(length(X1), 280, 1);
X1 = X1(RandRows); % Hypothesis: Works With Random Subset -> Accepted
Y1 = Y1(RandRows);
DesignMtx = [ones(size(X1)) X1]; % MODEL: X1*B = Y1
B2 = DesignMtx\Y1; % Linear Biparametric Regression — Estimate Parameters
Yhat2 = DesignMtx*B2; % Linear Biparametric Regression — Generate Line
B1 = X1\Y1; % Linear Uniparametric Regression — Estimate Parameter
Yhat1 = X1*B1; % Linear Uniparametric Regression — Generate Line
XTX = (DesignMtx'*DesignMtx); % X'X
figure(1)
scatter(X1, Y1, 'bp')
hold on
plot(X1, Yhat2, '-r')
hold off
grid
xlabel('Year')
ylabel('Distance (Parsecs)')
  3 Comments
jgillis16
jgillis16 on 12 Oct 2015
Yes, it's nearly impossible trying to get anything out of this. I am going to try a different approach to this set of data. Thanks regardless, Star!
Star Strider
Star Strider on 12 Oct 2015
My pleasure!
Extrapolating back to the x-intercept, the first galaxy discovered was in January 1409. It was undoubtedly the Milky Way, because it has a distance of zero (we’re in it).
Having fun with the numbers...

Sign in to comment.

More Answers (2)

Image Analyst
Image Analyst on 11 Oct 2015
There are other more sophisticated methods, but try polyfit() and polyval(). See attached demo.

Matt J
Matt J on 11 Oct 2015
Edited: Matt J on 11 Oct 2015
The following attempts to fit the data to an equation A*X1+B*Y1=C,
X1=X1(:);
Y1=Y1(:);
e=-ones(length(X1),1);
[~,~,V]=svd( [X1,Y1,e], 0);
ABC=V(:,end);
A=ABC(1);
B=ABC(2);
C=ABC(3);

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!