How do i force lsqcurvefit to overfit data?
Show older comments
Hi there, I am currently trying to illustrate overfitting.
parabel = @(p,x) p(1) * (x-p(2)).^2 +p(3)
polynom = @(p,x) p(1)*x.^7 + p(2)*x.^6 + p(3)*x.^5 + p(4)*x.^4 +p(4)
x = [0:0.2:2];
y = parabel([1,0,0],x) + rand(1,length(x));
%draw true model
tempX = -10:0.1:10;
plot(tempX, parabel([1,0,0],tempX), ':'); hold on
%draw model points
plot(x,y,'x');
%estimate polynomial model
p = lsqcurvefit(polynom,[0,0,0,0,1],x,y);
plot(tempX,polynom(p,tempX));
legend('true model','data','fitted model')
axis([-10,10,-3,10])
Therefore, I created a data set from a quadratic function and added some noise to it. Now I tried to overfit the data by fitting a much higher dimensional polynomial model. Unfortunately the algorithm seems to suppress the noise and give me an almost smooth function. The overfitting becomes only visible when the model extrapolates. Within the data range the effect is almost invisible.
</matlabcentral/answers/uploaded_files/65580/overview.jpg> </matlabcentral/answers/uploaded_files/65581/zoom.jpg>
The point I want to make is, that a higher dimensional model might look better but fail for other data sets even if the data ranges are the same. Do you guys have any suggestions on how to force the algorithm to overfit within the data range?
Thank you!
Answers (1)
John D'Errico
on 15 Dec 2016
Edited: John D'Errico
on 15 Dec 2016
No. You misunderstand the concept of overfitting. You cannot stamp your feet and insist that lsqcurvefit try harder. Overfitting is not something that comes from the fitting algorithm, but an issue of model choice.
Worse, why are you using lsqcurvefit to estimate a model that is fully linear in the parameters? Teach people about the use of the appropriate tool for the problem.
The example that you used is a poor one anyway, with relatively low noise compared to the signal. (And why in god's name would you add uniform, therefore biased, noise to create an example? Especially if you are trying to teach someone about modeling?)
I'm also disturbed by your choice of model, with these terms in it:
... + p(4)*x.^4 + p(4)
Note that p(4)is used both for the constant term as well as the 4th order coefficient. Expect arbitrarily meaningless results from that model! Yes, it may fit, but not for any rational, predictable reason, beyond the fact that you have sufficient degrees of freedom in the model to describe some rather simple behavior.
Of course you get crapola when you extrapolate ANY sufficiently high order polynomial model beyond the support of the data. This is completely expected.
My point in all of this is, IF you really wanted to give an example of overfitting, you are going about it in completely the wrong way.
x = -5:5;
xint = linspace(min(x),max(x),250);
y = x.^3 + x.^2 + 2*x + 3 + randn(size(x))*5;
p3 = polyfit(x,y,3);
p10 = polyfit(x,y,10);
plot(x,y,'bo',xint,polyval(p3,xint),'-g',xint,polyval(p10,xint),'--r')

Thus, in green is an appropriate model. (By the way, the green fitted curve looks to be pretty darn close to the original curve.) In red, the overfitted example. See that it does crap when interpolating. It follows all of the noise. Of course it will be even more obscene when extrapolating, but then what do you expect from ANY polynomial, and certainly from a high order one?
This is an example of overfitting. See that I never had to use anything beyond polyfit. Also see that lsqcurvefit was not needed, but that it should generate the same results. Overfitting has nothing to do with the fitting tool, ONLY with the model chosen.
Let me come at this from the other direction. (I spent almost 30 years working with scientists, engineers, analysts, etc., for a major corporation. They would come to me with a modeling problem, and my job was to help them solve it.)
Suppose you came to me with the data in the plot I generated. First of all, I would endeavor to find out if the bumps in the curve were real. Is it noise? Or is it important behavior, that you desperately need to model? Things like that would change my approach. If those bumps were real though, the very first thing I'd ask is why they did not generate MORE data! A curve that predicts real bumps is NOT overfitting the problem. It is just doing the job it was designed to do.
2 Comments
John D'Errico
on 15 Dec 2016
Edited: John D'Errico
on 15 Dec 2016
Your original question was a case where the polynomial model chosen had too few parameters. With only 4 parameters there, even though they were terribly the wrong terms to use in that model, the solver was able to find a linear combination of them that is in fact, not a poor approximation to the underlying function. It still did obscene things away from the support as expected. Had I the exact coefficients in that model, an interesting thing to show might be the chosen linear combination could be used to approximate those lower order terms.
A regularizer changes the problem. You can view it as effectively adding data points (information) to the problem, while still using the same underlying tool as a solver. Anyway, lsqcurvefit has no regularization capabilities.
The example you chose in your comment is indeed an overfit. It shows the classic overfitting behavior, like fitting every bump, while doing nasty stuff in between the points.
Categories
Find more on Get Started with Curve Fitting Toolbox in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!