How do i force lsqcurvefit to overfit data?

Question

1 vote

Hi there, I am currently trying to illustrate overfitting.

parabel = @(p,x)   p(1) * (x-p(2)).^2 +p(3)
polynom = @(p,x) p(1)*x.^7 + p(2)*x.^6 + p(3)*x.^5 + p(4)*x.^4 +p(4)
x = [0:0.2:2];
y = parabel([1,0,0],x) + rand(1,length(x));
%draw true model
tempX = -10:0.1:10;
plot(tempX, parabel([1,0,0],tempX), ':'); hold on
%draw model points
plot(x,y,'x'); 
%estimate polynomial model
p = lsqcurvefit(polynom,[0,0,0,0,1],x,y);
plot(tempX,polynom(p,tempX));
legend('true model','data','fitted model')
axis([-10,10,-3,10])

Therefore, I created a data set from a quadratic function and added some noise to it. Now I tried to overfit the data by fitting a much higher dimensional polynomial model. Unfortunately the algorithm seems to suppress the noise and give me an almost smooth function. The overfitting becomes only visible when the model extrapolates. Within the data range the effect is almost invisible.

</matlabcentral/answers/uploaded_files/65580/overview.jpg> </matlabcentral/answers/uploaded_files/65581/zoom.jpg>

The point I want to make is, that a higher dimensional model might look better but fail for other data sets even if the data ranges are the same. Do you guys have any suggestions on how to force the algorithm to overfit within the data range?

Thank you!

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

John D'Errico on 15 Dec 2016

Edited: John D'Errico on 15 Dec 2016

Open in MATLAB Online

1 vote

No. You misunderstand the concept of overfitting. You cannot stamp your feet and insist that lsqcurvefit try harder. Overfitting is not something that comes from the fitting algorithm, but an issue of model choice.

Worse, why are you using lsqcurvefit to estimate a model that is fully linear in the parameters? Teach people about the use of the appropriate tool for the problem.

The example that you used is a poor one anyway, with relatively low noise compared to the signal. (And why in god's name would you add uniform, therefore biased, noise to create an example? Especially if you are trying to teach someone about modeling?)

I'm also disturbed by your choice of model, with these terms in it:

... + p(4)*x.^4 + p(4)

Note that p(4)is used both for the constant term as well as the 4th order coefficient. Expect arbitrarily meaningless results from that model! Yes, it may fit, but not for any rational, predictable reason, beyond the fact that you have sufficient degrees of freedom in the model to describe some rather simple behavior.

Of course you get crapola when you extrapolate ANY sufficiently high order polynomial model beyond the support of the data. This is completely expected.

My point in all of this is, IF you really wanted to give an example of overfitting, you are going about it in completely the wrong way.

x = -5:5;
xint = linspace(min(x),max(x),250);
y = x.^3 + x.^2 + 2*x + 3 + randn(size(x))*5;
p3 = polyfit(x,y,3);
p10 = polyfit(x,y,10);
plot(x,y,'bo',xint,polyval(p3,xint),'-g',xint,polyval(p10,xint),'--r')

Thus, in green is an appropriate model. (By the way, the green fitted curve looks to be pretty darn close to the original curve.) In red, the overfitted example. See that it does crap when interpolating. It follows all of the noise. Of course it will be even more obscene when extrapolating, but then what do you expect from ANY polynomial, and certainly from a high order one?

This is an example of overfitting. See that I never had to use anything beyond polyfit. Also see that lsqcurvefit was not needed, but that it should generate the same results. Overfitting has nothing to do with the fitting tool, ONLY with the model chosen.

Let me come at this from the other direction. (I spent almost 30 years working with scientists, engineers, analysts, etc., for a major corporation. They would come to me with a modeling problem, and my job was to help them solve it.)

Suppose you came to me with the data in the plot I generated. First of all, I would endeavor to find out if the bumps in the curve were real. Is it noise? Or is it important behavior, that you desperately need to model? Things like that would change my approach. If those bumps were real though, the very first thing I'd ask is why they did not generate MORE data! A curve that predicts real bumps is NOT overfitting the problem. It is just doing the job it was designed to do.

2 Comments
Show None Hide None

Jan on 15 Dec 2016

Edited: Jan on 15 Dec 2016

Hi John, thanks for your quick response!

My initial problem was that a high polynomial did not seem to overfit the disturbed data points but converge towards the true model. I wanted to know what would explain this behavior and tried a custom polynomial with only higher order terms than 2. I assume the model order was still chosen too small.

Overfitting is not something that comes from the fitting algorithm, but an issue of model choice.

I would argue, that the magnitude of overfitting is indeed impacted by the chosen tools like regularization. Anyway I agree, my question should have been "how do I illustrate overfitting better".

Of course you get crapola when you extrapolate ANY sufficiently high order polynomial model beyond the support of the data. This is completely expected.

Absolutely, and that also was my point. I wanted a model which spends more of its DOF on the noisy data points.

Also see that lsqcurvefit was not needed Of course, I still used it because I was batch testing a bunch of static models with and without nonlinear parametrization.

Thanks for the example :_)

Now, I chose a 10th order polynomial which seems to overfit pretty nicely.

John D'Errico on 15 Dec 2016

Edited: John D'Errico on 15 Dec 2016

Your original question was a case where the polynomial model chosen had too few parameters. With only 4 parameters there, even though they were terribly the wrong terms to use in that model, the solver was able to find a linear combination of them that is in fact, not a poor approximation to the underlying function. It still did obscene things away from the support as expected. Had I the exact coefficients in that model, an interesting thing to show might be the chosen linear combination could be used to approximate those lower order terms.

A regularizer changes the problem. You can view it as effectively adding data points (information) to the problem, while still using the same underlying tool as a solver. Anyway, lsqcurvefit has no regularization capabilities.

The example you chose in your comment is indeed an overfit. It shows the classic overfitting behavior, like fitting every bump, while doing nasty stuff in between the points.

Sign in to comment.

How do i force lsqcurvefit to overfit data?

0 Comments
Show -2 older comments Hide -2 older comments

Answers (1)

2 Comments
Show None Hide None

Categories

Products

Tags

Community Treasure Hunt

How do i force lsqcurvefit to overfit data?

0 Comments Show -2 older comments Hide -2 older comments

Answers (1)

2 Comments Show None Hide None

Categories

Products

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

2 Comments
Show None Hide None