Criteria for judging overfitting
2 views (last 30 days)
I'm making a model using neural network fitting in matlab. I can check the training , validation , and test R values. However, it is observed that the model created has high training R values, but low validation and test R values.
Can you determine that overfitting has occurred? How much difference does the R value have to be considered overfitting?
AMIT POTE on 30 Jun 2022
Hey @정민 이.
There is no thumb rule that a particular difference in R-values would cause overfitting. Typically, if the R value for the training set is higher than validaion and test sets then it is likely that your model is overfitting. To confirm that your model is overfitting, you can use other metrics like validation accuracy and loss to check how your model works on unseen data.
John D'Errico on 30 Jun 2022
If there were some clear and simple rule, then the code would be written to recognize that, and alert you of the problem. But the real world is never so clear and simple, else we might all be doing something more interesting. (Certainly true for me.)
You should recognize that in virtually any case, a model will have better capability to fit the training data than it will have to predict validation data. Surely you cannot expect it to go the other way? And while it would be nice if the model does exactly as well on the training data as the validation data, life is never perfect. So it is perfectly normal for the model to fit the training data a little better. The question is, how much better? And that really has no exact answer. So what can you do?
Very often all of this indicates your data may be more noisy than you think, so a lower signal to noise ratio. And you don't want your model to be chasing noise in the data.
A simple idea is to reduce the complexity of your model, by just a bit. One would expect this to reduce the ability of your model to represent the training data. But if it is chasing noise, then it really costs you nothing. If you do reduce the model complexity, and it has no effective impact on the ability of your model to predict the validation set, then you are going in the right direction. Continue to do so, reducing the complexity of your model, until just before it starts to significantly impact the ability of the model to predict the validation set. Somewhere around that point should be the sweet spot. At this point, you might hope that the model is predicting the training set just a little better than it is predicting the validation set. That will be a good place to live.
In the end, the best solution is to GET BETTER DATA. And always you want MORE data.