Computer scienceData scienceMachine learningIntroduction to machine learning

Over- and underfitting

6 minutes read

In this topic, you will learn which behaviors of a model is called overfitting and underfitting. We will discuss the reasons why they happen and mention some ways to counter them.

Train and validation split

We usually talk only about the train and test data: we train our model on the train data and use test data to, well, test it. However, this approach has two major drawbacks:

  1. Test data often comes unlabeled if you are in a competition. That means you are not really able to determine how well your model performs before you submit it.

  2. Most of the models and/or solutions have so-called hyperparameters, that is, the parameters that are impossible to train. That means you have to set them on their own based on your experience.

In order to deal with both of these problems, we can split our train data into two parts: the actual train data and the so-called validation data. This allows us to train our model on the "new" train subset and run tests and tune hyperparameters on the validation subset.

There is a number of ways to validate our model. For the sake of simplicity, we will stick to the method described above.

Splitting the dataset into training, validation, and test sets

Underfitting

Ok, we now have a model and two subsets: the training and the validation one. Now we start training our model and monitor the error on both of the subsets. It doesn't really matter what exactly the error means; what matters is that it gets bigger when the difference between the expected outcome and the outcome produced by our model also grows. Since the very idea of training is to reduce the error on the train subset, we expect the possibility of error to go down. However, sometimes, regardless of how much time we would spend training the model, the train error won't go away.

We call this situation underfitting. It means that our model is simply not big enough to generalize the data it has. Let's look at an example.

Say we have some kind of data similar to below and we want to use some kind of linear regression on it. It is clear that this idea is doomed: a linear model is simply not able to recover such a complex dependence. You can see that it doesn't really matter how would we "place" our line produced by the linear model: the train error will still remain quite big.

A simple model underfits the train set

When we face underfitting, we usually need to make our model more complex. We could either start using another model (for example, an ensemble may work) or make some changes to ours (for instance, by using not only the features we have but also feature squares, cubes, etc.).

Overfitting

We might find ourselves in an opposite situation: after the model is trained, the training error is lower than ever. However, we also need to check the error on the validation subset. If it appears to be relatively high, we say that overfitting has occurred. That means that our model has simply "memorized" the whole training subset instead of making any generalizations and that is why it performs so well on the training set but fails the validation.

In order to illustrate what has happened, let's have a look at the image below again. Here we have the same data as before, but the model is able to recover complex dependencies. And that is how we get this absolutely implausible curve.

An overfit model with low train error - but large validation and test errors

You usually detect overfitting by comparing the train and the validation errors. Validation error is often higher since we don't train our model for it, but it should not differ drastically. However, it's not the only way. When all of the features of the data have the same scale, we could detect the overfitting by looking at the coefficients of the model. If those are inadequately big, the model might be overfitting. This actually provides us with one of the ways (along with using another model) to counter overfitting which is called regularizations. We will consider this in detail in future topics. For now, we can just assume that we somehow "penalize" our model for big coefficients in it, which might reduce overfitting. There are other ways too, of course: for example, we could remove some of the features or tune hyperparameters.

Conclusion

In this topic, you've learned that it is a good idea to use three sets of data: the train set used to train the model, the validation set used to tune hyperparameters and check the model performance, and the test set used for the final evaluation of the model.

We have also learned that we could face two possible problems when training our model: underfitting, when our model can't generalize our data, and overfitting, when our model simply memorizes the whole train set.

In general, it is often handy to have the Occam razor principle in mind: don't make your model more complicated than required. If you see that something doesn't really boost the results, you might want to give it up and try something else.

10 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo