Computer scienceData scienceMachine learningIntroduction to machine learning

Cross-validation

5 minutes read

We hope you have an idea of how models are trained. However, the training with the available data alone doesn't usually work due to the hyperparameters. To tune them, you need to test your model after it was trained. In other words, you need to split the data into two parts: the training and test parts.

How to train your model

However, having these two parts is not enough. If you use any sample with fine-tuned hyperparameters, they will provide good results only on that sample. While with another sample, their performance is going to be poor. If you use the only sample for tuning, the score is not going to be objective.

In the best-case scenario, you need three samples:

  1. A training subset during a training sequence;

  2. A validation subset to validate your model and tune the hyperparameters;

  3. A test subset that wasn't fed to your model.

Split the data into training, validation, and test sets

This method works well when you have a huge amount of data. If you want to split the data, make sure you keep the feature distribution across the subsets.

If you are taking part in a competition, you usually have the training and the test subsets. The training subset includes targets for your data while the test subset does not. So, you do not need to worry about the test subset. All you need is to split the training subset into the actual training and the validation ones. The usual ratio is 70% to 30% or 80% to 20%.

In real life, however, you don't have the test subset, so you need to create it. You can decide what you need at the very start. The reference value may be something like 10%.

However, this approach has two major drawbacks:

  1. The quality of your model during validation is measured on a small dataset part. It may be not that representative, especially if the classes are unbalanced. Say, we have 90% of the dataset that belongs to the first class, and 10% to the second one. If you take a 10% sample of your dataset and leave it, you could end up only with the first class;

  2. The measured quality may differ from the real one, as your model does not get part of the dataset for its training.

To resolve it, you can take another approach which is called cross-validation.

K-fold cross-validation

Instead of dividing the dataset into three parts, let's divide it into two: the training and the test ones. Then, instead of coming up with a fixed validation subset, use the floating subset. For this, let's divide the training set into kk parts (or folds). Each of them will serve as a validation subset.

The whole process will now look like this: train a model using the kk-th fold as the validation subset. After that, train a new model with the same set of parameters using the k1k-1 fold as the validation one and include the last one into the training subset (as displayed in the figure below). The process is repeated kk times. The resulting quality is defined as the mean quality of all folds.Split the training set into training and validation subsets for cross-validation

Generate a fold split for the cross-validation in a way that allows you to reproduce this split every time. Otherwise, you may get a quality improvement due to the different fold distribution.

This approach is called K-Fold, and it resolves the two issues we've pointed out above. It also has one disadvantage, as you need to train kk models instead of just one. It may take quite a while. However, you can turn it into an advantage — you train kk models that you can use later as an ensemble.

Of course, there are arguments that you cannot solve the unbalanced-class issue with this approach. However, there is another solution named the stratified K-Fold approach which is a variation of the K-Fold but with another fold that preserves the class ratio.

Another variant is the Leave-One-Out Cross-Validation. This is an extreme form of the cross-validation process when you take kk equal to the number of objects in your dataset. Yes, it will take some time, but if your dataset is small that is the best you can do.

Conclusion

In this topic, we've discussed the following points:

  • Why you need not only the test and training splits but also the validation one. The validation subset helps you fine-tune the hyperparameters;

  • What the K-Fold is and why should you use it. The K-fold method helps your model in terms of quality by creating more validation subsets;

  • What the stratified K-Fold and the Leave-One-Out Cross-Validation are. They are special variants of the K-fold applicable in certain scenarios. The stratified K-Fold helps when classes are unbalanced; the Leave-One-Out Cross-Validation is useful with small datasets for better quality.

This is it! Let's get to practice.

4 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo