Computer scienceData scienceMachine learningIntroduction to deep learning

Default train-validation loop

4 minutes read

The train-validation loop is the process of iterative model learning on the training data and evaluating its performance on previously unseen data.

This topic will explain what happens during training and validation, how input data is processed, and possibilities of influencing results by changing the values of batches, epochs, and learning rate. In addition, it will explain how to interpret metric values when evaluating model performance.

Preliminary concepts of the train-validation loop

As a preliminary step, let us take a look at the subtypes of gradient descent (GD) and their differences.

Batch gradient descent uses all the training data in a single iteration of the algorithm. The gradient of the loss function is computed for each sample in the data. Then, an average of the gradients is calculated and used for updating parameters. Batch GD is very time-consuming and computationally expensive if the dataset is large.

Illustration of batch gradient descent

Stochastic gradient descent (SGD) uses a single sample in every iteration of the algorithm, which is much faster than batch GD. The main disadvantage of SGD is that by using just one sample, the loss does not necessarily decrease over time and may never reach the global minimum.

Illustration of stochastic gradient descent

Mini-batch gradient descent is a compromise between batch gradient descent and SGD. It uses a mini-batch of a configured size, which is smaller than the actual dataset, to compute the gradient at each step. In each iteration, the model is trained on a different group of samples until all dataset samples have been used. This approach is considered the most efficient of the three.

Illustration of mini-batch gradient descent

The main configurable hyperparameters in the training-validation loop are mini-batch size (also known as "batch size"), steps, and epochs.

A batch refers to a specific subset of the entire dataset. For instance, consider a collection of 1000 images used in training—using all these images at once would be inefficient and slow. To avoid that, the batch size can be modified. If the batch size is set to 50, this would partition the 1000-image dataset into 20 batches, each containing 50 images. Larger batches provide a more accurate generalization of the dataset but require more memory, which can slow down the training process. Conversely, smaller batches may speed up training because they require less memory, but they can also introduce noise, which leads to worse training results.

Batch sizes are usually powers of 2: 1, 2, 4, 8, 16, etc. A common approach is to select a batch size as high as the RAM allows, but this is not always a good idea: keeping it at or below 32 has demonstrated improved generalization performance compared to cases with larger batch sizes (Masters, Luschi, 2018). Plus, it is unlikely that a personal computer using only a CPU could effectively run a training loop with a batch size larger than 16.

Batch size and learning rate can be tuned simultaneously. For instance, when multiplying the batch size by n, the learning rate should be multiplied by the square root of n to keep the variance in the gradient expectation constant. Another common practice involves linear scaling of the learning rate, meaning that when the batch size is multiplied by n, the learning rate should also be multiplied by n.

Steps and epochs

Another concept related to the train-validation loop is steps. A step is an instance when the model processes a batch and updates its parameters. For example, if there are 20 batches, the model will require 20 steps to complete one round of the entire dataset.

After the model has processed all the batches in the dataset, it has completed one epoch. An epoch is a full cycle through the dataset. Increasing the number of epochs allows the model to learn longer, leading to better performance. However, too many epochs can cause overfitting, a scenario where the model becomes so tailored to the training data that it performs poorly on unseen data. To avoid overfitting, dataloaders (covered in the next section) typically shuffle batches, ensuring that learning is generalized and not based on the sequence of data points. Training models on batches of data in the same order can lead to undesirable biases.

A common way to choose the number of epochs is to set it to a high value, like 100, and opt for early stopping. Early stopping is the interruption of the training process caused by a predefined trigger, e.g., an increase in loss after a consecutive decrease across several epochs. In this case, the resulting model is deemed the "best", having attained the lowest loss before the increase.

Illustration of early stopping

Thus, we can summarize the differences between batch, mini-batch, and SGD as follows:

	Number of samples in each gradient step	Number of updates per epoch
Batch GD	Entire dataset	$1$
Stochastic GD	One sample of the dataset	$n$
Mini-batch GD	A subset of a dataset (with a predetermined mini-batch size)	$\frac{n}{\text{size of batch}}$

Dataloaders

A dataloader streamlines the process of feeding data into the model for training and validation, as well as for preparing and transforming raw data into a format necessary for training machine learning models. Dataloaders are more versatile than simple batching tools because they can enhance data via augmentation. Data augmentation is a technique to increase the diversity of training data without collecting new data. It can create variations of the existing data by transforming it. For instance, if there is a dataset for a model designed to recognize cats, applying slight rotations or zooms to the cat images would help the model learn to recognize cats from various angles and distances, enhancing its accuracy.

Dataloaders make training more efficient by shuffling data; they can randomize the order of units at the beginning of each epoch. Another way dataloaders can be helpful is through parallel processing of batches, meaning that while the model is training on one batch, the next batch is already being prepared. This speeds up the training process by reducing the model's idle time.

Training of a model

The training of the model itself can be described with the following sequence of steps:

Split the data into train, validation, and test sets.
Initialize the hyperparameters (e.g., set the weight initialization scheme, the learning rate, batch size, the number of epochs, etc.).
Training loop: For each epoch (a full pass through the entire training dataset), the following steps are performed:
- The training data is divided into batches. For each batch:
  - Perform the forward pass on the batch to get the predictions.
  - Calculate the loss (the difference between the predictions and the true labels).
  - Perform backpropagation (compute the gradients of the loss function with respect to the weights).
  - Update the weights based on the gradients computed during backpropagation.
Validation: After each epoch, evaluate the performance on the validation set. This step involves:
- Performing the forward pass on the validation set.
- Computing the loss and other metrics (such as accuracy, precision, recall, etc.) to assess the model's performance on the validation set.
Based on the model's performance on the validation set, adjust the hyperparameters. This could be done manually or through automated processes like learning rate schedules, early stopping, etc.
The model with the best performance on the validation set is often chosen as the final model. If performance on the validation set starts to deteriorate (while performance on the training set continues to improve), this may be a sign of overfitting, and training might need to be stopped.
Once the model is selected, it is evaluated on the test set to estimate the generalization performance.

Validation and evaluation of the model

During training, typically after each epoch, a model runs with a separate set of data that isn't part of either the training or the test set. To evaluate the model's performance, its predictions based on the validation data are compared to the actual labels. The differences are then measured using metrics that depend on the type of task you're dealing with.

Similar to the training step, the model calculates the loss between its predictions and the actual results on the validation data. However, unlike the training phase, the model does not adjust its weights but uses the loss as a performance indicator. If the training loss decreases (meaning the model is learning) while the validation loss increases, this discrepancy may indicate that the model is overfitting.

Graphs for training and validation loss

Ideally, both training and validation losses should be gradually decreasing throughout epochs.

Graphs for training and validation loss

Evaluating deep learning models is not that different from regular ML evaluation. For instance, in classification tasks, you can look at the common metrics like accuracy, recall, precision, and F1. You can also build the ROC-AUC curve, the confusion matrix, etc. All pros and cons of these metrics remain the same as in the classic ML contexts.

There are quite a few neural-network specific and domain-dependent ways to interpret a trained model, through. For example, you can visualize the architecture (in case PyTorch is used, this can be done via the torchviz package):

A basic MLP visualization with torchviz

Another interesting thing is gradient plots, which help to detect vanishing/exploding gradients. Typically, in every training iteration, the average gradients per layer are stored and plotted later (one could also go more granular and make plots for every layer separately). If the average gradients are close to 0 in the early layers (there are no fluctuations happening), you've likely encountered vanishing gradients. In such cases, it might be helpful to look at the learning rate/ the optimizer/ weight initialization schemes, etc:

An example of a gradient plot

One can also visualize feature maps for computer vision applications to see what patterns the network is detecting in each layer (more on that here). We also know that smoother loss surfaces yield better performance, so visualizing the loss surface (refer to this source as an example) can hint on whether some loss smoothing technique should be added (such as batch normalization or skip connections). However, this is beyond the scope of this topic.

Conclusion

The train-validation loop in machine learning is an iterative process that can be modified in various ways by setting up different levels of generalization via batches and adjusting the number of epochs. Data for training and evaluation can be preprocessed and grouped into batches in dataloaders. The model then receives data, learns from it, and improves its predictions through backpropagation and optimization techniques. The validation step tests the model's ability to generalize its learning to unseen data.

How did you like the theory?

Report a typo