Computer scienceData scienceInstrumentsScikit-learnMachine Learning with scikit-learn

Cross-validation in scikit-learn

36 seconds read

Cross-validation improves upon simple train-test splits by using all data points for both testing and training. Techniques like k-fold cross-validation offer more robust evaluations through varied data combinations. In this topic, we will explore some of the most popular techniques for cross-validation in scikit-learn.

The setup

For demonstration purposes, a dataset of 100 random samples will be used:

The original data distribution of 100 samples with 3 classes and 4 groups — Plots in this topic are adapted from the scikit-learn documentation

Groups correspond to samples that are correlated in some manner. For instance, consider customer data from a bank regarding loan approval. In this scenario, a single customer (identified by the ID) would form a group. Thus, when a new customer comes in, you could predict loan approval based on numerous different customers. Alternatively, annual income could form another group (where all customers could be categorized based on their annual income). The data class is the target variable we are aiming to predict, so in the loan approval example, the customer ID would form a group, and whether or not to approve the loan would be the target variable. Groups may directly influence the predicted labels, but this depends on if and how the model incorporates them (for instance, decision trees do not account for potential groups, but various clustering methods do). It also depends on whether there are meaningful groups present.

The dataset consists of three classes with 35, 50, and 15 samples respectively. The number of samples in each class was chosen to demonstrate the effects of cross-validation techniques on class imbalance. The dataset also contains four groups with 26, 40, 6, and 28 samples each. The number of samples in each group was chosen to indicate group imbalance.

import numpy as np

rng = np.random.RandomState(42)
n_points = 100
X = rng.randn(100, 10)
percentiles_classes = [0.35, 0.50, 0.15]  # 35, 50, and 15 samples for each class
y = np.hstack([[ii] * int(100 * perc) for ii, perc in enumerate(percentiles_classes)])

groups = np.concatenate(
    (
        np.zeros(26, dtype=int),
        np.ones(40, dtype=int),
        np.full(6, 2, dtype=int),
        np.full(28, 3, dtype=int),
    )
)

KFold and StratifiedKFold

KFold splits the data into a specified number of segments. After splitting, one fold is picked as a validation set, and the rest are used for training. This process is repeated once for each subset, so every segment will be the validation set once. This results in a more robust validation process than a single train-test split.

Split of the sample dataset into 4-folds with KFold

An issue is already apparent: in the third cross-validation iteration (starting from 0), the third class (yellow) was not represented in the training set and only appears in the test set. This problem could be somewhat suppressed if the shuffling is enabled:

Split of the sample dataset into 4-folds with KFold and shuffling = True

Shuffling has no guarantee that the class distribution of the original dataset will be preserved, so you might end up with a different class imbalance. StratifiedKFold mitigates this problem.

StratifiedKFold returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set, as seen in the plot below. StratifiedKFold is suitable when dealing with an imbalanced dataset or when the data distribution is not similar for the target variable classes.

Split of the sample dataset into 4-folds with StratifiedKFold

Both KFold and StratifiedKFold accept the same set of parameters:

n_splits (default: 5, int) — number of folds, has to be more or equal to 2;
shuffle(default: False, bool) — determines whether to shuffle the samples before the splitting. Usually, the data should be shuffled, unless you are working with time series, data that has a meaningful order (or a grouping) or the dataset is large and already randomized;
random_state (default: None, RandomState instance/int) — ensures reproducibility when shuffling is enabled.

Creating the cross-validation generator is straightforward:

from sklearn.model_selection import KFold, StratifiedKFold

kf = KFold(n_splits = 4)            # 4-fold generator
skf = StratifiedKFold(n_splits = 4) # stratified 4-fold generator


for train_index, test_index in kf.split(X):
    print(f"train index: {train_index}/ test index: {test_index}")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

# same scheme for skf, however, skf.split() also requires the target, y, to be passed

.split() is the main method of the cross-validation generator, which yields the train and the test sample indexes.

RepeatedKFold

The RepeatedKFold takes KFold one step further, by repeating the KFold process $n$ times. Each repetition has a different random division of the data into subsets. The final model performance is the average of the performance of $k$ folds over $n$ repetitions, giving a more stable evaluation (RepeatedKFold reduces the variance of the estimated error, leading to more reliable results). RepeatedKFold takes almost the same arguments as KFold, but instead of shuffle there is the n_repeats parameter:

from sklearn.model_selection import RepeatedKFold

rkf = RepeatedKFold(n_splits=4, n_repeats=2, random_state=42)

for train_index, test_index in rkf.split(X):
    print(f"train index: {train_index}/ test index: {test_index}")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

RepeatedKFold is not suitable for large datasets due to the computational costs involved in repeating the KFold process $n$ times. There is also the stratified variant, RepeatedStratifiedKFold , that repeats the stratified k-fold $n$ times.

GroupKFold

GroupKFold ensures that the same group will not appear in both the testing and training sets to prevent the model from learning to recognize groups rather than learning actual patterns in data. This time, each split holds the entire group either in the train or in the test set:

Split of the sample dataset into 4-folds with GroupKFold

GroupKFold only accepts a single parameter, n_splits:

from sklearn.model_selection import GroupKFold

gkf = GroupKFold(n_splits=4)

for i, (train_index, test_index) in enumerate(gkf.split(X, y, groups)):
     print(f"fold {i}:")
     print(f"train: index={train_index}, group={groups[train_index]}")
     print(f"test:  index={test_index}, group={groups[test_index]}")

Let's say there is a medical dataset that contains information on multiple patients with multiple records per patient. The task is to predict whether a certain drug will work for a particular patient, given an input of a complete record history for a single patient. Using regular KFold will result in one patient's data being in the training set and another's in the validation set. This means certain samples will leak into the validation set, hindering the ability to predict the outcome when we try to predict the result for a new patient. GroupKFold ensures that all of a single patient's records are either in the training set or the validation set. This helps in better validation of the model as the samples in the groups are related.

LeaveOneOut (LOO)

LOO is a variant of KFold, where $k$ is equal to the number of samples in the dataset. During the cross-validation process, one sample is left out from the training set, and the model is trained on the remaining data. This single observation is then used to test the performance of the model. This process is repeated such that each observation in the dataset is used once as the test set.

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()

for train_index, test_index in loo.split(X):
    print(f"train index: {train_index}/ test index: {test_index}")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

LeaveOneOut does not accept any arguments. LOO is a good choice when the amount of data is scarce. However, it has two downsides. As the number of data points increases, LOO becomes computationally expensive and it could lead to high variance in the estimation of the model performance because the total observation in each training dataset is very similar to each other.

TimeSeriesSplit

Time series data is characterized by the correlation between observations that are near in time. Therefore, traditional cross-validation techniques such as KFold may result in poor performance. In TimeSeriesSplit, the data is split in a way that the test set always occurs after the training set in time:

Split of the sample dataset into 4 folds with TimeSeriesSplit

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits = 4)
 
for train_index, test_index in tscv.split(X):
    print(f"train index: {train_index}/ test index: {test_index}")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Besides n_splits, TimeSeriesSplit accepts the following parameters:

max_train_size(default: None/ int) — the maximum size for a single training set, when left to default, the training set can increase without an upper bound;
test_size(default: None, corresponds to n_samples // (n_splits + 1)/int) — limits the size of the test split in each fold;
gap(default: 0 /int) — the number of samples to exclude from the end of each training split before starting the test split. If kept to 0, there will be no exclusions.

When there is a strong seasonality in the data, TimeSeriesSplit might not capture it well, leading to biased results.

cross_val_score

scikit-learn provides a utility function (cross_val_score()) that can be used to estimate the performance of the model with cross-validation. cross_val_score() accepts the following parameters:

estimator — a scikit-learn estimator object such as sklearn.ensemble.RandomForestClassifier for a random forest classifier, for example;
X, y — features and targets, respectively;
cv(default: None, which is a 5 fold KFold/int/iterable/cross-validation object)— the cross-validation strategy;
scoring (default: the estimator's scorer) — specifies the scoring metric (such as accuracy, F1, ROC AUC, etc)

Note that if the estimator is a classifier and cv parameter is left to default or int, StratifiedKFold will be used automatically instead of the regular KFold

Below is an example usage of cross_val_score with the RandomForestClassifier:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

skf = StratifiedKFold(n_splits = 4)
clf = RandomForestClassifier()
scores_skf = cross_val_score(clf, X, y, cv=skf)
avg_skf_score = np.mean(scores_skf)

print(f'CV scores: {scores_skf}/ average cv score: {avg_skf_score}')

Conclusion

As a result, you are now familiar with the main cross-validation techniques available in scikit-learn, their possible use cases, and limitations.

How did you like the theory?

Report a typo