Computer scienceData scienceInstrumentsScikit-learnMachine Learning with scikit-learn

Hyperparameter search in scikit-learn

8 minutes read

GridSearchCV and RandomizedSearchCV are two popular techniques for hyperparameter tuning. GridSearchCV exhaustively searches over a grid of values, while RandomizedSearchCV randomly samples a set of hyperparameter values.

In this topic, we will delve into the two main hyperparameter search methods available in scikit-learn.

Hyperparameter search techniques

In scikit-learn, there are three main ways to perform the optimal hyperparameter search: grid search, randomized search, and successive halving (for both grid search and randomized search).

GridSearchCV finds the optimal set of hyperparameters by brute force, evaluating the performance for each possible hyperparameter combination from the set of passed values. Because the grid search is exhaustive, it is guaranteed that the optimal hyperparameter combination from the predetermined range will be found. At this point, two issues arise: how not to get stuck in the local optima found by the grid search (since one has to specify the possible hyperparameter values), and the fact that a large grid search is computationally expensive.

An alternative would be to use RandomizedSearchCV, where hyperparameter configurations are drawn randomly from the specified distributions (or discrete sets), which allows for a more expansive exploration of the hyperparameter space in a more effective manner.

Scikit-learn also has a successive halving variant for the previous two searches, which is more efficient. Successive halving, provided with a set of possible values, successively throws out low-performing configurations until only one configuration is left. Successive halving for random search would go as follows: draw a large set of possible configurations with random sampling, train the models with constrained resources (e.g., a smaller training subset), throw out the bottom $1/n$ (where $n$ is predetermined, defaults to 3 in scikit-learn) of the combinations based on performance, and continue the training with increased resources (e.g., a larger training sample) until only one combination is left. The resources increase from one iteration to the next.

Since the halving search is still an experimental feature, the focus of the next sections will be on the grid and random searches, but one can enable the halving search in scikit-learn as follows:

from sklearn.experimental import enable_halving_search_cv

GridSearchCV

Let's see how grid search could be applied to find the optimal hyperparameters for the K-Neighbors regressor on the California housing dataset. The preliminary steps:

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Next, we define the grid of possible hyperparameter values and instantiate the GridSearchCV() object:

param_grid = {
    "n_neighbors": [3, 5, 10, 13],
    "weights": ["uniform", "distance"],
    "algorithm": ["auto", "ball_tree", "kd_tree"],
}

grid_search = GridSearchCV(
    estimator=KNeighborsRegressor(), param_grid=param_grid, n_jobs=-1
)
grid_search.fit(X_train, y_train)

The main parameters that could be passed to GridSearchCV are

estimator — could be any valid sckit-learn estimator, e.g., aPipeline (we will look at the usage of Pipeline a bit further down the line);
param_grid (dict/list of dictionaries) — a dictionary where keys correspond to the hyperparameter names and values — to possible hyperparameter values;
scoring (default: None, str/callable/list/tuple/dict) — controls how the performance should be evaluated during cross-validation. Takes the scoring from the passed estimator object, but could be set to the appropriate custom scoring function.
cv (default: None, accepts intto specify the number of folds/cross-val generator/ iterable) — specifies the cross-validation strategy. By default, performs 5-fold cross-validation, but when the passed estimator is a classifier, the stratified 5-fold cross-validation is used (to preserve the class balance).

The n_jobs is set to -1, which enables parallel search on all available cores. For single processing, keep n_jobs to its default value (None).

Now, we can access the attributes of the GridSearchCV:

best_params = grid_search.best_params_   # the optimal hyperparameter combination
best_score = grid_search.best_score_     # mean cross-val score of the best_estimator
best_model = grid_search.best_estimator_ # the estimator with the highest score

test_score = grid_search.score(X_test, y_test)

GridSearchCV has a refit parameter, which will refit the grid_search.best_estimator_ to the whole training set automatically by default.

There is another important attribute, .cv_results_, which returns the dictionary containing the details of the grid search on each cross-validation iteration, and can be loaded into a dataframe for further inspection:

import pandas as pd
df = pd.DataFrame(grid_search.cv_results_)

RandomizedSearchCV

The main difference between GridSearchCV and RandomizedSearchCV is how the latter accepts the distribution as a component of the hyperparameter grid and specifies the number of configurations to be evaluated, with other parameters remaining the same as for GridSearchCV:

param_distributions (dict/ list of dictionaries) — dictionary with hyperparameters names as keys and distributions or lists of hyperparameters to consider. Distributions should have a rvs method for sampling (e.g., scipy.stats.distributions). rvs() generates random variates from a given distribution, with shape (controls the distribution's shape) and size (specifies how many random variables should be sampled, if not specified, returns a single random number from the distribution) being the two main parameters.
n_iter (default is 10, int) — the number of hyperparameter configurations to sample.

Here, we will use a Pipeline to combine the scaling step with the model and run a randomized search. We did a small trick: usually, you can just pass the distribution to the specified key, without calling .rvs() with a number of samples explicitly (thus, on each iteration of the search, a single sample from the distribution will be drawn, so n_iter controls the number of samples, and not .rvs()'s size). randint.rvs() is predetermined in this setting, since it draws 5 random integers in the [2, 10] range from the uniform distribution with a fixed random_state.

from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

pipeline = make_pipeline(StandardScaler(), KNeighborsRegressor())

param_grid = {
    "kneighborsregressor__n_neighbors": randint.rvs(
        2, 10, loc=0, size=5, random_state=40
    ),
    "kneighborsregressor__weights": ["uniform", "distance"],
    "kneighborsregressor__algorithm": ["auto", "ball_tree", "kd_tree"],
}

random_search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=param_grid,
    n_iter=10,
    random_state=40,
    n_jobs=-1,
)

# We continue to work with the California housing dataset,
# so we are not loading any new data

random_search.fit(X_train, y_train)

The attributes largely remain the same as with GridSearchCV. Note that the hyperparameter grid's keys (e.g., n_neighbors) are prefixed with two underscores and the estimator name from the pipeline, which could be accessed via the .named_steps attribute:

pipeline.named_steps

or, alternatively, pipeline.get_params().keys().

A comparison between GridSearchCV and RandomSearchCV

In this section, we will consider the differences between the two approaches and look at some scenarios when one search might be more suitable than the other.

When provided with a large number of possible hyperparameter values, GridSearchCV becomes very inefficient. In such a setting, RandomSearchCV is a more optimal choice as it randomly samples a subset of combinations, and the best hyperparameters might be found in fewer iterations.

When the possible optimal hyperparameter values are unknown, RandomizedSearchCV runs a wider exploration of the search space in the specified ranges more efficiently, possibly finding a combination that might be missed by the predetermined grid. This idea could be illustrated as follows:

A comparison between grid and random search with respect to the hyperparameter value ranges

RandomizedSearchCV can be used as a preliminary step to narrow down the space of ranges and then perform a more focused search with GridSearchCV.

Conclusion

In this topic we went over the hyperparameter search, to summarize we can highlight the main takeaways:

the two main methods for hyperparameter tuning in scikit-learn: GridSearchCV and RandomSearchCV;
the differences between the GridSearchCV and RandomSearchCV;
how to choose the most fitting search method for a particular scenario.

6 learners liked this piece of theory. 2 didn't like it. What about you?

Report a typo