Computer scienceData scienceInstrumentsOptuna

Hyperparameter tuning with Optuna

9 minutes read

As you already know, almost any model has a set of hyperparameters we can adjust to alter the model’s performance. Their effect is often dramatic, which makes selecting the correct hyperparameter set crucial. For this reason, it is a good idea to know how to tune them.

There are several valid ways to do this. You might already be familiar with grid search, the most basic and straightforward method for tuning hyperparameters, which essentially involves trying out different sets in a predefined manner. However, this approach is labor-intensive and often inconvenient, and it can lead to suboptimal results. For this reason, we have the optuna library at our disposal. It allows us to partly automate the process, making hyperparameter tuning easier.

Quick start

The most comprehensive and up-to-date way to learn about the library is to check the official documentation. Here, we are going to provide some excerpts from it.

To install the library, one could simply use pip install optuna or conda install optuna, depending on the virtual environment manager used.

Defining an Objective

An objective is a basic block of optuna, if you will. It determines the actions we take for every single hyperparameter set we try. It is defined as a function that receives a trial object as input and should output some sort of metric as a result. There are no restrictions for the metric used, other than it should either be "the bigger, the better" or "the smaller, the better" so that it would be possible for optuna to follow this pattern. Inside the objective, we define all the values that represent the hyperparameters. Here is a quick example:

import optuna

def objective(trial):
    x = trial.suggest_float('x', -10, 10)
    return (x - 2) ** 2

In trial.suggest_float we state that we have the hyperparameter x and the range over which it can change is -10, 10. There are other ways to set hyperparameters (for example, optuna allows you to sample them from a “categorical” set) and one could find those in the documentation.

Of course, the example above is a simple one. A typical ML objective might include model training and evaluation inside of it as it is shown in the next sections.

Defining a study

After we have our objective up and running, we can now define a study and set up some parameters:

import optuna

def objective(trial):
    x = trial.suggest_float('x', -10, 10)
    return (x - 2) ** 2

sampler = optuna.samplers.TPESampler(seed=42)
study = optuna.create_study(direction='minimize', sampler=sampler)
study.optimize(objective, n_trials=100)

study.best_params  # E.g. {'x': 2.002108042}

Let’s dive into what is actually happening here.

We create a sampler object. Its job is to 'generate' the hyperparameters based on the restrictions we defined in our objective. Here we create a TPESampler instance. It is a pretty popular sampler and we are going to study it closer below.
We define a study. This is the main object we work with: it stores all the data about different trials we had. At this point, we should also define the behavior of the metric used in the objective as well.
We start the hyperparameter optimization process. The number of trials defines the total number of hyperparameter sets we are going to explore.

Of course, optuna library is way more complicated and has many different features that we won’t be able to cover.

TPE Sampler

TPE (which stands for Tree-structured Parzen Estimator) is one of the popular samplers used in optuna. It doesn't require parameter grid selection and works with any parameters, making it easy to use. Let’s have a closer look at how it actually works.

Unlike we usually do, let’s assume that we’ve already had multiple trials in the past. We are now standing at trial $n$ and want to decide, which hyperparameters should we use for the next trial. What we have is data about the prior trials:

which hyperparameter sets we used (we would denote one set as $x$ );
what were the corresponding scores (we would denote one score as $y$ ).

This means that one could relatively easily calculate the PDF that models $x$ denoted as $g(x)$ . We are not doing this analytically rather than doing pretty much the same thing when one plots a histplot (talking strictly, this process is called Parzen Window Estimation and that is where the name comes from). So what we have now is a function $g(x)$ that indicates the frequency of using different sets of $x$ .

Now, to determine a new hyperparameter set, we just randomly sample $x$ from $g(x)$ . However, making this completely blind is a bad idea since we at the very least do not use the information about how good the prior trials actually were. To take that into account, we are going to introduce two more things. We would first set ourselves an $\hat{y}$ which would denote the threshold, from which a trial is named “successful”. Since usually, a lower $y$ corresponds to a better trial, we can write $y_i < \hat{y}$ to express that. Having the trials binarized, we can now get ourselves another function that we would call $I(x)$ . It is pretty much the same thing as $g(x)$ , but we only take into account the trials which were “successful”.

Having these two functions at our disposal, we can now introduce the final thing for this algorithm which is the Expected improvement (EI). By definition, one could count it as:

$\text{EI}(x) = l(x) \times \left[\log\big(g(x)\big) - \log\big(l(x)\big)\right]$

A bigger EI corresponds to such $x$ that:

are likely under the $I(x)$ distribution;
are unlikely under the $g(x)$ distribution.

So the finalized version of the sampler looks like this:

Sample several hyperparameter sets $x$ from $g(x)$ only;
Calculate the EI value for each of the sets;
Use the $x$ with the biggest EI for the next trial;
After completing the trial, recalculate the $g(x)$ and $I(x)$

Let us now reflect a little bit on what we said above.

We can see two hyperparameters for this sampler being $\hat{y}$ which is the success threshold and the number of $x$ -s we sample each time to choose from.
At the very first trials, we need to “ramp this up” before we can actually make real use of this algorithm. This means that it is more effective when you have a large number of trials. In the beginning, these are more like “blind shots” since we have no idea about which hyperparameters provide which results. By the way, optuna counts these trials as regular trials as well.

Pruning

Pruning is a method to cut the potentially “unsuccessful” runs on an early stage. Suppose we have a gradient boosting model training which consists of training a lot of small and simple estimators one buy one. After every estimator added, we can calculate the resulting score. The basic idea of pruning is like this: “if the current trial’s intermediate result is worse than median of intermediate results of previous trials at the same step” it should be eliminated to save resources. The quoted text is related to a MedianPruner, a simple pruning strategy optuna suggests.

In order to avoid trials being killed too early, optuna allows a tunable number of so-called warmup steps during which a trial cannot be eliminated regardless the score. It is useful when your model needs some time to build up before it starts producing decent results. However, keep in mind that a large grace period leads to larger resources consumption.

Using a pruner with your model could be tricky since the model has to have some sort of an interface to access it at different stages of training. It is usually easy to do for the deep learning neural networks since you usually make a training loop yourself and can implement that feature there. For other models, you sometimes get a callback interface which allows you to pass a function to be called after each optimization step the model has, but it might not be that obvious to use.

Here is a basic example of how you could use pruning with a simple two layer neural network. We start by defining the callback function:

def optuna_callback(trial, loss, epoch):
    trial.report(loss, epoch)
    if trial.should_prune():
        raise optuna.TrialPruned()

After that, you need to implement the callback option into your training loop:

# A default training loop
def train_model(model, optimizer, criterion, callback, n_epochs=3):
    for epoch in range(n_epochs):
        model.train()
        running_loss = 0.0
        for batch_idx, (data, target) in enumerate(train_loader):
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
        
        average_loss = running_loss / len(train_loader)
        callback(average_loss, epoch) # Callback based on loss value
    return average_loss

Then you bind these two in the objective:

def objective(trial):
		# "Put" the trial inside the callback
    callback = partial(optuna_callback, trial)
		# Define the hyperparameters
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)

    model = ... # Your model here
    optimizer = optim.Adam(model.parameters(), lr=lr)
    criterion = ...

    return train_model(model, optimizer, criterion, callback)

Finally, you initialize a study:

pruner = optuna.pruners.MedianPruner()
study = optuna.create_study(direction="minimize", pruner=pruner)
study.optimize(objective, n_trials=10)

Conclusion

In this topic, we covered some basics of the optuna library. We showed you some trivial usage examples, explained how one of the popular samplers, TPESampler, works, and explained what is pruning and how one could use it to speed up the trials.

How did you like the theory?

Report a typo

Hyperparameter tuning with Optuna

Quick start

TPE Sampler

Pruning

Conclusion

Related topics