Computer scienceData scienceMachine learningReinforcement learning

Multi-armed bandits

9 minutes read

The multi-armed bandit problem is a classic RL problem that models the exploration-exploitation trade-off in sequential decision-making scenarios. It involves an agent trying to maximize cumulative rewards by selecting from a set of actions (arms) with unknown reward distributions, balancing exploiting the currently known best action and exploring to find potentially better actions.

In this topic, we will look at the multi-armed bandit problem and consider the main approaches.

The problem setup

The multi-armed bandit problem is a sequential decision-making problem that models a scenario where an agent (or algorithm) must choose among a set of $K$ actions (or arms) over a sequence of time steps. Each action or arm has an associated reward distribution, which is initially unknown to the agent. The goal of the agent is to maximize the cumulative reward obtained over the time horizon by adaptively choosing the arms based on the rewards observed so far.

Formally, let's denote the set of arms as $A = {1, 2, ..., K}$ . For each arm $i \in A$ , there is an associated reward distribution with an unknown expected value or mean reward denoted by $\mu_i$ . At each time step t, the agent selects an arm $a_t \in A$ and receives a reward $r_t$ drawn from the reward distribution of the chosen arm.

The agent's objective is to maximize the expected cumulative reward over a time horizon of $T$ steps, which can be expressed as:

\max E[ \sum_{t=1}^Tr_t ]

where the expectation is taken over the reward distributions and the agent's policy for selecting arms.

A key challenge in the multi-armed bandit problem is the exploration-exploitation trade-off: the agent must balance exploiting the currently known best arm (based on observed rewards) and exploring other arms to gather more information about their reward distributions, as one of the unexplored arms may have a higher expected reward.

The performance of a multi-armed bandit algorithm is often measured by its cumulative regret, which is the expected difference between the rewards obtained by an optimal policy (that always selects the best arm) and the rewards obtained by the algorithm. The goal is to minimize the cumulative regret over the time horizon.

\text{Cumulative regret} = \sum_{t=1}^T (\mu_* - \mu_{a_t})

where $\mu_*$ is the expected reward of the optimal arm (the arm with the highest mean reward), and $μ_{a_t}$ is the expected reward of the arm chosen by the algorithm at time step $t$ .

Various algorithms have been developed to address the multi-armed bandit problem, such as Epsilon-Greedy, Upper Confidence Bound (UCB), and Thompson Sampling, each with its own approach to balancing exploration and exploitation, which we will look at in the upcoming sections.

Epsilon-greedy

The Epsilon-Greedy algorithm is one of the simplest and most widely used algorithms for the multi-armed bandit problem. It follows a straightforward strategy to balance exploration and exploitation: with a small probability $\epsilon$ (epsilon), the algorithm explores by selecting a random arm, and with probability $1 - \epsilon$ , it exploits by selecting the arm with the highest estimated expected reward based on the observed rewards so far.

Let $Q(a)$ represent the estimated expected reward for arm a based on the observed rewards up to the current time step $t$ . The Epsilon-Greedy algorithm can be defined as follows:

Initialize $Q(a) = 0$ for all arms $a \in A$ ;
For each time step $t$ :
a) With probability $\epsilon$ , select a random arm $a_t$ from $A$ (exploration),
b) With probability $1- \epsilon$ , select the arm $a_t = \argmax_a Q(a)$ (exploitation);
Observe the reward $r_t$ for the selected arm $a_t$ ;
Update the estimated expected reward $Q(a_t)$ based on the observed reward $r_t$ .

The estimated expected reward $Q(a)$ is typically computed as the average of the rewards observed for arm a up to the current time step. If $n_a$ is the number of times arm $a$ has been selected, and the observed rewards for arm $a$ are $r_1, r_2, ..., r_{n_a}$ , then:

Q(a) = \sum_{i=1}^{n_a} \frac{r_i}{n_a}

The key parameter in the Epsilon-Greedy algorithm is $\epsilon$ , which controls the exploration-exploitation trade-off. A higher value of $\epsilon$ promotes more exploration, as the algorithm will select random arms more frequently. Conversely, a lower value of $\epsilon$ promotes more exploitation, as the algorithm will tend to select the arm with the highest estimated expected reward more often.

A common practice is to decay the exploration rate ε over time, starting with a higher value to promote initial exploration and gradually decreasing it to favor exploitation as more information is gathered about the reward distributions of the arms.

While the Epsilon-Greedy algorithm is simple and easy to implement, its performance can be sensitive to the choice of the exploration rate $\epsilon$ . Setting $\epsilon$ too low can lead to insufficient exploration and potentially converging to a suboptimal arm, while setting $\epsilon$ too high can result in excessive exploration and slow convergence to the optimal arm.

Upper Confidence Bound (UCB)

The Upper Confidence Bound (UCB) algorithm is a more sophisticated approach to balancing exploration and exploitation in the multi-armed bandit problem. Instead of using a fixed exploration rate like in Epsilon-Greedy, UCB constructs an upper confidence bound on the expected reward of each arm and selects the arm with the highest upper bound at each time step.

The key idea behind UCB is to adjust the exploration-exploitation trade-off based on the number of times each arm has been played so far. Arms that have been played fewer times will have a higher degree of uncertainty in their estimated rewards, and UCB will favor exploring those arms more by assigning them a higher upper confidence bound.

For each arm $a \in A$ , let $X_a$ be a random variable representing the reward distribution of arm $a$ , with an unknown expected value $μ_a$ . At time step $t$ , the UCB algorithm selects the arm $a_t$ that maximizes the following upper confidence bound:

a_t = \argmax_a \Big(Q(a) + c \cdot \sqrt \frac{\log(t)}{n_a} \Big)

where:

$Q(a)$ is the estimated expected reward for arm $a$ , typically computed as the average of observed rewards for that arm.
$n_a$ is the number of times arm $a$ has been played up to time step $t$ .
$c$ is an exploration constant that determines the degree of exploration.
$\frac{\log(t)}{n_a}$ is the exploration term, which decreases as arm $a$ is played more times.

The first term, $Q(a)$ , represents the exploitation component, favoring arms with higher estimated rewards. The second term, $c \cdot \sqrt \frac{\log(t)}{n_a}$ , is the exploration component, which is higher for arms that have been played fewer times (smaller $n_a$ ).

The exploration constant $c$ controls the degree of exploration: a larger value of $c$ promotes more exploration, while a smaller value favors exploitation more. A common choice for $c$ is $\sqrt(2)$ , which provides theoretical guarantees on the regret bound for UCB.

After selecting arm $a_t$ , the UCB algorithm observes the reward $r_t$ and updates the estimated expected reward $Q(a_t)$ based on the new observation.

The UCB algorithm strikes a balance between exploration and exploitation by adaptively adjusting the exploration-exploitation trade-off based on the number of times each arm has been played. It has been shown to have strong theoretical guarantees on the cumulative regret and is widely used in practice for multi-armed bandit problems.

Thompson sampling

Unlike algorithms like Epsilon-Greedy and UCB that use heuristics to balance exploration and exploitation, Thompson Sampling treats the problem in a probabilistic framework and naturally balances exploration and exploitation through its sampling process.

In Thompson Sampling, for each arm $a$ , a prior distribution is maintained over the possible values of the expected reward $\mu_a$ . This prior distribution represents the initial belief about the reward distribution of the arm before any observations are made.

At each time step $t$ , Thompson Sampling proceeds as follows:

For each arm $a$ , sample a reward value $θ_a$ from the posterior distribution of $μ_a$ , which is the updated distribution after incorporating the observed rewards for that arm so far.
Select the arm $a_t$ with the highest sampled reward value $θ_a$ .
Observe the actual reward $r_t$ for the selected arm $a_t$ .
Update the posterior distribution of $μ_{a_t}$ based on the new observed reward $r_t$ .

The key idea behind Thompson Sampling is that arms with higher uncertainty in their posterior distributions (due to fewer observations) are more likely to occasionally sample higher reward values, leading to their selection and further exploration. At the same time, arms with higher expected rewards based on past observations are more likely to sample higher reward values more consistently, leading to their exploitation.

A common choice for the prior and posterior distributions in Thompson Sampling is the beta distribution, which is a conjugate prior for the Bernoulli distribution (commonly used to model binary rewards). If the rewards for arm $a$ are binary, and the prior is a $\text{Beta}(\alpha, \beta)$ distribution, then after observing $n_a$ rewards with $s_a$ successes (rewards of 1), the posterior distribution of $μ_a$ is:

\text{Posterior}(μ_a) = \text{Beta}(α + s_a, β + n_a - s_a)

The sampled reward value $θ_a$ is then drawn from this updated posterior distribution.

Thompson Sampling has been shown to perform well in practice and has strong theoretical guarantees on the cumulative regret. It naturally balances exploration and exploitation without the need for tuning parameters like the exploration rate in Epsilon-Greedy or the exploration constant in UCB.

Conclusion

As a result, you are now familiar with the basic setting of the multi-armed bandit problem and the theoretical foundations for Epsilon-Greedy, Upper confidence bound, and Thompson sampling algorithms.

How did you like the theory?

Report a typo