Computer scienceData scienceMachine learningReinforcement learning

Multi-armed bandits

9 minutes read

The multi-armed bandit problem is a classic RL problem that models the exploration-exploitation trade-off in sequential decision-making scenarios. It involves an agent trying to maximize cumulative rewards by selecting from a set of actions (arms) with unknown reward distributions, balancing exploiting the currently known best action and exploring to find potentially better actions.

In this topic, we will look at the multi-armed bandit problem and consider the main approaches.

The problem setup

The multi-armed bandit problem is a sequential decision-making problem that models a scenario where an agent (or algorithm) must choose among a set of KK actions (or arms) over a sequence of time steps. Each action or arm has an associated reward distribution, which is initially unknown to the agent. The goal of the agent is to maximize the cumulative reward obtained over the time horizon by adaptively choosing the arms based on the rewards observed so far.

Formally, let's denote the set of arms as A=1,2,...,KA = {1, 2, ..., K}. For each arm iAi \in A, there is an associated reward distribution with an unknown expected value or mean reward denoted by μi\mu_i. At each time step t, the agent selects an arm atAa_t \in A and receives a reward rtr_t drawn from the reward distribution of the chosen arm.

The agent's objective is to maximize the expected cumulative reward over a time horizon of TT steps, which can be expressed as:

maxE[t=1Trt]\max E[ \sum_{t=1}^Tr_t ]

where the expectation is taken over the reward distributions and the agent's policy for selecting arms.

A key challenge in the multi-armed bandit problem is the exploration-exploitation trade-off: the agent must balance exploiting the currently known best arm (based on observed rewards) and exploring other arms to gather more information about their reward distributions, as one of the unexplored arms may have a higher expected reward.

The performance of a multi-armed bandit algorithm is often measured by its cumulative regret, which is the expected difference between the rewards obtained by an optimal policy (that always selects the best arm) and the rewards obtained by the algorithm. The goal is to minimize the cumulative regret over the time horizon.

Cumulative regret=t=1T(μμat)\text{Cumulative regret} = \sum_{t=1}^T (\mu_* - \mu_{a_t})

where μ\mu_* is the expected reward of the optimal arm (the arm with the highest mean reward), and μat μ_{a_t} is the expected reward of the arm chosen by the algorithm at time step tt.

Various algorithms have been developed to address the multi-armed bandit problem, such as Epsilon-Greedy, Upper Confidence Bound (UCB), and Thompson Sampling, each with its own approach to balancing exploration and exploitation, which we will look at in the upcoming sections.

Epsilon-greedy

The Epsilon-Greedy algorithm is one of the simplest and most widely used algorithms for the multi-armed bandit problem. It follows a straightforward strategy to balance exploration and exploitation: with a small probability ϵ\epsilon (epsilon), the algorithm explores by selecting a random arm, and with probability 1ϵ1 - \epsilon, it exploits by selecting the arm with the highest estimated expected reward based on the observed rewards so far.

Let Q(a)Q(a) represent the estimated expected reward for arm a based on the observed rewards up to the current time step tt. The Epsilon-Greedy algorithm can be defined as follows:

  1. Initialize Q(a)=0Q(a) = 0 for all arms aAa \in A;

  2. For each time step tt:
    a) With probability ϵ\epsilon, select a random arm ata_t from AA (exploration),
    b) With probability 1ϵ1- \epsilon, select the arm at=arg maxaQ(a)a_t = \argmax_a Q(a) (exploitation);

  3. Observe the reward rtr_t for the selected arm ata_t;

  4. Update the estimated expected reward Q(at)Q(a_t) based on the observed reward rtr_t.

The estimated expected reward Q(a)Q(a) is typically computed as the average of the rewards observed for arm a up to the current time step. If nan_a is the number of times arm aa has been selected, and the observed rewards for arm aa are r1,r2,...,rnar_1, r_2, ..., r_{n_a}, then:

Q(a)=i=1narinaQ(a) = \sum_{i=1}^{n_a} \frac{r_i}{n_a}

The key parameter in the Epsilon-Greedy algorithm is ϵ\epsilon, which controls the exploration-exploitation trade-off. A higher value of ϵ\epsilon promotes more exploration, as the algorithm will select random arms more frequently. Conversely, a lower value of ϵ\epsilon promotes more exploitation, as the algorithm will tend to select the arm with the highest estimated expected reward more often.

A common practice is to decay the exploration rate ε over time, starting with a higher value to promote initial exploration and gradually decreasing it to favor exploitation as more information is gathered about the reward distributions of the arms.

While the Epsilon-Greedy algorithm is simple and easy to implement, its performance can be sensitive to the choice of the exploration rate ϵ\epsilon. Setting ϵ\epsilon too low can lead to insufficient exploration and potentially converging to a suboptimal arm, while setting ϵ\epsilon too high can result in excessive exploration and slow convergence to the optimal arm.

Upper Confidence Bound (UCB)

The Upper Confidence Bound (UCB) algorithm is a more sophisticated approach to balancing exploration and exploitation in the multi-armed bandit problem. Instead of using a fixed exploration rate like in Epsilon-Greedy, UCB constructs an upper confidence bound on the expected reward of each arm and selects the arm with the highest upper bound at each time step.

The key idea behind UCB is to adjust the exploration-exploitation trade-off based on the number of times each arm has been played so far. Arms that have been played fewer times will have a higher degree of uncertainty in their estimated rewards, and UCB will favor exploring those arms more by assigning them a higher upper confidence bound.

For each arm aAa \in A, let XaX_a be a random variable representing the reward distribution of arm aa, with an unknown expected value μaμ_a. At time step tt, the UCB algorithm selects the arm ata_t that maximizes the following upper confidence bound:

at=arg maxa(Q(a)+clog(t)na)a_t = \argmax_a \Big(Q(a) + c \cdot \sqrt \frac{\log(t)}{n_a} \Big)

where:

  • Q(a)Q(a) is the estimated expected reward for arm aa, typically computed as the average of observed rewards for that arm.

  • nan_a is the number of times arm aa has been played up to time step t t.

  • cc is an exploration constant that determines the degree of exploration.

  • log(t)na\frac{\log(t)}{n_a} is the exploration term, which decreases as arm aa is played more times.

The first term, Q(a)Q(a), represents the exploitation component, favoring arms with higher estimated rewards. The second term, clog(t)nac \cdot \sqrt \frac{\log(t)}{n_a} , is the exploration component, which is higher for arms that have been played fewer times (smaller nan_a).

The exploration constant cc controls the degree of exploration: a larger value of cc promotes more exploration, while a smaller value favors exploitation more. A common choice for cc is (2)\sqrt(2), which provides theoretical guarantees on the regret bound for UCB.

After selecting arm ata_t, the UCB algorithm observes the reward rtr_t and updates the estimated expected reward Q(at)Q(a_t) based on the new observation.

The UCB algorithm strikes a balance between exploration and exploitation by adaptively adjusting the exploration-exploitation trade-off based on the number of times each arm has been played. It has been shown to have strong theoretical guarantees on the cumulative regret and is widely used in practice for multi-armed bandit problems.

Thompson sampling

Unlike algorithms like Epsilon-Greedy and UCB that use heuristics to balance exploration and exploitation, Thompson Sampling treats the problem in a probabilistic framework and naturally balances exploration and exploitation through its sampling process.

In Thompson Sampling, for each arm aa, a prior distribution is maintained over the possible values of the expected reward μa\mu_a. This prior distribution represents the initial belief about the reward distribution of the arm before any observations are made.

At each time step tt, Thompson Sampling proceeds as follows:

  1. For each arm aa, sample a reward value θaθ_a from the posterior distribution of μaμ_a, which is the updated distribution after incorporating the observed rewards for that arm so far.

  2. Select the arm ata_t with the highest sampled reward value θaθ_a.

  3. Observe the actual reward rtr_t for the selected arm ata_t.

  4. Update the posterior distribution of μatμ_{a_t} based on the new observed reward rtr_t.

The key idea behind Thompson Sampling is that arms with higher uncertainty in their posterior distributions (due to fewer observations) are more likely to occasionally sample higher reward values, leading to their selection and further exploration. At the same time, arms with higher expected rewards based on past observations are more likely to sample higher reward values more consistently, leading to their exploitation.

A common choice for the prior and posterior distributions in Thompson Sampling is the beta distribution, which is a conjugate prior for the Bernoulli distribution (commonly used to model binary rewards). If the rewards for arm aa are binary, and the prior is a Beta(α,β)\text{Beta}(\alpha, \beta) distribution, then after observing nan_a rewards with sas_a successes (rewards of 1), the posterior distribution of μaμ_a is:

Posterior(μa)=Beta(α+sa,β+nasa)\text{Posterior}(μ_a) = \text{Beta}(α + s_a, β + n_a - s_a)

The sampled reward value θaθ_a is then drawn from this updated posterior distribution.

Thompson Sampling has been shown to perform well in practice and has strong theoretical guarantees on the cumulative regret. It naturally balances exploration and exploitation without the need for tuning parameters like the exploration rate in Epsilon-Greedy or the exploration constant in UCB.

Conclusion

As a result, you are now familiar with the basic setting of the multi-armed bandit problem and the theoretical foundations for Epsilon-Greedy, Upper confidence bound, and Thompson sampling algorithms.

How did you like the theory?
Report a typo