Computer scienceData scienceMachine learningReinforcement learning

Introduction to policy optimization

4 minutes read

Policy optimization is a class of reinforcement learning algorithms that directly optimize the policy function by adjusting its parameters to maximize the expected cumulative reward. These algorithms search the policy space more efficiently compared to value-based methods, making them well-suited for continuous action spaces and high-dimensional environments.

In this topic, we will look at the basic formulation of the policy gradient and consider some common policy optimization algorithms.

The policy gradient

The policy gradient provides a way to directly optimize the policy parameters to maximize the expected return.

The policy gradient theorem is the foundation for a class of algorithms known as policy gradient methods. The policy gradient theorem states that the gradient of the expected return J(θ)J(θ) with respect to the policy parameters θθ can be written as:

θJ(θ)=Eπθ[t=0Tθlogπθ(atst)Qπθ(st,at)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) Q^{\pi_\theta}(s_t,a_t) \right]

Where

  • Eπθ\mathbb{E}_{\pi_\theta} is the expectation under policy πθ\pi_\theta;

  • πθ(atst)\pi_\theta (a_t|s_t) is the the probability of taking action ata_t in state sts_t at time step tt under policy π\pi parameterized by θθ;

  • ata_t is the action taken at time step tt;

  • sts_t is the state at time step tt;

  • Qπθ(st,at)Q^{\pi_\theta}(s_t,a_t) is the action-value function under the current policy πθ\pi_\theta at time step tt.

The policy gradient theorem provides a way to adapt the policy toward actions that yield higher returns, based on the agent's experiences, without needing to know the underlying dynamics of the environment. This can be thought of as "do more of what works well, and less of what doesn't". The policy parameters θθ can then be updated in the direction of the gradient to improve the expected return:

θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)

where αα is the learning rate.

One limitation of the basic policy gradient method is that it suffers from high variance in the gradient estimates, leading to unstable learning. To mitigate this issue, various techniques are employed, such as actor-critic methods and baseline subtraction.

Actor-critic methods

The actor-critic method combines the ideas of policy gradient methods (the actor) and value function approximation (the critic). The actor learns the policy that maps states to actions, while the critic estimates the value function, which is then used to improve the actor's policy updates.

The actor-critic architecture consists of two components: the actor and the critic. The actor is the policy πθ(atst)\pi_\theta(a_t|s_t) parameterized by θθ, which determines the probability distribution over actions given a state ss. The critic estimates the value function Vπ(s)V_π(s) or the action-value function Qπ(s,a)Q_π(s,a) under the current policy ππ. Below, we will outline the actor-critic method in the classical form, although there are a few other variants available.

The policy gradient theorem with the actor-critic method can be written as:

θJ(θ)Eπθ[t=0Tθlogπθ(atst)Aw(st,at)]\nabla_\theta J(\theta) \approx \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) A_w(s_t,a_t) \right]

Where Aw(st,at)A_w(s_t,a_t) is the advantage function, defined as:

Aw(st,at)=Qw(st,at)Vw(st)A_w(s_t,a_t) = Q_w(s_t,a_t) - V_w(s_t)

The advantage function represents the relative advantage of taking action ata_t in state sts_t, compared to the average performance of the current policy in that state. The further notation breakdown can be given as

  • Qw(st,at)Q_w(s_t,a_t) is the expected return starting from state sts_t, taking action ata_t, and then following policy π\pi;

  • Vw(st)V_w(s_t) is the expected return starting from state sts_t and following policy π\pi.

The actor and critic components are typically trained simultaneously, with the critic providing a better estimate of the value function to the actor, and the actor updating the policy based on the critic's value estimates.

The baselines

Another common approach to addressing the learning instability due to high variance in the gradient estimates is the baselines. In this section, we will briefly look at two common baselines (although there are many more than two):

  • baseline subtraction, and

  • state value function baseline.

The basic idea behind baseline subtraction is to subtract a baseline value from the cumulative reward, without changing the expected value of the gradient estimator. This can significantly reduce the variance of the gradient estimates, leading to more stable learning.

The policy gradient theorem with baseline subtraction can be written as:

θJ(θ)=Eπθ[t=0Tθlogπθ(atst)(Gtb(st))]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) (G_t - b(s_t)) \right]

Where

  • GtG_t is the return from time step tt;

  • b(st)b(s_t) is the baseline function, which can be any function that does not depend on the action taken in state sts_t.

Another choice for the baseline function is the state value function Vπ(s)V^\pi(s), which represents the expected return from state ss under the current policy π\pi. The state value function can be given as

Vπ(s)=Eπ[GtSt=s]=Eπ[k=0γkRt+k+1St=s]V^{\pi}(s) = \mathbb{E}_{\pi} [G_t | S_t = s] = \mathbb{E}_{\pi} \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t = s \right]

where

  • GtG_t is the discounted return from time step tt;

  • γ\gamma is the discount factor;

  • RtR_t is the reward at time step tt.

Then, we can introduce the baseline, known as the state value function baseline, and the policy gradient theorem becomes:

θJ(θ)=Eπθ[t=0Tθlogπθ(atst)(GtVπθ(st))]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) (G_t - V^{\pi_\theta}(s_t)) \right]

Using the value function as a baseline can significantly reduce the variance of the gradient estimates, especially in problems with high rewards or long episodes.

Conclusion

Policy optimization offers ways to directly improve an agent's behavior in complex environments. We've looked at several key approaches, including policy gradients and actor-critic methods, as well as techniques for dealing with the high variance often seen in gradient estimates. These methods are important for achieving stable and efficient learning in reinforcement learning systems.

How did you like the theory?
Report a typo