Policy optimization is a class of reinforcement learning algorithms that directly optimize the policy function by adjusting its parameters to maximize the expected cumulative reward. These algorithms search the policy space more efficiently compared to value-based methods, making them well-suited for continuous action spaces and high-dimensional environments.
In this topic, we will look at the basic formulation of the policy gradient and consider some common policy optimization algorithms.
The policy gradient
The policy gradient provides a way to directly optimize the policy parameters to maximize the expected return.
The policy gradient theorem is the foundation for a class of algorithms known as policy gradient methods. The policy gradient theorem states that the gradient of the expected return with respect to the policy parameters can be written as:
Where
is the expectation under policy ;
is the the probability of taking action in state at time step under policy parameterized by ;
is the action taken at time step ;
is the state at time step ;
is the action-value function under the current policy at time step .
The policy gradient theorem provides a way to adapt the policy toward actions that yield higher returns, based on the agent's experiences, without needing to know the underlying dynamics of the environment. This can be thought of as "do more of what works well, and less of what doesn't". The policy parameters can then be updated in the direction of the gradient to improve the expected return:
where is the learning rate.
One limitation of the basic policy gradient method is that it suffers from high variance in the gradient estimates, leading to unstable learning. To mitigate this issue, various techniques are employed, such as actor-critic methods and baseline subtraction.
Actor-critic methods
The actor-critic method combines the ideas of policy gradient methods (the actor) and value function approximation (the critic). The actor learns the policy that maps states to actions, while the critic estimates the value function, which is then used to improve the actor's policy updates.
The actor-critic architecture consists of two components: the actor and the critic. The actor is the policy parameterized by , which determines the probability distribution over actions given a state . The critic estimates the value function or the action-value function under the current policy . Below, we will outline the actor-critic method in the classical form, although there are a few other variants available.
The policy gradient theorem with the actor-critic method can be written as:
Where is the advantage function, defined as:
The advantage function represents the relative advantage of taking action in state , compared to the average performance of the current policy in that state. The further notation breakdown can be given as
is the expected return starting from state , taking action , and then following policy ;
is the expected return starting from state and following policy .
The actor and critic components are typically trained simultaneously, with the critic providing a better estimate of the value function to the actor, and the actor updating the policy based on the critic's value estimates.
The baselines
Another common approach to addressing the learning instability due to high variance in the gradient estimates is the baselines. In this section, we will briefly look at two common baselines (although there are many more than two):
baseline subtraction, and
state value function baseline.
The basic idea behind baseline subtraction is to subtract a baseline value from the cumulative reward, without changing the expected value of the gradient estimator. This can significantly reduce the variance of the gradient estimates, leading to more stable learning.
The policy gradient theorem with baseline subtraction can be written as:
Where
is the return from time step ;
is the baseline function, which can be any function that does not depend on the action taken in state .
Another choice for the baseline function is the state value function , which represents the expected return from state under the current policy . The state value function can be given as
where
is the discounted return from time step ;
is the discount factor;
is the reward at time step .
Then, we can introduce the baseline, known as the state value function baseline, and the policy gradient theorem becomes:
Using the value function as a baseline can significantly reduce the variance of the gradient estimates, especially in problems with high rewards or long episodes.
Conclusion
Policy optimization offers ways to directly improve an agent's behavior in complex environments. We've looked at several key approaches, including policy gradients and actor-critic methods, as well as techniques for dealing with the high variance often seen in gradient estimates. These methods are important for achieving stable and efficient learning in reinforcement learning systems.