Computer scienceData scienceMachine learningReinforcement learning

Introduction to reinforcement learning

5 minutes read

Reinforcement learning (RL) is a sub-branch of machine learning where an agent learns to make decisions by interacting with an environment and receiving rewards or penalties for its actions. The goal is to learn a policy that maximizes the cumulative reward over time.

In this topic, we will look at some of the fundamentals of RL.

The RL setting

The agent and the environment are central in RL:

The agent-environment interaction

The environment represents the space where the agent resides and interacts. A state (later denoted by ss) is a complete description of the state of the environment at a given point. At each interaction step, the agent perceives a (potentially partial) observation (denoted by oo) of the world's state and then decides on an action to take. The environment undergoes change due to the agent's actions (denoted by aa), but it may also evolve independently. The agent receives a reward signal from the environment, a numerical value indicating the desirability or undesirability of the current world state. The reward function, denoted as R(s,a,s)R(s, a, s'), maps a state (ss) -action (aa) pair and the resulting next state (ss') to a numerical reward value. The agent's objective is to maximize its cumulative reward, termed return. RL methods provide the agent with strategies to learn behaviors that accomplish this goal.

Main components

In this section, we will briefly define the main components of RL, namely, we will talk about policies, value functions, and trajectories.

A policy is a function that maps states to the action that the agent should take. The goal of RL algorithms is to learn an optimal policy that maximizes the expected cumulative reward (return) over time, sometimes referred to as the average return.

Policies are classified into two main categories: deterministic and stochastic. A deterministic policy is a function π(s)\pi(s) that maps a given state ss to a single action aa. A stochastic policy π(as)\pi(a|s) represents the probability distribution over actions aa given the state ss.

The value function is used to evaluate the quality of a policy. The state-value function V(s)V(s) represents the expected return when starting from state ss and following policy π\pi thereafter. It is defined as:

V(s)=E[GtSt=s],V(s) = E[G_t | S_t = s],

where GtG_t is the cumulative discounted reward from time step tt onward, and the expectation is taken over the policy π\pi and the dynamics of the environment. The return GtG_t can be defined as:

Gt=k=0γkrt+k+1G_t = \sum_{k=0}^{\infin} \gamma^k r_{t+k+1}

where γ[0,1]\gamma \in [0, 1] is the discount factor that determines the importance of future rewards relative to immediate rewards.

Similarly, the action-value function Q(s,a)Q(s, a) represents the expected return when starting from state ss, taking action aa, and following policy π\pi thereafter:

Q(s,a)=E[GtSt=s,At=a]Q(s, a) = E[G_t | S_t = s, A_t = a]

The optimal state-value function V(s)V^*(s) and the optimal action-value function Q(s,a)Q^*(s, a) are defined as the maximum value functions over all possible policies:

V(s)=maxπVπ(s);Q(s,a)=maxπQπ(s,a)V^*(s) = \max_\pi V^\pi(s); \\ Q^*(s, a) = \max_\pi Q^\pi(s, a)

The optimal policy π(s)π^*(s) can be derived from the optimal action-value function Q(s,a)Q^*(s, a) by selecting the action that maximizes the expected return:

π(s)=arg maxaQ(s,a)\pi^*(s) = \argmax_a Q^*(s, a)

RL algorithms aim to learn the optimal policy π\pi^* or its corresponding optimal value functions VV^* or QQ^*.

The interaction between the agent and the environment is often represented as a trajectory. A trajectory is a sequence of states, actions, and rewards experienced by the agent as it navigates through the environment. A trajectory τ\tau can be formally defined as:

τ=(s0,a0,r1,s1,a1,r2,...,sT,aT,rT+1)τ = (s_0, a_0, r_1, s_1, a_1, r_2, ..., s_T, a_T, r_{T+1})

where:

  • sts_t is the state of the environment at time step tt;

  • ata_t is the action taken by the agent at time step tt;

  • rt+1r_{t+1} is the reward received by the agent after transitioning from state sts_t to st+1s_{t+1} by taking action ata_t.

The length of a trajectory, denoted by TT, can be finite or infinite, depending on the problem setting. In episodic tasks, the trajectory has a well-defined terminal state, and the agent's goal is to maximize the cumulative reward within each episode. In continuous tasks, the trajectory can be infinite, and the agent aims to maximize the discounted sum of future rewards.

It's important to note the on- and off-policy methods. In on-policy methods, the agent learns the value functions or policies by following the same policy that is being evaluated or improved (the agent learns from the trajectories generated by the current policy being used to make decisions). The main advantage of on-policy methods is that they can learn and improve the policy directly from the experiences generated by that policy. However, they may converge slowly, especially in environments with large state spaces or when the policy changes drastically during learning. Off-policy methods allow the agent to learn the value functions or policies from trajectories or experiences generated by a different policy than the one being evaluated or improved (the agent can learn from data generated by another policy, such as a behavior policy or an exploratory policy).

The Bellman equation

The Bellman equation is a concept that relates the value functions of successive states and actions. It provides a recursive relationship that allows the agent to estimate the expected return (value) of a state or a state-action pair based on the immediate reward and the discounted value of the next state or state-action pair.

The Bellman equation for the state-value function V(s)V(s) is given by:

V(s)=aπ(as)[R(s,a)+γsP(ss,a)V(s)]V(s) = \sum_a \pi(a|s) [R(s, a) + \gamma \sum_{s'} P(s'|s, a) V(s')]

where:

  • ss and ss' are the current and next states, respectively;

  • aa is the action taken in state ss;

  • π(as)\pi(a|s) is the policy, which specifies the probability of taking action aa in state ss;

  • R(s,a)R(s, a) is the expected immediate reward for taking action aa in state ss;

  • P(ss,a)P(s'|s, a) is the transition probability of reaching state ss' from state ss after taking action aa;

  • γ[0,1]\gamma \in [0, 1] is the discount factor.

Similarly, the Bellman equation for the action-value function Q(s,a)Q(s, a) is given by:

Q(s,a)=sP(ss,a)[R(s,a)+γaπ(as)Q(s,a)]Q(s, a) = \sum_{s'} P(s'|s, a) [R(s, a) + \gamma \sum_{a'} π(a'|s') Q(s', a')]

The Bellman equations express a relationship between the value of a state or a state-action pair and the values of its successor states or state-action pairs. They capture the intuition that the value of a state or a state-action pair should be equal to the immediate reward plus the discounted value of the next state or state-action pair, averaged over all possible transitions.

The RL limitations

There are a few fundamental challenges in RL. We will consider 3 of them, namely, the credit assignment problem, the curse of dimensionality, and the exploration-exploitation trade-off.

The credit assignment problem arises when the agent needs to attribute credit or blame for the cumulative reward to individual actions or states in a sequence. In many RL problems, the reward signal is delayed, and the agent's actions can have long-term consequences that are not immediately apparent. This makes it difficult to determine which actions or states contributed to the observed reward or punishment, and to what extent. The credit assignment problem becomes more challenging as the length of the decision sequences and the complexity of the environment increase. Techniques such as temporal difference learning and eligibility traces have been developed to address this issue by propagating credit or blame over time.

Another significant challenge in RL is the curse of dimensionality, which refers to the exponential growth of the state and action spaces as the number of variables or dimensions increases. In complex environments with high-dimensional state and action spaces, it becomes computationally intractable to accurately represent and learn the value functions or policies. This is because the number of possible states and actions grows exponentially with the number of dimensions, making it infeasible to explore and learn the optimal behavior for all possible situations. The curse of dimensionality can severely hinder the scalability and efficiency of RL algorithms. Techniques such as function approximation, hierarchical reinforcement learning, and deep reinforcement learning have been developed to mitigate this issue by leveraging generalization and abstracting the state and action spaces.

The exploration-exploitation trade-off arises from the need to balance between exploring new actions and states to gather more information about the environment, and exploiting the agent's current knowledge to maximize its rewards. Excessive exploration can lead to inefficient behavior and lower cumulative rewards, while relying solely on exploitation can prevent the agent from discovering new and potentially better policies. Effective exploration strategies, such as epsilon-greedy policies, upper confidence bound algorithms, and intrinsic motivation techniques, are crucial for striking the right balance between exploration and exploitation, enabling the agent to learn efficiently while also maximizing its cumulative rewards over time.

Conclusion

As a result, you are now familiar with some of the fundamental concepts in reinforcement learning.

How did you like the theory?
Report a typo