In this topic, you'll learn about the Q-learning algorithm, its components, action selection strategies such as epsilon-greedy, Q-value updates, and the difference between off-policy and on-policy learning.
By the end, you'll understand how Q-learning works and its real-world uses.
Q-learning algorithm
Q-learning is a value-based RL algorithm that helps an agent learn the best action to take in any given state. The main idea is to create a table, called a Q-table, which stores the expected rewards (Q-values) for taking each action in each state.
The Q-learning process begins by setting up the Q-table with random values, often set to zero. As the agent interacts with the environment, it updates these Q-values based on the rewards it receives. The goal is to find the best policy, which tells the agent what action to take in each state to maximize long-term reward.
Here's a simple example of a Q-table for a robot navigating a 3x3 maze:
Up Down Left Right
S1 0.0 0.0 0.0 0.0
S2 0.0 0.0 0.0 0.0
S3 0.0 0.0 0.0 0.0
S4 0.0 0.0 0.0 0.0
S5 0.0 0.0 0.0 0.0
S6 0.0 0.0 0.0 0.0
S7 0.0 0.0 0.0 0.0
S8 0.0 0.0 0.0 0.0
S9 0.0 0.0 0.0 0.0The Q-learning algorithm follows these steps:
Set up the Q-table.
Choose an action based on the current policy.
Perform the action and observe the reward and new state.
Update the Q-value for the state-action pair using the Q-learning update rule.
Repeat steps 2-4 until the goal is reached or a maximum number of iterations is met.
This process allows the agent to improve its decision-making over time, eventually finding the best policy.
Action selection and exploration
A key challenge in Q-learning is balancing exploration (trying new actions to gather information) and exploitation (using known information to make the best decision). This is where the epsilon-greedy strategy comes in.
The epsilon-greedy strategy uses a parameter ε (epsilon) that determines the probability of exploring versus exploiting:
With probability ε, the agent chooses a random action (exploration).
With probability 1-ε, the agent chooses the action with the highest Q-value (exploitation).
Typically, ε starts high (e.g., 1.0) and decreases over time. For example, it might start at 1.0 and decrease to 0.1 over 1000 iterations. This allows the agent to explore more at the beginning and gradually focus on using what it has learned.
Here's a simple implementation of the epsilon-greedy strategy:
import random
def epsilon_greedy(q_table, state, epsilon):
if random.uniform(0, 1) < epsilon:
return random.choice(list(q_table[state].keys())) # Explore
else:
return max(q_table[state], key=q_table[state].get) # ExploitQ-value updates
The core of Q-learning is how it updates the Q-values. After each action, the agent updates the Q-value for the state-action pair it just experienced using the Q-learning update rule:
Q(s,a) = Q(s,a) + α * (r + γ * max(Q(s',a')) - Q(s,a))Where:
Q(s,a) is the current Q-value for state
sand actiona.α (alpha) is the learning rate (0 < α ≤ 1).
r is the reward received for taking action
ain states.γ (gamma) is the discount factor (0 ≤ γ ≤ 1).
max(Q(s',a')) is the maximum Q-value for the next state
s'.
Let's look at a simple example:
Current Q(s,a) = 0.5
α = 0.1
r = 1
γ = 0.9
max(Q(s',a')) = 0.7
Applying the formula:
Q(s,a) = 0.5 + 0.1 * (1 + 0.9 * 0.7 - 0.5) = 0.563We use the maximum Q-value for the next state because Q-learning assumes the agent will always choose the best action in the future, even if it might explore in reality.
Off-policy vs. on-policy learning
Q-learning is an example of off-policy learning, which means it learns about the best policy while following a different, exploratory policy. This contrasts with on-policy learning, where the agent learns about the policy it's currently following.
The main difference is in how they update their value estimates:
Off-policy methods (like Q-learning) can learn from actions that weren't actually taken, making them more efficient with data.
On-policy methods (like SARSA) only learn from actions that were actually taken, which can make them more stable in some environments.
This difference affects the update rule. In Q-learning, we use max(Q(s',a')) assuming the best action will be taken next. In SARSA (an on-policy method), we use the Q-value of the actual next action taken.
Conclusion
Q-learning is a powerful reinforcement learning technique that helps agents learn the best behaviors through trial and error. Its ability to balance exploration and exploitation, along with its off-policy nature, makes it useful for solving many decision-making problems.
Key points to remember:
Q-learning uses a Q-table to store and update action values.
The epsilon-greedy strategy balances exploration and exploitation.
Q-values are updated using the Q-learning update rule.
Q-learning is an off-policy method, learning about the best policy while following an exploratory one.
Now that you understand Q-learning, you're ready to try some hands-on exercises and see it in action!