Introduction
In the previous part, we introduced the concept of function approximation in reinforcement learning (RL), which is essential for handling large or continuous state and action spaces. We discussed different types of function approximators, including linear and non-linear methods, and how they can be integrated into RL algorithms. Finally, we provided an introduction to Deep Reinforcement Learning (Deep RL), which combines deep learning techniques with RL to tackle complex problems.
In this part, we will continue our discussion on Deep RL and explore how to use function approximation for control tasks in RL. We will cover key algorithms such as semi-gradient SARSA then continue with Deep Q-Networks (DQN) and its variants.
Control with Function Approximation
Recall: Setup for Control with Function Approximation
We consider the control problem with a parametric approximation of the action-value function $\hat{q}(s, a, \mathbf{w}) \approx q_{\star}(s, a)$, where $\mathbf{w} \in \mathbb{R}^d$ is a finite-dimensional weight vector.
Further, instead of using training examples of the form $S_t \mapsto U_t$, we now use examples of the form $(S_t, A_t) \mapsto U_t$. The update target $U_t$ can be any approximation of $q_{\pi}(S_t, A_t)$, including the usual update values such as the full MC return $(G_t)$ or any of the $n$-step Sarsa retuns $(G_{t:t+n})$.
Episodic Semi-Gradient SARSA
Definition: Episodic Semi-Gradient SARSA
The general gradient-descent update for action-value prediction is, $$ \mathbf{w}_{t + 1} \coloneqq \mathbf{w}_t + \alpha [U_t - \hat{q}(S_t, A_t, \mathbf{w}_t)] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t), $$ Thus, the update for the one-step SARSA method is, $$ \mathbf{w}_{t + 1} \coloneqq \mathbf{w}_t + \alpha [R_{t + 1} + \gamma \hat{q}(S_{t + 1}, A_{t + 1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t). $$ To form (general) control methods, we need to couple action-value prediction with policy improvement and action selection.
For each possible action $a$ available in the next state $S_{t + 1}$, we need to compute $\hat{q}(S_{t + 1}, a, \mathbf{w}_t)$, and then find the greedy action, $$ A^{\star}_{t + 1} \coloneqq \arg\max_a \hat{q}(S_{t + 1}, a, \mathbf{w}_t). $$ Thus, policy improvement is can be done by changing the estimation policy to a soft approximation of the greedy policy, i.e., $\epsilon$-greedy policy.
Algorithm: Semi-Gradient SARSA for estimating $\hat{q} \approx q_{\star}$
Input: a differentiable action-value function parameterization $\hat{q} : \mathcal{S} \times \mathcal{A} \times \mathbb{R}^d \rightarrow \mathbb{R}$
Algorithm parameters: step-size $\alpha > 0$, small $\epsilon > 0$
Loop for each episode:
- $S, A \leftarrow$ initial state and action of episode (e.g., $\epsilon$-greedy)
- Loop for each step of episode:
- Take action $A$, observe $R, S^{\prime}$
- If $S^{\prime}$ is terminal:
- $\mathbf{w} \leftarrow \mathbf{w} + \alpha [R - \hat{q}(S, A, \mathbf{w})] \nabla \hat{q}(S, A, \mathbf{w})$
- Go to next episode
- Choose $A^{\prime}$ as a function of $\hat{q}(S^{\prime}, \cdot, \mathbf{w})$ (e.g., $\epsilon$-greedy)
- $\mathbf{w} \leftarrow \mathbf{w} + \alpha [R + \gamma \hat{q}(S^{\prime}, A^{\prime}, \mathbf{w}) - \hat{q}(S, A, \mathbf{w})] \nabla \hat{q}(S, A, \mathbf{w})$
- $S \leftarrow S^{\prime}$
- $A \leftarrow A^{\prime}$
Definition: Semi-Gradient $n$-step SARSA
We can generalize and obtain an $n$-step version of the episodic semi-gradient SARSA algorithm by using the $n$-step SARSA target $G_{t:t+n}$ instead of the one-step SARSA target $R_{t + 1} + \gamma \hat{q}(S_{t + 1}, A_{t + 1}, \mathbf{w}_t)$.
The $n$-step update equation is, $$ \mathbf{w}_{t + n} \coloneqq \mathbf{w}_t + \alpha [G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_{t + n - 1})] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t + n - 1}), $$ where, $$ G_{t:t+n} \coloneqq R_{t + 1} + \gamma R_{t + 2} + \ldots + \gamma^{n - 1} R_{t + n} + \gamma^n \hat{q}(S_{t + n}, A_{t + n}, \mathbf{w}_{t + n - 1}), $$
Deep RL & Deep Q-Networks (DQN)
As we discussed last time, many ideas that we have seen so far, can easily be transferred when using deep neural networks as function approximators.
Definition: Deep Q-Networks (DQN)
Deep Q-Networks (DQN) is a value-based method that combines Q-learning with deep neural networks, developed by DeepMind (Mnih et al., 2015) 1.
They let the DQN learn to play 49 different Atari 2600 games by interacting with the game environment, using only the raw pixel inputs and the game score as rewards. The innovations of DQN are threefold.
Intuition: Experience Replay
Experience replay is a technique where the agent’s experiences (state, action, reward, next state) are stored in a replay buffer (memory). After the game emulator executes action $A_t$ in state $S_t$ and yields reward $R_{t + 1}$ and next state $S_{t + 1}$, the experience tuple $(S_t, A_t, R_{t + 1}, S_{t + 1})$ is stored in the replay buffer.
At each time step in the training phase, a mini-batch of experiences is randomly sampled from the replay buffer and used to update the DQN. Instead of $S_{t + 1}$ becoming the next $S_t$ for the next update, as it would in the usual form of Q-learning, a new unconnected experience is drawn from the replay buffer.
Note
Since Q-learning is an off-policy method, it can learn from data generated by a different policy than the one currently being learned. Thus, experience replay can be used to break the correlation between consecutive experiences and improve data efficiency.
Intuition: Duplicate Network
We have previously discussed that Q-learning can suffers from an inherent overestimation bias. To mitigate this issue, DQN uses a duplicate network (also called target network) to generate the target values for the Q-learning updates. The duplicate network has the same architecture as the main DQN but with frozen weights that are periodically updated to match the main network. This helps to stabilize the learning process by providing a more consistent target for the Q-learning updates.
Letting $\tilde{q}$ denote the output of this duplicate network, the DQN update equation is, $$ \mathbf{w}_{t + 1} \coloneqq \mathbf{w}_t + \alpha [R_{t + 1} + \gamma \max_a \tilde{q}(S_{t + 1}, a, \mathbf{w}^-_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t), $$ where $\mathbf{w}^-_t$ are the weights of the duplicate network at time step $t$.
Intuition: Reward Clipping
To handle the varying reward scales across different Atari games, DQN employs reward clipping. All positive rewards are clipped to +1, all negative rewards are clipped to -1, and zero rewards remain unchanged. This normalization helps to stabilize the learning process and allows the same hyperparameters to be used across different games.
Algorithm: Deep Q-Network (DQN)
Initialize network $\hat{q}$
Initialize target network $\tilde{q}$
Initialize replay buffer $\mathcal{D}$
Initialize the Agent to interact with the Environment
while not converged do
-
$\epsilon \leftarrow$ setting new epsilon with $\epsilon$-decay
-
Choose an action $a$ from state $s$ using policy $\epsilon$-greedy($\hat{q})$
-
Agent takes action $a$, observe reward $r$, and next state $s^{\prime}$
-
Store transition $(s, a, r, s^{\prime}, \text{done})$ in the experience replay memory $\mathcal{D}$
-
If enough experiences in $\mathcal{D}$ then
- Sample random mini-batch of $N$ transitions from $\mathcal{D}$
- for every transition $(s_i, a_i, r_i, s^{\prime}_i, \text{done}_i)$ in the mini-batch do
- if $\text{done}_i$ then
- $y_i = r_i$
- else
- $y_i = r_i + \gamma \max_a \tilde{q}(s^{\prime}_i, a, \mathbf{w}^-)$
- end
- if $\text{done}_i$ then
- end
- Calculate the loss $\mathcal{L} = \frac{1}{N} \sum_{i=1}^N (\hat{q}(s_i, a_i, \mathbf{w}) - y_i)^2$
- Update $\hat{q}$ using the SGD algorithm by minimizing the loss $\mathcal{L}$
- Every $C$ steps, copy weights from $\hat{q}$ to $\tilde{q}$
-
end
end