Part 11 - Reinforcement Learning: Deep RL and Control with Function Approximation

Introduction

In the previous part, we introduced the concept of function approximation in reinforcement learning (RL), which is essential for handling large or continuous state and action spaces. We discussed different types of function approximators, including linear and non-linear methods, and how they can be integrated into RL algorithms. Finally, we provided an introduction to Deep Reinforcement Learning (Deep RL), which combines deep learning techniques with RL to tackle complex problems.

In this part, we will continue our discussion on Deep RL and explore how to use function approximation for control tasks in RL. We will cover key algorithms such as semi-gradient SARSA then continue with Deep Q-Networks (DQN) and its variants.

Control with Function Approximation

Recall (Setup for Control with Function Approximation)

We consider the control problem with a parametric approximation of the action-value function $\hat{q}(s, a, \mathbf{w}) \approx q_{\star}(s, a)$ , where $\mathbf{w} \in \mathbb{R}^d$ is a finite-dimensional weight vector.

Further, instead of using training examples of the form $S_t \mapsto U_t$ , we now use examples of the form $(S_t, A_t) \mapsto U_t$ . The update target $U_t$ can be any approximation of $q_{\pi}(S_t, A_t)$ , including the usual update values such as the full MC return $(G_t)$ or any of the $n$ -step Sarsa retuns $(G_{t:t+n})$ .

Episodic Semi-Gradient SARSA

Definition 1 (Episodic Semi-Gradient SARSA)

The general gradient-descent update for action-value prediction is,

\mathbf{w}_{t + 1} \coloneqq \mathbf{w}_t + \alpha [U_t - \hat{q}(S_t, A_t, \mathbf{w}_t)] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t),

Thus, the update for the one-step SARSA method is,

\mathbf{w}_{t + 1} \coloneqq \mathbf{w}_t + \alpha [R_{t + 1} + \gamma \hat{q}(S_{t + 1}, A_{t + 1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t).

To form (general) control methods, we need to couple action-value prediction with policy improvement and action selection.

For each possible action $a$ available in the next state $S_{t + 1}$ , we need to compute $\hat{q}(S_{t + 1}, a, \mathbf{w}_t)$ , and then find the greedy action,

A^{\star}_{t + 1} \coloneqq \arg\max_a \hat{q}(S_{t + 1}, a, \mathbf{w}_t).

Thus, policy improvement is can be done by changing the estimation policy to a soft approximation of the greedy policy, i.e., $\epsilon$ -greedy policy.

Algorithm (Semi-Gradient SARSA for estimating

\hat{q} \approx q_{\star}

)

Input: a differentiable action-value function parameterization $\hat{q} : \mathcal{S} \times \mathcal{A} \times \mathbb{R}^d \rightarrow \mathbb{R}$

Algorithm parameters: step-size $\alpha > 0$ , small $\epsilon > 0$

Loop for each episode:

$S, A \leftarrow$ initial state and action of episode (e.g., $\epsilon$ -greedy)
Loop for each step of episode:
- Take action $A$ , observe $R, S^{\prime}$
- If $S^{\prime}$ $S^{'}$ is terminal:
  - $\mathbf{w} \leftarrow \mathbf{w} + \alpha [R - \hat{q}(S, A, \mathbf{w})] \nabla \hat{q}(S, A, \mathbf{w})$
  - Go to next episode
- Choose $A^{\prime}$ as a function of $\hat{q}(S^{\prime}, \cdot, \mathbf{w})$ (e.g., $\epsilon$ -greedy)
- $\mathbf{w} \leftarrow \mathbf{w} + \alpha [R + \gamma \hat{q}(S^{\prime}, A^{\prime}, \mathbf{w}) - \hat{q}(S, A, \mathbf{w})] \nabla \hat{q}(S, A, \mathbf{w})$
- $S \leftarrow S^{\prime}$
- $A \leftarrow A^{\prime}$

Definition 2 (Semi-Gradient

n

-step SARSA)

We can generalize and obtain an $n$ -step version of the episodic semi-gradient SARSA algorithm by using the $n$ -step SARSA target $G_{t:t+n}$ instead of the one-step SARSA target $R_{t + 1} + \gamma \hat{q}(S_{t + 1}, A_{t + 1}, \mathbf{w}_t)$ .

The $n$ -step update equation is,

\mathbf{w}_{t + n} \coloneqq \mathbf{w}_t + \alpha [G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_{t + n - 1})] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t + n - 1}),

where,

G_{t:t+n} \coloneqq R_{t + 1} + \gamma R_{t + 2} + \ldots + \gamma^{n - 1} R_{t + n} + \gamma^n \hat{q}(S_{t + n}, A_{t + n}, \mathbf{w}_{t + n - 1}),

Deep RL & Deep Q-Networks (DQN)

As we discussed last time, many ideas that we have seen so far, can easily be transferred when using deep neural networks as function approximators.

Definition 3 (Deep Q-Networks (DQN))

Deep Q-Networks (DQN) is a value-based method that combines Q-learning with deep neural networks, developed by DeepMind (Mnih et al., 2015) [^1].

They let the DQN learn to play 49 different Atari 2600 games by interacting with the game environment, using only the raw pixel inputs and the game score as rewards. The innovations of DQN are threefold.

Intuition (Experience Replay)

Experience replay is a technique where the agent’s experiences (state, action, reward, next state) are stored in a replay buffer (memory). After the game emulator executes action $A_t$ in state $S_t$ and yields reward $R_{t + 1}$ and next state $S_{t + 1}$ , the experience tuple $(S_t, A_t, R_{t + 1}, S_{t + 1})$ is stored in the replay buffer.

At each time step in the training phase, a mini-batch of experiences is randomly sampled from the replay buffer and used to update the DQN. Instead of $S_{t + 1}$ becoming the next $S_t$ for the next update, as it would in the usual form of Q-learning, a new unconnected experience is drawn from the replay buffer.

Note

Since Q-learning is an off-policy method, it can learn from data generated by a different policy than the one currently being learned. Thus, experience replay can be used to break the correlation between consecutive experiences and improve data efficiency.

Intuition (Duplicate Network)

We have previously discussed that Q-learning can suffers from an inherent overestimation bias. To mitigate this issue, DQN uses a duplicate network (also called target network) to generate the target values for the Q-learning updates. The duplicate network has the same architecture as the main DQN but with frozen weights that are periodically updated to match the main network. This helps to stabilize the learning process by providing a more consistent target for the Q-learning updates.

Letting $\tilde{q}$ denote the output of this duplicate network, the DQN update equation is,

\mathbf{w}_{t + 1} \coloneqq \mathbf{w}_t + \alpha [R_{t + 1} + \gamma \max_a \tilde{q}(S_{t + 1}, a, \mathbf{w}^-_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t),

where $\mathbf{w}^-_t$ are the weights of the duplicate network at time step $t$ .

Intuition (Reward Clipping)

To handle the varying reward scales across different Atari games, DQN employs reward clipping. All positive rewards are clipped to +1, all negative rewards are clipped to -1, and zero rewards remain unchanged. This normalization helps to stabilize the learning process and allows the same hyperparameters to be used across different games.

Algorithm (Deep Q-Network (DQN))

Initialize network $\hat{q}$

Initialize target network $\tilde{q}$

Initialize replay buffer $\mathcal{D}$

Initialize the Agent to interact with the Environment

while not converged do

$\epsilon \leftarrow$ setting new epsilon with $\epsilon$ -decay
Choose an action $a$ from state $s$ using policy $\epsilon$ -greedy( $\hat{q})$
Agent takes action $a$ , observe reward $r$ , and next state $s^{\prime}$
Store transition $(s, a, r, s^{\prime}, \text{done})$ in the experience replay memory $\mathcal{D}$
If enough experiences in $\mathcal{D}$ then
- Sample random mini-batch of $N$ transitions from $\mathcal{D}$
- for every transition $(s_i, a_i, r_i, s^{\prime}_i, \text{done}_i)$ $(s_{i}, a_{i}, r_{i}, s_{i}^{'}, done_{i})$ in the mini-batch do
  - if $\text{done}_i$ $done_{i}$ then
    - $y_i = r_i$
  - else
    - $y_i = r_i + \gamma \max_a \tilde{q}(s^{\prime}_i, a, \mathbf{w}^-)$
  - end
- end
- Calculate the loss $\mathcal{L} = \frac{1}{N} \sum_{i=1}^N (\hat{q}(s_i, a_i, \mathbf{w}) - y_i)^2$
- Update $\hat{q}$ using the SGD algorithm by minimizing the loss $\mathcal{L}$
- Every $C$ steps, copy weights from $\hat{q}$ to $\tilde{q}$
end

end