Part 9 - Reinforcement Learning: Temporal-Difference Learning II

Introduction

In the previous part, we introduced Temporal-Difference (TD) learning methods, which combine ideas from both Monte Carlo (MC) methods and Dynamic Programming (DP). We covered both the policy evaluation problem, estimating the value function $v_{\pi}$ for a given policy $\pi$ , and the control problem, using a variation of the generalized policy iteration (GPI) algorithm. We introduced two classes of TD control methods, one for on-policy (SARSA) and one for off-policy (Q-learning).

In this part, we will cover some additional TD methods, including Expected SARSA, which is an on-policy method that uses the expected value of the next state-action pair rather than a sample, and Double Q-learning, which is an off-policy method that addresses the overestimation bias in Q-learning. Finally, we will discuss the general framwork known as $n$ -step bootstrapping, which unifies TD and MC methods.

Expected SARSA

Recall (Q-Learning Update Rule)

Recall the update rule for Q-learning,

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t + 1} + \gamma \max_{a} Q(S_{t + 1}, a) - Q(S_t, A_t)]

Instead of the $\max$ operator over the next state-action pair, we can use the expected value with respect to the probability of each action under our current policy $\pi$ .

Definition 1 (Expected SARSA Update Rule)

The update rule for Expected SARSA is given by,

\begin{align*} Q(S_t, A_t) & \leftarrow Q(S_t, A_t) + \alpha [R_{t + 1} + \gamma \mathbb{E}_{\pi}[Q(S_{t + 1}, A_{t + 1}) | S_{t + 1}] - Q(S_t, A_t)] \newline & = Q(S_t, A_t) + \alpha \left[R_{t + 1} + \gamma \sum_{a} \pi(a | S_{t + 1}) Q(S_{t + 1}, a) - Q(S_t, A_t)\right] \end{align*}

Expected SARSA will be more complex computationally than regular SARSA, but it will eliminate the variance due to the random selection of $A_{t + 1}$ , why?

Solution

The expected value is computed over all possible actions at the next state, rather than relying on a single sampled action. This reduces the variance in the updates, leading to more stable learning.

Maximization Bias and Double Learning

So far, all the control algorithms we have discussed involve some sort of maximization in the construction of their target policies.

In Q-learning, the target policy is the greedy policy with respect to the current action-value function $Q$ .
In SARSA, the policy is (often) $\epsilon$ -greedy with respect to $Q$ , which also involves a maximization step.

These maximization steps can lead to significant positive biases in the action-value estimates.

Definition 2 (Double Learning)

Double learning is a technique where we divide the plays (experiences) into two sets and use them to learn two independent action-value function estimates, $Q_1$ and $Q_2$ .

We use one estimate, say $Q_1$ , to determine the greedy action (maximization), and the other estimate, $Q_2$ , to compute the target value,

Q_2(s, \underset{a}{\arg\max} Q_1(s, a))

We can repeat this process with the roles of $Q_1$ and $Q_2$ swapped to yield a second unbiased estimate,

Q_1(s, \underset{a}{\arg\max} Q_2(s, a))

Although we learn two estimates, we only ever update one of them each play.

Note

Double learning will require double the memory, but does not increase the amount of computation per step.

Intuition (Double Q-Learning)

Let’s apply double learning to Q-learning, we will call this Double Q-learning.

Firstly, we divide the time steps into two sets with equal probability (e.g., by flipping a coin), if the result is heads, we update $Q_1$ , otherwise we update $Q_2$ . The behavior policy can use both action-value estimates, e.g., an $\epsilon$ -greedy polic for Double Q-learning could be based on the average (or just sum) of the two action-value estimates, $Q_1 + Q_2$ .

Further, one can also double learning for SARSA and Expected SARSA, by using the two action-value estimates to compute the expected value in the update rule.

Algorithm (Double Q-Learning for estimating

Q_1 \approx Q_2 \approx q_{\star}

)

Algorithm parameters: step-size $\alpha \in (0, 1]$ , small $\epsilon > 0$

Initialize $Q_1(s, a)$ and $Q_2(s, a)$ , for all $s \in \mathcal{S}^+, a \in \mathcal{A}(s)$ , arbitrarily except that $Q(\text{terminal}, \cdot) = 0$

Loop for each episode:

Initialize $S$
Loop for each step of episode:
- Choose $A$ from $S$ using the policy $\epsilon$ -greedy with respect to $Q_1 + Q_2$ .
- Take action $A$ , observe $R, S^{\prime}$
- With probability 0.5:
  - $Q_1(S, A) \leftarrow Q_1(S, A) + \alpha [R + \gamma Q_2(S^{\prime}, \underset{a}{\arg\max} Q_1(S^{\prime}, a)) - Q_1(S, A)]$
- Otherwise:
  - $Q_2(S, A) \leftarrow Q_2(S, A) + \alpha [R + \gamma Q_1(S^{\prime}, \underset{a}{\arg\max} Q_2(S^{\prime}, a)) - Q_2(S, A)]$
- $S \leftarrow S^{\prime}$
until $S$ is terminal

$n$ -Step Bootstrapping

In many practical situations, neither MC or one-step TD are always the best choice.

$n$ -step TD (or $n$ -step bootstrapping) methods generalize both MC and one-step TD methods.

As per usual, we will first cover the prediction problema nd then the control problem.

$n$ -Step TD Prediction

Intuition (

n

-Step Return)

Consider estimating $v_{\pi}$ from sample episodes generated by following policy $\pi$ .

MC will perform an update for each state based on the entire sequence of observed rewards from that state until the end of the episode, i.e., the return $G_t$ for that state $S_t$ .

The update of one-step TD is based on just the one next reward, bootstrapping from the value of the state one step later.

Thus, if we create an intermediate method, where we perform an update based on an intermediate number of rewards (more than one, but less than the entire episode), we can potentially get the best of both worlds.

A two-step update would be based on the first two rewards and the estimated value of the state two steps later, a three-step update would be based on the first three rewards and the estimated value of the state three steps later, and so on.

Definition 3 (

n

-Step Return)

The $n$ -step return $G_{t:t+n}$ is defined as,

G_{t:t+n} \coloneqq R_{t + 1} + \gamma R_{t + 2} + \gamma^2 R_{t + 3} + \ldots + \gamma^{n - 1} R_{t + n} + \gamma^n V_{t + n - 1}(S_{t + n})

where $V_{t + n - 1}$ is the estimate of the value function used at time $t + n - 1$ .

However, we need to wait for future rewards to become available!

The natural state-value update when using $n$ -step returns is,

V_{t + n}(S_t) \coloneqq V_{t + n - 1}(S_t) + \alpha [G_{t:t+n} - V_{t + n - 1}(S_t)]

while the values of all other states remain unchanged,

V_{t + n}(s) = V_{t + n - 1}(s), \quad s \neq S_t

Note that no changes are made during the first $n - 1$ time steps of the episode, since we do not yet have enough rewards to make an $n$ -step update.

Definition 4 (

n

-step SARSA Update Rule)

How can $n$ -step methods be used for control?

The main idea is to simply switch states for actions (i.e., use action-value functions rather than state-value functions) and then use an $\epsilon$ -greedy policy with respect to the current action-value function estimate.

We (re)define the $n$ -step return for action-value functions as,

G_{t:t+n} \coloneqq R_{t + 1} + \gamma R_{t + 2} + \gamma^2 R_{t + 3} + \ldots + \gamma^{n - 1} R_{t + n} + \gamma^n Q_{t + n - 1}(S_{t + n}, A_{t + n})

where $Q_{t + n - 1}$ is the estimate of the action-value function used at time $t + n - 1$ . The natural action-value update when using $n$ -step returns is,

Q_{t + n}(S_t, A_t) \coloneqq Q_{t + n - 1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t + n - 1}(S_t, A_t)]

while the values of all other state-action pairs remain unchanged,

Q_{t + n}(s, a) = Q_{t + n - 1}(s, a), \quad (s, a) \neq (S_t, A_t)

Definition 5 (Algorithm:

n

-step Expected SARSA Update Rule)

Similiar to $n$ -step SARSA, but with the exception that its last element is a branch over all action possibilities weighted by their probabilities under the current policy $\pi$ ,

G_{t:t+n} \coloneqq R_{t + 1} + \gamma R_{t + 2} + \gamma^2 R_{t + 3} + \ldots + \gamma^{n - 1} R_{t + n} + \gamma^n \bar{V}_{t + n - 1}(S_{t + n})

where $\bar{V}_t(s)$ is the expected approximate value of state $s$ , using the estimated action values at time $t$ and the policy $\pi$ ,

\bar{V}_t(s) \coloneqq \sum_a \pi(a | s) Q_t(s, a)