Part 9 - Reinforcement Learning: Temporal-Difference Learning II

Introduction

In the previous part, we introduced Temporal-Difference (TD) learning methods, which combine ideas from both Monte Carlo (MC) methods and Dynamic Programming (DP). We covered both the policy evaluation problem, estimating the value function vπv_{\pi} for a given policy π\pi, and the control problem, using a variation of the generalized policy iteration (GPI) algorithm. We introduced two classes of TD control methods, one for on-policy (SARSA) and one for off-policy (Q-learning).

In this part, we will cover some additional TD methods, including Expected SARSA, which is an on-policy method that uses the expected value of the next state-action pair rather than a sample, and Double Q-learning, which is an off-policy method that addresses the overestimation bias in Q-learning. Finally, we will discuss the general framwork known as nn-step bootstrapping, which unifies TD and MC methods.

Expected SARSA

Recall (Q-Learning Update Rule)

Recall the update rule for Q-learning,

Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t + 1} + \gamma \max_{a} Q(S_{t + 1}, a) - Q(S_t, A_t)]

Instead of the max\max operator over the next state-action pair, we can use the expected value with respect to the probability of each action under our current policy π\pi.

Definition 1 (Expected SARSA Update Rule)

The update rule for Expected SARSA is given by,

Q(St,At)Q(St,At)+α[Rt+1+γEπ[Q(St+1,At+1)St+1]Q(St,At)]=Q(St,At)+α[Rt+1+γaπ(aSt+1)Q(St+1,a)Q(St,At)]\begin{align*} Q(S_t, A_t) & \leftarrow Q(S_t, A_t) + \alpha [R_{t + 1} + \gamma \mathbb{E}_{\pi}[Q(S_{t + 1}, A_{t + 1}) | S_{t + 1}] - Q(S_t, A_t)] \newline & = Q(S_t, A_t) + \alpha \left[R_{t + 1} + \gamma \sum_{a} \pi(a | S_{t + 1}) Q(S_{t + 1}, a) - Q(S_t, A_t)\right] \end{align*}

Expected SARSA will be more complex computationally than regular SARSA, but it will eliminate the variance due to the random selection of At+1A_{t + 1}, why?

Solution

The expected value is computed over all possible actions at the next state, rather than relying on a single sampled action. This reduces the variance in the updates, leading to more stable learning.

Maximization Bias and Double Learning

So far, all the control algorithms we have discussed involve some sort of maximization in the construction of their target policies.

  • In Q-learning, the target policy is the greedy policy with respect to the current action-value function QQ.
  • In SARSA, the policy is (often) ϵ\epsilon-greedy with respect to QQ, which also involves a maximization step.

These maximization steps can lead to significant positive biases in the action-value estimates.

Definition 2 (Double Learning)

Double learning is a technique where we divide the plays (experiences) into two sets and use them to learn two independent action-value function estimates, Q1Q_1 and Q2Q_2.

We use one estimate, say Q1Q_1, to determine the greedy action (maximization), and the other estimate, Q2Q_2, to compute the target value,

Q2(s,argmaxaQ1(s,a))Q_2(s, \underset{a}{\arg\max} Q_1(s, a))

We can repeat this process with the roles of Q1Q_1 and Q2Q_2 swapped to yield a second unbiased estimate,

Q1(s,argmaxaQ2(s,a))Q_1(s, \underset{a}{\arg\max} Q_2(s, a))

Although we learn two estimates, we only ever update one of them each play.

Note

Double learning will require double the memory, but does not increase the amount of computation per step.

Intuition (Double Q-Learning)

Let’s apply double learning to Q-learning, we will call this Double Q-learning.

Firstly, we divide the time steps into two sets with equal probability (e.g., by flipping a coin), if the result is heads, we update Q1Q_1, otherwise we update Q2Q_2. The behavior policy can use both action-value estimates, e.g., an ϵ\epsilon-greedy polic for Double Q-learning could be based on the average (or just sum) of the two action-value estimates, Q1+Q2Q_1 + Q_2.

Further, one can also double learning for SARSA and Expected SARSA, by using the two action-value estimates to compute the expected value in the update rule.

Algorithm (Double Q-Learning for estimating Q1Q2qQ_1 \approx Q_2 \approx q_{\star})

Algorithm parameters: step-size α(0,1]\alpha \in (0, 1], small ϵ>0\epsilon > 0

Initialize Q1(s,a)Q_1(s, a) and Q2(s,a)Q_2(s, a), for all sS+,aA(s)s \in \mathcal{S}^+, a \in \mathcal{A}(s), arbitrarily except that Q(terminal,)=0Q(\text{terminal}, \cdot) = 0

Loop for each episode:

  • Initialize SS
  • Loop for each step of episode:
    • Choose AA from SS using the policy ϵ\epsilon-greedy with respect to Q1+Q2Q_1 + Q_2.
    • Take action AA, observe R,SR, S^{\prime}
    • With probability 0.5:
      • Q1(S,A)Q1(S,A)+α[R+γQ2(S,argmaxaQ1(S,a))Q1(S,A)]Q_1(S, A) \leftarrow Q_1(S, A) + \alpha [R + \gamma Q_2(S^{\prime}, \underset{a}{\arg\max} Q_1(S^{\prime}, a)) - Q_1(S, A)]
    • Otherwise:
      • Q2(S,A)Q2(S,A)+α[R+γQ1(S,argmaxaQ2(S,a))Q2(S,A)]Q_2(S, A) \leftarrow Q_2(S, A) + \alpha [R + \gamma Q_1(S^{\prime}, \underset{a}{\arg\max} Q_2(S^{\prime}, a)) - Q_2(S, A)]
    • SSS \leftarrow S^{\prime}
  • until SS is terminal

nn-Step Bootstrapping

In many practical situations, neither MC or one-step TD are always the best choice.

nn-step TD (or nn-step bootstrapping) methods generalize both MC and one-step TD methods.

As per usual, we will first cover the prediction problema nd then the control problem.

nn-Step TD Prediction

Intuition (nn-Step Return)

Consider estimating vπv_{\pi} from sample episodes generated by following policy π\pi.

MC will perform an update for each state based on the entire sequence of observed rewards from that state until the end of the episode, i.e., the return GtG_t for that state StS_t.

The update of one-step TD is based on just the one next reward, bootstrapping from the value of the state one step later.

Thus, if we create an intermediate method, where we perform an update based on an intermediate number of rewards (more than one, but less than the entire episode), we can potentially get the best of both worlds.

A two-step update would be based on the first two rewards and the estimated value of the state two steps later, a three-step update would be based on the first three rewards and the estimated value of the state three steps later, and so on.

Definition 3 (nn-Step Return)

The nn-step return Gt:t+nG_{t:t+n} is defined as,

Gt:t+nRt+1+γRt+2+γ2Rt+3++γn1Rt+n+γnVt+n1(St+n)G_{t:t+n} \coloneqq R_{t + 1} + \gamma R_{t + 2} + \gamma^2 R_{t + 3} + \ldots + \gamma^{n - 1} R_{t + n} + \gamma^n V_{t + n - 1}(S_{t + n})

where Vt+n1V_{t + n - 1} is the estimate of the value function used at time t+n1t + n - 1.

However, we need to wait for future rewards to become available!

The natural state-value update when using nn-step returns is,

Vt+n(St)Vt+n1(St)+α[Gt:t+nVt+n1(St)]V_{t + n}(S_t) \coloneqq V_{t + n - 1}(S_t) + \alpha [G_{t:t+n} - V_{t + n - 1}(S_t)]

while the values of all other states remain unchanged,

Vt+n(s)=Vt+n1(s),sStV_{t + n}(s) = V_{t + n - 1}(s), \quad s \neq S_t

Note that no changes are made during the first n1n - 1 time steps of the episode, since we do not yet have enough rewards to make an nn-step update.

Definition 4 (nn-step SARSA Update Rule)

How can nn-step methods be used for control?

The main idea is to simply switch states for actions (i.e., use action-value functions rather than state-value functions) and then use an ϵ\epsilon-greedy policy with respect to the current action-value function estimate.

We (re)define the nn-step return for action-value functions as,

Gt:t+nRt+1+γRt+2+γ2Rt+3++γn1Rt+n+γnQt+n1(St+n,At+n)G_{t:t+n} \coloneqq R_{t + 1} + \gamma R_{t + 2} + \gamma^2 R_{t + 3} + \ldots + \gamma^{n - 1} R_{t + n} + \gamma^n Q_{t + n - 1}(S_{t + n}, A_{t + n})

where Qt+n1Q_{t + n - 1} is the estimate of the action-value function used at time t+n1t + n - 1. The natural action-value update when using nn-step returns is,

Qt+n(St,At)Qt+n1(St,At)+α[Gt:t+nQt+n1(St,At)]Q_{t + n}(S_t, A_t) \coloneqq Q_{t + n - 1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t + n - 1}(S_t, A_t)]

while the values of all other state-action pairs remain unchanged,

Qt+n(s,a)=Qt+n1(s,a),(s,a)(St,At)Q_{t + n}(s, a) = Q_{t + n - 1}(s, a), \quad (s, a) \neq (S_t, A_t)
Definition 5 (Algorithm: nn-step Expected SARSA Update Rule)

Similiar to nn-step SARSA, but with the exception that its last element is a branch over all action possibilities weighted by their probabilities under the current policy π\pi,

Gt:t+nRt+1+γRt+2+γ2Rt+3++γn1Rt+n+γnVˉt+n1(St+n)G_{t:t+n} \coloneqq R_{t + 1} + \gamma R_{t + 2} + \gamma^2 R_{t + 3} + \ldots + \gamma^{n - 1} R_{t + n} + \gamma^n \bar{V}_{t + n - 1}(S_{t + n})

where Vˉt(s)\bar{V}_t(s) is the expected approximate value of state ss, using the estimated action values at time tt and the policy π\pi,

Vˉt(s)aπ(as)Qt(s,a)\bar{V}_t(s) \coloneqq \sum_a \pi(a | s) Q_t(s, a)