Part 8 - Reinforcement Learning: Temporal-Difference Learning I

Introduction

In the previous part, we saw how Monte Carlo (MC) methods can be used to learn value functions and optimal policies from experience, without requiring a model of the environment (the dynamics p\mathcal{p}). However, MC methods have some limitations, such as sparse rewards, i.e., estimated returns are only available at the end of an episode.

In this part, we will introduce Temporal-Difference (TD) learning methods, which essentially combine ideas from both MC methods and Dynamic Programming (DP). Like MC methods, TD methods can learn directly from experience without a explicit model of the dynamics p\mathcal{p}. However, like DP, TD methods update estimates based in part on other learned estimates, without waiting for the final outcome (i.e., they can learn from incomplete episodes).

We will first cover the policy evaluation problem, estmating the value function vπv_{\pi} for a given policy π\pi. Then, we will cover the control problem, as per the other methods, we will see a variation of the generalized policy iteration (GPI) algorithm.

TD Prediction

As we have seen previously, MC methods wait until the end of the episode (to obtain the return GtG_t) to update the value function estimate V(St)V(S_t), e.g., Constant-α\alpha MC uses the update rule,

V(St)V(St)+α[GtV(St)]V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)]

where GtG_t is the return following time step tt.

The idea behind TD methods is to instead update our value function estimate V(St)V(S_t) based on the immediate reward Rt+1R_{t + 1} and the estimate of the value function at the next state V(St+1)V(S_{t + 1}), why?

Solution

We know from the Bellman equation that the value of the current state must equal the (discounted) value of the expected next state, plus reward along the way.

Thus, we can use the following update rule, called the TD(0), or one-step TD,

V(St)V(St)+α[Rt+1+γV(St+1)V(St)]V(S_t) \leftarrow V(S_t) + \alpha [R_{t + 1} + \gamma V(S_{t + 1}) - V(S_t)]
Note

Note that the target that we aim for in the MC update is the return GtG_t, while in TD(0) it is the one-step lookahead Rt+1+γV(St+1)R_{t + 1} + \gamma V(S_{t + 1}).

We will discuss this difference more later on.

Algorithm ((Tabular) TD(0) for estimating vπv_{\pi})

Input: a policy π\pi to be evaluated.

Algorithm parameters: step-size α(0,1]\alpha \in (0, 1]

Initialize: V(s)V(s), for all sS+s \in \mathcal{S}^+, arbitrarily except that V(terminal)=0V(\text{terminal}) = 0

Loop for each episode:

  • Initialize SS
  • Loop for each step of episode:
    • AA \leftarrow action given by π\pi for SS
    • Take action AA, observe R,SR, S^{\prime}
    • V(S)V(S)+α[R+γV(S)V(S)]V(S) \leftarrow V(S) + \alpha [R + \gamma V(S^{\prime}) - V(S)]
    • SSS \leftarrow S^{\prime}
  • until SS is terminal
Note

Since the TD(0) update rule bases its update on an existing estimate V(St+1)V(S_{t + 1}), it is said to be a bootstrap method, just like DP methods (which bases their updates for vk+1v_{k + 1} on vkv_k).

Intuition (Relationship between TD, MC, and DP)

We have previously seen that,

vπ(s)Eπ[GtSt=s]=Eπ[Rt+1+γGt+1St=s]=Eπ[Rt+1+γvπ(St+1)St=s]\begin{align*} v_{\pi}(s) & \coloneqq \mathbb{E}_{\pi}[G_t | S_t = s] \newline & = \mathbb{E}_{\pi}[R_{t + 1} + \gamma G_{t + 1} | S_t = s] \newline & = \mathbb{E}_{\pi}[R_{t + 1} + \gamma v_{\pi}(S_{t + 1}) | S_t = s] \newline \end{align*}

MC methods use an estimate of the first line as a target, more specifically, a sample return is used in place of the real expected return.

DP methods use an estimate of the last line as a target, more specifically, since vπ(St+1)v_{\pi}(S_{t + 1}) is unknown, they use the current estimate V(St+1)V(S_{t + 1}) in its place.

Thus, TD(0) can be seen as a combination of both MC and DP methods, it samples the expected value of the last line and uses the current estimate V(St+1)V(S_{t + 1}) in place of the unknown vπ(St+1)v_{\pi}(S_{t + 1}).

Definition 1 (TD Error)

The difference between the estimated value of StS_t and the better estimate Rt+1+γV(St+1)R_{t + 1} + \gamma V(S_{t + 1}),

δtRt+1+γV(St+1)V(St)\delta_t \coloneqq R_{t + 1} + \gamma V(S_{t + 1}) - V(S_t)

is called the TD error in V(St)V(S_t), available at time t+1t + 1.

Intuition (Advantages of TD Prediction)

TD advantages over DP: Does not require a model of the environment (the dynamics p\mathcal{p}), rewards, and next-state probability distributions.

TD advantages over MC: Is naturally implemented in an online setting, fully incremental fashion (i.e., it updates estimates based on each step of experience, without waiting for the end of the episode).

In other words, TD learns a guess (V(S)V(S)) from a guess (V(St+1)V(S_{t + 1})), but still converges!

Batch Updating TD

We have seen that the TD(0) update rule can update the value function estimate immediately after each time step. However, this might lead to a very fluctuating estimate, especially if the step-size α\alpha is large.

Instead, consider a batch (of finite amount) of experience, e.g., 10 episodes or 100 time steps. We can now present this experience repeatedly until our method converges.

Given an approximate value function VV, the TD (or MC) increments can be computed for every time step tt, but the value function is changed only once, by the entire sum of all the increments, this is called batch updating.

Then, all the available experience is processed again, with the new value function to produce a new overall increment and so on, until the value function converges.

Thus, the updates are made only after processing each complete batch of training data (instead of immediate updates).

Batch TD VS. Batch MC

Under batch training, constant-α\alpha MC converges to values V(s)V(s) that are sample averages of the actual returns experienced after visiting each state ss.

These are optimal estimates in the sense that they minimize the mean square error (MSE) from the actual returins in the training set.

However, since TD methods model it as a Markov process, they converge to different estimates.

If our process is Markov, we expect that TD will produce a lower error on future data, even though MC is better on the existing (training) data.

TD Control

As we have seen previously, we can use the generalized policy iteration (GPI) framework to solve the control problem.

We will introduce two classes of TD control methods, one for on-policy (SARSA) and one for off-policy (Q-learning).

The first step as per usual, is to learn an action-value function rather than a state-value function, since we can directly derive optimal policies from action values.

On-Policy TD Control: SARSA

Consider the transitions from state-action pair to state-action pair with intermediate rewards, we can update our action-value function estimate Q(St,At)Q(S_t, A_t) based on the observed reward Rt+1R_{t + 1} and the estimate of the value function at the next state-action pair Q(St+1,At+1)Q(S_{t + 1}, A_{t + 1}),

Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t + 1} + \gamma Q(S_{t + 1}, A_{t + 1}) - Q(S_t, A_t)]

As in all on-policy methods, we continually estimate qπq_{\pi} for the behavior policy π\pi, and at the same time change π\pi toward greediness with respect to the current estimate of qπq_{\pi}.

Algorithm (SARSA (On-Policy TD Control) for estimating QqQ \approx q_{\star})

Algorithm parameters: step-size α(0,1]\alpha \in (0, 1], small ϵ>0\epsilon > 0 Initialize Q(s,a)Q(s, a), for all sS+,aA(s)s \in \mathcal{S}^+, a \in \mathcal{A}(s), arbitrarily except that Q(terminal,)=0Q(\text{terminal}, \cdot) = 0

Loop for each episode:

  • Initialize SS
  • Choose AA from SS using policy derived from QQ (e.g., ϵ\epsilon-greedy)
  • Loop for each step of episode:
    • Take action AA, observe R,SR, S^{\prime}
    • Choose AA^{\prime} from SS^{\prime} using policy derived from QQ (e.g., ϵ\epsilon-greedy)
    • Q(S,A)Q(S,A)+α[R+γQ(S,A)Q(S,A)]Q(S, A) \leftarrow Q(S, A) + \alpha [R + \gamma Q(S^{\prime}, A^{\prime}) - Q(S, A)]
    • SS;AAS \leftarrow S^{\prime}; A \leftarrow A^{\prime}
  • until SS is terminal

Off-Policy TD Control: Q-Learning

By looking at the TD update rule for SARSA, we can see another possibility.

The learned action-value function QQ directly approximates qq_{\star} the optimal action-value function, independent of the followed policy,

Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)optimal action value at St+1Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t + 1} + \gamma \underbrace{\max_{a} Q(S_{t + 1}, a)}_{\text{optimal action value at } S_{t + 1}} - Q(S_t, A_t)]

The policy still determines which state-action pairs are visited and updated. However, it is still an off-policy method, since the update is based on the greedy action at the next state St+1S_{t + 1}, rather than the action actually taken At+1A_{t + 1}.

Algorithm (Q-Learning (Off-Policy TD Control) for estimating ππ\pi \approx \pi_{\star})

Algorithm parameters: step-size α(0,1]\alpha \in (0, 1], small ϵ>0\epsilon > 0 Initialize Q(s,a)Q(s, a), for all sS+,aA(s)s \in \mathcal{S}^+, a \in \mathcal{A}(s), arbitrarily except that Q(terminal,)=0Q(\text{terminal}, \cdot) = 0 Loop for each episode:

  • Initialize SS
  • Loop for each step of episode:
    • Choose AA from SS using policy derived from QQ (e.g., ϵ\epsilon-greedy)
    • Take action AA, observe R,SR, S^{\prime}
    • Q(S,A)Q(S,A)+α[R+γmaxaQ(S,a)Q(S,A)]Q(S, A) \leftarrow Q(S, A) + \alpha [R + \gamma \max_{a} Q(S^{\prime}, a) - Q(S, A)]
    • SSS \leftarrow S^{\prime}
  • until SS is terminal