Part 8 - Reinforcement Learning: Temporal-Difference Learning I

DAT441_8

Introduction

In the previous part, we saw how Monte Carlo (MC) methods can be used to learn value functions and optimal policies from experience, without requiring a model of the environment (the dynamics $\mathcal{p}$). However, MC methods have some limitations, such as sparse rewards, i.e., estimated returns are only available at the end of an episode.

In this part, we will introduce Temporal-Difference (TD) learning methods, which essentially combine ideas from both MC methods and Dynamic Programming (DP). Like MC methods, TD methods can learn directly from experience without a explicit model of the dynamics $\mathcal{p}$. However, like DP, TD methods update estimates based in part on other learned estimates, without waiting for the final outcome (i.e., they can learn from incomplete episodes).

We will first cover the policy evaluation problem, estmating the value function $v_{\pi}$ for a given policy $\pi$. Then, we will cover the control problem, as per the other methods, we will see a variation of the generalized policy iteration (GPI) algorithm.

TD Prediction

As we have seen previously, MC methods wait until the end of the episode (to obtain the return $G_t$) to update the value function estimate $V(S_t)$, e.g., Constant-$\alpha$ MC uses the update rule,

$$ V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)] $$ where $G_t$ is the return following time step $t$.

The idea behind TD methods is to instead update our value function estimate $V(S_t)$ based on the immediate reward $R_{t + 1}$ and the estimate of the value function at the next state $V(S_{t + 1})$, why?

Solution

We know from the Bellman equation that the value of the current state must equal the (discounted) value of the expected next state, plus reward along the way.

Thus, we can use the following update rule, called the TD(0), or one-step TD, $$ V(S_t) \leftarrow V(S_t) + \alpha [R_{t + 1} + \gamma V(S_{t + 1}) - V(S_t)] $$

Note

Note that the target that we aim for in the MC update is the return $G_t$, while in TD(0) it is the one-step lookahead $R_{t + 1} + \gamma V(S_{t + 1})$.

We will discuss this difference more later on.

Algorithm: (Tabular) TD(0) for estimating $v_{\pi}$

Input: a policy $\pi$ to be evaluated.

Algorithm parameters: step-size $\alpha \in (0, 1]$

Initialize: $V(s)$, for all $s \in \mathcal{S}^+$, arbitrarily except that $V(\text{terminal}) = 0$

Loop for each episode:

Initialize $S$
Loop for each step of episode:
- $A \leftarrow$ action given by $\pi$ for $S$
- Take action $A$, observe $R, S^{\prime}$
- $V(S) \leftarrow V(S) + \alpha [R + \gamma V(S^{\prime}) - V(S)]$
- $S \leftarrow S^{\prime}$
until $S$ is terminal

Note

Since the TD(0) update rule bases its update on an existing estimate $V(S_{t + 1})$, it is said to be a bootstrap method, just like DP methods (which bases their updates for $v_{k + 1}$ on $v_k$).

Intuition: Relationship between TD, MC, and DP

We have previously seen that, $$ \begin{align*} v_{\pi}(s) & \coloneqq \mathbb{E}_{\pi}[G_t | S_t = s] \newline & = \mathbb{E}_{\pi}[R_{t + 1} + \gamma G_{t + 1} | S_t = s] \newline & = \mathbb{E}_{\pi}[R_{t + 1} + \gamma v_{\pi}(S_{t + 1}) | S_t = s] \newline \end{align*} $$ MC methods use an estimate of the first line as a target, more specifically, a sample return is used in place of the real expected return.

DP methods use an estimate of the last line as a target, more specifically, since $v_{\pi}(S_{t + 1})$ is unknown, they use the current estimate $V(S_{t + 1})$ in its place.

Thus, TD(0) can be seen as a combination of both MC and DP methods, it samples the expected value of the last line and uses the current estimate $V(S_{t + 1})$ in place of the unknown $v_{\pi}(S_{t + 1})$.

Definition: TD Error

The difference between the estimated value of $S_t$ and the better estimate $R_{t + 1} + \gamma V(S_{t + 1})$, $$ \delta_t \coloneqq R_{t + 1} + \gamma V(S_{t + 1}) - V(S_t) $$ is called the TD error in $V(S_t)$, available at time $t + 1$.

Intuition: Advantages of TD Prediction

TD advantages over DP: Does not require a model of the environment (the dynamics $\mathcal{p}$), rewards, and next-state probability distributions.

TD advantages over MC: Is naturally implemented in an online setting, fully incremental fashion (i.e., it updates estimates based on each step of experience, without waiting for the end of the episode).

In other words, TD learns a guess ($V(S)$) from a guess ($V(S_{t + 1})$), but still converges!

Batch Updating TD

We have seen that the TD(0) update rule can update the value function estimate immediately after each time step. However, this might lead to a very fluctuating estimate, especially if the step-size $\alpha$ is large.

Instead, consider a batch (of finite amount) of experience, e.g., 10 episodes or 100 time steps. We can now present this experience repeatedly until our method converges.

Given an approximate value function $V$, the TD (or MC) increments can be computed for every time step $t$, but the value function is changed only once, by the entire sum of all the increments, this is called batch updating.

Then, all the available experience is processed again, with the new value function to produce a new overall increment and so on, until the value function converges.

Thus, the updates are made only after processing each complete batch of training data (instead of immediate updates).

Batch TD VS. Batch MC

Under batch training, constant-$\alpha$ MC converges to values $V(s)$ that are sample averages of the actual returns experienced after visiting each state $s$.

These are optimal estimates in the sense that they minimize the mean square error (MSE) from the actual returins in the training set.

However, since TD methods model it as a Markov process, they converge to different estimates.

If our process is Markov, we expect that TD will produce a lower error on future data, even though MC is better on the existing (training) data.

TD Control

As we have seen previously, we can use the generalized policy iteration (GPI) framework to solve the control problem.

We will introduce two classes of TD control methods, one for on-policy (SARSA) and one for off-policy (Q-learning).

The first step as per usual, is to learn an action-value function rather than a state-value function, since we can directly derive optimal policies from action values.

On-Policy TD Control: SARSA

Consider the transitions from state-action pair to state-action pair with intermediate rewards, we can update our action-value function estimate $Q(S_t, A_t)$ based on the observed reward $R_{t + 1}$ and the estimate of the value function at the next state-action pair $Q(S_{t + 1}, A_{t + 1})$,

$$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t + 1} + \gamma Q(S_{t + 1}, A_{t + 1}) - Q(S_t, A_t)] $$

As in all on-policy methods, we continually estimate $q_{\pi}$ for the behavior policy $\pi$, and at the same time change $\pi$ toward greediness with respect to the current estimate of $q_{\pi}$.

Algorithm: SARSA (On-Policy TD Control) for estimating $Q \approx q_{\star}$

Algorithm parameters: step-size $\alpha \in (0, 1]$, small $\epsilon > 0$ Initialize $Q(s, a)$, for all $s \in \mathcal{S}^+, a \in \mathcal{A}(s)$, arbitrarily except that $Q(\text{terminal}, \cdot) = 0$

Loop for each episode:

Initialize $S$
Choose $A$ from $S$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)
Loop for each step of episode:
- Take action $A$, observe $R, S^{\prime}$
- Choose $A^{\prime}$ from $S^{\prime}$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)
- $Q(S, A) \leftarrow Q(S, A) + \alpha [R + \gamma Q(S^{\prime}, A^{\prime}) - Q(S, A)]$
- $S \leftarrow S^{\prime}; A \leftarrow A^{\prime}$
until $S$ is terminal

Off-Policy TD Control: Q-Learning

By looking at the TD update rule for SARSA, we can see another possibility.

The learned action-value function $Q$ directly approximates $q_{\star}$ the optimal action-value function, independent of the followed policy,

$$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t + 1} + \gamma \underbrace{\max_{a} Q(S_{t + 1}, a)}_{\text{optimal action value at } S_{t + 1}} - Q(S_t, A_t)] $$

The policy still determines which state-action pairs are visited and updated. However, it is still an off-policy method, since the update is based on the greedy action at the next state $S_{t + 1}$, rather than the action actually taken $A_{t + 1}$.

Algorithm: Q-Learning (Off-Policy TD Control) for estimating $\pi \approx \pi_{\star}$

Initialize $S$
Loop for each step of episode:
- Choose $A$ from $S$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)
- Take action $A$, observe $R, S^{\prime}$
- $Q(S, A) \leftarrow Q(S, A) + \alpha [R + \gamma \max_{a} Q(S^{\prime}, a) - Q(S, A)]$
- $S \leftarrow S^{\prime}$
until $S$ is terminal