Part 10 - Reinforcement Learning: Function Approximation and Introduction to Deep RL

Introduction

In the previous part, we covered some additional Temporal-Difference (TD) learning methods, including Expected SARSA, which is an on-policy method that uses the expected value of the next state-action pair rather than a sample, and Double Q-learning, which is an off-policy method that addresses the overestimation bias in Q-learning. Finally, we discussed the general framework known as $n$ -step bootstrapping, which unifies TD and Monte Carlo (MC) methods.

In this part, we will introduce the concept of function approximation in reinforcement learning (RL), which is essential for handling large or continuous state and action spaces. We will discuss different types of function approximators, including linear and non-linear methods, and how they can be integrated into RL algorithms. Finally, we will provide an introduction to Deep Reinforcement Learning (Deep RL), which combines deep learning techniques with RL to tackle complex problems.

Function Approximation in Reinforcement Learning

In many (real-world) tasks, the state (and action) space is combinatorial and enormous, thus the memory, time, and, data requirements become infeasible.

Intuition (Function Approximation)

If we instead can find a good approximate solution using limited computational resources, while not losing (much) generalization, we can solve much larger problems.

But, how can experience with a limited subset of the state space be used to produce a good approximation for the entire state space?

This is what function approximation aims to solve.

Definition 1 (Value Function Approximation)

We represent the state value function approximation as a parameterized function with weight vector $\mathbf{w} \in \mathbb{R}^d$ ,

\hat{v}(s, \mathbf{w}) \approx v_{\pi}(s),

$\hat{v}$ might be a linear function but it can also be a function computed by a neural network.

Typically, the number of (shared) weights (the dimensionality of $\mathbf{w}$ ) is much smaller than the number of states, i.e., $d \ll |\mathcal{S}|$ . When a single state is updated, the change generalizes from that state to affect the values of many other states.

Definition 2 (Approximate Prediction of Value Functions)

We denote the individual update (target) by the notation $s \mapsto u$ ,

MC Updatew: $S_t \mapsto G_t$
TD(0) Update: $S_t \mapsto R_{t + 1} + \gamma \hat{v}(S_{t + 1}, \mathbf{w})$
$n$ -step TD Update: $S_t \mapsto G_{t:t+n}$
DP Update: $s \mapsto \mathbb{E}_{\pi}[R_{t + 1} + \gamma \hat{v}(S_{t + 1}, \mathbf{w}) | S_t = s]$

The update $s \mapsto u$ means that the estimated value for state $s$ should be “more like” the updated target $u$ .

Thus, we can treat this as a supervised learning problem! However, there is a fundamental difference, in classical supervised learning, we have a stating training dataset. While in RL, learning occurs online (i.e., when the agent interacts with the environment), and the targets are incremental (i.e., non i.i.d. data) and non-stationary (i.e., the target changes as the value function changes).

Definition 3 (The Prediction Objective (

\bar{VE}

))

For a supervised learning task, we need some notion of error or loss, a common choice is the mean squared value error ( $\bar{VE}$ ),

\bar{VE}(\mathbf{w}) \coloneqq \sum_{s \in \mathcal{S}} \mu(s) [v_{\pi}(s) - \hat{v}(s, \mathbf{w})]^2,

where $\mu(s)$ is the weighting factor such that $\mu(s) \geq 0, \sum_s \mu(s) = 1$ , is chosen to be the fraction of the time spent in state $s$ under policy $\pi$ (also called the on-policy distribution).

Another common choice is the square root of $\bar{VE}$ , called the root mean squared value error (RMSVE).

Note

Note that, in continuing tasks, the on-policy distribution is the stationary distribution under policy $\pi$ .

Definition 4 (On-Policy Distribution in Episodic Tasks)

Let $h(s)$ be the probability that an episode starts in state $s$ . Further, let $\eta(s)$ be the number of time steps spent, on average, in state $s$ in a single episode,

\eta(s) \coloneqq h(s) + \sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a | \bar{s}) p(s | \bar{s}, a), \ \forall s \in \mathcal{S}.

This system of equations can be solved for the expected number of visits $\eta(s)$ to each state $s$ .

The on-policy distribution is then obtained by normalization,

\mu(s) \coloneqq \frac{\eta(s)}{\sum_{\bar{s}} \eta(\bar{s})}, \ \forall s \in \mathcal{S}.

Note

Note that, we assumed no discounting, i.e., $\gamma = 1$ here.

Intuition (Minimizing the Prediction Objective)

The next logical step would be to find a global optimum, in terms of the weights $\mathbf{w}$ , that minimizes the prediction objective $\bar{VE}(\mathbf{w})$ . This is possible if we have a nice simple function, such as linear, convex, or differentiable. However, this is rarely possible for more complex models, such as neural networks.

Thus, we instead will suffice with finding a local optimum.

Definition 5 (Stochastic Gradient Descent (SGD))

Firstly, the weightt vector is a column vector with a fixed number of real valued elements,

\mathbf{w} \coloneqq \begin{bmatrix} w_1 \newline w_2 \newline \vdots \newline w_d \end{bmatrix}

Secondly, the function $\hat{v}(s, \mathbf{w})$ is a differentiable function with respect to $\mathbf{w} \forall s \in \mathcal{S}$ .

Since we will be updating $\mathbf{w}$ at each of a series of discrete time steps $t = 0, 1, 2, \ldots$ , we will denote the weight vector at time $t$ by $\mathbf{w}_t$ . Further, assume at time step $t$ , the agent observes a new example $S_t \mapsto v_{\pi}(S_t)$ consisting of a state $S_t$ and its (true) value under the policy.

Lastly, for simplicity, assume that the states appearing in the examples have the same distribution $\mu(s)$ .

Thus, the SGD update rule is given by,

\begin{align*} \mathbf{w}_{t + 1} & \coloneqq \mathbf{w}_t + \frac{1}{2} \alpha \nabla [v_{\pi}(S_t) - \hat{v}(S_t, \mathbf{w}_t)]^2 \newline & = \mathbf{w}_t + \alpha [v_{\pi}(S_t) - \hat{v}(S_t, \mathbf{w}_t)] \nabla \hat{v}(S_t, \mathbf{w}_t) \end{align*}

where $\alpha > 0$ is the step-size parameter (or learning rate).

Definition 6 (SGD with Approximate Targets)

However, unlike the classical supervised learning setting, in RL we do not have acces to the true value $v_{\pi}(S_t)$ . Instead, we have to use an approximate target $U_t \in \mathbb{R}$ of the $t$ -th training example $S_t \mapsto U_t$ , e.g., a noise-corrupted version of the true value or a bootstrapped estimate.

In this setting, we simply substitute $U_t$ for $v_{\pi}(S_t)$ in the SGD update rule,

\mathbf{w}_{t + 1} \coloneqq \mathbf{w}_t + \alpha [U_t - \hat{v}(S_t, \mathbf{w}_t)] \nabla \hat{v}(S_t, \mathbf{w}_t).

Warning

If $U_t$ is an unbiased estimate, i.e., $\mathbb{E}[U_t | S_t] = v_{\pi}(S_t) \ \forall t$ , then $\mathbf{w}_t$ is guaranteed to converge to a local optimum under some certain conditions.

Algorithm (Gradient Monte Carlo Algorithm for estimating

\hat{v} \approx v_{\pi}

)

Input: the policy $\pi$ to be evaluated

Input: a differentiable function $\hat{v} : \mathcal{S} \times \mathbb{R}^d \rightarrow \mathbb{R}$

Algorithm parameter: step-size $\alpha > 0$

Initialize value-function weights $\mathbf{w} \in \mathbb{R}^d$ arbitrarily (e.g., $\mathbf{w} = \mathbf{0}$ )

Loop forever (for each episode):

Generate an episode $S_0, A_0, R_1, S_1, A_1, \ldots, R_T, S_T$ using policy $\pi$
Loop for each step of episode $t = 0, 1, \ldots, T - 1$ $t = 0, 1, \dots, T - 1$ :
- $\mathbf{w} \leftarrow \mathbf{w} + \alpha [G_t - \hat{v}(S_t, \mathbf{w})] \nabla \hat{v}(S_t, \mathbf{w})$

Note

The MC target $U_t \coloneqq G_t$ is an unbiased estimate of $v_{\pi}(S_t)$ 11Why? The answer is, by definition of the value function.. Thus, it is guaranteed to converge to a local optimum under some certain conditions.

Definition 7 (Semi-Gradient Methods)

When we are dealing with bootstrapping targets, such as the $n$ -step TD target $G_{t:t+n}$ or the DP target $\sum_{a, s^{\prime}, r} \pi(a | S_t) \mathcal{p}(s^{\prime}, r | S_t, a) [r + \gamma \hat{v}(s^{\prime}, \mathbf{w})]$ , all depend on the current value of the weight vector $\mathbf{w}_t$ . Thus, the targets will be biased and they will not produce a true gradient descent method.

We instead call these methods semi-gradient methods.

Algorithm (Semi-Gradient TD(0) for estimating

\hat{v} \approx v_{\pi}

)

Input: the policy $\pi$ to be evaluated

Input: a differentiable function $\hat{v} : \mathcal{S} \times \mathbb{R}^d \rightarrow \mathbb{R}$ such that $\hat{v}(\text{terminal}, \cdot) = 0$

Algorithm parameters: step-size $\alpha > 0$

Initialize value-function weights $\mathbf{w} \in \mathbb{R}^d$ arbitrarily (e.g., $\mathbf{w} = \mathbf{0}$ )

Loop for each episode:

Initialize $S$
Loop for each step of episode:
- Choose $A \sim \pi(\cdot | S)$
- Take action $A$ , observe $R, S^{\prime}$
- $\mathbf{w} \leftarrow \mathbf{w} + \alpha [R + \gamma \hat{v}(S^{\prime}, \mathbf{w}) - \hat{v}(S, \mathbf{w})] \nabla \hat{v}(S, \mathbf{w})$
- $S \leftarrow S^{\prime}$
until $S$ is terminal

Generalizing Function Approximation

Definition 8 (State Aggregation)

A simple form of generalizing function approximation is state aggregation, where states are grouped together, with one estimated value (one element in the weight vector $\mathbf{w}$ ) for each group. The value of a state is estimated as its group’s component, and when the state is updated, that component alone is updated.

Thus, this is a special case of SGD in which the gradient $\nabla \hat{v}(s, \mathbf{w})$ is 1 for $S_t$ ’s group’s component and 0 for all other components.

Definition 9 (Linear Methods)

A more general and powerful form of function approximation is linear methods, where the value of a state is estimated as a linear combination of features. For every state $s$ , there is a vector $\mathbf{x}(s) \coloneqq \begin{bmatrix} x_1(s) & x_2(s) & \ldots & x_d(s) \end{bmatrix}^T$ with the same number of components as $\mathbf{w}$ .

Linear methods approximate the state-value function by,

\hat{v}(s, \mathbf{w}) \coloneqq \mathbf{w}^T \mathbf{x}(s) = \sum_{i = 1}^d w_i x_i(s),

The real valued vector $\mathbf{x}(s)$ is called the feature vector representing state $s$ .

The gradient of the linear approximate value function with respect to $\mathbf{w}$ is simply the feature vector,

\nabla \hat{v}(s, \mathbf{w}) = \mathbf{x}(s)

Thus, the SGD update is,

\mathbf{w}_{t + 1} \coloneqq \mathbf{w}_t + \alpha [U_t - \hat{v}(S_t, \mathbf{w}_t)] \mathbf{x}(S_t).

Further, local optima are guaranteed to be global optima 22Why? The answer is, the prediction objective $\bar{VE}(\mathbf{w})$ is a quadratic function in terms of $\mathbf{w}$ , thus it is convex..

Note

The performance of linear methods depends critically on how the states are represented in terms of features.

Choosing appropriate features is an important way of adding prior domain knowledge to RL systems. There are several ways to construct the relevant features, but hand-crafted features can be brittle and non-transferable to other tasks.

Definition 10 (Nonlinear function approximation with neural networks)

Neural networks are widely used for nonlinear function approximation [^1] in RL. They can for example use the TD error/target to learn value functions. We use backpropagation to estimate the partial derivative of an objective function with respect to each weight in the network.

However, they can suffer from overfitting, especially for deep networks with many layers and parameters.

Summary (Parametric Approximation Methods)

So far, we have discussed the parametric approach to approximating value functions, let’s summarize the main points.

A learning algorithm adjusts the parameters of a functional form intended to approximate the value function.
Each update $s \mapsto g$ is a training example used by the learning algorithm to change the parameters with the aim of reducing the approximation error.
After the update, the training example can be discarded (although, it might be saved to be used again).
When an approximate value of a state (sometimes called the query state) is needed, the function is simply evaluated at that state using the latest parameters produced by the learning algorithm.

Non-Parametric (Memory-Based) Function Approximation

Definition 11 (Non-Parametric (Memory-Based) Function Approximation)

In the parametric setting, we update the parameters each time we receive a new training example, and then discard the example. However, in the non-parametric (memory-based) setting, we store the training examples without explicitly updating a parameterized function.

Whenever a query state’s value is needed, a set of examples is retrieved from memory and used to compute a value estimate for the query state. It is also sometimes called, lazy learning, because processing training examples is postponed until the system is queried to provide an output.

Definition 12 (Local-Learning Methods)

Local-learning methods are a class of methods where we approximate a value function only locally in the neighborhood of the current query state. The closer a training example’s state is to the query state, the more relevant it is considered to be. Distance can be defined in many different ways, e.g., Euclidean distance, Manhattan distance, or domain-specific distance measures.

An example is the nearest neighbor method, it simply finds the example in memory whose state is closest to the query state and returns that example’s valuea s the approximate value of the query state.