Introduction
In the previous part, we continued our discussion on Deep Reinforcement Learning (Deep RL) and explored how to use function approximation for control tasks in RL. We covered key algorithms such as semi-gradient SARSA and Deep Q-Networks (DQN) along with its variants. In this part, we will shift our focus to policy gradient methods and actor-critic algorithms. We will discuss the fundamental concepts behind policy gradients, explore various algorithms such as REINFORCE and Proximal Policy Optimization (PPO), and understand how actor-critic methods combine value-based and policy-based approaches to achieve efficient learning in complex environments.
Policy Gradient Methods
So far, we have focused on state/action-value functions.
Here, instead of state/action-value methods, we consider methods that learn a parameterized policy that can select actions without consulting a value function.
Notation: Policy Gradient Methods
We use the notation $\boldsymbol{\theta} \in \mathbb{R}^{d^{\prime}}$ for the policy’s parameter vector. Thus, $\pi(a \mid s, \boldsymbol{\theta}) = P(A_t = a \mid S_t = s, \boldsymbol{\theta}_t = \boldsymbol{\theta})$ denotes the probablity that action $a$ is taken at time $t$ given that the environment is in state $s$ at time $t$ with parameter vector $\boldsymbol{\theta}$.
If a method uses a learned value function as well, then the value function’s weight vector is denoted $\mathbf{w} \in \mathbb{R}^d$ as usual, as in $\hat{v}(s, \mathbf{w})$.
Intuition: Policy Gradient
We consider learning the policy parameter based on the gradient (ascent) of some scalar performance measure $J(\boldsymbol{\theta})$, $$ \boldsymbol{\theta}_{t + 1} \coloneqq \boldsymbol{\theta}_t + \alpha \nabla J(\boldsymbol{\theta}_t), $$ where $\nabla J(\boldsymbol{\theta})$ is the gradient of $J(\boldsymbol{\theta})$ with respect to $\boldsymbol{\theta}$ (or an estimate of it) and $\alpha > 0$ is a step-size parameter.
Such methods are called policy gradient methods. They can also learn an approximate value function.
Further, methods that learn approximations to both policy and value functions are often called actor-critic methods, where actor refers to the learned policy, and critic refers to the learned value function, usually a state-value function.
Intuition: Pros & Cons of Policy-based RL
Advantages:
- Good convergence properties.
- Better extended to high-dimensional or continuous action spaces.
- Can learn stochastic policies, in addition to being deterministic if needed.
- A way of injecting prior knowledge about the desired form of the policy.
- Somtimes policies are simple while values and models are complex. E.g., rich domain, but optimal is always “go left”.
- The action probabilities change smoothly as a function of the parameter (with $\epsilon$-greedy the action probabilities may change dramatically).
Disadvantages:
- Susceptible to local optima (especially with non-linear models).
- Obtained knowledge is specific, does not always generalize well.
- Ignores a lot of information in the data (when used in isolation).
Definition: Policy Objective Functions
Given a policy $\pi_{\boldsymbol{\theta}}(a \mid s)$, with parameters $\boldsymbol{\theta}$, we want to find the best $\boldsymbol{\theta}$.
But how do we measure the quality of a policy $\pi_{\boldsymbol{\theta}}$?
In episodic environments we can use the start value, $$ J_0(\boldsymbol{\theta}) \coloneqq v_{\pi_{\boldsymbol{\theta}}}(s_0). $$ In continuing environments we can use the average value, $$ J_{avV}(\boldsymbol{\theta}) \coloneqq \sum_s \mu_{\pi_{\boldsymbol{\theta}}}(s) v_{\pi_{\boldsymbol{\theta}}}(s), $$ where $\mu_{\pi_{\boldsymbol{\theta}}}(s) = p(S_t = s \mid \pi_{\boldsymbol{\theta}})$ is the probability of being in state $s$ in the long run, i.e., the on-policy distribution under policy $\pi_{\boldsymbol{\theta}}$.
Alternatively, we can use the average reward per time step, $$ J_{avR}(\boldsymbol{\theta}) \coloneqq \sum_s \mu_{\pi_{\boldsymbol{\theta}}}(s) \sum_a \pi_{\boldsymbol{\theta}}(a \mid s) \sum_{r} r p(r \mid s, a). $$
Intuition: Policy Optimization
Policy based RL is an optimization problem, i.e., find $\boldsymbol{\theta}^{\star} = \arg\max_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$. It is possible to not use the gradient, e.g., hill climbing and genetic algorithms.
However, we will focus on stochastic gradient ascent, which is often quite efficient (and easy to use with deep nets).
Definition: Policy Gradient
Let $J(\boldsymbol{\theta})$ be any objective function. Policy gradient algorithms search for a (local) maximum in $J(\boldsymbol{\theta})$ by ascending the gradient of the policy, with respect to $\boldsymbol{\theta}$, $$ \Delta \boldsymbol{\theta} = \alpha \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}), $$ where $\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$ is the policy gradient, $$ \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) \coloneqq \begin{bmatrix} \frac{\partial J(\boldsymbol{\theta})}{\partial \theta_1} \newline \vdots \newline \frac{\partial J(\boldsymbol{\theta})}{\partial \theta_{d^{\prime}}} \end{bmatrix}. $$
Note
However, we face a challenge; $J(\boldsymbol{\theta})$ depends on both the action selections and the distribution of states in which those selections are made.
Given a staate, the effect of the policy parameter on the actions, and thus on the reward, can be computed in a relatively straightforward way.
But the effect of the policy on the state distribution is a function of the environment and is typically unknown.
Thus, how can we estimate the gradient of the performance measure with respect to the policy parameters when the gradient depends on the unknown effect of policy changes on the state distribution?
Theorem: Policy Gradient Theorem
The policy gradient theorem answers this question.
It provides an analytical expression for $\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$ that does not involve the derivative of the state distribution, $$ \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) \propto \sum_s \mu(s) \sum_a q_{\pi}(s, a) \nabla_{\boldsymbol{\theta}} \pi(a \mid s, \boldsymbol{\theta}), $$ Any constant of proportionality can be absorbed into $\alpha$.
Further, in the episodic case, this holds for $\gamma = 1$; for arbitrary $\gamma$ then $\mu(s)$ is replaced by some more complicated term depending on discounting factor at different $t$.
Definition: Naïve Policy Gradient Algorithm
The RHS of the policy gradient theorem is a sum over states weighted by how often the states occur under the policy $\pi$. Thus, $$ \begin{align*} \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) & \propto \sum_s \mu(s) \sum_a q_{\pi}(s, a) \nabla_{\boldsymbol{\theta}} \pi(a \mid s, \boldsymbol{\theta}) \newline & = \mathbb{E}_{\pi} \left[\sum_a q_{\pi}(S_t, a) \nabla_{\boldsymbol{\theta}} \pi(a \mid S_t, \boldsymbol{\theta})\right] \newline \end{align*} $$ We could stop here and instantiate our stochastic gradient ascent algorithm (called an all-actions method) as, $$ \boldsymbol{\theta}_{t + 1} \coloneqq \boldsymbol{\theta}_t + \alpha \sum_a \hat{q}(S_t, a, \mathbf{w}) \nabla_{\boldsymbol{\theta}} \pi(a \mid S_t, \boldsymbol{\theta}_t), $$ where $\hat{q}$ is some learned approximation of $q_{\pi}$.
It is promising, but we want to focus on a method whose update at time $t$ involves only $A_t$.
Intuition: REINFORCE Monte-Carlo Policy Gradient
An idea is to replace the sum over the random variable’s possible values by an expectation under $\pi$ and then sampling the expectation, $$ \begin{align*} \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) & \propto \mathbb{E}_{\pi} \left[\sum_a \pi(a \mid S_t, \boldsymbol{\theta}) q_{\pi}(S_t, a) \frac{\nabla_{\boldsymbol{\theta}} \pi(a \mid S_t, \boldsymbol{\theta})}{\pi(a \mid S_t, \boldsymbol{\theta})}\right] \newline & = \mathbb{E}_{\pi} \left[q_{\pi}(S_t, A_t) \frac{\nabla_{\boldsymbol{\theta}} \pi(A_t \mid S_t, \boldsymbol{\theta})}{\pi(A_t \mid S_t, \boldsymbol{\theta})}\right] \newline & = \mathbb{E}_{\pi} \left[G_t \frac{\nabla_{\boldsymbol{\theta}} \pi(A_t \mid S_t, \boldsymbol{\theta})}{\pi(A_t \mid S_t, \boldsymbol{\theta})}\right] \newline \end{align*} $$ The final expression, read in words, is a quantity that can be sampled on each time step whose expectation is proportional to the gradient!
Then, the REINFORCE update becomes (which is very intuitive!), $$ \boldsymbol{\theta}_{t + 1} \coloneqq \boldsymbol{\theta}_t + \alpha G_t \frac{\nabla_{\boldsymbol{\theta}} \pi(A_t \mid S_t, \boldsymbol{\theta}_t)}{\pi(A_t \mid S_t, \boldsymbol{\theta}_t)}. $$ It increases $\boldsymbol{\theta}_t$ in the direction of $\nabla_{\boldsymbol{\theta}} \pi(A_t \mid S_t, \boldsymbol{\theta}_t)$ proportional to the return and inversely proportional to the action probability.
The former makes sense because it causes the parameter to move most in the directions that favor actions that yield the highest sum.
The latter makes sense, otherwise actions that are selected frequently are at an advantage (the updates will be more often in their direction) and might win out even if they do not yield the highest return.
Algorithm: REINFORCE: Monte-Carlo Policy Gradient
Input: a differentiable policy parameterization $\pi(a \mid s, \boldsymbol{\theta})$
Algorithm parameters: step-size $\alpha > 0$
Initialize policy parameter $\boldsymbol{\theta} \in \mathbb{R}^{d^{\prime}}$ (e.g., to $\mathbf{0}$)
Loop forever (for each episode):
- Generate an episode $S_0, A_0, R_1, S_1, A_1, R_2, \ldots, S_{T - 1}, A_{T - 1}, R_T$, following $\pi(\cdot \mid \cdot, \boldsymbol{\theta})$
- Loop for each step of the episode, $t = 0, 1, \ldots, T - 1$:
- $G \leftarrow \sum_{k = t + 1}^T \gamma^{k - t - 1} R_k$
- $\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha \gamma^t G \nabla_{\boldsymbol{\theta}} \ln \pi(A_t \mid S_t, \boldsymbol{\theta})$
We note that $\frac{\nabla_{\boldsymbol{\theta}} \pi(A_t \mid S_t, \boldsymbol{\theta})}{\pi(A_t \mid S_t, \boldsymbol{\theta})} = \nabla_{\boldsymbol{\theta}} \ln \pi(A_t \mid S_t, \boldsymbol{\theta})$, sometimes called the eligibility vector.
Note: Softmax Policy
In policy gradient methods, the policy can be parameterized in any way, as long as $\nabla_{\boldsymbol{\theta}} \pi(a \mid s, \boldsymbol{\theta})$ exists and is finite $\forall s, a, \boldsymbol{\theta}$.
The policy can be defined for example according to an exponential softmax distribution, $$ \pi(a \mid s, \boldsymbol{\theta}) \coloneqq \frac{e^{h(s, a, \boldsymbol{\theta})}}{\sum_b e^{h(s, b, \boldsymbol{\theta})}}, $$ where $h(s, a, \boldsymbol{\theta}) \in \mathbb{R}$ is a parameterized preference function.
Further, the gradient of the log probability is, $$ \nabla_{\boldsymbol{\theta}} \ln \pi(a \mid s, \boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} h(s, a, \boldsymbol{\theta}) - \sum_b \pi_{\boldsymbol{\theta}}(b \mid s) \nabla_{\boldsymbol{\theta}} h(s, b, \boldsymbol{\theta}). $$ Action preferences ($h$ functions) can be parameterized arbitrarily. For example, they might be computed by a deep neural network where $\boldsymbol{\theta}$ is the vector of all the connetion weights of the network.
They can also be linear, using feature vectors $\mathbf{x}(s, a) \in \mathbb{R}^{d^{\prime}}$, $$ h(s, a, \boldsymbol{\theta}) \coloneqq \boldsymbol{\theta}^T \mathbf{x}(s, a). $$
Intuition: REINFORCE with Baseline
The policy gradient theorem can be generalized to include a comparison of the action value to an arbitrary baseline $b(s)$, $$ \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) \propto \sum_s \mu(s) \sum_a (q_{\pi}(s, a) - b(s)) \nabla_{\boldsymbol{\theta}} \pi(a \mid s, \boldsymbol{\theta}). $$ The baseline can be any function, even a random variable, as long as it does not vary with $a$. The equation remains valid because the subtracted quantity is zero, $$ \begin{align*} \sum_a b(s) \nabla_{\boldsymbol{\theta}} \pi(a \mid s, \boldsymbol{\theta}) & = b(s) \nabla_{\boldsymbol{\theta}} \sum_a \pi(a \mid s, \boldsymbol{\theta}) \newline & = b(s) \nabla_{\boldsymbol{\theta}} 1 \newline & = 0. \newline \end{align*} $$
Definition: Update Rule for REINFORCE with Baseline
Thus, a new version of the REINFORCE algorithm, with baseline $b(s)$, has the update, $$ \boldsymbol{\theta}_{t + 1} \coloneqq \boldsymbol{\theta}_t + \alpha (G_t - b(S_t)) \frac{\nabla_{\boldsymbol{\theta}} \pi(A_t \mid S_t, \boldsymbol{\theta}_t)}{\pi(A_t \mid S_t, \boldsymbol{\theta}_t)}. $$ The baseline leaves the expected value of the update unchanged, but it can have a large effect on its variance.
A good choice of $b(S_t)$ is an estimate of the state value $\hat{v}(S_t, \mathbf{w})$, i.e., use a Monte-Carlo method to learn the state-value weights $\mathbf{w}$.
Algorithm: REINFORCE with Baseline: Monte-Carlo Policy Gradient
Input: a differentiable policy parameterization $\pi(a \mid s, \boldsymbol{\theta})$
Input: a differentiable state-value function parameterization $\hat{v}(s, \mathbf{w})$
Algorithm parameters: step-sizes $\alpha_{\boldsymbol{\theta}} > 0$, $\alpha_{\mathbf{w}} > 0$
Initialize policy parameter $\boldsymbol{\theta} \in \mathbb{R}^{d^{\prime}}$ and state-value weights $\mathbf{w} \in \mathbb{R}^d$ (e.g., to $\mathbf{0}$)
Loop forever (for each episode):
- Generate an episode $S_0, A_0, R_1, S_1, A_1, R_2, \ldots, S_{T - 1}, A_{T - 1}, R_T$, following $\pi(\cdot \mid \cdot, \boldsymbol{\theta})$
- Loop for each step of the episode, $t = 0, 1, \ldots, T - 1$:
- $G \leftarrow \sum_{k = t + 1}^T \gamma^{k - t - 1} R_k$
- $\delta \leftarrow G - \hat{v}(S_t, \mathbf{w})$
- $\mathbf{w} \leftarrow \mathbf{w} + \alpha_{\mathbf{w}} \delta \nabla_{\mathbf{w}} \hat{v}(S_t, \mathbf{w})$
- $\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha_{\boldsymbol{\theta}} \gamma^t \delta \nabla_{\boldsymbol{\theta}} \ln \pi(A_t \mid S_t, \boldsymbol{\theta})$
Actor-Critic Methods
Intuition: Actor-Critic Methods
A way to combine value-based and policy-based methods is to use an actor-critic architecture.
The actor learns the policy and the critic learns the value function. The overall policy-gradient method with values function estimation is called an actor-critic method.
We consider one-step actor-critic methods, the analog of the TD methods such as TD(0), SARSA, and Q-learning.
One-step actor-critic methods replace the full return of the REINFORCE algorithm by a one-step return (and use a learned state-value function as a baseline).
Definition: One-step Actor-Critic
The one-step actor-critic update is, $$ \begin{align*} \boldsymbol{\theta}_{t + 1} & \coloneqq \boldsymbol{\theta}_t + \alpha \left(G_{t:t+1} - \hat{v}(S_t, \mathbf{w}_t)\right) \frac{\nabla_{\boldsymbol{\theta}} \pi(A_t \mid S_t, \boldsymbol{\theta}_t)}{\pi(A_t \mid S_t, \boldsymbol{\theta}_t)} \newline & = \boldsymbol{\theta}_t + \alpha \left(R_{t + 1} + \gamma \hat{v}(S_{t + 1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t)\right) \frac{\nabla_{\boldsymbol{\theta}} \pi(A_t \mid S_t, \boldsymbol{\theta}_t)}{\pi(A_t \mid S_t, \boldsymbol{\theta}_t)} \newline & = \boldsymbol{\theta}_t + \alpha \delta_t \frac{\nabla_{\boldsymbol{\theta}} \pi(A_t \mid S_t, \boldsymbol{\theta}_t)}{\pi(A_t \mid S_t, \boldsymbol{\theta}_t)}, \newline \end{align*} $$ We note that this is fully online and incremental.
The critic uses TD errors $\delta$ to criticize the actor’s action choices.
Positive $\delta$ means that the action was “good” because it led to a state with a better-than-expected value. Negative $\delta$ means that the action was “bad” because it led to a state with a worse-than-expected value.
Algorithm: One-step Actor-Critic (episodic)
Input: a differentiable policy parameterization $\pi(a \mid s, \boldsymbol{\theta})$
Input: a differentiable state-value function parameterization $\hat{v}(s, \mathbf{w})$
Parameters: step-sizes $\alpha_{\boldsymbol{\theta}} > 0$, $\alpha_{\mathbf{w}} > 0$
Initialize policy parameter $\boldsymbol{\theta} \in \mathbb{R}^{d^{\prime}}$ and state-value weights $\mathbf{w} \in \mathbb{R}^d$ (e.g., to $\mathbf{0}$)
Loop forever (for each episode):
- Initialize $S$ (first state of episode)
- $I \leftarrow 1$
- Loop while $S$ is not terminal (for each time step):
- $A \sim \pi(\cdot \mid S, \boldsymbol{\theta})$
- Take action $A$, observe $S^{\prime}, R$
- $\delta \leftarrow R + \gamma \hat{v}(S^{\prime}, \mathbf{w}) - \hat{v}(S, \mathbf{w})$ (if $S^{\prime}$ is terminal, then $\hat{v}(S^{\prime}, \mathbf{w}) = 0$)
- $\mathbf{w} \leftarrow \mathbf{w} + \alpha_{\mathbf{w}} \delta \nabla_{\mathbf{w}} \hat{v}(S, \mathbf{w})$
- $\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha_{\boldsymbol{\theta}} I \delta \nabla_{\boldsymbol{\theta}} \ln \pi(A \mid S, \boldsymbol{\theta})$
- $I \leftarrow \gamma I$
- $S \leftarrow S^{\prime}$
The generalization to $n$-step method is straightforward, using the $n$-step return $G_{t:t+n}$ instead of the one-step return $G_{t:t+1}$.