Part 6 - Reinforcement Learning: Dynamic Programming

DAT441_6

Introduction to Dynamic Programming

Dynamic programming (DP) is a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment (the dynamics $\mathcal{p}$), as a Markov Decision Process (MDP).

We will only consider an environment where the MDP is finite, i.e., the sets of states, actions, and rewards ($\mathcal{S}, \mathcal{A}, \mathcal{R}$) all have a finite number of elements and the dynamics are given by a set of probabilities $\mathcal{p}(s^{\prime}, r | s, a)$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s), r \in \mathcal{R}, s^{\prime} \in \mathcal{S}^+$.

Policy Evaluation

Firstly, recall that we can easily obtain optimal policies once we have found the optimal value functions, $v_{\star}$ or $q_{\star}$, which satisfy the Bellman optimality equations,

Theorem: Bellman Optimality Equation for $v_{\star}$

$$ \begin{align*} v_{\star}(s) & = \max_a \mathbb{E}[R_{t+1} + \gamma v_{\star}(S_{t+1}) | S_t = s, A_t = a] \newline & = \max_a \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma v_{\star}(s^{\prime})], \newline \newline q_{\star}(s, a) & = \mathbb{E}[R_{t+1} + \gamma \max_{a^{\prime}} q_{\star}(S_{t+1}, a^{\prime}) | S_t = s, A_t = a] \newline & = \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma \max_{a^{\prime}} q_{\star}(s^{\prime}, a^{\prime})], \newline \end{align*} $$

Now consider that we compute the state-value function $v_{\pi}$ for an arbitrary policy $\pi$, $$ \begin{align*} v_{\pi}(s) & = \mathbb{E}_{\pi, \mathcal{p}}[G_t | S_t = s] \newline & = \mathbb{E}_{\pi, \mathcal{p}}[R_{t+1} + \gamma G_{t+1} | S_t = s] \newline & = \mathbb{E}_{\pi, \mathcal{p}}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s] \newline & = \sum_a \pi(a | s) \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma v_{\pi}(s^{\prime})] \end{align*} $$

Note

With completely known environment dynamics $\mathcal{p}$, the above equation is a system of $|\mathcal{S}|$ linear equations of $v_{\pi}, s \in \mathcal{S}$, unknowns. Thus, it can be solved directly by standard methods for solving linear equations (e.g., Gaussian elimination). However, this is often computationally expensive and impractical for large state spaces.

Iterative Policy Evaluation

Thus, we instead use iterative methods that are more computationally efficient.

Definition: Iterative Policy Evaluation

Initially approximate the state values arbitrarily, e.g., $v_0(s) = 0$ for all $s \in \mathcal{S}$.

Then, update the approximations by using the Bellman equation for $v_{\pi}$, $$ \begin{align*} v_{k + 1}(s) & \coloneqq \mathbb{E}_{\pi, \mathcal{p}}[R_{t+1} + \gamma v_k(S_{t+1}) | S_t = s] \newline & = \sum_a \pi(a | s) \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma v_k(s^{\prime})] \end{align*} $$

Note

The sequence $\{v_k\}$ can be shown to converge to $v_{\pi}$ as $k \to \infty$.

Algorithm: Iterative Policy Evaluation

Input $\pi$, the policy to be evaluated.

Algorithm parameter: a small threshold $\theta > 0$ determining accuracy of estimation.

Initialize $V(s)$, for all $s \in \mathcal{S}^+$, arbitrarily, except that $V(\mathrm{terminal}) = 0$.

Loop:

$\Delta \leftarrow 0$
Loop for each $s \in \mathcal{S}$,
- $v \leftarrow V(s)$
- $V(s) \leftarrow \sum_a \pi(a | s) \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma V(s^{\prime})]$
- $\Delta \leftarrow \max(\Delta, |v - V(s)|)$
Until $\Delta < \theta$

Policy Improvement

But how do we use our newly computed value function $v_{\pi}$ to find a better policy?

Consider a one-step lookahead, i.e., we select $a \in \mathcal{A}(s)$ that maximizes the expected return by taking action $a$ in state $s$ and thereafter following policy $\pi$.

The value of this one-step lookahead is,

$$ \begin{align*} q_{\pi}(s, a) & = \mathbb{E}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s, A_t = a] \newline & = \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma v_{\pi}(s^{\prime})] \newline \end{align*} $$

If we assume that $q_{\pi}(s, a)$ is greater than $v_{\pi}(s)$, then we may expect that taking action $a$ in state $s$ and thereafter following policy $\pi$ will yield a greater return than simply following policy $\pi$ from the start, thus yielding an overall better policy.

This is a special cae of a general result called the policy improvement theorem.

Policy Improvement Theorem

Theorem: Policy Improvement Theorem

Let $\pi$ and $\pi^{\prime}$ be any pair of deterministic policies such that, for all $s \in \mathcal{S}$, $$ q_{\pi}(s, \pi^{\prime}(s)) \geq v_{\pi}(s) $$ Then, the policy $\pi^{\prime}$ must be as good as or better than policy $\pi$, i.e., it must obtain greater or equal return from all states $s \in \mathcal{S}$, $$ v_{\pi^{\prime}}(s) \geq v_{\pi}(s) $$

We will now prove this theorem.

Proof: Policy Improvement Theorem

$$ \begin{align*} v_{\pi}(s) & \leq q_{\pi}(s, \pi^{\prime}(s)) \newline & = \mathbb{E}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s, A_t = \pi^{\prime}(s)] \newline & = \mathbb{E}_{\pi^{\prime}}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s] \newline & \leq \mathbb{E}_{\pi^{\prime}}[R_{t+1} + \gamma q_{\pi}(S_{t+1}, \pi^{\prime}(S_{t+1})) | S_t = s] \newline & = \mathbb{E}_{\pi^{\prime}}[R_{t+1} + \gamma \mathbb{E}_{\pi^{\prime}}[R_{t+2} + \gamma v_{\pi}(S_{t+2}) | S_{t+1}] | S_t = s] \newline & = \mathbb{E}_{\pi^{\prime}}[R_{t+1} + \gamma R_{t+2} + \gamma^2 v_{\pi}(S_{t+2}) | S_t = s] \newline & \leq \mathbb{E}_{\pi^{\prime}}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 v_{\pi}(S_{t+3}) | S_t = s] \newline & \leq \ldots \newline & = v_{\pi^{\prime}}(s) \newline \end{align*} $$

Now, we can consider changes at all states, selecting at each state the action that appears best according to $q_{\pi}(s, a)$, i.e., the greedy policy with respect to $q_{\pi}$,

$$ \begin{align*} \pi^{\prime}(s) & \coloneqq \underset{a \in \mathcal{A}(s)}{\arg\max} \ q_{\pi}(s, a) \newline & = \underset{a \in \mathcal{A}(s)}{\arg\max} \ \mathbb{E}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s, A_t = a] \newline & = \underset{a \in \mathcal{A}(s)}{\arg\max} \ \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma v_{\pi}(s^{\prime})] \newline \end{align*} $$

By construction, the greedy policy meets the conditions of the policy improvement theorem, thus we can conclude that the greedy policy must be as good as or better than the original policy $\pi$.

Suppose that the new greedy policy $\pi^{\prime}$ is as good as, but not better than, the old policy $\pi$. Then $v_{\pi^{\prime}}(s) = v_{\pi}(s)$ for all $s \in \mathcal{S}$, $$ \begin{align*} v_{\pi}(s) & = \underset{a}{\max} \ \mathbb{E}[R_{t+1} + \gamma v_{\pi^{\prime}}(S_{t+1}) | S_t = s, A_t = a] \newline & = \underset{a}{\max} \ \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma v_{\pi^{\prime}}(s^{\prime})] \newline \end{align*} $$

This is the same as the Bellman optimality equation for $v_{\star}$, thus $v_{\pi} = v_{\pi^{\prime}} = v_{\star}$, and both policies $\pi$ and $\pi^{\prime}$ are optimal policies, i.e., $\pi, \pi^{\prime} \in {\pi_{\star}}$.

Policy Iteration

We can now combine policy evaluation and policy improvement to form the policy iteration algorithm.

Once a policy $\pi$, has been improved using $v_{\pi}$ to yield a better policy $\pi^{\prime}$, we can then compute $v_{\pi^{\prime}}$ and improve it again to yield an even better policy $\pi^{\prime\prime}$, and so on.

Each policy is guaranteed to be a strict improvement over the previous one, unless the policy is already optimal, in which case the policy improvement step will yield the same policy again.

Because we assume a finite MDP, only a finite number of policies exist, thus this process must eventually terminate with an optimal policy.

Algorithm: Policy Iteration

Initialization

$V(s) \in \mathbb{R}$ and $\pi(s) \in \mathcal{A}(s)$ arbitrarily for all $s \in \mathcal{S}$.

Policy Evaluation Loop:

$\Delta \leftarrow 0$
Loop for each $s \in \mathcal{S}$,
- $v \leftarrow V(s)$
- $V(s) \leftarrow \sum_a \pi(a | s) \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma V(s^{\prime})]$
- $\Delta \leftarrow \max(\Delta, |v - V(s)|)$
Until $\Delta < \theta$ (a small positive number determining accuracy of estimation)

Policy Improvement

policy-stable $\leftarrow$ true
For each $s \in \mathcal{S}$,
- old-action $\leftarrow \pi(s)$
- $\pi(s) \leftarrow \underset{a}{\arg\max} \ \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma V(s^{\prime})]$
- If old-action $\neq \pi(s)$, then policy-stable $\leftarrow$ false
If policy-stable, then stop and return $V \approx v_{\star}$ and $\pi \approx \pi_{\star}$; else go to step 2

Value Iteration

The policy evaluation step in policy iteration is an iterative procedure that can be computationally expensive.

We can instead truncate the policy evaluation step after only one sweep of updates over all states, i.e., we update each state value $V(s)$ only once before moving on to the policy improvement step.

Note that we can truncate in several ways, wihtout losing the guarantee of convergence to the optimal policy.

One extreme case is value iteration; Stop the policy evaluation after just one sweep (one update of each state), and then immediately perform the policy improvement step.

We can further extend the policy improvement step and truncated policy evaluation step into a single update, $$ \begin{align*} v_{k + 1}(s) & \coloneqq \max_a \mathbb{E}[R_{t+1} + \gamma v_k(S_{t+1}) | S_t = s, A_t = a] \newline & = \max_a \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma v_k(s^{\prime})] \newline \end{align*} $$

Or in other words, we turn the Bellman optimality equation for $v_{\star}$ into an update rule.

Algorithm: Value Iteration

Algorithm parameter: a small threshold $\theta > 0$ determining accuracy of estimation.

Initialize $V(s)$, for all $s \in \mathcal{S}^+$, arbitrarily, except that $V(\mathrm{terminal}) = 0$.

Loop:

$\Delta \leftarrow 0$
Loop for each $s \in \mathcal{S}$,
- $v \leftarrow V(s)$
- $V(s) \leftarrow \max_a \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma V(s^{\prime})]$
- $\Delta \leftarrow \max(\Delta, |v - V(s)|)$
Until $\Delta < \theta$

Output a deterministic policy $\pi \approx \pi_{\star}$ such that, $$ \pi(s) = \underset{a}{\arg\max} \ \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma V(s^{\prime})] $$

Variants

Value iteration effectively combines, in each of its sweeps, one sweep of policy evaluation and one sweep of policy improvement. However, we can gain faster convergence with interposing multiple policy evaluation sweeps between each policy improvement sweep.

In general, the entire class of truncated policy iteration algorithms can be thought of as a sequence of sweeps, some of which use policy evaluation updates and some of which use value iteration updates. Note that, because the only difference between the two types of updates is whether we use a $\max$, we can even intermix the two types of updates within a single sweep.

Note

All of these algorithms converge to an optimal policy for discounted, finite MDPs.

Asynchronous DP

Performing sweeps over the entire state set can be computationally expensive.

Thus, asynchronous DP algorithms update the values of states in any order whatsoever, using whatever values of other states happen to be available. The values of some states may be updated several times before the values of others are updated even once!

To converge correctly, all states should be updated properly.

Generalized Policy Iteration (GPI)

Policy iteration consists of two simultaneous, interacting proceses.

One making the value function consistent with the current policy (policy evaluation), and,
The other making the policy greedy with respect to the current value function (policy improvement).

There are several other variants as well.

As long as both processes continue to update all states, the ultimate result is typically the same - convergence to the optimal value function and an optimal policy.

Efficiency of Dynamic Programming

DP may not be practical for very large problems, but compared to other methods for solving MDPs, DP methods are actually quite efficient.

The time that DP methods take to find an optimal policy is polynomial in the number of states and actions.

However, (as always), we are dealing with the curse of dimensionality ¹ (i.e., exponential growth of state space), where this might limit the applicability of DP, but it is an inherent difficulty of the problem itself, not of the solution method.

On problems with large state spaces, asynchronous DP methods (and other variations of GPI) are often preferred. To complete even one sweep of a synchronous method requires computation and memory for every state, which may be infeasible.

Wikipedia: Curse of Dimensionality ↩

Part 6 - Reinforcement Learning: Dynamic Programming

Introduction to Dynamic Programming

Policy Evaluation

Iterative Policy Evaluation

Policy Improvement

Policy Improvement Theorem

Policy Iteration

Value Iteration

Variants

Asynchronous DP

Generalized Policy Iteration (GPI)

Efficiency of Dynamic Programming

Footnotes