Introduction
In the previous part, we saw how dynamic programming (DP) methods can be used to compute optimal policies given a perfect model of the environment (the dynamics $\mathcal{p}$), as a Markov Decision Process (MDP). However, in many real-world scenarios, we do not have access to the complete model of the environment. Instead, we often have to learn from experience, which is where Monte Carlo (MC) methods come into play.
MC methods are instances of methods that learn from experience, they will sample sequences of states, actions, and rewards from actual or simulated interaction with the environment. The fundamental idea behind MC methods is to estimate the value of a state (or state-action pair) based on the average return observed after visiting that state (or state-action pair).
As we can understand, learning from (actual or simulated) experience requires no prior knowledge of the dynamics, however, it can still attain optimal policies (we will see how later).
Monte Carlo Methods
We will define MC methods in the context of episodic tasks, where each episode ends in a terminal state. Further, MC methods sample and average the returns of each state(-action pair) to estimate its value.
We wil adapt the idea of general policy iteration (GPI) to MC methods.
Note
With DP we compute value functions from knowledge of the MDP (the dynamics $\mathcal{p}$).
With MC methods we learn value functions from sample returns without knowledge of the MDP.
Prediction
So, we have stated that our goal with MC methods is to learn the state-value function $v_{\pi}$ for a given policy $\pi$. More specifically, estimate it from experience, i.e., average the returns observed after visists to each state $s \in \mathcal{S}$. As more returns are observed, the average should converge to the expected return, which is $v_{\pi}(s)$.
We will cover two types of MC methods for policy evaluation: first-visit MC and every-visit MC.
First-Visit MC and Every-Visit MC
The first-visit MC method estimates $v_{\pi}(s)$ as the average of the returns following the first visit to state $s$ in each episode.
Algorithm: First-Visit MC Prediction
Input: a policy $\pi$ to be evaluated.
Initialize:
- $V(s)$, arbitrarily for all $s \in \mathcal{S}$
- $\mathrm{Returns}(s) \leftarrow []$, an empty list for all $s \in \mathcal{S}$
Loop forever (for each episode):
- Generate an episode following $\pi$: $S_0, A_0, R_1, \ldots, S_{T-1}, A_{T-1}, R_T$
- $G \leftarrow 0$
- Loop for each step of episode, $t = T - 1, T - 2, \ldots, 0$:
- $G \leftarrow \gamma G + R_{t + 1}$
- Unless $S_t$ appears in $S_0, \ldots, S_{t - 1}$:
- Append $G$ to $\mathrm{Returns}(S_t)$
- $V(S_t) \leftarrow$ average of $\mathrm{Returns}(S_t)$
For every-visit MC, we simply remove the condition that the return is only appended for the first visit to state $s$ in each episode.
We will not formally prove the convergence of these methods, but for first-visit MC the idea is the following.
Note
Each return is an independent, identically distributed (i.i.d.) estimate of $v_{\pi}(s)$ with finite variance (since we have finite returns in an episodic task).
Thus, by the law of large numbers, the average of these returns converges to the expected value $v_{\pi}(s)$ as the number of returns (i.e., visits to state $s$) approaches infinity.
Action Values
We stated that one of the strengths of MC methods are that they do not require a model of the dynamics $\mathcal{p}$.
Then, it is particularly useful to estimate action values $q_{\pi}(s, a)$, rather than state values $v_{\pi}(s)$, since we can directly derive optimal policies from action values.
Essentially everything is the same but now we talk about visits to a state-action pair.
A state-action pair $(s, a)$ is said to be visited in an episode if ever the state $s$ is visited and action $a$ is taken in it.
However, we have to deal with a new problem, many state-action pairs may never be visited (why?), thus we cannot estimate their values.
Solution
It is possible that, under a certain policy $\pi$, some state-action pairs are never visited, simply because the policy $\pi$ never selects those actions in those states.
With no returns to average, the MC estimates of the other actions will not improve with experience, and thus we cannot improve the policy.
The solution, exploration!
Control
One way to solve this is by specifying that the episodes start in a state-actiopn pair, and that every pair has a nonzero probability of being selected as the start.
This will guarentee that all state-action pairs will be visited in an infinite number of times in the limit of an infinite number of episodes.
We call this setup and assumption of exploring starts.
The overall idea is to use the generalized policy iteration (GPI) we have seen in the case of DP.
We will maintain both an approximate policy and an approximate value function (or action-value function), and we will use each to improve the other.
Monte Carlo Control with Exploring Starts
As in the case for DP, for policy improvement, we improve the policy by making it greedy with respect to the current action-value function estimate.
Definition: Policy Improvement for MC Control
For each $s \in \mathcal{S}$ and for any action-value function $q$, we deterministically choose an action with maximum action value, $$ \pi(s) \coloneqq \underset{a \in \mathcal{A}(s)}{\arg\max} \ q(s, a) $$
Proof: Policy Improvement for MC Control
The policy improvement theorem then applies to $\pi_k$ and $\pi_{k + 1}$ (the greedy policy with respect to $q_{\pi_k}$), because, $$ \begin{align*} q_{\pi_k}(s, \pi_{k + 1}(s)) & = q_{\pi_k}\left(s, \underset{a \in \mathcal{A}(s)}{\arg\max} \ q_{\pi_k}(s, a)\right) \newline & = \max_{a \in \mathcal{A}(s)} q_{\pi_k}(s, a) \newline & \geq q_{\pi_k}(s, \pi_k(s)) \newline & \geq v_{\pi_k}(s) \end{align*} $$ Thus, we are assured that each $\pi_{k + 1}$ is uniformly better than $\pi_k$, or just as good as $\pi_k$, in which case they are both optimal policies (by the definition of optimal policies).
Thus, the overall process converges to the optimal policy and optimal (action-)value function.
However, we have made two unlikely assumptions to obtain the MC convergence result.
- The episodes assume exploring starts, i.e., each state-action pair has a nonzero probability of being selected as the start of an episode.
- Policy evaluation could be done with an infinite number of episodes (we saw this problem with DP as well).
Intuition
An idea we can try, we may give up trying to complete policy evaluation before returning to policy improvement.
As we have seen previously, an extreme case is value iteration, where we interleave policy evaluation and policy improvement at every step.
Algorithm: Monte Carlo Control with Exploring Starts
Initialize:
- $\pi(s) \in \mathcal{A}(s)$, arbitrarily for all $s \in \mathcal{S}$
- $Q(s, a)$, arbitrarily for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$
- $\mathrm{Returns}(s, a) \leftarrow []$, an empty list for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$
Loop forever (for each episode):
- Choose $S_0 \in \mathcal{S}, A_0 \in \mathcal{A}(S_0)$ randomly such that all pairs have probability $> 0$ (exploring starts)
- Generate an episode from $S_0, A_0$ following $\pi$: $S_0, A_0, R_1, \ldots, S_{T-1}, A_{T-1}, R_T$
- $G \leftarrow 0$
- Loop for each step of episode, $t = T - 1, T - 2, \ldots, 0$:
- $G \leftarrow \gamma G + R_{t + 1}$
- Unless $(S_t, A_t)$ appears in $(S_0, A_0), \ldots, (S_{t - 1}, A_{t - 1})$:
- Append $G$ to $\mathrm{Returns}(S_t, A_t)$
- $Q(S_t, A_t) \leftarrow \mathrm{average}(\mathrm{Returns}(S_t, A_t))$
- $\pi(S_t) \leftarrow \underset{a \in \mathcal{A}(S_t)}{\arg\max} Q(S_t, a)$
Again, if we remove the condition that the return is only appended for the first visit to the state-action pair in each episode, we get the every-visit version of this algorithm.
Further, note that the last step is value iteration, since we update the policy at every step of every episode.
On-Policy vs Off-Policy
How can we avoid the unlikely assumption of exploring starts?
We will now introduce two approaches to ensure all actions are selected infinitely often.
- On-policy: Evaluate or improve the policy that is used to make decisions.
- Off-policy: Evaluate or improve a policy different from that used to generate the data/experience.
In on-policy methods, we say that our policy is soft.
Definition: Soft Policy
$\pi(a|s) > 0$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$. but gradually will shift closer to a deterministic optimal policy.
Here, we use $\epsilon$-greedy policies, where with probability $1 - \epsilon$ we select the greedy action (i.e., maximal estimated action value), and with probability $\epsilon$ we select an action uniformly at random from the set of available actions.
Exercise
What is the probability given to selecting the greedy action?
Solution
We understand that with probability $1 - \epsilon$ we select the greedy action, but we also need to consider the probability of selecting the greedy action when we select an action uniformly at random, thus, $$ \begin{align*} P(\text{greedy action}) & = P(\text{greedy action} | \text{greedy}) P(\text{greedy}) + P(\text{greedy action} | \text{random}) P(\text{random}) \newline & = (1 - \epsilon) + \frac{\epsilon}{|\mathcal{A}(s)|} \newline & = 1 - \epsilon + \frac{\epsilon}{|\mathcal{A}(s)|} \end{align*} $$
Algorithm: On-Policy First-Visit MC Control
Algorithm parameter: small $\epsilon > 0$ Initialize:
- $\pi \leftarrow$ an arbitrary $\epsilon$-soft policy
- $Q(s, a) \in \mathbb{R}$, arbitrarily for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$
- $\mathrm{Returns}(s, a) \leftarrow []$, an empty list for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$ Repeat forever (for each episode):
- Generate an episode following $\pi$: $S_0, A_0, R_1, \ldots, S_{T-1}, A_{T-1}, R_T$
- $G \leftarrow 0$
- Loop for each step of episode, $t = T - 1, T - 2, \ldots, 0$:
- $G \leftarrow \gamma G + R_{t + 1}$
- Unless $(S_t, A_t)$ appears in $(S_0, A_0), \ldots, (S_{t - 1}, A_{t - 1})$:
- Append $G$ to $\mathrm{Returns}(S_t, A_t)$
- $Q(S_t, A_t) \leftarrow \mathrm{average}(\mathrm{Returns}(S_t, A_t))$
- $A^{\star} \leftarrow \underset{a \in \mathcal{A}(S_t)}{\arg\max} Q(S_t, a)$
- For all $a \in \mathcal{A}(S_t)$:
- $$\pi(a|S_t) \leftarrow \begin{cases} 1 - \epsilon + \frac{\epsilon}{|\mathcal{A}(S_t)|}, & \text{if } a = A^{\star} \newline \frac{\epsilon}{|\mathcal{A}(S_t)|}, & \text{otherwise}\end{cases}$$
Analysis of $\epsilon$-greedy policy for MC Control
We will now analyze the policy improvement step of the on-policy MC control algorithm.
Definition: Policy Improvement for On-Policy MC Control
Any $\epsilon$-greedy policy $\pi^{\prime}$ with respect to $q_{\pi}$ is an improvement over any $\epsilon$-greedy policy $\pi$.
Proof: Policy Improvement for On-Policy MC Control
Again, this is assured by the policy improvement theorem, $$ \begin{align*} q_{\pi}(s, \pi^{\prime}(s)) & = \sum_{a \in \mathcal{A}(s)} \pi^{\prime}(a|s) q_{\pi}(s, a) \newline & = \frac{\epsilon}{|\mathcal{A}(s)|} \sum_{a \in \mathcal{A}(s)} q_{\pi}(s, a) + (1 - \epsilon) \max_{a \in \mathcal{A}(s)} q_{\pi}(s, a) \newline & \geq \frac{\epsilon}{|\mathcal{A}(s)|} \sum_{a \in \mathcal{A}(s)} q_{\pi}(s, a) + (1 - \epsilon) \sum_{a \in \mathcal{A}(s)} \frac{\pi(a|s) - \frac{\epsilon}{|\mathcal{A}(s)|}}{1 - \epsilon} q_{\pi}(s, a) \newline \end{align*} $$ Notice that the sum is a weighted average with nonnegative weights summing to 1 1, thus, $$ \begin{align*} q_{\pi}(s, \pi^{\prime}(s)) & \geq \frac{\epsilon}{|\mathcal{A}(s)|} \sum_{a \in \mathcal{A}(s)} q_{\pi}(s, a) - \frac{\epsilon}{|\mathcal{A}(s)|} \sum_{a \in \mathcal{A}(s)} q_{\pi}(s, a) + \sum_{a \in \mathcal{A}(s)} \pi(a|s) q_{\pi}(s, a) \newline & = v_{\pi}(s) \end{align*} $$ Thus, $\pi^{\prime} \geq \pi$ (i.e., $v_{\pi^{\prime}}(s) \geq v_{\pi}(s)$ for all $s \in \mathcal{S}$).
Note
We can only achieve the bewst policy among the $\epsilon$-greedy policies, but on the other hand, we have eliminated the need for exploring starts.
Off-Policy Prediction via Importance Sampling
(Online) control methods face a dilemma, since they seek to learn action values conditional on subsequent optimal behavior, but they need to behave non-optimally in order to explore all actions.
How can they learn about the optimal policy while behaving according to an exploratory policy?
Thus, we will now introduce off-policy methods.
In off-policy methods, we have two policies,
- Target policy: the policy that is learned about and that becomes the optimal policy.
- Behavior policy: the policy that is more exploratory and is used to generate behavior.
We say that learning is from data “off” the target policy, and the overall process is called off-policy learning.