Part 7 - Reinforcement Learning: Monte Carlo Methods

Introduction

In the previous part, we saw how dynamic programming (DP) methods can be used to compute optimal policies given a perfect model of the environment (the dynamics p\mathcal{p}), as a Markov Decision Process (MDP). However, in many real-world scenarios, we do not have access to the complete model of the environment. Instead, we often have to learn from experience, which is where Monte Carlo (MC) methods come into play.

MC methods are instances of methods that learn from experience, they will sample sequences of states, actions, and rewards from actual or simulated interaction with the environment. The fundamental idea behind MC methods is to estimate the value of a state (or state-action pair) based on the average return observed after visiting that state (or state-action pair).

As we can understand, learning from (actual or simulated) experience requires no prior knowledge of the dynamics, however, it can still attain optimal policies (we will see how later).

Monte Carlo Methods

We will define MC methods in the context of episodic tasks, where each episode ends in a terminal state. Further, MC methods sample and average the returns of each state(-action pair) to estimate its value.

We wil adapt the idea of general policy iteration (GPI) to MC methods.

Note

With DP we compute value functions from knowledge of the MDP (the dynamics p\mathcal{p}).

With MC methods we learn value functions from sample returns without knowledge of the MDP.

Prediction

So, we have stated that our goal with MC methods is to learn the state-value function vπv_{\pi} for a given policy π\pi. More specifically, estimate it from experience, i.e., average the returns observed after visists to each state sSs \in \mathcal{S}. As more returns are observed, the average should converge to the expected return, which is vπ(s)v_{\pi}(s).

We will cover two types of MC methods for policy evaluation: first-visit MC and every-visit MC.

First-Visit MC and Every-Visit MC

The first-visit MC method estimates vπ(s)v_{\pi}(s) as the average of the returns following the first visit to state ss in each episode.

Algorithm (First-Visit MC Prediction)

Input: a policy π\pi to be evaluated.

Initialize:

  • V(s)V(s), arbitrarily for all sSs \in \mathcal{S}
  • Returns(s)[]\mathrm{Returns}(s) \leftarrow [], an empty list for all sSs \in \mathcal{S}

Loop forever (for each episode):

  • Generate an episode following π\pi: S0,A0,R1,,ST1,AT1,RTS_0, A_0, R_1, \ldots, S_{T-1}, A_{T-1}, R_T
  • G0G \leftarrow 0
  • Loop for each step of episode, t=T1,T2,,0t = T - 1, T - 2, \ldots, 0:
    • GγG+Rt+1G \leftarrow \gamma G + R_{t + 1}
    • Unless StS_t appears in S0,,St1S_0, \ldots, S_{t - 1}:
      • Append GG to Returns(St)\mathrm{Returns}(S_t)
      • V(St)V(S_t) \leftarrow average of Returns(St)\mathrm{Returns}(S_t)

For every-visit MC, we simply remove the condition that the return is only appended for the first visit to state ss in each episode.

We will not formally prove the convergence of these methods, but for first-visit MC the idea is the following.

Note

Each return is an independent, identically distributed (i.i.d.) estimate of vπ(s)v_{\pi}(s) with finite variance (since we have finite returns in an episodic task).

Thus, by the law of large numbers, the average of these returns converges to the expected value vπ(s)v_{\pi}(s) as the number of returns (i.e., visits to state ss) approaches infinity.

Action Values

We stated that one of the strengths of MC methods are that they do not require a model of the dynamics p\mathcal{p}.

Then, it is particularly useful to estimate action values qπ(s,a)q_{\pi}(s, a), rather than state values vπ(s)v_{\pi}(s), since we can directly derive optimal policies from action values.

Essentially everything is the same but now we talk about visits to a state-action pair.

A state-action pair (s,a)(s, a) is said to be visited in an episode if ever the state ss is visited and action aa is taken in it.

However, we have to deal with a new problem, many state-action pairs may never be visited (why?), thus we cannot estimate their values.

Solution

It is possible that, under a certain policy π\pi, some state-action pairs are never visited, simply because the policy π\pi never selects those actions in those states.

With no returns to average, the MC estimates of the other actions will not improve with experience, and thus we cannot improve the policy.

The solution, exploration!

Control

One way to solve this is by specifying that the episodes start in a state-actiopn pair, and that every pair has a nonzero probability of being selected as the start.

This will guarentee that all state-action pairs will be visited in an infinite number of times in the limit of an infinite number of episodes.

We call this setup and assumption of exploring starts.

The overall idea is to use the generalized policy iteration (GPI) we have seen in the case of DP.

We will maintain both an approximate policy and an approximate value function (or action-value function), and we will use each to improve the other.

Monte Carlo Control with Exploring Starts

As in the case for DP, for policy improvement, we improve the policy by making it greedy with respect to the current action-value function estimate.

Definition 1 (Policy Improvement for MC Control)

For each sSs \in \mathcal{S} and for any action-value function qq, we deterministically choose an action with maximum action value,

π(s)argmaxaA(s) q(s,a)\pi(s) \coloneqq \underset{a \in \mathcal{A}(s)}{\arg\max} \ q(s, a)
Proof (Policy Improvement for MC Control)

The policy improvement theorem then applies to πk\pi_k and πk+1\pi_{k + 1} (the greedy policy with respect to qπkq_{\pi_k}), because,

qπk(s,πk+1(s))=qπk(s,argmaxaA(s) qπk(s,a))=maxaA(s)qπk(s,a)qπk(s,πk(s))vπk(s)\begin{align*} q_{\pi_k}(s, \pi_{k + 1}(s)) & = q_{\pi_k}\left(s, \underset{a \in \mathcal{A}(s)}{\arg\max} \ q_{\pi_k}(s, a)\right) \newline & = \max_{a \in \mathcal{A}(s)} q_{\pi_k}(s, a) \newline & \geq q_{\pi_k}(s, \pi_k(s)) \newline & \geq v_{\pi_k}(s) \end{align*}

Thus, we are assured that each πk+1\pi_{k + 1} is uniformly better than πk\pi_k, or just as good as πk\pi_k, in which case they are both optimal policies (by the definition of optimal policies).

Thus, the overall process converges to the optimal policy and optimal (action-)value function.

However, we have made two unlikely assumptions to obtain the MC convergence result.

  1. The episodes assume exploring starts, i.e., each state-action pair has a nonzero probability of being selected as the start of an episode.
  2. Policy evaluation could be done with an infinite number of episodes (we saw this problem with DP as well).
Intuition

An idea we can try, we may give up trying to complete policy evaluation before returning to policy improvement.

As we have seen previously, an extreme case is value iteration, where we interleave policy evaluation and policy improvement at every step.

Algorithm (Monte Carlo Control with Exploring Starts)

Initialize:

  • π(s)A(s)\pi(s) \in \mathcal{A}(s), arbitrarily for all sSs \in \mathcal{S}
  • Q(s,a)Q(s, a), arbitrarily for all sS,aA(s)s \in \mathcal{S}, a \in \mathcal{A}(s)
  • Returns(s,a)[]\mathrm{Returns}(s, a) \leftarrow [], an empty list for all sS,aA(s)s \in \mathcal{S}, a \in \mathcal{A}(s)

Loop forever (for each episode):

  • Choose S0S,A0A(S0)S_0 \in \mathcal{S}, A_0 \in \mathcal{A}(S_0) randomly such that all pairs have probability >0> 0 (exploring starts)
  • Generate an episode from S0,A0S_0, A_0 following π\pi: S0,A0,R1,,ST1,AT1,RTS_0, A_0, R_1, \ldots, S_{T-1}, A_{T-1}, R_T
  • G0G \leftarrow 0
  • Loop for each step of episode, t=T1,T2,,0t = T - 1, T - 2, \ldots, 0:
    • GγG+Rt+1G \leftarrow \gamma G + R_{t + 1}
    • Unless (St,At)(S_t, A_t) appears in (S0,A0),,(St1,At1)(S_0, A_0), \ldots, (S_{t - 1}, A_{t - 1}):
      • Append GG to Returns(St,At)\mathrm{Returns}(S_t, A_t)
      • Q(St,At)average(Returns(St,At))Q(S_t, A_t) \leftarrow \mathrm{average}(\mathrm{Returns}(S_t, A_t))
      • π(St)argmaxaA(St)Q(St,a)\pi(S_t) \leftarrow \underset{a \in \mathcal{A}(S_t)}{\arg\max} Q(S_t, a)

Again, if we remove the condition that the return is only appended for the first visit to the state-action pair in each episode, we get the every-visit version of this algorithm.

Further, note that the last step is value iteration, since we update the policy at every step of every episode.

On-Policy vs Off-Policy

How can we avoid the unlikely assumption of exploring starts?

We will now introduce two approaches to ensure all actions are selected infinitely often.

  1. On-policy: Evaluate or improve the policy that is used to make decisions.
  2. Off-policy: Evaluate or improve a policy different from that used to generate the data/experience.

In on-policy methods, we say that our policy is soft.

Definition 2 (Soft Policy)

π(as)>0\pi(a|s) > 0 for all sS,aA(s)s \in \mathcal{S}, a \in \mathcal{A}(s). but gradually will shift closer to a deterministic optimal policy.

Here, we use ϵ\epsilon-greedy policies, where with probability 1ϵ1 - \epsilon we select the greedy action (i.e., maximal estimated action value), and with probability ϵ\epsilon we select an action uniformly at random from the set of available actions.

Exercise 1

What is the probability given to selecting the greedy action?

Solution

We understand that with probability 1ϵ1 - \epsilon we select the greedy action, but we also need to consider the probability of selecting the greedy action when we select an action uniformly at random, thus,

P(greedy action)=P(greedy actiongreedy)P(greedy)+P(greedy actionrandom)P(random)=(1ϵ)+ϵA(s)=1ϵ+ϵA(s)\begin{align*} P(\text{greedy action}) & = P(\text{greedy action} | \text{greedy}) P(\text{greedy}) + P(\text{greedy action} | \text{random}) P(\text{random}) \newline & = (1 - \epsilon) + \frac{\epsilon}{|\mathcal{A}(s)|} \newline & = 1 - \epsilon + \frac{\epsilon}{|\mathcal{A}(s)|} \end{align*}
Algorithm (On-Policy First-Visit MC Control)

Algorithm parameter: small ϵ>0\epsilon > 0 Initialize:

  • π\pi \leftarrow an arbitrary ϵ\epsilon-soft policy
  • Q(s,a)RQ(s, a) \in \mathbb{R}, arbitrarily for all sS,aA(s)s \in \mathcal{S}, a \in \mathcal{A}(s)
  • Returns(s,a)[]\mathrm{Returns}(s, a) \leftarrow [], an empty list for all sS,aA(s)s \in \mathcal{S}, a \in \mathcal{A}(s) Repeat forever (for each episode):
  • Generate an episode following π\pi: S0,A0,R1,,ST1,AT1,RTS_0, A_0, R_1, \ldots, S_{T-1}, A_{T-1}, R_T
  • G0G \leftarrow 0
  • Loop for each step of episode, t=T1,T2,,0t = T - 1, T - 2, \ldots, 0:
    • GγG+Rt+1G \leftarrow \gamma G + R_{t + 1}
    • Unless (St,At)(S_t, A_t) appears in (S0,A0),,(St1,At1)(S_0, A_0), \ldots, (S_{t - 1}, A_{t - 1}):
      • Append GG to Returns(St,At)\mathrm{Returns}(S_t, A_t)
      • Q(St,At)average(Returns(St,At))Q(S_t, A_t) \leftarrow \mathrm{average}(\mathrm{Returns}(S_t, A_t))
      • AargmaxaA(St)Q(St,a)A^{\star} \leftarrow \underset{a \in \mathcal{A}(S_t)}{\arg\max} Q(S_t, a)
      • For all aA(St)a \in \mathcal{A}(S_t):
        • π(aSt){1ϵ+ϵA(St),if a=AϵA(St),otherwise\pi(a|S_t) \leftarrow \begin{cases} 1 - \epsilon + \frac{\epsilon}{|\mathcal{A}(S_t)|}, & \text{if } a = A^{\star} \newline \frac{\epsilon}{|\mathcal{A}(S_t)|}, & \text{otherwise}\end{cases}

Analysis of ϵ\epsilon-greedy policy for MC Control

We will now analyze the policy improvement step of the on-policy MC control algorithm.

Definition 3 (Policy Improvement for On-Policy MC Control)

Any ϵ\epsilon-greedy policy π\pi^{\prime} with respect to qπq_{\pi} is an improvement over any ϵ\epsilon-greedy policy π\pi.

Proof (Policy Improvement for On-Policy MC Control)

Again, this is assured by the policy improvement theorem,

qπ(s,π(s))=aA(s)π(as)qπ(s,a)=ϵA(s)aA(s)qπ(s,a)+(1ϵ)maxaA(s)qπ(s,a)ϵA(s)aA(s)qπ(s,a)+(1ϵ)aA(s)π(as)ϵA(s)1ϵqπ(s,a)\begin{align*} q_{\pi}(s, \pi^{\prime}(s)) & = \sum_{a \in \mathcal{A}(s)} \pi^{\prime}(a|s) q_{\pi}(s, a) \newline & = \frac{\epsilon}{|\mathcal{A}(s)|} \sum_{a \in \mathcal{A}(s)} q_{\pi}(s, a) + (1 - \epsilon) \max_{a \in \mathcal{A}(s)} q_{\pi}(s, a) \newline & \geq \frac{\epsilon}{|\mathcal{A}(s)|} \sum_{a \in \mathcal{A}(s)} q_{\pi}(s, a) + (1 - \epsilon) \sum_{a \in \mathcal{A}(s)} \frac{\pi(a|s) - \frac{\epsilon}{|\mathcal{A}(s)|}}{1 - \epsilon} q_{\pi}(s, a) \newline \end{align*}

Notice that the sum is a weighted average with nonnegative weights summing to 1 1since aA(s)π(as)=1\sum_{a \in \mathcal{A}(s)} \pi(a|s) = 1, thus,

qπ(s,π(s))ϵA(s)aA(s)qπ(s,a)ϵA(s)aA(s)qπ(s,a)+aA(s)π(as)qπ(s,a)=vπ(s)\begin{align*} q_{\pi}(s, \pi^{\prime}(s)) & \geq \frac{\epsilon}{|\mathcal{A}(s)|} \sum_{a \in \mathcal{A}(s)} q_{\pi}(s, a) - \frac{\epsilon}{|\mathcal{A}(s)|} \sum_{a \in \mathcal{A}(s)} q_{\pi}(s, a) + \sum_{a \in \mathcal{A}(s)} \pi(a|s) q_{\pi}(s, a) \newline & = v_{\pi}(s) \end{align*}

Thus, ππ\pi^{\prime} \geq \pi (i.e., vπ(s)vπ(s)v_{\pi^{\prime}}(s) \geq v_{\pi}(s) for all sSs \in \mathcal{S}).

Note

We can only achieve the bewst policy among the ϵ\epsilon-greedy policies, but on the other hand, we have eliminated the need for exploring starts.

Off-Policy Prediction via Importance Sampling

(Online) control methods face a dilemma, since they seek to learn action values conditional on subsequent optimal behavior, but they need to behave non-optimally in order to explore all actions.

How can they learn about the optimal policy while behaving according to an exploratory policy?

Thus, we will now introduce off-policy methods.

In off-policy methods, we have two policies,

  1. Target policy: the policy that is learned about and that becomes the optimal policy.
  2. Behavior policy: the policy that is more exploratory and is used to generate behavior.

We say that learning is from data “off” the target policy, and the overall process is called off-policy learning.