Part 5 - Reinforcement Learning: Basics

Introduction to Reinforcement Learning

So far, we have been focusing on the multi-armed bandit problem. Reinforcement learning (RL) and multi-armed bandits are closely related, however, we will see that RL has one major difference.

The Agent-Environment Interaction

The core concepts of a RL system are,

  • An agent (learner, decision maker), that contains,
    • A state that represents the current situation of the agent.
    • Following a policy to decide what action to take.
    • (Probably) a value function to estimate how good a state (or action) is.
    • (Optionally) a model of the environment.
  • An environment: the thing that our agent interacts with, comprising everything outside the agent.
  • A reward signal: special numerical values that the agent seeks to maximize over time through its choice of actions (we will discuss this more later).

The agent and environment interact at each of a sequence of discrete time steps, t=1,2,3,t = 1, 2, 3, \ldots.

At each time step tt, the agent receives some representation of the environment’s state, StSS_t \in \mathcal{S}, and on that basis selects an action, AtA(s)A_t \in \mathcal{A}(s).

One time step later, the agent receives a numerical reward Rt+1RRR_{t+1} \in \mathcal{R} \subset \mathbb{R} and finds itself in a new state, St+1S_{t+1}.

The environment and agent together thereby give rise to a sequence or trajectory that begins like this,

S0,A0,R1,S1,A1,R2,S2,A2,R3,S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \ldots

In a finite setting, the sets of states, actions, and rewards (S,A,R\mathcal{S}, \mathcal{A}, \mathcal{R}) all have a finite number of elements.

Formalizing the RL Interface

To formalize the RL interface, we will introduce a mathematical formulation of the agent-environment interaction.

It is formulated via a Markov Decision Process (MDP), which will describe an environment and its dynamics.

Markov Property

Definition 1 (Markov Property)

“The future is independent of the past, given the present.”, or in other words,

P(St=s,Rt=rSt1,At1,Rt1,,R1,S0,A0)=P(St=s,Rt=rSt1,At1)P(S_t = s, R_t = r | S_{t-1}, A_{t-1}, R_{t-1}, \ldots, R_1, S_0, A_0) = P(S_t = s, R_t = r | S_{t-1}, A_{t-1})

The probability of each possible value for StS_t and RtR_t depends on the immediately preceding state St1S_{t-1} and action At1A_{t-1}, and not on any of the earlier ones.

Markov Decision Process (MDP)

A Markov Decision Process (MDP) is a tuple (S,A,p,γ)(\mathcal{S}, \mathcal{A}, \mathcal{p}, \gamma), where,

  • S\mathcal{S} is the set of all possible states.
  • A\mathcal{A} is the set of all possible actions.
  • p(s,rs,a)\mathcal{p}(s^{\prime}, r | s, a) is the joint probability of next state sSs^{\prime} \in \mathcal{S} and reward rRr \in \mathcal{R}, given the current state sSs \in \mathcal{S} and action aAa \in \mathcal{A}.
  • γ[0,1]\gamma \in [0, 1] is the discount factor to trade off longer-term rewards against immediate ones.

The rewards and discount together define the goal. Further, p\mathcal{p} defines the dynamics of the environment,

p(s,rs,a)=P(St=s,Rt=rSt1=s,At1=a)\mathcal{p}(s^{\prime}, r | s, a) = P(S_t = s^{\prime}, R_t = r | S_{t-1} = s, A_{t-1} = a)
NotesSrRp(s,rs,a)=1,sS,aA\sum_{s^{\prime} \in \mathcal{S}} \sum_{r \in \mathcal{R}} \mathcal{p}(s^{\prime}, r | s, a) = 1, \quad \forall s \in \mathcal{S}, a \in \mathcal{A}

Why?

Solution

Because the sum of probabilities of all possible next states and rewards given the current state and action must equal 1. We are never in a state where our action and state lead to an impossible outcome.

Other Useful Transition Probabilities

We can derive some other useful transition probabilities from p\mathcal{p}.

  • The state-transition probabilities,
p(ss,a)=P(St=sSt1=s,At1=a)=rRp(s,rs,a)\mathcal{p}(s^{\prime} | s, a) = P(S_t = s^{\prime} | S_{t-1} = s, A_{t-1} = a) = \sum_{r \in \mathcal{R}} \mathcal{p}(s^{\prime}, r | s, a)
  • The expected rewards for state-action pairs,
r(s,a)=E[RtSt1=s,At1=a]=rRrsSp(s,rs,a)r(s, a) = \mathbb{E}[R_t | S_{t-1} = s, A_{t-1} = a] = \sum_{r \in \mathcal{R}} r \sum_{s^{\prime} \in \mathcal{S}} \mathcal{p}(s^{\prime}, r | s, a)
  • The expected rewards for state-action-next-state triples,
r(s,a,s)=E[RtSt1=s,At1=a,St=s]=rRrp(s,rs,a)p(ss,a)r(s, a, s^{\prime}) = \mathbb{E}[R_t | S_{t-1} = s, A_{t-1} = a, S_t = s^{\prime}] = \frac{\sum_{r \in \mathcal{R}} r \mathcal{p}(s^{\prime}, r | s, a)}{\mathcal{p}(s^{\prime} | s, a)}

Where the third equation comes from the definition of conditional expectation.

Goals and Rewards

The agent’s goal is to maximize not the immediate reward, but the cumulative reward in the long run.

Proposition 1 (Reward Hypothesis)

All goals and purposes imposed on an agent can be described as the act of maximization of the expected value of the cumulative sum of a received scalar signal.

Rewards are provided in such a way that maximizing them (cumulatively) will achieve the desired outcome.

The reward signal specifies the agent what should be achieved, however, not how to achieve it. The “how” is specified by the RL algorithm/policy.

Returns, Episodic Tasks, and Continuing Tasks

Definition 2 (Episodic Tasks)

When the agent-environment interaction breaks naturally into subsequences, each of which is called an episode (e.g., a game of chess, a robot navigating a maze).

Definition 3 (Return)

Consider the sequence of rewards received after time step tt,

Rt+1,Rt+2,Rt+3,R_{t+1}, R_{t+2}, R_{t+3}, \ldots

The return quantity is the sum of the rewards, where TT is the final time step of the episode,

GtRt+1+Rt+2+Rt+3++RTG_t \coloneqq R_{t+1} + R_{t+2} + R_{t+3} + \ldots + R_T

Thus, the goal is to maximize the expected return, E[Gt]\mathbb{E}[G_t].

Definition 4 (Continuing Tasks)

When the agent-environment interaction continues indefinitely, without end (e.g., a robot that must keep walking without falling over).

Definition 5 (Discounted Return)

We can’t define return in the same way as in episodic tasks, because the sum of rewards may diverge to infinity (T=T = \infty). Instead, we define the discounted return,

GtRt+1+γRt+2+γ2Rt+3+γ3Rt+4+=k=0γkRt+k+1G_t \coloneqq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \ldots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

where γ[0,1)\gamma \in [0, 1) is the discount factor.

The discount factor determines the present value of future rewards. Why?

Solution

A reward received kk time steps in the future is worth only γk1\gamma^{k - 1} times what it would be worth if it were received immediately.

If γ<1\gamma < 1, the infinite sum above has a finite value, as long as the reward sequence {Rk}\{R_k\}, is bounded.

If γ=0\gamma = 0, the agent is “myopic” (short-sighted), caring only about immediate rewards.

As γ\gamma approaches 1, the agent becomes more far-sighted, valuing future rewards more strongly.

Unification of Episodic and Continuing RL

We can unify episodic and continuing tasks in two ways.

  1. Consider episode termination to be the entering of a special absorbing state, that only transitions to itself and that only receives zero reward.

  2. We can rewrite the discounted return for continuing tasks as,

Gtk=t+1Tγkt1RkG_t \coloneqq \sum_{k=t + 1}^T \gamma^{k - t - 1} R_k

with the possibility that T=T = \infty or γ=1\gamma = 1, but not both.

Recursive Relationship of Returns

One can express the return recursively,

GtRt+1+γRt+2+γ2Rt+3+γ3Rt+4+=Rt+1+γ(Rt+2+γRt+3+γ2Rt+4+)Gt+1=Rt+1+γGt+1\begin{align*} G_t & \coloneqq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \ldots \newline & = R_{t+1} + \gamma \underbrace{(R_{t+2} + \gamma R_{t+3} + \gamma^2 R_{t+4} + \ldots)}_{G_{t+1}} \newline & = R_{t+1} + \gamma G_{t+1} \newline \end{align*}

This works for all time steps t<Tt < T, even if termination occurs at t+1t + 1, provided we define GT=0G_T = 0.

Exercise 1

Compute GtG_t if the reward is constant +1.

Solution

If Rt=1R_t = 1 for all tt, then,

Gt=1+γ+γ2+γ3+=k=0γk=11γ,for γ<1\begin{align*} G_t & = 1 + \gamma + \gamma^2 + \gamma^3 + \ldots \newline & = \sum_{k=0}^{\infty} \gamma^k \newline & = \frac{1}{1 - \gamma}, \quad \text{for }\gamma < 1 \newline \end{align*}

Policies and Value Functions

Definition 6 (Policy)

A policy, π\pi, is a mapping from states to probabilities of selecting each possible action. Thus, π(as)\pi(a | s) is the probability that action aa is taken when in state ss.

π(as)=P(At=aSt=s)\pi(a | s) = P(A_t = a | S_t = s)
Definition 7 (Value Function)

A value function of states (or of state-action pairs) estimates how good it is for the agent to be in a given state (or, how good it is to perform a given action in a given state).

vπ(s)Eπ,p[GtSt=s]=Eπ,p[k=0γkRt+k+1|St=s], sSqπ(s,a)Eπ,p[GtSt=s,At=a]=Eπ,p[k=0γkRt+k+1|St=s,At=a], sS,aA(s)\begin{align*} v_{\pi}(s) & \coloneqq \mathbb{E}_{\pi, \mathcal{p}}[G_t | S_t = s] \newline & = \mathbb{E}_{\pi, \mathcal{p}}\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \middle| S_t = s\right], \ \forall s \in \mathcal{S} \newline q_{\pi}(s, a) & \coloneqq \mathbb{E}_{\pi, \mathcal{p}}[G_t | S_t = s, A_t = a] \newline & = \mathbb{E}_{\pi, \mathcal{p}}\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \middle| S_t = s, A_t = a\right], \ \forall s \in \mathcal{S}, a \in \mathcal{A}(s) \end{align*}
Note

The value functions vπ(s)v_{\pi}(s) and qπ(s,a)q_{\pi}(s, a) can be estimated from experience.

For exam ple, if an agent follows policy π\pi and maintains an average, for each state encountered, of the actual returns that have followed that state, then the average will converge to the state’s value, vπ(s)v_{\pi}(s), as the number of time that state is encountered approaches infinity.

If separate averages are kept for each action taken in each state, then these averages will similarly converge to the action-values, qπ(s,a)q_{\pi}(s, a).

Such estimation methods are called Monte Carlo methods, because they involve averaging over many random samples of actual returns.

In the case that our state space is (too) large or continuous, we can approximate vπ(s)v_{\pi}(s) or qπ(s,a)q_{\pi}(s, a) with some parameterized function (e.g., a neural network). Or, adjust the parameter to better match the observed returns.

Bellman Equation for Value Functions

Theorem 1 (Bellman Equation for vπv_{\pi})

The value of a state under a policy can be decomposed into two parts: the immediate reward plus the discounted value of the next state,

vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)],sSv_{\pi}(s) = \sum_{a} \pi(a | s) \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma v_{\pi}(s^{\prime})], \quad \forall s \in \mathcal{S}

where the sums are over all possible actions, next states, and rewards.

In other words, The value of the current state must equal the (discounted) value of the expected next state, plus reward along the way.

Proof (Bellman Equation for vπv_{\pi})vπ(s)=Eπ,p[GtSt=s]=Eπ,p[Rt+1+γGt+1St=s]=aπ(as)s,rp(s,rs,a)[r+γEπ[Gt+1St+1=s]]=aπ(as)s,rp(s,rs,a)[r+γvπ(s)],sS\begin{align*} v_{\pi}(s) & = \mathbb{E}_{\pi, \mathcal{p}}[G_t | S_t = s] \newline & = \mathbb{E}_{\pi, \mathcal{p}}[R_{t+1} + \gamma G_{t+1} | S_t = s] \newline & = \sum_a \pi(a | s) \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma \mathbb{E}_{\pi}[G_{t+1} | S_{t+1} = s^{\prime}]] \newline & = \sum_a \pi(a | s) \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma v_{\pi}(s^{\prime})], \quad \forall s \in \mathcal{S} \newline \end{align*}

Optimal Policies and Optimal Value Functions

Definition 8 (Policy Ordering)

We say that policy π\pi is better than or equal to policy π\pi^{\prime} (ππ\pi \geq \pi^{\prime}) if and only if,

vπ(s)vπ(s),sSv_{\pi}(s) \geq v_{\pi^{\prime}}(s), \quad \forall s \in \mathcal{S}

There is always at least one policy that is better than or equal to all other policies, we denote all such policies (there may be more than one) as π\pi_{\star}, and call any of them an optimal policy.

All of the optimal policies share the same optimal value function vv_{\star}. Why?

Solution

Because if there were two different optimal value functions, then one of the corresponding policies would have to be better than the other, contradicting the definition of optimal policies.

Optimal policies also share the same optimal action-value function, qq_{\star}. Why?

Solution

Same reasoning as above.

Definition 9 (Optimal Value Functions)

The optimal state-value function, vv_{\star}, is defined as,

v(s)maxπ vπ(s),sSv_{\star}(s) \coloneqq \underset{\pi}{\max} \ v_{\pi}(s), \quad \forall s \in \mathcal{S}
Definition 10 (Optimal Action-Value Function)

The optimal action-value function, qq_{\star}, is defined as,

q(s,a)maxπ qπ(s,a),sS,aA(s)=E[Rt+1+γv(St+1)St=s,At=a]\begin{align*} q_{\star}(s, a) \coloneqq \underset{\pi}{\max} \ q_{\pi}(s, a), \quad \forall s \in \mathcal{S}, a \in \mathcal{A}(s) & = \mathbb{E}[R_{t+1} + \gamma v_{\star}(S_{t+1}) | S_t = s, A_t = a] \newline \end{align*}

Why can we rewrite it to the second line?

Solution

Because the optimal action-value function can be expressed in terms of the immediate reward and the optimal state-value function of the next state, as the agent will always act optimally thereafter.

Bellman Optimality Equations

Theorem 2 (Bellman Optimality Equation for vv_{\star})

The Bellman equation for optimal state values is written without reference to any policy. Thus, the Bellman optimality equation expresses the value of a state under an optimal policy must equal the expected return for the best action from that state,

v(s)=maxas,rp(s,rs,a)[r+γv(s)],sSv_{\star}(s) = \max_a \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma v_{\star}(s^{\prime})], \quad \forall s \in \mathcal{S}

where the sums are over all possible next states and rewards.

Proof (Bellman Optimality Equation for vv_{\star})v(s)=maxaA(s)q(s,a)=maxa Eπ,p[GtSt=s,At=a]=maxa Eπ,p[Rt+1+γGt+1St=s,At=a]=maxa Ep[Rt+1+γv(St+1)St=s,At=a]=maxa s,rp(s,rs,a)[r+γv(s)],s\begin{align*} v_{\star}(s) & = \max_{a \in \mathcal{A}(s)} q_{\star}(s, a) \newline & = \underset{a}{\max} \ \mathbb{E}_{\pi_{\star}, \mathcal{p}}[G_t | S_t = s, A_t = a] \newline & = \underset{a}{\max} \ \mathbb{E}_{\pi_{\star}, \mathcal{p}}[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a] \newline & = \underset{a}{\max} \ \mathbb{E}_{\mathcal{p}}[R_{t+1} + \gamma v_{\star}(S_{t+1}) | S_t = s, A_t = a] \newline & = \underset{a}{\max} \ \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma v_{\star}(s^{\prime})], \quad \forall s \end{align*}
Theorem 3 (Bellman Optimality Equation for qq_{\star})

The Bellman optimality equation for action values is,

q(s,a)=s,rp(s,rs,a)[r+γmaxaq(s,a)],sS,aA(s)q_{\star}(s, a) = \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma \max_{a^{\prime}} q_{\star}(s^{\prime}, a^{\prime})], \quad \forall s \in \mathcal{S}, a \in \mathcal{A}(s)

where the sums are over all possible next states and rewards.

Proof (Bellman Optimality Equation for qq_{\star})q(s,a)=Eπ,p[Rt+1+γ maxa q(St+1,a)St=s,At=a]=s,rp(s,rs,a)[r+γ maxa q(s,a)],sS,aA(s)\begin{align*} q_{\star}(s, a) & = \mathbb{E}_{\pi_{\star}, \mathcal{p}}[R_{t+1} + \gamma \ \underset{a^{\prime}}{\max} \ q_{\star}(S_{t+1}, a^{\prime}) | S_t = s, A_t = a] \newline & = \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma \ \underset{a^{\prime}}{\max} \ q_{\star}(s^{\prime}, a^{\prime})], \quad \forall s \in \mathcal{S}, a \in \mathcal{A}(s) \newline \end{align*}

Once one has vv_{\star} (or qq_{\star}), it is relatively easy to determine an optimal policy π\pi_{\star}, why?

Solution

For each state ss, there will be one or more actions at which the maximum is obtained in the Bellman optimality equation. Any policy that assigns nonzero probability only to these actions is an optimal policy.

Further, one-step search or greedy search based on immediate consideration (i.e., evaluating short-term consequences based on vv_{\star}) is sufficient for an optimal policy in long-term.

vv_{\star} already takes into account the reward consequences of all possible future behavior.

With qq_{\star}, for any state ss, it can simply find any action that maximizes q(s,a)q_{\star}(s, a). Further, with qq_{\star}, nothing more is needed to be known about the dynamics of the environment.

Solving Bellman Optimality Equation

Explicitly solving the Bellman optimality equation might be akin to an exhaustive search over all possiblities.

It relies on at least these three assumptions,

  1. The dynamics of the environment, p\mathcal{p}, are accurately known.
  2. Computational resources are sufficient to complete the calculations.
  3. The states have the Markov property.

For example, for the game of backgammon, there are 1020\sim 10^{20} possible states, and computing vv_{\star} or qq_{\star} is impossible in practice.

Further, the tabular case (i.e., using tables with one entry for each state or state-action pair) can be very memory inefficient.

Therefore, approximate solutions are more common in practice.