Part 5 - Reinforcement Learning: Basics

Introduction to Reinforcement Learning

So far, we have been focusing on the multi-armed bandit problem. Reinforcement learning (RL) and multi-armed bandits are closely related, however, we will see that RL has one major difference.

The Agent-Environment Interaction

The core concepts of a RL system are,

An agent (learner, decision maker), that contains,
- A state that represents the current situation of the agent.
- Following a policy to decide what action to take.
- (Probably) a value function to estimate how good a state (or action) is.
- (Optionally) a model of the environment.
An environment: the thing that our agent interacts with, comprising everything outside the agent.
A reward signal: special numerical values that the agent seeks to maximize over time through its choice of actions (we will discuss this more later).

The agent and environment interact at each of a sequence of discrete time steps, $t = 1, 2, 3, \ldots$ .

At each time step $t$ , the agent receives some representation of the environment’s state, $S_t \in \mathcal{S}$ , and on that basis selects an action, $A_t \in \mathcal{A}(s)$ .

One time step later, the agent receives a numerical reward $R_{t+1} \in \mathcal{R} \subset \mathbb{R}$ and finds itself in a new state, $S_{t+1}$ .

The environment and agent together thereby give rise to a sequence or trajectory that begins like this,

S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \ldots

In a finite setting, the sets of states, actions, and rewards ( $\mathcal{S}, \mathcal{A}, \mathcal{R}$ ) all have a finite number of elements.

Formalizing the RL Interface

To formalize the RL interface, we will introduce a mathematical formulation of the agent-environment interaction.

It is formulated via a Markov Decision Process (MDP), which will describe an environment and its dynamics.

Markov Property

Definition 1 (Markov Property)

“The future is independent of the past, given the present.”, or in other words,

P(S_t = s, R_t = r | S_{t-1}, A_{t-1}, R_{t-1}, \ldots, R_1, S_0, A_0) = P(S_t = s, R_t = r | S_{t-1}, A_{t-1})

The probability of each possible value for $S_t$ and $R_t$ depends on the immediately preceding state $S_{t-1}$ and action $A_{t-1}$ , and not on any of the earlier ones.

Markov Decision Process (MDP)

A Markov Decision Process (MDP) is a tuple $(\mathcal{S}, \mathcal{A}, \mathcal{p}, \gamma)$ , where,

$\mathcal{S}$ is the set of all possible states.
$\mathcal{A}$ is the set of all possible actions.
$\mathcal{p}(s^{\prime}, r | s, a)$ is the joint probability of next state $s^{\prime} \in \mathcal{S}$ and reward $r \in \mathcal{R}$ , given the current state $s \in \mathcal{S}$ and action $a \in \mathcal{A}$ .
$\gamma \in [0, 1]$ is the discount factor to trade off longer-term rewards against immediate ones.

The rewards and discount together define the goal. Further, $\mathcal{p}$ defines the dynamics of the environment,

\mathcal{p}(s^{\prime}, r | s, a) = P(S_t = s^{\prime}, R_t = r | S_{t-1} = s, A_{t-1} = a)

Note

\sum_{s^{\prime} \in \mathcal{S}} \sum_{r \in \mathcal{R}} \mathcal{p}(s^{\prime}, r | s, a) = 1, \quad \forall s \in \mathcal{S}, a \in \mathcal{A}

Why?

Solution

Because the sum of probabilities of all possible next states and rewards given the current state and action must equal 1. We are never in a state where our action and state lead to an impossible outcome.

Other Useful Transition Probabilities

We can derive some other useful transition probabilities from $\mathcal{p}$ .

The state-transition probabilities,

\mathcal{p}(s^{\prime} | s, a) = P(S_t = s^{\prime} | S_{t-1} = s, A_{t-1} = a) = \sum_{r \in \mathcal{R}} \mathcal{p}(s^{\prime}, r | s, a)

The expected rewards for state-action pairs,

r(s, a) = \mathbb{E}[R_t | S_{t-1} = s, A_{t-1} = a] = \sum_{r \in \mathcal{R}} r \sum_{s^{\prime} \in \mathcal{S}} \mathcal{p}(s^{\prime}, r | s, a)

The expected rewards for state-action-next-state triples,

r(s, a, s^{\prime}) = \mathbb{E}[R_t | S_{t-1} = s, A_{t-1} = a, S_t = s^{\prime}] = \frac{\sum_{r \in \mathcal{R}} r \mathcal{p}(s^{\prime}, r | s, a)}{\mathcal{p}(s^{\prime} | s, a)}

Where the third equation comes from the definition of conditional expectation.

Goals and Rewards

The agent’s goal is to maximize not the immediate reward, but the cumulative reward in the long run.

Proposition 1 (Reward Hypothesis)

All goals and purposes imposed on an agent can be described as the act of maximization of the expected value of the cumulative sum of a received scalar signal.

Rewards are provided in such a way that maximizing them (cumulatively) will achieve the desired outcome.

The reward signal specifies the agent what should be achieved, however, not how to achieve it. The “how” is specified by the RL algorithm/policy.

Returns, Episodic Tasks, and Continuing Tasks

Definition 2 (Episodic Tasks)

When the agent-environment interaction breaks naturally into subsequences, each of which is called an episode (e.g., a game of chess, a robot navigating a maze).

Definition 3 (Return)

Consider the sequence of rewards received after time step $t$ ,

R_{t+1}, R_{t+2}, R_{t+3}, \ldots

The return quantity is the sum of the rewards, where $T$ is the final time step of the episode,

G_t \coloneqq R_{t+1} + R_{t+2} + R_{t+3} + \ldots + R_T

Thus, the goal is to maximize the expected return, $\mathbb{E}[G_t]$ .

Definition 4 (Continuing Tasks)

When the agent-environment interaction continues indefinitely, without end (e.g., a robot that must keep walking without falling over).

Definition 5 (Discounted Return)

We can’t define return in the same way as in episodic tasks, because the sum of rewards may diverge to infinity ( $T = \infty$ ). Instead, we define the discounted return,

G_t \coloneqq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \ldots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

where $\gamma \in [0, 1)$ is the discount factor.

The discount factor determines the present value of future rewards. Why?

Solution

A reward received $k$ time steps in the future is worth only $\gamma^{k - 1}$ times what it would be worth if it were received immediately.

If $\gamma < 1$ , the infinite sum above has a finite value, as long as the reward sequence $\{R_k\}$ , is bounded.

If $\gamma = 0$ , the agent is “myopic” (short-sighted), caring only about immediate rewards.

As $\gamma$ approaches 1, the agent becomes more far-sighted, valuing future rewards more strongly.

Unification of Episodic and Continuing RL

We can unify episodic and continuing tasks in two ways.

Consider episode termination to be the entering of a special absorbing state, that only transitions to itself and that only receives zero reward.
We can rewrite the discounted return for continuing tasks as,

G_t \coloneqq \sum_{k=t + 1}^T \gamma^{k - t - 1} R_k

with the possibility that $T = \infty$ or $\gamma = 1$ , but not both.

Recursive Relationship of Returns

One can express the return recursively,

\begin{align*} G_t & \coloneqq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \ldots \newline & = R_{t+1} + \gamma \underbrace{(R_{t+2} + \gamma R_{t+3} + \gamma^2 R_{t+4} + \ldots)}_{G_{t+1}} \newline & = R_{t+1} + \gamma G_{t+1} \newline \end{align*}

This works for all time steps $t < T$ , even if termination occurs at $t + 1$ , provided we define $G_T = 0$ .

Exercise 1

Compute $G_t$ if the reward is constant +1.

Solution

If $R_t = 1$ for all $t$ , then,

\begin{align*} G_t & = 1 + \gamma + \gamma^2 + \gamma^3 + \ldots \newline & = \sum_{k=0}^{\infty} \gamma^k \newline & = \frac{1}{1 - \gamma}, \quad \text{for }\gamma < 1 \newline \end{align*}

Policies and Value Functions

Definition 6 (Policy)

A policy, $\pi$ , is a mapping from states to probabilities of selecting each possible action. Thus, $\pi(a | s)$ is the probability that action $a$ is taken when in state $s$ .

\pi(a | s) = P(A_t = a | S_t = s)

Definition 7 (Value Function)

A value function of states (or of state-action pairs) estimates how good it is for the agent to be in a given state (or, how good it is to perform a given action in a given state).

\begin{align*} v_{\pi}(s) & \coloneqq \mathbb{E}_{\pi, \mathcal{p}}[G_t | S_t = s] \newline & = \mathbb{E}_{\pi, \mathcal{p}}\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \middle| S_t = s\right], \ \forall s \in \mathcal{S} \newline q_{\pi}(s, a) & \coloneqq \mathbb{E}_{\pi, \mathcal{p}}[G_t | S_t = s, A_t = a] \newline & = \mathbb{E}_{\pi, \mathcal{p}}\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \middle| S_t = s, A_t = a\right], \ \forall s \in \mathcal{S}, a \in \mathcal{A}(s) \end{align*}

Note

The value functions $v_{\pi}(s)$ and $q_{\pi}(s, a)$ can be estimated from experience.

For exam ple, if an agent follows policy $\pi$ and maintains an average, for each state encountered, of the actual returns that have followed that state, then the average will converge to the state’s value, $v_{\pi}(s)$ , as the number of time that state is encountered approaches infinity.

If separate averages are kept for each action taken in each state, then these averages will similarly converge to the action-values, $q_{\pi}(s, a)$ .

Such estimation methods are called Monte Carlo methods, because they involve averaging over many random samples of actual returns.

In the case that our state space is (too) large or continuous, we can approximate $v_{\pi}(s)$ or $q_{\pi}(s, a)$ with some parameterized function (e.g., a neural network). Or, adjust the parameter to better match the observed returns.

Bellman Equation for Value Functions

Theorem 1 (Bellman Equation for

v_{\pi}

)

The value of a state under a policy can be decomposed into two parts: the immediate reward plus the discounted value of the next state,

v_{\pi}(s) = \sum_{a} \pi(a | s) \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma v_{\pi}(s^{\prime})], \quad \forall s \in \mathcal{S}

where the sums are over all possible actions, next states, and rewards.

In other words, The value of the current state must equal the (discounted) value of the expected next state, plus reward along the way.

Proof (Bellman Equation for

v_{\pi}

)

\begin{align*} v_{\pi}(s) & = \mathbb{E}_{\pi, \mathcal{p}}[G_t | S_t = s] \newline & = \mathbb{E}_{\pi, \mathcal{p}}[R_{t+1} + \gamma G_{t+1} | S_t = s] \newline & = \sum_a \pi(a | s) \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma \mathbb{E}_{\pi}[G_{t+1} | S_{t+1} = s^{\prime}]] \newline & = \sum_a \pi(a | s) \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma v_{\pi}(s^{\prime})], \quad \forall s \in \mathcal{S} \newline \end{align*}

Optimal Policies and Optimal Value Functions

Definition 8 (Policy Ordering)

We say that policy $\pi$ is better than or equal to policy $\pi^{\prime}$ ( $\pi \geq \pi^{\prime}$ ) if and only if,

v_{\pi}(s) \geq v_{\pi^{\prime}}(s), \quad \forall s \in \mathcal{S}

There is always at least one policy that is better than or equal to all other policies, we denote all such policies (there may be more than one) as $\pi_{\star}$ , and call any of them an optimal policy.

All of the optimal policies share the same optimal value function $v_{\star}$ . Why?

Solution

Because if there were two different optimal value functions, then one of the corresponding policies would have to be better than the other, contradicting the definition of optimal policies.

Optimal policies also share the same optimal action-value function, $q_{\star}$ . Why?

Solution

Same reasoning as above.

Definition 9 (Optimal Value Functions)

The optimal state-value function, $v_{\star}$ , is defined as,

v_{\star}(s) \coloneqq \underset{\pi}{\max} \ v_{\pi}(s), \quad \forall s \in \mathcal{S}

Definition 10 (Optimal Action-Value Function)

The optimal action-value function, $q_{\star}$ , is defined as,

\begin{align*} q_{\star}(s, a) \coloneqq \underset{\pi}{\max} \ q_{\pi}(s, a), \quad \forall s \in \mathcal{S}, a \in \mathcal{A}(s) & = \mathbb{E}[R_{t+1} + \gamma v_{\star}(S_{t+1}) | S_t = s, A_t = a] \newline \end{align*}

Why can we rewrite it to the second line?

Solution

Because the optimal action-value function can be expressed in terms of the immediate reward and the optimal state-value function of the next state, as the agent will always act optimally thereafter.

Bellman Optimality Equations

Theorem 2 (Bellman Optimality Equation for

v_{\star}

)

The Bellman equation for optimal state values is written without reference to any policy. Thus, the Bellman optimality equation expresses the value of a state under an optimal policy must equal the expected return for the best action from that state,

v_{\star}(s) = \max_a \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma v_{\star}(s^{\prime})], \quad \forall s \in \mathcal{S}

where the sums are over all possible next states and rewards.

Proof (Bellman Optimality Equation for

v_{\star}

)

\begin{align*} v_{\star}(s) & = \max_{a \in \mathcal{A}(s)} q_{\star}(s, a) \newline & = \underset{a}{\max} \ \mathbb{E}_{\pi_{\star}, \mathcal{p}}[G_t | S_t = s, A_t = a] \newline & = \underset{a}{\max} \ \mathbb{E}_{\pi_{\star}, \mathcal{p}}[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a] \newline & = \underset{a}{\max} \ \mathbb{E}_{\mathcal{p}}[R_{t+1} + \gamma v_{\star}(S_{t+1}) | S_t = s, A_t = a] \newline & = \underset{a}{\max} \ \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma v_{\star}(s^{\prime})], \quad \forall s \end{align*}

Theorem 3 (Bellman Optimality Equation for

q_{\star}

)

The Bellman optimality equation for action values is,

q_{\star}(s, a) = \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma \max_{a^{\prime}} q_{\star}(s^{\prime}, a^{\prime})], \quad \forall s \in \mathcal{S}, a \in \mathcal{A}(s)

where the sums are over all possible next states and rewards.

Proof (Bellman Optimality Equation for

q_{\star}

)

\begin{align*} q_{\star}(s, a) & = \mathbb{E}_{\pi_{\star}, \mathcal{p}}[R_{t+1} + \gamma \ \underset{a^{\prime}}{\max} \ q_{\star}(S_{t+1}, a^{\prime}) | S_t = s, A_t = a] \newline & = \sum_{s^{\prime}, r} \mathcal{p}(s^{\prime}, r | s, a) [r + \gamma \ \underset{a^{\prime}}{\max} \ q_{\star}(s^{\prime}, a^{\prime})], \quad \forall s \in \mathcal{S}, a \in \mathcal{A}(s) \newline \end{align*}

Once one has $v_{\star}$ (or $q_{\star}$ ), it is relatively easy to determine an optimal policy $\pi_{\star}$ , why?

Solution

For each state $s$ , there will be one or more actions at which the maximum is obtained in the Bellman optimality equation. Any policy that assigns nonzero probability only to these actions is an optimal policy.

Further, one-step search or greedy search based on immediate consideration (i.e., evaluating short-term consequences based on $v_{\star}$ ) is sufficient for an optimal policy in long-term.

$v_{\star}$ already takes into account the reward consequences of all possible future behavior.

With $q_{\star}$ , for any state $s$ , it can simply find any action that maximizes $q_{\star}(s, a)$ . Further, with $q_{\star}$ , nothing more is needed to be known about the dynamics of the environment.

Solving Bellman Optimality Equation

Explicitly solving the Bellman optimality equation might be akin to an exhaustive search over all possiblities.

It relies on at least these three assumptions,

The dynamics of the environment, $\mathcal{p}$ , are accurately known.
Computational resources are sufficient to complete the calculations.
The states have the Markov property.

For example, for the game of backgammon, there are $\sim 10^{20}$ possible states, and computing $v_{\star}$ or $q_{\star}$ is impossible in practice.

Further, the tabular case (i.e., using tables with one entry for each state or state-action pair) can be very memory inefficient.

Therefore, approximate solutions are more common in practice.