Reinforcement Learning: An Introduction

Feb 21, 2025

Table of Contents

Introduction
- Reinforcement Learning
Chapter 2
Finite Markov Decision Processes
Dynamic Programming
Monte Carlo Methods
Temporal-Difference Learning
Chapter 7
Planning and Learning with Tabular Methods
On-policy Prediction with Approximation
Chapter 10
Chapter 11
TODO Eligibility Traces
Policy Gradient Methods
Chapter 14
Chapter 15
Chapter 16
References

\( \newcommand{\states}{\mathcal{S}} \newcommand{\actions}{\mathcal{A}} \newcommand{\observations}{\mathcal{O}} \newcommand{\rewards}{\mathcal{R}} \newcommand{\traces}{\mathbf{e}} \newcommand{\transition}{P} \newcommand{\reals}{\mathbb{R}} \newcommand{\naturals}{\mathbb{N}} \newcommand{\complexs}{\mathbb{C}} \newcommand{\field}{\mathbb{F}} \newcommand{\numfield}{\mathbb{F}} \newcommand{\expected}{\mathbb{E}} \newcommand{\var}{\mathbb{V}} \newcommand{\by}{\times} \newcommand{\partialderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\defineq}{\stackrel{{\tiny\mbox{def}}}{=}} \newcommand{\defeq}{\stackrel{{\tiny\mbox{def}}}{=}} \newcommand{\eye}{\Imat} \newcommand{\hadamard}{\odot} \newcommand{\trans}{\top} \newcommand{\inv}{{-1}} \newcommand{\argmax}{\operatorname{argmax}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\avec}{\mathbf{a}} \newcommand{\bvec}{\mathbf{b}} \newcommand{\cvec}{\mathbf{c}} \newcommand{\dvec}{\mathbf{d}} \newcommand{\evec}{\mathbf{e}} \newcommand{\fvec}{\mathbf{f}} \newcommand{\gvec}{\mathbf{g}} \newcommand{\hvec}{\mathbf{h}} \newcommand{\ivec}{\mathbf{i}} \newcommand{\jvec}{\mathbf{j}} \newcommand{\kvec}{\mathbf{k}} \newcommand{\lvec}{\mathbf{l}} \newcommand{\mvec}{\mathbf{m}} \newcommand{\nvec}{\mathbf{n}} \newcommand{\ovec}{\mathbf{o}} \newcommand{\pvec}{\mathbf{p}} \newcommand{\qvec}{\mathbf{q}} \newcommand{\rvec}{\mathbf{r}} \newcommand{\svec}{\mathbf{s}} \newcommand{\tvec}{\mathbf{t}} \newcommand{\uvec}{\mathbf{u}} \newcommand{\vvec}{\mathbf{v}} \newcommand{\wvec}{\mathbf{w}} \newcommand{\xvec}{\mathbf{x}} \newcommand{\yvec}{\mathbf{y}} \newcommand{\zvec}{\mathbf{z}} \newcommand{\Amat}{\mathbf{A}} \newcommand{\Bmat}{\mathbf{B}} \newcommand{\Cmat}{\mathbf{C}} \newcommand{\Dmat}{\mathbf{D}} \newcommand{\Emat}{\mathbf{E}} \newcommand{\Fmat}{\mathbf{F}} \newcommand{\Gmat}{\mathbf{G}} \newcommand{\Hmat}{\mathbf{H}} \newcommand{\Imat}{\mathbf{I}} \newcommand{\Jmat}{\mathbf{J}} \newcommand{\Kmat}{\mathbf{K}} \newcommand{\Lmat}{\mathbf{L}} \newcommand{\Mmat}{\mathbf{M}} \newcommand{\Nmat}{\mathbf{N}} \newcommand{\Omat}{\mathbf{O}} \newcommand{\Pmat}{\mathbf{P}} \newcommand{\Qmat}{\mathbf{Q}} \newcommand{\Rmat}{\mathbf{R}} \newcommand{\Smat}{\mathbf{S}} \newcommand{\Tmat}{\mathbf{T}} \newcommand{\Umat}{\mathbf{U}} \newcommand{\Vmat}{\mathbf{V}} \newcommand{\Wmat}{\mathbf{W}} \newcommand{\Xmat}{\mathbf{X}} \newcommand{\Ymat}{\mathbf{Y}} \newcommand{\Zmat}{\mathbf{Z}} \newcommand{\Sigmamat}{\boldsymbol{\Sigma}} \newcommand{\identity}{\Imat} \newcommand{\epsilonvec}{\boldsymbol{\epsilon}} \newcommand{\thetavec}{\boldsymbol{\theta}} \newcommand{\phivec}{\boldsymbol{\phi}} \newcommand{\muvec}{\boldsymbol{\mu}} \newcommand{\sigmavec}{\boldsymbol{\sigma}} \newcommand{\jacobian}{\mathbf{J}} \newcommand{\ind}{\perp!!!!\perp} \newcommand{\bigoh}{\text{O}} \)

tags: Reinforcement Learning
source: http://www.incompleteideas.net/book/the-book-2nd.html

Introduction

We learn by interacting with the world around us. This has been thought about since the time of Hermann von Helmholtz, with his unconscious inferences.

Reinforcement Learning

Chapter 2 Finite Markov Decision Processes

This chapter introduces the formal problem of finite Markov decision processes (finite MDPs). You can think of MDPs as the mathematically idealized form of reinforcement learning (while the former came much earlier in the development of mathematical models for decision processes and stochastic processes).

The Agent-Environment Interface

The learner and decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is the environment. These two systems interact continually, the agent selecting actions and the environment responding to these action and presenting new situations to the agent. The environment also gives rise to the rewards, special observations that the agent seeks to maximize over time. More specifically, the agent and environment interact at each of a sequence of discrete time steps, $t=0,1,2,3,\ldots$. At each step $t$, the agent receives some representation of the environment’s state, $S_t \in \states$, and on that basis selects and action, $A_t \in \actions(s)$. At step $t+1$ the agent receives a numerical reward, $R_{t+1} \in \rewards \subset \reals$ and a new state $S_{t+1}$. The assumption is there exists some relationship between the agent’s action and the environment’s reaction. This gives rise to a trajectory

\[S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \ldots\].

If the MDP is finite the states, actions, and rewards ($\states$, $\actions$, and $\rewards$) all have a finite number of elements. The random variables $R_t$ and $S_t$ have well defined discrete probability distributions dependent only on the preceding state and action

\[p(s’, r | s, a) \defeq Pr\{S_t=s’, R_t=r | S_{t-1}=s, A_{t-1} = a\} \quad \triangleright \text{forall } s’,s\in\states, r\in \rewards, a\in\actions(s) \].

The function $p: \states\times\rewards\times\states\times\actions$ defines the dynamics of the MDP.

Goals and Rewards

A core axiom on which reinforcement learning stands is noted as the reward hypothesis

That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).

Why is the reward hypothesis a hypothesis and not an assumption or axiom.

Returns and Episodes

The agent seeks to maximize the expected return, where the return, denoted $G_t$, is defined as some specific function of the reward sequence. In the simplest case the return is

\[G_t \defeq R_{t+1} + R_{t+2} + R_{t+3} + \ldots + R_T\]

where $T$ is thought of as the “final” time step, or end of an episode. Each episode ends in a terminal state.

We can also think of continuing tasks where there are no terminal states. Because there is no termination, the above sum will eventually diverge to infinity. Discounting is one way of limiting the sum to a finite number:

\[G_t \defeq \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \quad \text{for } 0\le\gamma<1\].

We can define the return recursively

\[G_t \defeq R_{t+1} + \gamma G_{t+1} \quad \text{where } G_T = 0\].

Because this is an exponentially decayed function, if the reward is a constant +1 we can say:

\[G_t = \frac{1}{1-\gamma}\].

Unified Notation for Episodic and Continuing Tasks

We can define a return for both episodic and continuing tasks

\begin{equation} \label{eqn:RLI:return} G_t \defeq \sum_{k=t+1}^T \gamma^{k-t-1}R_k \end{equation}

Where either $T=\infty$ or $\gamma = 1$.

Policies and Value Functions

value function: an estimate of the expected return, and can be thought of as “how good”
policy: a mapping from states to probabilities of selecting each possible action ($\pi(a|s)$).

The value function of a state s while following policy $\pi$ is

\begin{equation} \label{eqn:RLI:valuefunc} v_{\pi}(s) \defeq \expected_{\pi}\left[G_{t} | S_{t}=s\right]=\expected_{\pi}\left[\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1} | S_{t}=s\right], \text { for all } s \in \states. \end{equation}

The value of a terminal state (if any exist) is always zero. We say the function $v_\pi$ is the state-value function for policy π. Similarly, we define the action-value function for policy π as

\begin{equation} \label{eqn:RLI:actionvaluefunc} q_{\pi}(s, a) \defeq \expected_{\pi}\left[G_{t} | S_{t}=s, A_{t}=a\right]=\expected_{\pi}\left[\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1} | S_{t}=s, A_{t}=a\right]. \end{equation}

A fundamental property of value functions used throughout reinforcement learning and dynamic programming is that they satisfy recursive relationships (similar to equation eqn:RLI:return)! For any policy $\pi$ and any state $s$, the following consistency condition holds:

\begin{align} v_{\pi}(s) &\defeq \mathbb{E}_{\pi}\left[G_{t} | S_{t}=s\right] \nonumber \\ &=\expected_{\pi}\left[R_{t+1}+\gamma G_{t+1} | S_{t}=s\right] \nonumber \\ &=\sum_{a} \pi(a | s) \sum_{s^{\prime}} \sum_{r} p\left(s^{\prime}, r | s, a\right)\left[r+\gamma \expected_{\pi}\left[G_{t+1} | S_{t+1}=s^{\prime}\right]\right] \nonumber\\ &=\sum_{a} \pi(a | s) \sum_{s^{\prime}, r} p\left(s^{\prime}, r | s, a\right)\left[r+\gamma v_{\pi}\left(s^{\prime}\right)\right], \quad \text { for all } s \in \mathcal{S} \label{eqn:RLI:bellman} \end{align}

The recursive form defined in equation eqn:RLI:bellman is known as the Bellman equation for $v_\pi$. It expresses the relationship between the value of a state and the values of its successor states. The value function $v_\pi$ is the unique solution to its Bellman equation.

Optimal Policies and Optimal Value Functions

Value functions define a partial ordering over policies. A policy π is defined to be better than or equal to another policy π’ if its expected return is greater than or equal to that of π’ for all states. More clearly, $\pi\geq\pi’$ iff $v_\pi(s) \geq v_{\pi’}(s)$ for all $s\in\states$. We will denote all optimal policies by $\pi_\star$. All optimal polices share the same value function

\begin{equation} v_\star (s) \defeq \max_\pi v_\pi(s) \quad \text{for all } s\in\states. \end{equation}

We can also define the optimal action-value function

\begin{equation} q_\star (s,a)\defeq \max_\pi q_\pi(s,a) \quad \text{for all } s\in\states, a\in\actions. \end{equation}

We can also define a special Bellman equation for the optimal value function called the Bellman optimality equation

\begin{align} v_\star(s) &= \max_a \sum_{s’,r} p(s’, r | s, a) \left[r+\gamma v_\star(s’)\right] \\ q_\star(s,a) &= \sum_{s’,r} p(s’,r|s,a)\left[r+\gamma \max_{a’}q_\star (s’, a’)\right] \end{align}

For finite MDPs, the Bellman optimality equation has a unique solution independent of the policy. While we can explicitly solve the Bellman optimality equation, there are a number of assumptions that are rarely true in real world examples:

we accurately know the dynamics of the environment
we have enough computational resources to complete the computation of the solution
the Markov property

Many reinforcement learning methods can be clearly understood as approximately solving the Bellman equation, using actual experienced transitions in place of knowledge of the expected transitions.

Optimality and Approximation

We are forced to settle for an approximate solution to the Bellman equation. But this is ok. The online nature of reinforcement learning makes it possible to approximate optimal policies in ways that put more effort into learning to make good decisions for frequently encountered states, at the expense of less effort for infrequently encountered states. This is “one key property” that distinguishes reinforcement learning from other approaches to approximately solving MDPs.

Dynamic Programming

Dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as an MDP. These algorithms are of limited use to the full RL problem as they require a perfect model and have an immense up-front computational cost. These algorithms are still interesting theoretically, and provide a nice foundation to begin thinking about learning optimal policies. In this chapter, we will limit ourselves to finite MDPs. A key idea of DP, and of RL generally, is the use of value functions to organize and structure the search for good policies.

Policy Evaluation (prediction)

Policy Evaluation: compute the state-value function $v_\pi$ for an arbitrary policy $\pi$. Also known as the prediction problem. Recall from Chapter 3:

\begin{align} v_\pi(s) &\defeq \expected_\pi[G_t | S_t=s] \nonumber\\ &= \sum_{a}\pi(a|s)\sum_{s’, r} p(s’, r|s, a)[r+\gamma v_\pi(s’)] \label{eqn:RLI:dp:valuefunc} \end{align}

If the environment’s dynamics are completely known, then \ref{eqn:RLI:dp:valuefunc} is a system of $|\states|$ simultaneous linear equations in $|\states|$ unknowns. This calculation is straightforward but tedious. For our purposes, iterative solution methods are most suitable. If $v_0$ is chosen arbitrarily (except for any terminal states which must be 0), then we can iteratively solve for the value function

\begin{align*} v_{k+1}(s) &\defeq \expected_\pi[R_{t+1} + \gamma v_k(S_{t+1}) | S_t = s] \\ & = \sum_{a}\pi(a|s)\sum_{s’, r} p(s’, r|s, a)[r+\gamma v_{k}(s’)] \label{eqn:RLI:dp:iterative} \end{align*}

The sequence of $v_k$ can be shown to converge to $v_\pi$ in general as $k\rightarrow\infty$. This algorithm is called iterative policy evaluation. This algorithm can also be done in-place.

Policy Improvement

Once we have a state value function $v_\pi$, we would like to use it to find a better policy (if one exists). The most straightforward method is to construct an action value function

\[q_\pi(s,a) \defeq \expected[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t =s, A_t = a].\]

If $q_\pi(s,a) > v_\pi(s)$ then you can improve the policy $\pi$ by selecting action $a$ in state $s$ and then following $\pi$ elsewhere. This is a special case of the policy improvement theorem. Let π and π’ be any pair of deterministic policies such that

\[q_\pi(s, \pi’(s)) \ge v_\pi(s) \quad \triangleright \quad \forall s \in \states\].

Then the policy π’ must be as good as, or better than π, meaning $v_{\pi’}(s) \ge v_\pi(s)$.

Instead of focusing on changing the policy at a a single state, we can improve the policy over all states simultaneously: $\pi’ \defeq \argmax_a q_\pi(s,a)$.

Policy Iteration

Once a policy π has been improved using $v_\pi$ to yield a better policy π’, we can then compute $v_{\pi’}$ and improve it again to yield an even better policy! We can create a chain of evaluations and improvements continuing until we get to a policy which is no longer changing. Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations. This is called policy iteration.

\begin{algorithmic} \State 1. Initialization \State $V(s) \in \reals$ and $\pi(s) \in \actions(s)$ arbitrarily for $s\in\states$ \State \State 2. Policy Evaluation \Repeat \State $\Delta \leftarrow 0$ \ForAll{$s \in \states$} \State $v \leftarrow V(s)$ \State $V(s) \leftarrow \sum_{s’,r}p(s’,r|s,\pi(s)) [r+\gamma V(s’)]$ \State $\Delta \leftarrow max(\Delta, |v-V(s)|)$ \EndFor \Until{$\Delta < \theta$} \State \State 3. Policy Improvement \State $policy\_stable \leftarrow true$ \ForAll{$s\in\states$} \State $old-action\leftarrow\pi(s)$ \State $\pi(s) \leftarrow \argmax_a \sum_{s’,r}p(s’,r|s,\pi(s)) [r+\gamma V(s’)]$ \State If $old\_action \neq \pi(s)$, then $policy\_stable \leftarrow false$ \EndFor \State If $policy\_stable$, then stop and return $V\approx v_\star$ and $\pi \approx \pi_\star$; else go to 2. \end{algorithmic}

\caption{Policy Iteration (using iterative policy evaluation) for estimating $\pi \approx \pi^*$}

Value Iteration

Instead of having a full policy evaluation step in-between policy improvement steps, we can truncate policy evaluation to a single sweep through the state-space without changing the convergence properties of the algorithm. This algorithm is known as value iteration and can be written simply as

\[v_{k+1}(s) \defeq \max_a \sum_{s’,r} p(s’, r|s,a)[r+\gamma v_k(s’)] \quad \triangleright \quad \forall s\in\states\].

Value iteration effectively combines one sweep of policy evaluation and one sweep of policy improvement. Faster convergence is often achieved by interposing multiple policy evaluation sweeps between each policy improvement sweep. In general, the entire class of truncated policy iteration algorithms can be through of as sequences of sweeps.

\begin{algorithmic} \State Algorithm parameter: a small threshold $\theta > 0$ determining accuracy of estimation. \State Initialize $V(s)$, for all $s\in\states^+$, arbitrarily except that $V(terminal)=0$ \Repeat \State $\Delta \leftarrow 0$ \ForAll{$s\in\states$} \State $v\leftarrow V(s)$ \State $V(s) \leftarrow \max_a \sum_{s’,r} p(s’, r|s,a)[r+\gamma V(s’)]$ \State $\Delta \leftarrow \max(\Delta, |v-V(s)|)$ \EndFor \Until{$\Delta < \theta$} \State Output a deterministic policy, $\pi \approx\pi_\star$, such that \State $\pi(s) = \argmax_a \sum_{s’,r} p(s’, r|s,a)[r+\gamma V(s’)]$ \end{algorithmic}

\caption{Value Iteration, for estimating $\pi\approx\pi^\star$}

Asynchronous Dynamic Programming

Asynchronous DP algorithms: in-place iterative DP algorithms that are not organized in terms of systematic sweeps of the state set. These algorithms update the values of states in any order whatsoever, using whatever values of other states happen to be available. The only restriction for convergence is an asynchronous algorithm must continue to update the values of all the states.

Generalized Policy Iteration

Generalized policy iteration (GPI): the general idea of letting policy-evaluation and policy-improvement porcesses interact, independent of the granularity and other details of the two processes.

Efficiency of Dynamic Programming

If $n$ and $k$ denote the number of states and actions, a DP method takes a number of computational operations that is less than some polynomial function of $n$ and $k$ (i.e. DP methods are in polynomial time). DP is sometimes through to be of limited applicability because of the curse of dimensionality, the fact that the number of states often grows exponentially with the number of state variables.

Monte Carlo Methods

Monte Carlo methods require only experience–sample sequences of states, action, and rewards from actual or simulated interaction with an environment. These methods are a way to solve the reinforcement learning problem based on averaging sample returns. In this section we will only go into Monte Carlo methods for episodic tasks, forgoing continuing tasks (discounting) entirely.

Monte Carlo Prediction

A Monte Carlo method estimates a value for a particular state by averaging the returns of seen after the state is visited.

first-visit MC estimates $v_\pi(s)$ as the average of the returns following first visits to $s$.
every-visit MC averages the returns following all visits to $s$.

Both the above methods are well known and converge to the optimal, albeit with slightly different theoretical properties. It is quite easy to see why first-visit converges (by the law of large numbers). Every-visit still converges quadratically, but is a bit more onerous (Singh and Sutton 1996).

\begin{algorithmic} \State Input: a policy $\pi$ to be evaluated \State Initialize: $V(s)\in\reals \forall s\in\states$, $Returns(s) \leftarrow$ empty list. \ForAll{$e\in$Episodes} \State Generate an episode following $\pi$: $S_0, A_0, R_1, S_1, A_1, R_2, \ldots, S_{T-1}, A_{T-1}, R_T$ \State $G \leftarrow 0$ \ForAll{$t \in \{T-1, T-2, \ldots, 0\}$} \State $G\leftarrow \gamma G + R_{t+1}$ \If{$S_t \notin \{S_0, S_1, \ldots, S_{t-1}\}$} \State Append $G$ to $Returns(S_t)$ \State $V(s_t) \leftarrow average(Returns(S_t))$ \EndIf \EndFor \EndFor \end{algorithmic}

\caption{First-visit MC prediction, for estimating $V \approx v_\pi$}

Monte Carlo Estimation of Action Values

Without a model of the environment, estimating state value functions is not enough to prepare behavior. Instead, one must estimate action-state value functions (as discussed in section sec:RLI:mdps). The evaluation algorithm is very similar to that of state value functions. The one modification to the original algorithm is the need for the agent to explore (even when given a deterministic policy) to estimate the return for each action from every state.

Monte Carlo Control

\begin{algorithmic} \State Initialize: \State $\quad\pi(s) \in \actions(s)$ (arbitrarily) \State $\quad Q(s,a) \in \reals$ (arbitrarily) \State $\quad Returns(s,a) \leftarrow $ empty list. \For{each episode} \State Choose $S_0 \in \states, A_0 \in \actions(S_0)$ randomly from all states and actions \State Generate an episode of length $T>0$ \State $G \leftarrow 0$ \For{$t = \{T-1, T-2, \ldots, 0\}$} \State $G \leftarrow \gamma G + R_{t+1}$ \If{$S_t, A_t \notin \{S_0, A_0, \ldots S_{t-1}, A_{t-1}\}$} \State Append $G$ to $Returns(S_t, A_t)$ \State $Q(S_t, A_t) \leftarrow average(Returns(S_t, A_t))$ \State $\pi(S_t) \leftarrow \argmax_a Q(S_t, a)$ \EndIf \EndFor \EndFor \end{algorithmic}

\caption{Monte Carlo ES (Exploring Starts), for estimating $\pi \approx \pi_*$}

Monte Carlo Control without Exploring Starts

In the case that the prior assumption of “exploring starts” is not met, the easiest way to fulfill the assumption that all actions are chosen infinitely often is for the agent to continue to select them. This section details an on-policy version of the above algorithm which accomplishes this, in the next section an off-policy method will be introduced.

In on-policy control the policy is generally soft (i.e. $\pi(a | s) > 0 \quad \forall s \in \states \text{ and } a \in \actions$ ) and gradually shifted to the deterministic optimal policy.

This Assumes there is an optimal deterministic policy, which depending on the problem setting may not be true. But this assumption is reasonable in the MDP setting as stated.

The on-policy method in this setting uses what are called $ε$-greedy policies. Effectively they assign probability $1-\epsilon + \frac{\epsilon}{|\actions(s)|}$ to the action with the highest estimated value and $frac{\epsilon}{|\actions(s)|}$ to the remaining values. While not shown here, one can find the sketch of a proof showing policy improvement works for $ε$-greedy policies in a similar way to deterministic policies w/ exploring starts in (Sutton and Barto 2018) page 101.

Off-policy Prediction via Importance Sampling

This section introduces the notion of learning from data which is generated from a policy which is different from the policy of interest (i.e. the optimal). These learning strategies are called off-policy.

target policy $\pi$: The policy of interest.
behavior policy $b$: The policy used to act in the environment.

In order to use experience generated from $b$ to learn about policy $\pi$ they must have shared support over the state action space (i.e. $\pi(a|s) > 0 \text{ implies } b(a|s) > 0$). The core of the algorithm comes in correcting the mismatch in distributions through what is called the importance ratio $\rho(A, S) = \frac{\pi(A|S)}{b(A|S)}$ and over a trajectory $\rho_{t:T-1} = \prod_{k=t}^{T-1} \rho(A_k | S_k)$. Importance sampling can then be applied to the expectation over returns to get the proper value function

\[ \expected [\rho_{t:T-1} G_t | S_t = s] = v_{\pi}(s). \]

There are plenty of problems when applying importance ratios immediately to the problem. One of particular concern is that ordinary importance sampling can have infinite variance (depending on the problem), which leads to unstable slow learning. Other algorithms have been invented to handle the infinite variance issue (i.e. weighted importance sampling, resampling).

Incremental Implementation

The episode-by-episode incremental version of Monte Carlo prediction methods are straight forward taking the form:

\[ V_{n+1} = V_n + \frac{1}{n} [G_n - V_n]. \]

These can also be defined w/ OIS and WIS (see (Sutton and Barto 2018) page 109).

Off-policy Monte Carlo Control

This is a straightforward extension of the off-policy Monte Carlo prediction algorithms presented above to GPI.

Discounting-aware Importance Sampling

Per-decision Importance Sampling

Temporal-Difference Learning

TD Prediction

Advantages of TD Prediction Methods

Optimality of TD(0)

Sarsa: On-policy TD Control

Q-learning: Off-policy TD Control

Expected Sarsa

Maximization Bias and Double Learning

Games Afterstates, and Other Special Cases

Chapter 7 Planning and Learning with Tabular Methods

This chapter develops a unified view of reinforcement learning methods that require a model of the environment, such as dynamic programming and heuristic search, and methods that can be used without a model.

model-free methods: These methods are those discussed in the first 6 chapters of the book. Specifically Monte-carlo methods, TD methods, and similar such methods not requiring a model of the environment’s dynamics.
model-based methods: These methods deploy a type of planning for their main component for training value functions and policies. Dynamic programming can be considered as model-based in this context.

Models and Planning

model: A model is a representation of the environment’s dynamics. This can be a non-parametric model such as an experience replay buffer, or a parametric model generating sample transitions. You can also represent a model through predictions in the world (not discussed directly here), or through options in an option-model.

From the book: A model of the environment is anything that an agent can use to predict how the environment will respond to its actions.

distribution models: produce a description of all possibilities and their probabilities (type of model assumed by dynamic programming)
sample models: produce just one of the possibilities sampled according to the probabilities.

Distribution models are stronger than sample models as they can be used to produce samples, but these types of models are hard to learn/obtain.

We say the model is used to simulate the environment and produce simulated experience.

planning: In this book, the term refers to any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment.

State-space planning: a search through the state space for an optimal policy or an optimal path to a goal.
Plan-space planning: a search through the space of plans. These types of methods include evolutionary methods and “partial-order planning”

The view taken in this book is that all state-space planning methods share a common structure, which is also present in the learning methods. There are two basic ideas to this view:

all state-space planning methods involve computing value functions as a key intermediate step toward improving the policy
they compute value functions by updates applied to simulated experience.

Dyna: Integrated Planning, Acting, and Learning

There are at least two roles for real experience:

Model-learning: it can be used to improve the model
Direct RL: it can be used to directly improve the value function and policy using and model-free method.

When using simulated experience (from the model) to learn, this is called indirect RL.

Dyna-Q includes all of the above processes occuring continually. The planning method is the random-sample one-step tabular Q-planning method (simply sampling a transition from a model from a randomly chosen state action pair; see page 161 of (Sutton and Barto 2018)). The model-learning method is table-based and assumes the environment is deterministic (making this the least usable model learning approach beyond tabular domains).

They show an example of a maze comparing Q-learning (Dyna-Q with $n=0$ planning steps), and Dyna-Q with $n=\{5, 50\}$ planning steps. It quite clearly shows model-based methods working better than model-free methods, but this result can only be generalized to tabular settings. It is quite straightforward with a deterministic tabular environment, a model is accurate from basic experience. Unfortunately, such models are hard to learn/uncover.

When the Model Is Wrong

When the model is incorrect, the planning process is likely to compute a suboptimal policy. This is shown in the blocking maze experiment. Another method was introduced, Dyna-Q+, which introduces a heuristic (i.e. reward shaping and state counting) to determine when the environment might change, or when the agent should go and explore parts of the environment again. This is again not possible in larger environments, as model based learning is hard in such environments to achieve the fidelity needed to perform well.

Specifically the Dyna-Q+ algorithm keeps track of each state-action pair of how many time steps have elapsed since the pair was last tried in a real interaction with the environment. The indirect RL then set a reward for these samples as $r + \kappa\sqrt{\tau}$, for a small set $\kappa$. This would inject bias into the agent’s policy to explore parts of the state-space not seen in a while.

This is fine with a finite state-space, but a state-space which is large this could cause the agent to do something bad. Also this assumes any change to the environment is good, which is a bold assumption.

Prioritized Sweeping

The main idea is that we want to plan backward from any state whose value has changed. This idea can be termed backward focusing of planning computations. Instead of uniformly sampling across all transitions which have changed values, we could prioritize on the urgency of the change (i.e. update the changes with the most impact first). This is implemented using a priority queue over state-action pairs.

This idea is easily extended to stochastic environments, and we can use similar ideas from Dyna-Q+ for non-stationary environments.

Expected vs. Sample Updates

Expected updates take considerably more computation than sample updates. For instance consider the expectation update

\[ Q(s, a) \leftarrow \sum_{s^{\prime}, r} \hat{p}\left(s^{\prime}, r | s, a\right)\left[r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right] \]

and the corresponding sample update

\[ Q(s, a) \leftarrow Q(s, a)+\alpha\left[R+\gamma \max _{a^{\prime}} Q\left(S^{\prime}, a^{\prime}\right)-Q(s, a)\right]. \]

The expected updates consider all possible events taking the expectation over these events to get a cleaner update. The sample updates only consider a single sample from the environment, and implicitly do the expectation over several visits to this states local area. The computational differences is significant here. If a correct estimate of the transition dynamics is available, and there is roughly $|\states \times \rewards|$ more computation available then the expected update would produce a better estimator.

They present a nice plot which illustrates why sample updates are better. In summary, the amount of computation needed to perform a expected update as above is incredibly large as the number of states and actions increases. The sample-updates almost definitely win in real world applications.

Trajectory Sampling

This is a contrast to prior model based methods presented (specifically Dyna-Q), where a uniform sampling of the state-space was performed. This implicitly assigns equal computation to all states, which is undesirable. One alternative is trajectory based sampling, where the agent simulates a trajectory based on a randomly sampled state-action pair. This is ok, but also requires a model which can simulate such a trajectory (which is a really difficult ask).

Another approach that I can think of is through an ER buffer, where we are selecting states and action according to the desired distribution induced by the behavior. It is difficult to imagine every being able to sample according to the uniform policy in large scale problems, but given the toy examples the on-policy distribution seems to be the correct thing for faster learning.

Real-time Dynamic Programming

This is an example of an asynchronous-DP method. Here the update order is dictated by the order states are visited in real or simulated trajectories.

Properties:

RTDP is guaranteed to find a policy that is optimal on the relevant states without visiting every state infinitely often, or even without visiting some states at all (in undiscounted episodic tasks for MDPs with absorbing goal states that generate zero rewards). This convergence happens with probability one when each episode begins in a randomly chosen state sampled from the set of start states and ending at the goal state.
As the value function approaches the optimal value function, the policy used by the agent to generate trajectories approaches an optimal policy because it is always greedy wrt the current value function.

Planning at Decision Time

Background planning: Planning to gradually improve the value function or policy on the basis of simulated experience obtained from a model. Selecting actions is then a matter of comparing the current state’s action values. Planning is not focused on the current state.

Decision-time planning: When planning is begun and completed after encountering each new state $S_t$, as a computation whose output is the selection of a single action $A_t$. Planning focuses on a particular state (i.e. the current state or those already visited).

Heuristic Search

For each state encountered, a large tree of possible continuations is considered. The approximate value function is applied to the leaf nodes and then backed up toward the current state at the root.

Rollout Algorithms

These algorithms are based on Monte Carlo control applied to simulated trajectories that all begin at the current environment state. They estimate action values for a given policy by averaging the returns of many simulated trajectories that start with each possible action and then follow the given policy. These algorithms were described by (Tesauro 1994) in their self-learning backgammon agent.

The goal of a rollout algorithm is to improve on the rollout policy, not to find the one-step optimal policy.

Monte Carlo Tree Search

This is an example of a rollout algorithm enhanced by the addition of a means for accumulating value estimates obtained from the Monte Carlo simulations in order to successively direct simulations toward more highly-rewarding trajectories.

Selection Starting at the root node, a tree policy based on the action values attached to the edges of the tree traverses the tree to select a leaf node.
Expansion On some iterations, the tree is expanded from the selected leaf node by adding one or more child nodes reached from the selected node via unexplored actions.
Simulation From the selected node, or from one of the added child nodes, simulation of a complete episode is run with actions selected by the rollout policy.
Backup The return generated by the simulated episode is backed up to update, or initialize, the action values attached to the edges of the tree traversed by the tree policy. No values are saved for the states and actions visited by the rollout policy beyond the tree.

On-policy Prediction with Approximation

This chapter begins the study of function approximation in reinforcement learning, where the estimate of the value function $v_\pi$ and policy $\pi$ could be a parameterized functional with weight vector $\wvec\in\reals^d$. $\hat{v}(s,\wvec) \approx v_\pi(s)$. Typically the number of weights is much less than the number of states ($d \ll |\states|$).

Value-function Approximation

We will refer to an individual update by the notation $s \mapsto u$ where $s$ is the state updated, and $u$ is the update target. For example,

the Monte Carlo update for value prediction: $S_t \mapsto G_t$,
the TD(0) update: $S_t \mapsto R_{t+1} + \gamma\hat{v}(S_{t+1}, \wvec_t)$,
n-step TD update: $S_t\mapsto G_{t:t+n}$.
DP policy-evaluation update: $s\mapsto \expected_{\pi}[R_{t+1} + \gamma \hat{v}(S_{t+1}, \wvec_t) | S_t = s]$.

The Prediction Objective ($\bar{VE}$)

The most obvious objective is the Mean Squared Value Error $\overline{VE}$, where it is constructed as you would expect, except weighted by the state distribution induced by the behavior

\begin{equation} \label{eqn:RLI:VE} \overline{VE}(\wvec) \defeq \sum_{s\in\states} \mu(s) \left[ v_{\pi}(s) - \hat{v}(s, \wvec) \right]^2 \end{equation}

The square root of this measure gives a rough measure of how much the approximate values differ from the true values and is often used in plots.

In an episodic task, the on-policy distribution depends on how the initial states of the episodes are chosen. Let $h(s)$ be the starting state distribution,

\begin{align*} \eta(s) &= h(s) + \sum_{\bar{s}} \eta(\bar{s}) \sum_{a} \pi(a|\bar{s}) p(s|\bar{s}, a) \\ \mu(s) &= \frac{\eta(s)}{\sum_{s’} \eta(s’)}, \text{ for all } s\in\states \end{align*}

When there is discounting, $\gamma$ can be included in the second term of $\eta(s)$.

Stochastic-gradient and Semi-gradient Methods

This is a pretty straightforward derivation of gradient Monte Carlo and semi-gradient TD(0).

\begin{algorithmic} \Require the policy $\pi$ to be evaluated \Require a differentiable function $\hat{v}: \states \times \reals^d \rightarrow \reals$ \\ \Comment step size $\alpha > 0$\\ \Comment Initialize value-function weights $\wvec\in\reals^d$ arbitrarily \Loop forever (for each episode) \State Generate an episode using $\pi$ \For{each step of episode} \State $\wvec \leftarrow \wvec + \alpha [G_t - \hat{v}(S_t,\wvec)]\nabla \hat{v}(S_t, \wvec)$ \EndFor \EndLoop \end{algorithmic}

\caption{Gradient Monte Carlo Algorithm}

\begin{algorithmic} \Require the policy $\pi$ to be evaluated \Require a differentiable function $\hat{v}: \states \times \reals^d \rightarrow \reals$ \\ \Comment step size $\alpha > 0$\\ \Comment Initialize value-function weights $\wvec\in\reals^d$ arbitrarily \Loop{forever (for each episode)} \State Initialize S \For{each step of episode} \State Choose $A \sim \pi(\cdot|S)$ \State Take action $A$, observe $R, S’$ \State $\wvec \leftarrow \wvec + \alpha [R + \gamma\hat{v}(S’, \wvec) - \hat{v}(S,\wvec)]\nabla \hat{v}(S_t, \wvec)$ \State $S \leftarrow S’$ \EndFor \EndLoop \end{algorithmic}

\caption{Semi-gradient TD(0) Algorithm}

Linear Methods

One special case of function approximation is where the approximate value is a linear function of the weight vector.

$\hat{v}(s, \wvec) = \wvec^\trans x(s)$
$\nabla\hat{v}(s, \wvec) = x(s)$

\begin{equation}\label{eqn:RLI:linear-update} \wvec_{t+1} = \wvec_t + \alpha [U_t - \hat{v}(S_t, \wvec_t)]x(S_t) \end{equation}

Under linear function approximation the Monte Carlo algorithm presented above converges to the global optimum of the $\overline{VE}$ objective under the normal constraints.

Semi-gradient TD(0) also converges, but to a local optimum. See the box for more details.

We first need to rewrite the above update \ref{eqn:RLI:linear-update}

\[\wvec_{t+1} = \wvec_t + \alpha\left( R_{t+1} \xvec_t - \xvec_t(\xvec_t - \gamma\xvec_{t+1})^\trans \wvec_t \right)\]

Once the system reaches a steady state, the expected next weight vector can be written

\[\expected[\wvec_{t+1}|\wvec_t] = \wvec_t + \alpha(\bvec - \Amat\wvec_t)\]

where

\[\bvec \defeq \expected[R_{t+1}\xvec_t] \in \reals^d \text{ and } \Amat\defeq \expected\left[ \xvec_t(\xvec_t - \gamma \xvec_{t+1})^\trans\right]\in\reals^d\times\reals^d \]

This system must then converge to (if it converges at all) $\wvec_{TD} \defeq \Amat^\inv \bvec$.

What sorts of properties are required for convergence? By rewriting the expected next weight vector

\[\expected[\wvec_{t+1}|\wvec_t] = (\eye - \alpha\Amat)\wvec_t + \alpha\bvec\]

we can see that $\Amat$ is the only important component for convergence. Consider a special case where $\Amat$ is a diagonal matrix. If any of the components are negative, then the corresponding diagonal element of $\eye - \alpha\Amat$ will be greater than one, and the corresponding component of $\wvec_t$ will be amplified leading to divergence. In general, we can say $\wvec_t$ will converge whenever $\Amat$ is positive definite (i.e. $y^\trans\Amat y$ for any $y\neq 0$). The rest of the proof follows once we know $\Amat$ is positive definite.

We can rewrite this matrix:

\begin{align*} \Amat &= \sum_s \mu(s) \sum_a \pi(a|s) \sum_{r,s’} p(r, s’|s, a)\xvec(s)(\xvec(s) - \gamma \xvec(s’))^\trans \\ &= \sum_s \mu(s)\xvec(s) \left( \xvec(s) - \gamma \sum_{s’} p(s’|s)\xvec(s’) \right) \\ &= \Xmat^\trans \Dmat (\eye - \gamma \Pmat)\Xmat \end{align*}

From here, it is clear that the inner matrix is key to determining the positive definiteness of $\Amat$. It has been shown prior that a matrix of this form is positive definite if all of its columns sum to a nonnegative number. See (Sutton 1988) for a detailed proof. Also note that $\muvec$, which is the $|\states|$ vector of the $\mu(s)$ distribution, is a stationary distribution thus $\muvec = \Pmat^\trans\muvec$ The column sums of this matrix is then:

\begin{align*} \mathbf{1}^\trans \Dmat(\eye - \gamma \Pmat) &= \muvec^\trans(\eye - \gamma \Pmat) \\ &= \muvec^\trans - \gamma \muvec^\trans \Pmat \\ &= (1-\gamma)\muvec^\trans \end{align*}

All the components of $(1-\gamma)\muvec^\trans$ are positive, and therefore $\Amat$ is positive definite, and TD(0) is stable.

It has also been shown at the TD fixed point that the $\overline{VE}$ is within a bounded expansion of the lowest possible error.

\[\overline{VE}(\wvec_{TD}) \leq \frac{1}{1-\gamma} \min_{\wvec} \overline{VE}(\wvec)\]

\begin{algorithmic} \Require the policy $\pi$ to be evaluated \Require a differentiable function $\hat{v}: \states \times \reals^d \rightarrow \reals$ \Require All store and access operations ($S_t$ and $R_t$) can take their index mod $n+1$ \\ \Comment step size $\alpha > 0$\\ \Comment Initialize value-function weights $\wvec\in\reals^d$ arbitrarily \For{each episode} \State Initialize and store $S_0 \neq \text{terminal}$ \State $T \leftarrow \infty$ \Repeat \If{t < T} \State Take an action according to $\pi(\cdot|S_t)$ \State Observe and store $R_{t+1}$ and $S_{t+1}$ \State If $S_{t+1}$ is terminal, then $T \leftarrow t+1$ \EndIf \State $\tau \leftarrow t - n + 1$ \Comment $\tau$ is the time whose state’s estimate is being updated \If{$\tau \geq 0$} \State $G \leftarrow \sum_{i=\tau+1}^{\min(\tau + n, T)} \gamma^{i-\tau-1} R_i$ \State If $\tau + n < T$, then: $G \leftarrow G+\gamma^{n} \hat{v}(S_{\tau + n}, \wvec)$ \State $\wvec \leftarrow \wvec + \alpha [G - \hat{v}(S,\wvec)]\nabla \hat{v}(S_t, \wvec)$ \EndIf \Until{$\tau = T-1$} \EndFor \end{algorithmic}

\caption{n-step Semi-gradient TD Algorithm}

Feature Construction for Linear Methods

This section is exactly what you would expect. It lays out a number of ways to construct features for a reinforcement learning agent.

Polynomials

This is straightforward, and takes advantage of the Taylor series formulation. This formulation can represent highly-complex interactions among the problem’s state dimensions.

Suppose each state s corresponds to k numbers, $\{s_1, s_2, \ldots, s_k\}$, with each $s_i \in \reals$. For a k-dimensional state space, each order-n polynomial-basis feature $x_i$ can be written as:

\[x_i(s) = \prod_{j=1}^k s_j^{c_i,j} \text{ where } c_{i,j} \in \{0,1,\ldots,n\}\]

Just like a Taylor series, more polynomials will be more accurate at the cost of storing such order-n polynomial functions (which grow exponentially with $k$).

Fourier Basis

This feature creation technique is based on the Fourier series. You can represent any even function, or a function which is symmetric about the origin with interval $[0,\tau/2]$, with only a cosine basis. You can represent an odd function with only sinusoidal basis. We focus on even functions here, setting $\tau=2$ so the features are defined over the half-τ interval $[0,1]$. The one-dimensional order-n Fourier cosine basis is

\[x_i = cos(i\pi s), s\in[0,1], i = 0, \ldots, n\]

This can also be applied in the multi-dimensional state case as well, but not discussed in these notes (see (Sutton and Barto 2018) page 213).

If using a Fourier cosine feature creation method, it has been found useful to define a learning rate for each feature separately. (Konidaris, Osentoski, and Thomas 2011) suggest setting the learning rate parameter for features $x_i$ to $\alpha_i = \alpha/\sqrt{(c^i_1)^2 + \ldots + (c^i_k)^2}$.

Coarse Coding

Coarse coding is a simple idea. The basic idea is to have a bunch of overlapping “areas” (or “receptive fields”), which correspond to features in a large binary feature vector. When a state is within a certain region, the feature is said to be present and has an active unit within the feature vector.

Properties:
- If we train at one state, a point in the space, then the weights of all overlapping fields will be affected by a learning update. The approximate value function will be affected at all states within the union of the fields, with a greater effect the more circles a point has “in common” with the state.
- If the RFs are small, the generalization will be over a short distance.
- If the RFs are large, the generalization will be over a large distance.
- The shape of the field will determine the nature of the generalization.
- Features with large receptive fields give broad generalization, but might also seem to limit the learned function to a coarse approximation. This is not necessarily the case as initial generalization from one point to another is controlled by the size and shape of RFs, but acuity is controlled by the total number of overlapping features.
- Tile Coding
  
  This is a simple form of coarse coding, which automatically creates a set of tilings. Each of which tile the space through state aggregation. The power comes in the off-set of the tilings, where this creates a number of overlapping (but not completely) RFs which can be used in learning.
  
  I won’t go into details here, but this is obviously one of Rich’s favorites, as the explanation is long and much more complete as compared to other feature creating techniques. If you look at the results, the Fourier basis does surprisingly similar to the tile coding with 50 tilings in the simple example.
- Radial Basis Functions
  
  Radial basis functions (RBFs) are a natural instantiation of coarse coding with continuous-valued features. Rather than a feature being 0 or 1, it can be anything in the interval representing the closeness to the center of a receptive field represented through a gaussian function. If the feature is 1 then the feature represents the “center” or “prototype” of the gaussian. This is similar to a kernel representation, although the kernel prototypes here are not not necessarily chosen through a principled algorithm. More on kernel-based function approximation below.

Selecting Step-Size Parameters Manually

An ok rule of thumb is:

\[\alpha \defeq (\tau \expected[\xvec^\trans \xvec])^\inv\].

In coarse coding this simply resolves to \[\alpha = \frac{1}{\tau \times \text{number of active features (tilings)}}\]

Nonlinear Function Approximation: Artificial Neural Networks

This is a straightforward treatment of ANNs, although this book is quite biased towards linear methods as exemplified by the sheer lack of details in this sections. There are many other books that go through these much clearer, specifically for RL (François-Lavet et al. 2018).

Least-Squares TD

Least-squares TD is a straightforward algorithm which solves for the fixed point weights (i.e. $\wvec_{TD} = \Amat^\inv \bvec$) directly rather than through an iterative process. As a reminder

\[\bvec \defeq \expected[R_{t+1}\xvec_t] \in \reals^d \quad \text{ and } \quad \Amat \defeq \expected\left[ \xvec_t(\xvec_t - \gamma \xvec_{t+1})^\trans\right] \in \reals^d\times\reals^d \]

which can be estimated relatively straightforwardly given a trajectory of data:

\begin{equation} \widehat{\bvec}_{t} = \frac{1}{t}\sum_{k=0}^{t-1} R_{k+1} \xvec_{k} \quad \text{ and } \quad \widehat{\Amat}_{t} = \frac{1}{t}\sum_{k=0}^{t-1} \xvec_{k}\left(\xvec_{k}-\gamma \xvec_{k+1}\right)^{\top}+\varepsilon \mathbf{I} \end{equation}

We can estimate just the sums, without taking average, as when creating the weight vector the average cancels out.

\[\wvec_t \defeq \widehat{\Amat}_t^\inv \widehat{\bvec}_t \]

The major computational component of this algorithm is the inverse of the matrix $\widehat{\Amat}$. This can be side stepped through the Sherman-Morrison formula (see more details in the box).

\begin{equation} \widehat{\Amat}_t^\inv = \widehat{\Amat}_{t-1}^\inv - \frac{\widehat{\Amat}_{t-1}^\inv \xvec_t(\xvec_t - \gamma \xvec_{t+1})^\trans \widehat{\Amat}_{t-1}^\inv} {1 + (\xvec_t - \gamma\xvec_{t+1})^\trans \widehat{\Amat}_{t-1}^\inv \xvec_t} \end{equation}

\begin{theorem} Suppose $\Amat \in \reals^{n \times n}$ is an /invertable square matrix/ and $\uvec, \vvec \in \reals^n$ are column vectors. Then $\Amat + \uvec\vvec^\trans$ is invertable iff $1 + \vvec^\trans\Amat^\inv\uvec \neq 0$. Specifically,

\[(\Amat + \uvec\vvec^\trans)^\inv = \Amat^\inv - \frac{A^\inv \uvec\vvec^\trans \Amat^\inv}{1+\vvec^\trans\Amat^\inv\uvec}\]

\end{theorem}

\begin{proof}

We prove this in the backward direction, and do this by verifing the properties of the inverse. Specifically ($\Xmat = (\Amat + \uvec\vvec^\trans)$),

\begin{align*} \Xmat\Xmat^\inv &= (\Amat + \uvec\vvec^\trans)\left(\Amat^\inv - \frac{A^\inv \uvec\vvec^\trans \Amat^\inv}{1+\vvec^\trans\Amat^\inv\uvec} \right) \\ &= \Amat\Amat^\inv + \uvec\vvec^\trans \Amat^\inv - \frac{\Amat\Amat^\inv\uvec\vvec^\trans\Amat^\inv + \uvec\vvec^\trans\Amat^\inv\uvec\vvec^\trans\Amat^\inv} {1 + \vvec^\trans\Amat^\inv\uvec} \\ &= \eye + \uvec\vvec^\trans \Amat^\inv - \frac{\uvec(1 + \vvec^\trans\Amat^\inv\uvec)\vvec^\trans\Amat^\inv} {1 + \vvec^\trans\Amat^\inv\uvec} \\ &= \eye + \uvec\vvec^\trans \Amat^\inv - \uvec\vvec^\trans \Amat^\inv \hspace{1cm} \triangleright 1 + \vvec^\trans \Amat^\inv \uvec \in \reals, \neq 0 \\ &= \eye \end{align*}

Reciprocally, if $1 + \vvec^\trans \Amat^\inv \uvec = 0$, then $x=\Amat^\inv\uvec$, we can see $(\Amat + \uvec\vvec^\trans)x = 0$ and therefore $(\Amat + \uvec\vvec^\trans)$ has a non-trivial kernel and is therefore not invertible.

\href{https://en.wikipedia.org/wiki/Sherman–Morrison_formula}{wikipedia}

\end{proof}

Memory-based Function Approximation

This approach does not update a parametric function towards some objective. Instead it stores examples of transitions/states and their corresponding value estimate. When a new state arrives the value is estimated through a set of examples which are similar to the state. This is sometimes called lazy learning.

Some approaches:

Nearest neighbor
Weighted average
Locally weighted regression

Kernel-based Function Approximation

This is the typical Kernel regression formulation.

Looking Deeper at On-policy Learning: Interest and Emphasis

Instead of treating each state uniformly, we can also treat states according to some interest function $I_t$ indicating the degree to which we are interested in accurately.

\begin{align*} \wvec_{t+n} &\defeq \wvec_{t+n-1} + \alpha M_t[G_{t:t+n} - \hat{v}(S_t, \wvec_{t+n-1})]\nabla \hat{v}(S_t, \wvec_{t+n-1}), 0\leq t < T \\ M_t &= I_t + \gamma^n M_{t-1} \end{align*}

Chapter 10

Chapter 11 TODO Eligibility Traces

Policy Gradient Methods

This chapter considers policy gradient methods, or methods which search for the policy directly rather than through an action-value methods. The policy gradient methods directly parameterize a policy to select actions (either deterministically or through some stochastic process).

The notation is standard in this chapter:

$\thetavec \in \reals^{d’}$ for the policy’s parameter vector,
$\pi(a|s,\thetavec) = \text{Pr}\{A_t=a | S_t=s, \thetavec=\theta\}$ for the probability that action a is taken at time t given the parameterization and state of the environment,
$\wvec \in \reals^d$ is the value function parameterization where $\hat{v}(s,\wvec)$ is the estimated value.

The methods considered here look to maximize some performance measure $J(\thetavec)$ with respect to the policy parameter through gradient ascent.

\[\Delta\thetavec = \alpha \nabla J(\thetavec_t) \approx \alpha \hat{\nabla J(\thetavec_t)}\]

where $\hat{\nabla J(\thetavec_t)}$ is the stochastic estimate of the gradient wrt the objective and policy parameterization.

Policy gradient method: any method which directly parameterizes a policy and learns through maximizing some objective function.
actor-critic method: A policy gradient method which also learns an estimate of the value function.

The only requirement for a policy gradient method is $\pi(a|s, \thetavec)$ must be continuous and differentiable wrt its parameters over all $a\in\actions, s\in\states$. We also never want the policy to become deterministic!

Policy approximation and its advantages

The most natural of parameterization is to form parameterized numerical preferences $h(s,a,\thetavec)\in\reals$. The policy is then defined through the soft-max distribution called the soft-max in action preferences:

\[\pi(a|s, \thetavec) = \frac{e^(h(s,a,\thetavec))}{\sum_{a’} h(s,a’,\thetavec)}\].

Some advantages:

The approximate policy can approach a deterministic policy (as opposed to something like ε-greedy).
Enables the selection of actions with arbitrary probabilities. In problems with significant function approximation, the best approximate policy may be stochastic (for example Poker).
The policy may be a simpler function to approximate as opposed to the value function.
The selection of a policy parameterization could be a good way of injecting prior knowledge about the desired form of the policy.

The Policy Gradient Theorem

With a continuous policy parameterization the action probabilities change smoothly as a function of the learned parameters! Because of this, much stronger theoretical guarantees are available for policy-gradient methods than for action-value methods.

Episodic case: The performance measure: \[J(\thetavec) = v_{\pi_{\thetavec}}(s_0)\]

The challenge in policy gradients is the performance depends of both the action selections and the distribution of states in which those selections are made. Unfortunately, both of these are affected by the policy parameter, which means a straightforward gradient estimate requires intimate knowledge of the state distribution induced by the current policy.

The policy gradient theorem for the episodic case establishes an expression for the gradient which does not depend on the state distribution.

\[\nabla J(\thetavec) \propto \sum_s \mu(s) \sum_a q_{\pi}(s,a) \nabla \pi(a|s,\thetavec)\]

The constant of proportionality is the average length of an episode.

REINFORCE

All-actions Method: The rhs of the policy gradient theorem is a sum over states weighted by how often the states occur under the target policy $\pi$; if $\pi$ is followed, then states will be encountered in these proportions, therefore

\[\nabla J(\thetavec) \propto \mathbb{E}_\pi \left[\sum_a q_{\pi}(S_t, a) \nabla \pi(a|S_t, \thetavec) \right]\]

We could stop here and instantiate stochastic gradient ascent here, where we approximate the value function. We can continue down the line of reasoning presented in the policy gradient theorem and remove the weighting of the policy and replace the value function with the return. This results in

\[\nabla J(\thetavec) \approx \mathbb{E}_\pi \left[ G_t \frac{\nabla \pi(A_t | S_t \thetavec)}{\pi(A_t | S_t \thetavec)} \right]\]

yielding the REINFORCE with Monte Carlo update (Williams 1992):

\[\Delta \theta_t = \alpha G_t \frac{\nabla \pi(A_t | S_t \thetavec)}{\pi(A_t | S_t \thetavec)} \]

We can change out the comparison of the action value to an arbitrary baseline b(s) resulting in an update

\[\Delta \theta_t = \alpha (G_t - b(S_t)) \frac{\nabla \pi(A_t | S_t \thetavec)}{\pi(A_t | S_t \thetavec)} \]

This baseline can be learned or set.

Actor-critic

We can use the same methods used for TD(0), Sarsa(0), and other algorithms for estimating the return and using it as a critic in the policy gradient framework.

\begin{align*} \Delta \theta_t &= \alpha (G_{t:t+1} - \hat{v}(S_t|\wvec)) \frac{\nabla \pi(A_t | S_t \thetavec)}{\pi(A_t | S_t \thetavec)} \\ &= \alpha (R_{t+1} + \gamma \hat{v}(S_{t+1}|\wvec) - \hat{v}(S_t|\wvec)) \frac{\nabla \pi(A_t | S_t \thetavec)}{\pi(A_t | S_t \thetavec)} \\ &= \alpha \delta_t \frac{\nabla \pi(A_t | S_t \thetavec)}{\pi(A_t | S_t \thetavec)} \end{align*}

This generalizations for n-step and λ-return methods is straightforward, and not specified in these notes. See page 332 of (Sutton and Barto 2018)

Policy Gradient for Continuing Problems

The performance metric for the continuing problem is in terms of the average rate of reward per time step:

\[J(\thetavec) \defineq r(\pi) \defineq \lim_{h\rightarrow\infty} \frac{1}{h} \expected [R_t| S_0, A_{0:t-1} \sim \pi] \]

As in the episodic case, the policy gradient theorem results in a gradient equation which is devoid of the gradient of the stationary distribution, and is in fact the same as above.

Policy Parameterization for Continuous Actions

We can parameterize the policy according to the parameters of a distribution, or other such approaches. This is pretty straightforward.

Chapter 14

Chapter 15

Chapter 16 References

François-Lavet, Vincent, Peter Henderson, Riashat Islam, Marc G. Bellemare, and Joelle Pineau. 2018. “An Introduction to Deep Reinforcement Learning.” Foundations and Trends® in Machine Learning 11 (3-4): 219–354. doi:10.1561/2200000071.

Konidaris, George, Sarah Osentoski, and Philip Thomas. 2011. “Value Function Approximation in Reinforcement Learning Using the Fourier Basis.” In Twenty-Fifth AAAI Conference on Artificial Intelligence.

Singh, Satinder, and Richard Sutton. 1996. “Reinforcement Learning with Replacing Eligibility Traces.” Machine Learning 22 (1): 123–58. doi:10.1023/A:1018012322525.

Sutton, Richard. 1988. “Learning to Predict by the Methods of Temporal Differences.” Machine Learning 3 (1): 9–44. doi:10.1007/BF00115009.

Sutton, Richard, and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. Second edition. Adaptive Computation and Machine Learning Series. Cambridge, Massachusetts: The MIT Press.

Tesauro, Gerald. 1994. “TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play.” Neural Computation 6: 215–19.

Williams, Ronald J. 1992. “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.” In Reinforcement Learning, 5–32. The Springer International Series in Engineering and Computer Science. Springer, Boston, MA. doi:10.1007/978-1-4615-3618-5_2.