white2015developing: Developing a predictive approach to knowledge

tags
Reinforcement Learning, General Value Functions
source
thesis
authors
White, A.
year
2015

This is a thesis from Adam White encompassing and developing what predictive knowledge is, and how it can specify world knowledge. He develops a system, horde, which is a knowledge representation built from value function predictions. The main premise of this knowledge representation is that value-function predictions can encompass all forms of predictive knowledge.

Sensorimotor data streams and robots

This chapter describes learning predictions from and about low-level data produced by robots. They describe two robots capable of generating data suitable for this task and are used in later chapters.

Learning about sensorimotor data

The data will be a stream of instantaneous uninterpreted sensor readings generated by a mobile robot. The agent wants to predict the sensorimotor data itself, but not any external variables such as the coordinates.

Some stats on the critterbot

  • The critterbot can produce 19 million sensor readings an hour
  • 150 million readings a day (8 hour cycles)
  • 222 Gigabytes of raw sensor data a year.

The data collected is very complex, and the data can be arbitrarily different under different conditions and policies.

General Value Functions

This chapter lays out the foundations of General Value Functions (GVFs). We will discuss this in detail, as the formulation is useful beyond the thesis.

The learning system receives a stream of sensorimotor data. The objective of the system is to estimate a scalar signal \(G_t \in \mathbb{R}\), the target. The agent makes a prediction \(V_t \in \mathbb{R}\) of the target at each time step \(t=1,2,3,\ldots\). We can define how accurate the prediction is to the target through simple metrics such as the error \(V_t - G_t\) assuming we have the exact target. The main crux here is the target can be though of as specifying a question about the agent’s interaction with the world. The prediction can be thought of as the agent’s prediction.

We define the target in terms of a monte carlo return, and formulate further into the typical TD-target framework. The cumulant (synonymous with the reward) \(Z_t \in \mathbb{R}\). The target accumulates future values of the cumulant (much like a monte carlo return of the reward function). Like in typical RL frameworks we define a termination signal (discount factor) \(\gamma_t \in [0,1]\). The target is then built in the typical fashion

\[G_t = Z_{t+1} + \gamma_{t+1}Z_{t+2} + \gamma_{t+1}\gamma{t+2}Z_{t+2} + \ldots = \sum_{k=0}^{\infty}\left(\product_{j=1}^k \gamma_{t+j}\right) Z_{t+k+1}\].

With these components we can finally define a GVF as a function \(v : \mathcal{S} \rightarrow \mathbb{R}\) with three auxiliary functional inputs:

  • a target policy \(\pi : \mathcal{S} \times \mathcal{A} \rightarrow [0,1]\),
  • a termination (or continuation) function \(\gamma: \mathcal{S} \rightarrow [0,1]\),
  • and \(z: \mathcal{S} \rightarrow \mathbb{R}\).

The GVF specifies the expected value of the target, with actions specified via a policy

\[v(s; \pi, \gamma, z) \triangleq \mathbb{E}_\pi \right[G_t | S_t = s\left] \]

This can also be defined in terms of action state value functions as expected.

Learning GVFs

They use the usual approximate notation \(\hat{v}(s, \mathbf{w}) \approx v(s; \pi, \gamma, z)\). where \(\mathbf{w} \in \mathbb{R}^n\). This paper primarily uses linear function approximation \(V_t \triangleq \hat{v}(S_t, \mathbf{w}_t) = \mathbf{x}_t^\top \mathbf{w}\). Because of this formulation, we can learn any GVF using typical RL algorithms (TD(\(\lambda\)), GTD(\(\lambda\)), GQ(\(\lambda\)), \(\ldots\) ).

Another important quality is that of Independence of Span (Van Hasselt and Sutton 2015). We will not discuss this in detail, but at a basic level the TD algorithm (with some minor corrections) emerges from an analytical study looking for an algorithm to predict independent of span.

A Horde of Demons

Parallel GVF learning, pretty straight forward.

Papers

Nexting

This is a term used in psychology to refer to the natural inclination of people and animals to continually predict what will happen next. This chapter goes into detail about the nexting paper (Modayil, White, and Sutton 2014). We will not discuss this in detail.

Experiments with GVFs on robots

This provides more empirical evidence of GVF learning in real-time on a mobile robot, with many various signals and functions.

Estimating Off-policy progress

This chapter describes RUPEE.

Adapting the behavior policy

This chapter describes how a policy can be changed based on the agents demons.

References

Hasselt, Hado van, and Richard S. Sutton. 2015. “Learning to Predict Independent of Span.” arXiv:1508.04582 [Cs]. https://arxiv.org/abs/1508.04582.
Modayil, Joseph, Adam White, and Richard S Sutton. 2014. “Multi-Timescale Nexting in a Reinforcement Learning Robot.” Adaptive Behavior.

Links to this note: