sutton2011horde: Horde: A Scalable Real-time Architecture for Learning Knowledge from Unsupervised Sensorimotor Interaction
- tags
- Reinforcement Learning, General Value Functions
- source
- paper
- authors
- Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., & Precup, D.
- year
- 2011
This paper focuses on building world knowledge through value function predictions on the sensorimotor stream. This new class of sensorimotor predictions are coined General Value Functions, and posited to be able to encapsulate the model of the world through gradient temporal-difference training. They also posit that this architecture, deemed the Horde, is a step towards a real-time architecture for efficient learning.
The Problem of Expressive and Learnable Knowledge
They make the claim that knowledge representation is hard, which is a reasonable claim and one I agree with. They posit that prior approaches are either limited from their abstraction away from the data stream (first-order predicate logic), or lack of generality (differential equations and state-transition matrices). They argue that a class of value predictions will produce a more general form of knowledge (predictive knowledge), which will better encapsulate the agent’s knowledge of the world. Several other approaches and theories of the representation of knowledge learned through the sensorimotor stream have been proposed including: (Drescher 1991; Cunningham 2013; Becker 1973).
- (Drescher 1991) explored a simulated robot baby learning conditional probability tables for boolean events
- (Ring 1997) explored continual learning of a hierarchical representation of sequences
These systems among others learned knowledge but remained far from learning from sensorimotor data. They claim that a knowledge representation built on predictions made from value functions will provide a broader knowledge representation.
Value functions as semantics
They argue that value functions have a clear semantic meaning (or “truth”) grounded in the sensorimotor data. This “truth” or “accuracy” comes from the underlying approximation of a value function as an expected discounted sum of cumulants into the future. While this is convenient, and the meaning of accuracy here is valid this does not entirely lead to value functions having semantic meaning. It is also a bit unclear if it is useful to describe knowledge in terms of semantic meaning.
The main hypothesis of the work is to define the value-function approach as a theory spanning all of world-knowledge.
They then go to define value functions and the reinforcement learning setting. These terms are standard. They return to the discounted return given a specification. They ground the answer of the specification in data based on the expected squared error of the estimate and true return. Because all specifications have “answers” in this way, a knowledge representation built on value function predictions will be grounded by the true answers (“grounding semantics”) of knowledge about the reward function.
From Values to Knowledge (General Value Functions)
By extending the prior notion of rewards, termination, and policy to a more general cumulant, pseudo-termination, and pseudo-policy function, the prior grounding semantics of reward knowledge extends to grounding semantics of sensorimotor knowledge. These extensions also lead to a more general class of predictions made through temporal-difference.
There are specifically two types of GVFs which must be seperated in learning:
- Predictive knowledge: gvfs with fixed policies which predict the outcome of acting on a policy.
- Control (procedural) knowledge: gvfs which look to maximize some signal.
The Horde Architecture
The horde architecture is simply a large collection of GVFs (both predictive and control), which is said to encapsulate world knowledge. In this architecture they use GQ(λ) to train the “answer” function from the semantics of the “predictive question”. These are then trained in a massively parallel scheme with online updates.
Results
They perform experiments on the critterbot to show the “success” of the architecture.
Prediction
They provide experiential evidence of the horde making accurate predictions about the time-to-obstacle and time-to-stop.
- Time-to-obstacle: Question functions: \(\pi(s, forward) = 1, r(s) = 1, z(s)=0, \forall s\in\mathcal{S}, \gamma(s) = 1\) when ir sensor over a set threshold.
- Time to stop Semantically the same, except the policy is now the stopping policy, and the reward is 0 and the quesiton terminates when the agent stops (vel is below some threshold.).
Off-policy learning of multiple spinning control policies
They show they can learn policies off-policy.
Light seeking
They show they can learn a policy to maximize a signal.
Thoughts
-
The use of “accurate world knowledge” is devoid of meaning in this context. What is world knowledge, how is it related to a more general sense of knowledge, in what measure is this world knowledge accurate? By starting the paper with broad overly general terms, and without defining these terms, it is hard to find the true meaning and ideals of the work.
-
The claim that a knowledge system built on value function predictions will provide a more general representation of knowledge is not well supported, and claimed as an assumption without experiential evidence. While this claim may be true, the lack of support is troubling.
-
defining value functions with semantic meaning is unfortunate and leads to people thinking about them in terms of symbols (symbols whose true “value” is learned but symbols non-the-less).
-
They also tacitly imply that value functions are the only form of knowledge looked at which has semantic meaning in the data (which I’m unclear if this is accurate).