reed2022generalist: A Generalist Agent

Feb 21, 2025

\( \newcommand{\states}{\mathcal{S}} \newcommand{\actions}{\mathcal{A}} \newcommand{\observations}{\mathcal{O}} \newcommand{\rewards}{\mathcal{R}} \newcommand{\traces}{\mathbf{e}} \newcommand{\transition}{P} \newcommand{\reals}{\mathbb{R}} \newcommand{\naturals}{\mathbb{N}} \newcommand{\complexs}{\mathbb{C}} \newcommand{\field}{\mathbb{F}} \newcommand{\numfield}{\mathbb{F}} \newcommand{\expected}{\mathbb{E}} \newcommand{\var}{\mathbb{V}} \newcommand{\by}{\times} \newcommand{\partialderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\defineq}{\stackrel{{\tiny\mbox{def}}}{=}} \newcommand{\defeq}{\stackrel{{\tiny\mbox{def}}}{=}} \newcommand{\eye}{\Imat} \newcommand{\hadamard}{\odot} \newcommand{\trans}{\top} \newcommand{\inv}{{-1}} \newcommand{\argmax}{\operatorname{argmax}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\avec}{\mathbf{a}} \newcommand{\bvec}{\mathbf{b}} \newcommand{\cvec}{\mathbf{c}} \newcommand{\dvec}{\mathbf{d}} \newcommand{\evec}{\mathbf{e}} \newcommand{\fvec}{\mathbf{f}} \newcommand{\gvec}{\mathbf{g}} \newcommand{\hvec}{\mathbf{h}} \newcommand{\ivec}{\mathbf{i}} \newcommand{\jvec}{\mathbf{j}} \newcommand{\kvec}{\mathbf{k}} \newcommand{\lvec}{\mathbf{l}} \newcommand{\mvec}{\mathbf{m}} \newcommand{\nvec}{\mathbf{n}} \newcommand{\ovec}{\mathbf{o}} \newcommand{\pvec}{\mathbf{p}} \newcommand{\qvec}{\mathbf{q}} \newcommand{\rvec}{\mathbf{r}} \newcommand{\svec}{\mathbf{s}} \newcommand{\tvec}{\mathbf{t}} \newcommand{\uvec}{\mathbf{u}} \newcommand{\vvec}{\mathbf{v}} \newcommand{\wvec}{\mathbf{w}} \newcommand{\xvec}{\mathbf{x}} \newcommand{\yvec}{\mathbf{y}} \newcommand{\zvec}{\mathbf{z}} \newcommand{\Amat}{\mathbf{A}} \newcommand{\Bmat}{\mathbf{B}} \newcommand{\Cmat}{\mathbf{C}} \newcommand{\Dmat}{\mathbf{D}} \newcommand{\Emat}{\mathbf{E}} \newcommand{\Fmat}{\mathbf{F}} \newcommand{\Gmat}{\mathbf{G}} \newcommand{\Hmat}{\mathbf{H}} \newcommand{\Imat}{\mathbf{I}} \newcommand{\Jmat}{\mathbf{J}} \newcommand{\Kmat}{\mathbf{K}} \newcommand{\Lmat}{\mathbf{L}} \newcommand{\Mmat}{\mathbf{M}} \newcommand{\Nmat}{\mathbf{N}} \newcommand{\Omat}{\mathbf{O}} \newcommand{\Pmat}{\mathbf{P}} \newcommand{\Qmat}{\mathbf{Q}} \newcommand{\Rmat}{\mathbf{R}} \newcommand{\Smat}{\mathbf{S}} \newcommand{\Tmat}{\mathbf{T}} \newcommand{\Umat}{\mathbf{U}} \newcommand{\Vmat}{\mathbf{V}} \newcommand{\Wmat}{\mathbf{W}} \newcommand{\Xmat}{\mathbf{X}} \newcommand{\Ymat}{\mathbf{Y}} \newcommand{\Zmat}{\mathbf{Z}} \newcommand{\Sigmamat}{\boldsymbol{\Sigma}} \newcommand{\identity}{\Imat} \newcommand{\epsilonvec}{\boldsymbol{\epsilon}} \newcommand{\thetavec}{\boldsymbol{\theta}} \newcommand{\phivec}{\boldsymbol{\phi}} \newcommand{\muvec}{\boldsymbol{\mu}} \newcommand{\sigmavec}{\boldsymbol{\sigma}} \newcommand{\jacobian}{\mathbf{J}} \newcommand{\ind}{\perp!!!!\perp} \newcommand{\bigoh}{\text{O}} \)

tags: Reinforcement Learning
source: https://arxiv.org/pdf/2205.06175
authors: Reed, S., Zolna, K., Parisotto, E., Colmenarejo, Sergio G\‘omez, Novikov, A., Barth-maron, Gabriel, Gim\’enez, Mai, …
year: 2022

The Gato agent is a generalist agent with the base of a Transformer. This agent is trained in a supervised manner, taking in a interleaved sequence of tokenized observations, separator tokens, and previously sampled actions to product the next action.

Architecture

Tokenization of inputs

Text: Tokenization of the text information is done through (Kudo and Richardson 2018) with 32000 subwords. This results in a tokenization to integers of \([0, 32000)\).
Images are tokenized through the same process as ViT. That is images are separated into 16x16 image patches ordered in raster order (i.e. the order of image rendering). Pixels are normalized to be in the range [-1, 1], and divided by \(\sqrt(16)\).
Discrete values (e.g. Atari button presses) are flattened into sequences of integers. The range of the discretiztion is \([0, 1024)\).
Continuous values are first normalized to [-1, 1] using \[ F(x, \mu=100, M=256) = \sign(x) \frac{\log(|x|\mu + 1.0)}{\log(M\mu + 1.0)} \] which results in the following normalization:

![](/ox-hugo/2024-05-07_15-18-00_Screenshot 2024-05-07 at 3.17.54 PM.png) These values are then discretized into 1024 bins with uniform width on the domain.

Network Architecture

The architecture has two main components

The parameterized embedding function which transforms tokens to token embeddings. This function works in two different modes depending on the token being embedded:
- Text, discrete- or continuous-valued observations or actions are embedded via a lookup table into a learned vector embedding space.
- Tokens belonging to image patches are embedded using a single ResNet block to obtain a vector per patch.
The Sequence model can work for next token prediction (i.e. a Transformer, specifically (Vaswani et al. 2017)).

GATO uses a 1.2B parameter decoder-only transformer with 24 layers, and embedding size of 2048, and a post-attention feedforward hidden size of 8196.

Training regime

Loss

Given the sequence of tokens \(s_{1:T}\) and \(\theta\), the loss function is:

\[ \mathcal{L}(\theta, \mathcal{B}) = - \sum_{b=1}^{|\mathcal{B}} \sum_{t=1}^{T} m(b, l) \log p_{\theta} \left ( s_l^{(b)} \left \vert s_1^{(b)} \ldots s_{l-1}^{(b)} \right. \right ) \]

where \(m(b, l) = 1\) when the token at index l is from text or a logged action, otherwise its \(0\).

Training data

The data used to train the agent. Below is a figure including full details of how much data is used to train the agent.

Different task types:

Simulated control tasks: Meta-world, Sokoban, BabyAI, DM Control Suite, DeepMind Lab, ALE.
Vision and language: Several datasets are used which have multi-modal image and text inputs as well as a dataset for text only preditcion.
- MassiveText
- ALIGN
- LTIP
- COCO
- MultiModal MassiveWeb dataset
- OKVQA
- VQAv2
Robotics - RGB Stacking Benchmark: A robotoics benchmark for stacking objects. Both the simulated and real-world versions were used.

Results

References

Kudo, Taku, and John Richardson. 2018. “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing.” August 19. doi:10.48550/arXiv.1808.06226.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” In Advances in Neural Information Processing Systems 31, 6000–6010.

Matthew Schlegel