# kostas2019: Asynchronous Coagent Networks: Stochastic Networks for Reinforcement Learning without Backpropagation or a Clock

\( \newcommand{\states}{\mathcal{S}} \newcommand{\actions}{\mathcal{A}} \newcommand{\rewards}{\mathcal{R}} \newcommand{\transition}{P} \newcommand{\reals}{\mathbb{R}} \newcommand{\naturals}{\mathbb{N}} \newcommand{\expected}{\mathbb{E}} \newcommand{\by}{\times} \newcommand{\partialderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\defineq}{\stackrel{\mathclap{\tiny\mbox{def}}}{=}} \newcommand{\defeq}{\stackrel{\mathclap{\tiny\mbox{def}}}{=}} \newcommand{\eye}{\Imat} \newcommand{\hadamard}{\odot} \newcommand{\trans}{\top} \newcommand{\inv}{{-1}} \newcommand{\argmax}{\operatorname{argmax}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\avec}{\mathbf{a}} \newcommand{\bvec}{\mathbf{b}} \newcommand{\cvec}{\mathbf{c}} \newcommand{\dvec}{\mathbf{d}} \newcommand{\evec}{\mathbf{e}} \newcommand{\gvec}{\mathbf{g}} \newcommand{\hvec}{\mathbf{h}} \newcommand{\lvec}{\mathbf{l}} \newcommand{\mvec}{\mathbf{m}} \newcommand{\nvec}{\mathbf{n}} \newcommand{\pvec}{\mathbf{p}} \newcommand{\qvec}{\mathbf{q}} \newcommand{\rvec}{\mathbf{r}} \newcommand{\svec}{\mathbf{s}} \newcommand{\uvec}{\mathbf{u}} \newcommand{\vvec}{\mathbf{v}} \newcommand{\wvec}{\mathbf{w}} \newcommand{\xvec}{\mathbf{x}} \newcommand{\yvec}{\mathbf{y}} \newcommand{\zvec}{\mathbf{z}} \newcommand{\Amat}{\mathbf{A}} \newcommand{\Bmat}{\mathbf{B}} \newcommand{\Cmat}{\mathbf{C}} \newcommand{\Dmat}{\mathbf{D}} \newcommand{\Emat}{\mathbf{E}} \newcommand{\Fmat}{\mathbf{F}} \newcommand{\Imat}{\mathbf{I}} \newcommand{\Pmat}{\mathbf{P}} \newcommand{\Umat}{\mathbf{U}} \newcommand{\Wmat}{\mathbf{W}} \newcommand{\Xmat}{\mathbf{X}} \newcommand{\thetavec}{\boldsymbol{\theta}} \newcommand{\muvec}{\boldsymbol{\mu}} \newcommand{\sigmavec}{\boldsymbol{\sigma}} \newcommand{\jacobian}{\mathbf{J}} \)

- tags
- Reinforcement Learning, Representation
- source
- paper
- authors
- Kostas, J., Nota, C., & Thomas, P. S.
- year
- 2019

This paper continues the line of research into coagent networks, primarily looking at how these networks can be trained in an asynchronous way. Policy gradient coagent networks (PGCN) already have removed back-propagation as a requirement, meaning the updates require only information local to a particular coagent (or neuron in an ANN).

Some literature connected to this work is:

*stochastic computation graphs*- proposed to describe networks with a mixture of stochastic and deterministic nodes, with supervised, unsupervised, and RL applications.*Stochastic neural networks*- A type of ANN which builds in random variations to the network, either through a stochastic transfer function, or stochastic weights.- Multi-agent RL - (Liu 2014): showed that multi-agent systems sometimes learn more quickly when agents are given individualized rewards.

## Background

Consider the MDP, \((M=(\states, \actions, \rewards, \transition, R, d_0, \gamma))\). where the \(\states\) is all possible states, \(\actions\) set of all possible actions, and \(\rewards\) the set of all possible rewards (All finite sets). \(\transition: \states \by \actions \by \states \rightarrow [0,1], R: \states \by \actions \by \states \rightarrow [0,1], d_0(s) \triangleq Pr(S_0=s)\) are the transition function, reward distribution, and initial state distribution respectively.

The paper defines a parameterized policy as \(\pi: \states \by \actions \by \mathbb{R}^n \rightarrow [0,1]\)., such that for all \(\theta \in \reals^n\), \(\pi(\cdot, \cdot, \theta)\) is a policy. Assume \(\partialderiv{\pi(s,a,\theta)}{ \theta}\) exists for all \(s\in\states, a\in\actions\).

An agents goal is to maximize the objective

\[J(\pi) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t R_t|A_t \sim \pi(\cdot, \cdot, \theta)]\]

### How coagent networks function

We can consider an acyclic network of nodes, where each node is an independently acting coagent

The ith coagent has parameters \(\theta_i \in \reals^{n_i}\), where the remaining parameters of the network are \(\bar{\theta}_i \in \reals{n - n_i}\).

We model the output of the previous coagents in the network through a parameterized distribution \(U_t^\text{pre} \sim \pi_i^{\text{pre}}(S_t, \cdot, \bar{theta}_i)\). The output of the previous coagents is a random variable \(U_t^\text{pre}\), which takes some discrete values in a set \(\mathcal{U}^\text{pre}\).

The \(i^\text{th}\) coagent takes \(S_t\) and \(U_t^\text{pre}\) as input and produces \(U_t^i\), the conditional distribution of \(U^i_t\) is given by \(\pi(X_t^i, \cdot, \theta_i)\) where \(X_t^i = (S_t, U^\text{pre}_t)\). \(A_t\) is sampled according to a distribution \(\pi_i^\text{post}(X_t, U_t^i, \cdot, \bar{\theta_i})\).

Each agents environment is modeled as a CoMDP (*conjugate Markov decision process*).

## The Coagent Policy Gradient Theorem

If we were to simpley implement an unbiased policy gradient algorithm (say REINFORCE), the expected update or the *local policy gradient* \(\Delta_i\) for the \(i^\text{th}\) coagent would be:

\begin{equation} \Delta_i(\theta_i) \triangleq \expected\left[\sum_{t=0}^\infty \gamma^t G_t \partialderiv{\ln(\pi_i(X_t, U_t, \theta_i))}{\theta_i} \vert \theta\right] \label{eqn:kostas2019:local-policy-gradient} \end{equation}

What would happen if all coagents updated according to the local policy gradient updates?

**Coagent policy gradient theorm (CPGT)**: If \(\theta\) is fixed and all coagents update their parameters following unbiased estimates of eqn:kostas2019:local-policy-gradient then the entire network will follow an unbiased estimator of \(\nabla J(\theta)\), or the *global policy gradient*.

### Conjugate Markov Decision Process (CoMDP)

The CoMDP (or environment) of the \(i^\text{th}\), \(M^i \triangleq (\mathcal{X}^i, \mathcal{U}^i, \rewards^i, \transition^i, R^i, d_0^i, \gamma_i)\). Each component follows the normal MDP scheme laid out above. Here we will denote all random variables of the CoMDP with a tilde to clarify the distinction, for example \(\tilde{X}^i_t\) is the state of coagent \(i\) at time \(t\). The reward set and discount are the same as the MDP. We also define the transition and reward dynamics as

\[\transition^i(x, u, x’, \bar{\theta}_i) \triangleq \pi_i^\text{pre}(x’.s, x’.u_\text{pre}) \sum_{a \in \actions} \transition(x.s, a, x’.s) \pi_i^\text{post}(x, u, a)\]

\[R^i(x, u, x’, r, \bar{\theta}_i) \triangleq \sum_{a\in\actions} R(x.s, a, x’.s, r) \frac{\transition(x.s, a, x’.s) \pi_i^\text{post}(x, u, a)}{\sum_{\hat{a} \in \actions}\transition(x.s, \hat{a}, x’.s) \pi_i^\text{post}(x,u,\hat{a})}\]

They assume that given the same parameters \(\theta_i\), the \(i^\text{th}\) coagent has the same policy in both the original MDP and the \(i^\text{th}\) CoMDP.

The properties of the CoMDP are relatively straightforward from basic probability and algebraic manipulations. Something to keep in mind is the notion that the objective functions are equivalent (i.e. \(J(\theta) = J_i(\theta_i)\)). How this plays out in the network makes some amount of sense, as each coagent needs to act in such a way the whole agent can maximize the return.

I wonder if:

- Does the network ever collapse where each agent is selecting the same action? This could be a problem as it will force the network into a local optima, with no way out. If the policies are stochastic this should be avoidable.
- If the agents are dependent on the prior agents output, how can this be asynchronous? I’m still a bit confused on how the forward pass is being performed.
- Can this go beyond each node being an agent? How else can we construct asynchronous networks?

## Asynchronous Recurrent Networks

They allow for cyclic and asynchronous firings in straightforward ways. The cyclic structure is placed in \(\mathcal{X}_t\) which can contain prior firings of the coagent. The asynchronous firings are done through discritizing time into *atomic time steps* or really time steps which can be arbitrarily fine. Each coagent then has a probability of firing and any given *atomic time step*.

They provide theoretical evidence these extensions are valid, which will not be discussed in detail here.

## Case Study: Option Critic

They show the CGPT can be applied to an option critic framework. This allows detailed analysis of several multi-component algorithms to be analyzed using a more general policy gradient algorithm.

The details of this are skipped here.

## Empirical Support of CPGT

They provide some empirical evidence on a toy problem that the gradient estimates are similar to the back prop example. They show the directions are relatively similar as the number of estimate episodes increases.

## References

(Liu 2014) Bingyao Liu; Satinder Singh; Richard L. Lewis and Shiyin Qin, *Optimal {{Rewards}} for {{Cooperative Agents}}*, IEEE Transactions on Autonomous Mental Development, , pp. (2014). .