jaderberg2017reinforcement: Reinforcement Learning with Unsupervised Auxiliary Tasks

tags
Reinforcement Learning, Deep Reinforcement Learning, Auxiliary Tasks
source
https://arxiv.org/pdf/1611.05397.pdf

Unreal Agent:

  • Base agent is a CNN-LSTM ANN trained on-policy using A3C (Mnih et al. 2016) using a replay buffer of observations, rewards, and actions.
  • Pixel control: auxiliary policies \(Q^\text{aux}\) are trained to maximize change in pixel intensity of different regions.
  • Reward prediction: given three recent frames, the network must predict the reward that will be obtained in the next unobserved timestep.
  • Value Function Replay: further training of the value function using the agent network.

The overall architecture jointly optimizes all of its objectives.

Auxiliary Control Tasks

These are defined as an additional pseudo-reward function in the environment. This is similar to a control GVF. In this paper, they use Q-Learning as described by (Mnih et al. 2016) for learning the auxiliary control tasks. They use two types of rewards:

  • Pixel Changes: Learn a policy for maximally changing the pixels in each cell of an \(n \times n\) non-overlapping grid placed over the input image.
  • Network Features: maximially activating each of the units in a specific hidden layer.

Auxiliary Reward Task

In addition to the control tasks, they also propose to add a reward prediction task. This is to predict the one-step reward at the end of a short sequence of states (for the LSTM).

Value function replay

Experience Replay is used not only for the usual case, but also in value function replay. This effectively resamples recent historical sequences from behaviour policy distributions and performing extra value function regression in addition to the on-policy updates in A3C. The ER buffer is also used in the auxiliary control tasks described above.

Results

Across all domains tested they show an improvement over baseline A3C.

They test across several domains:

They use a set of other baselines which are all ablation studies on the UNREAL architecture.

References

Mnih, Volodymyr, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. “Asynchronous Methods for Deep Reinforcement Learning.” In International Conference on Machine Learning.