chandar2019: Towards Non-saturating Recurrent Units for Modelling Long-term Dependencies

Recurrent Neural Network, Machine Learning

In this paper they introduce a new recurrent cell named “NRU” for Non-saturating Recurrent Unit. There are two main contributions in this architecture which make it unique from other cells (i.e LSTM).

  • The study and use of non-saturating activation functions (i.e. ReLU Activations) for the non-linear transfer functions
  • They separate out a memory vector from the hidden state which can be a different size to the hidden state. This enables more information to be stored.

See the paper for a full description of the architecture.


The claim is that the NRU lessens the effect of the vanishing gradient problem and enables longer time dependencies to be modeled.

While the claim of the vanishing gradient problem isn’t as fully explored as in (Wu 2016), they support this in the “model analysis” section with some preliminary evidence of lack of divergence on the copy task. While this is a start, the instabilities seen in the training curves (i.e. figures 1 and 2) are worrying. What is causing these spikes in error? How would this effect an agent needing to make decisions in the real world? None of these questions are fully explored in the paper. They also don’t focus on these issues in the main analysis of the results, focusing instead on the “faster convergence rates”. I’m also a bit haphazard as I don’t see confidence bounds or a mention on runs used. This leads to further questions about should we care about convergence at all costs, or is there a balance between fast convergence and reliable representations.

(Wu 2016) Yuhuai Wu; Saizheng Zhang; Ying Zhang; Yoshua Bengio and Russ R Salakhutdinov, On {{Multiplicative Integration}} with {{Recurrent Neural Networks}}, (2016).