ADAM

tags
Optimizers, Neural Network

An optimizer which effectively combines RMSProp and Momentum.

wu2016: On Multiplicative Integration with Recurrent Neural Networks

First they look at the gradients when the RNNs have linear activation mappings (to focus on the internal mechanisms). They measure the log of the L2-norm of the gradient for each epoch (averaged over the training set) using the Penn-Treebank dataset using the ADAM optimizer. They are able to show the norm of the gradient grows much more in vanilla architecture (using additive operations) vs what occurs in the new architecture.