ADAM

tags
Optimizers, Neural Network
source
https://ruder.io/optimizing-gradient-descent/index.html#adam

An optimizer which effectively combines RMSProp and Momentum. The algorithm is as follows. Using the gradient \(g_t\), we compute moving averages (or decaying averages) of the first and second moments.

\begin{align*} m_t &= \beta_m m_{t-1} + (1-beta_m) g_t \\\
v_t &= \beta_v v_{t-1} + (1-beta_v) g^2_t \end{align*}

Both of these estimated averages are biased towards zero, and thus to unbias these moving averages we scale by \(\frac{1}{1-\beta^t}\)

\begin{align*} \hat{m}_t &= \frac{m_t}{1-\beta_m^t} \\\
\hat{v}_t &= \frac{v_t}{1-\beta_v^t} \end{align*}

The first and second moments are then used to calculate the update to the weights:

\begin{align*} \theat_{t+1} = \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t \end{align*}

Where \(\eta, \epsilon, \beta_m, \beta_v\) are all hyper parameters.

Typical settings:

\begin{align*} \beta_m &= 0.9 \\\
\beta_v &= 0.999 \\\
\epsilon &= 10^{-8} \end{align*}

wu2016: On Multiplicative Integration with Recurrent Neural Networks

First they look at the gradients when the RNNs have linear activation mappings (to focus on the internal mechanisms). They measure the log of the L2-norm of the gradient for each epoch (averaged over the training set) using the Penn-Treebank dataset using the ADAM optimizer. They are able to show the norm of the gradient grows much more in vanilla architecture (using additive operations) vs what occurs in the new architecture.