ADAM

tags
Optimizers, Neural Network
source
https://ruder.io/optimizing-gradient-descent/index.html#adam

An optimizer which effectively combines RMSProp and Momentum. The algorithm is as follows. Using the gradient \(g_t\), we compute moving averages (or decaying averages) of the first and second moments.

\begin{align*} m_t &= \beta_m m_{t-1} + (1-beta_m) g_t \\ v_t &= \beta_v v_{t-1} + (1-beta_v) g^2_t \end{align*}

Both of these estimated averages are biased towards zero, and thus to unbias these moving averages we scale by \(\frac{1}{1-\beta^t}\)

\begin{align*} \hat{m}_t &= \frac{m_t}{1-\beta_m^t} \\ \hat{v}_t &= \frac{v_t}{1-\beta_v^t} \end{align*}

The first and second moments are then used to calculate the update to the weights:

\begin{align*} \theta_{t+1} = \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t \end{align*}

Where \(\eta, \epsilon, \beta_m, \beta_v\) are all hyper parameters.

Typical settings:

\begin{align*} \beta_m &= 0.9 \\ \beta_v &= 0.999 \\ \epsilon &= 10^{-8} \end{align*}

Links to this note: