MATH 3850 ; Adam Paper, Algorithm 1 ; Every line explained, then a complete hand computation
PART AAlgorithm 1: Adam
Inputs: $\alpha = 0.001$ (step size), $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$, $f(\theta)$ (loss function), $\theta_0$ (initial parameters)
INITIALIZATION
REPEAT UNTIL CONVERGED
Return $\theta_t$ (the optimized parameters)
$\theta_t = \theta_{t-1} - \alpha \cdot \hat{m}_t \;/\; (\sqrt{\hat{v}_t} + \epsilon)$
Compare to plain gradient descent: $\theta_t = \theta_{t-1} - \alpha \cdot g_t$
Adam replaces $g_t$ with $\hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$. This does two things at once:
The ratio $\hat{m}_t / \sqrt{\hat{v}_t}$ is approximately $\pm 1$ for each parameter (Page 01 showed this). So the effective step size is approximately $\alpha$ for every parameter, regardless of gradient scale. That's the "trust region" property the paper discusses in Section 2.1.
Minimize $f(\theta) = \theta^2$. Derivative: $f'(\theta) = 2\theta$.
Start at $\theta_0 = 5$. Hyperparameters: $\alpha = 0.1$, $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$.
Initialize: $m_0 = 0$, $v_0 = 0$, $t = 0$.
(Using a simple 1-parameter function so we can focus on the algorithm, not the linear algebra.)
Line 5 — Gradient: $g_1 = f'(\theta_0) = 2(5) = $ $10$
Line 6 — Momentum: $m_1 = 0.9(0) + 0.1(10) = 0 + 1 = $ $1$
Line 7 — Scale: $v_1 = 0.999(0) + 0.001(10^2) = 0 + 0.1 = $ $0.1$
Line 8 — Correct momentum: $\hat{m}_1 = \frac{1}{1 - 0.9^1} = \frac{1}{0.1} = $ $10$
Line 9 — Correct scale: $\hat{v}_1 = \frac{0.1}{1 - 0.999^1} = \frac{0.1}{0.001} = $ $100$
Line 10 — Update: $\theta_1 = 5 - 0.1 \times \frac{10}{\sqrt{100} + 10^{-8}} = 5 - 0.1 \times \frac{10}{10} = 5 - 0.1 = $ $4.9$
Notice: effective step = $\alpha \times \hat{m}/\sqrt{\hat{v}} = 0.1 \times 10/10 = 0.1$. The step size is approximately $\alpha$ regardless of gradient magnitude! That's Adam's trust region property.
Line 5; Gradient: $g_2 = f'(\theta_1) = 2(4.9) = $ $9.8$
Gradient shrank slightly because $\theta$ moved closer to 0
Line 6; Momentum: $m_2 = 0.9(1) + 0.1(9.8) = 0.9 + 0.98 = $ $1.88$
Blends 90% of old momentum with 10% of new gradient
Line 7; Scale: $v_2 = 0.999(0.1) + 0.001(9.8^2) = 0.0999 + 0.09604 = $ $0.19594$
Second moment grows as it accumulates squared gradient history
Line 8; Correct momentum: $\hat{m}_2 = \frac{1.88}{1 - 0.9^2} = \frac{1.88}{0.19} = $ $9.895$
Correction factor $0.19$ is still small, so the boost is still large
Line 9; Correct scale: $\hat{v}_2 = \frac{0.19594}{1 - 0.999^2} = \frac{0.19594}{0.001999} = $ $98.02$
$\beta_2^2 = 0.998$, so correction factor is still tiny (massive boost)
Line 10; Update: $\theta_2 = 4.9 - 0.1 \times \frac{9.895}{\sqrt{98.02} + 10^{-8}} = 4.9 - 0.1 \times \frac{9.895}{9.901} = 4.9 - 0.0999 = $ $4.8001$
Again, effective step $\approx 0.1$. Adam takes steps of approximately $\alpha$ regardless of gradient size.
Line 5; Gradient: $g_3 = f'(\theta_2) = 2(4.8001) = $ $9.6002$
Gradient continues to shrink as $\theta$ approaches the minimum at 0
Line 6; Momentum: $m_3 = 0.9(1.88) + 0.1(9.6002) = 1.692 + 0.96002 = $ $2.652$
Raw $m_t$ is still far from the true gradient due to zero-init bias
Line 7; Scale: $v_3 = 0.999(0.19594) + 0.001(9.6002^2) = 0.19574 + 0.09216 = $ $0.28790$
Line 8; Correct momentum: $\hat{m}_3 = \frac{2.652}{1 - 0.9^3} = \frac{2.652}{0.271} = $ $9.786$
Correction factor $1 - 0.729 = 0.271$, still a meaningful boost
Line 9; Correct scale: $\hat{v}_3 = \frac{0.28790}{1 - 0.999^3} = \frac{0.28790}{0.002997} = $ $96.06$
Line 10; Update: $\theta_3 = 4.8001 - 0.1 \times \frac{9.786}{\sqrt{96.06}} = 4.8001 - 0.1 \times \frac{9.786}{9.801} = 4.8001 - 0.0998 = $ $4.7003$
Three iterations in and the pattern is clear: each step is almost exactly $\alpha = 0.1$, confirming the trust region property. The ratio $\hat{m}_t / \sqrt{\hat{v}_t} \approx 1$, so Adam's effective step is bounded by $\alpha$.
| $t$ | $\theta$ | $g_t$ | $m_t$ | $v_t$ | $\hat{m}_t$ | $\hat{v}_t$ | Step | $\theta_{\text{new}}$ |
|---|---|---|---|---|---|---|---|---|
| 1 | 5.000 | 10.0 | 1.000 | 0.100 | 10.00 | 100.0 | 0.100 | 4.900 |
| 2 | 4.900 | 9.80 | 1.880 | 0.196 | 9.895 | 98.02 | 0.100 | 4.800 |
| 3 | 4.800 | 9.60 | 2.652 | 0.288 | 9.786 | 96.06 | 0.100 | 4.700 |
The paper's recommended defaults (Section 2):
Key insight from the paper: "Good default settings for the tested machine learning problems are $\alpha = 0.001$, $\beta_1 = 0.9$, $\beta_2 = 0.999$ and $\epsilon = 10^{-8}$." These almost never need to be changed.
Gradient descent: You had to choose $\alpha$ carefully (too big = diverge, too small = crawl). Adam's default $\alpha = 0.001$ works because the adaptive scaling handles the rest.
Line search: Computed $\alpha$ each step, expensive. Adam doesn't need line search ; the second moment handles it.
BFGS: Built an $n \times n$ matrix. Adam stores just two vectors. Same concept (use gradient history to improve steps), radically simpler implementation.