Page 12: The Full Adam Algorithm

MATH 3850 ; Adam Paper, Algorithm 1 ; Every line explained, then a complete hand computation

PART A

The Complete Algorithm (Color-Coded)

Algorithm 1: Adam

Inputs: $\alpha = 0.001$ (step size), $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$, $f(\theta)$ (loss function), $\theta_0$ (initial parameters)

INITIALIZATION

1$m_0 \leftarrow 0$ (first moment ; momentum vector, all zeros)No gradient history yet

2$v_0 \leftarrow 0$ (second moment ; scale vector, all zeros)No squared gradient history yet

3$t \leftarrow 0$ (timestep counter)Will increment each iteration

REPEAT UNTIL CONVERGED

4$t \leftarrow t + 1$Advance clock

5$g_t \leftarrow \nabla_\theta f_t(\theta_{t-1})$Compute gradient on random minibatch

6$m_t \leftarrow \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$Update momentum (Page 09)

7$v_t \leftarrow \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$Update scale tracker (Page 10)

8$\hat{m}_t \leftarrow m_t \;/\; (1 - \beta_1^t)$Bias-correct momentum (Page 11)

9$\hat{v}_t \leftarrow v_t \;/\; (1 - \beta_2^t)$Bias-correct scale (Page 11)

10$\theta_t \leftarrow \theta_{t-1} - \alpha \cdot \hat{m}_t \;/\; (\sqrt{\hat{v}_t} + \epsilon)$THE UPDATE ; combine everything!

Return $\theta_t$ (the optimized parameters)

Line 10 decoded: the heart of Adam

$\theta_t = \theta_{t-1} - \alpha \cdot \hat{m}_t \;/\; (\sqrt{\hat{v}_t} + \epsilon)$

$\theta_t$: Updated parameters at timestep $t$ (what we are solving for)
$\theta_{t-1}$: Parameters from the previous step (our current position)
$\alpha$: Base step size (learning rate). Default 0.001. Because Adam normalizes steps, this directly controls how far each step goes
$\hat{m}_t$: Bias-corrected first moment: the smoothed gradient direction (numerator = which way to go)
$\sqrt{\hat{v}_t}$: Square root of bias-corrected second moment: per-parameter gradient magnitude (denominator = how much to scale down)
$\epsilon$: Tiny constant ($10^{-8}$) to prevent division by zero. Does not affect results in practice

Compare to plain gradient descent: $\theta_t = \theta_{t-1} - \alpha \cdot g_t$

Adam replaces $g_t$ with $\hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$. This does two things at once:

$\hat{m}_t$ in the numerator = smoothed gradient (momentum kills the zigzag)
$\sqrt{\hat{v}_t}$ in the denominator = per-parameter scaling (big gradient → big denominator → smaller step)

The ratio $\hat{m}_t / \sqrt{\hat{v}_t}$ is approximately $\pm 1$ for each parameter (Page 01 showed this). So the effective step size is approximately $\alpha$ for every parameter, regardless of gradient scale. That's the "trust region" property the paper discusses in Section 2.1.

PART B

Data Flow: How the Pieces Connect

PART C

Complete Hand Computation: 3 Iterations of Adam

Setup

Minimize $f(\theta) = \theta^2$. Derivative: $f'(\theta) = 2\theta$.

Start at $\theta_0 = 5$. Hyperparameters: $\alpha = 0.1$, $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$.

Initialize: $m_0 = 0$, $v_0 = 0$, $t = 0$.

(Using a simple 1-parameter function so we can focus on the algorithm, not the linear algebra.)

Iteration 1 ($t = 1$)

Line 5 — Gradient: $g_1 = f'(\theta_0) = 2(5) = $ $10$

Line 6 — Momentum: $m_1 = 0.9(0) + 0.1(10) = 0 + 1 = $ $1$

Line 7 — Scale: $v_1 = 0.999(0) + 0.001(10^2) = 0 + 0.1 = $ $0.1$

Line 8 — Correct momentum: $\hat{m}_1 = \frac{1}{1 - 0.9^1} = \frac{1}{0.1} = $ $10$

Line 9 — Correct scale: $\hat{v}_1 = \frac{0.1}{1 - 0.999^1} = \frac{0.1}{0.001} = $ $100$

Line 10 — Update: $\theta_1 = 5 - 0.1 \times \frac{10}{\sqrt{100} + 10^{-8}} = 5 - 0.1 \times \frac{10}{10} = 5 - 0.1 = $ $4.9$

Notice: effective step = $\alpha \times \hat{m}/\sqrt{\hat{v}} = 0.1 \times 10/10 = 0.1$. The step size is approximately $\alpha$ regardless of gradient magnitude! That's Adam's trust region property.

Iteration 2 ($t = 2$)

Line 5; Gradient: $g_2 = f'(\theta_1) = 2(4.9) = $ $9.8$

Gradient shrank slightly because $\theta$ moved closer to 0

Line 6; Momentum: $m_2 = 0.9(1) + 0.1(9.8) = 0.9 + 0.98 = $ $1.88$

Blends 90% of old momentum with 10% of new gradient

Line 7; Scale: $v_2 = 0.999(0.1) + 0.001(9.8^2) = 0.0999 + 0.09604 = $ $0.19594$

Second moment grows as it accumulates squared gradient history

Line 8; Correct momentum: $\hat{m}_2 = \frac{1.88}{1 - 0.9^2} = \frac{1.88}{0.19} = $ $9.895$

Correction factor $0.19$ is still small, so the boost is still large

Line 9; Correct scale: $\hat{v}_2 = \frac{0.19594}{1 - 0.999^2} = \frac{0.19594}{0.001999} = $ $98.02$

$\beta_2^2 = 0.998$, so correction factor is still tiny (massive boost)

Line 10; Update: $\theta_2 = 4.9 - 0.1 \times \frac{9.895}{\sqrt{98.02} + 10^{-8}} = 4.9 - 0.1 \times \frac{9.895}{9.901} = 4.9 - 0.0999 = $ $4.8001$

Again, effective step $\approx 0.1$. Adam takes steps of approximately $\alpha$ regardless of gradient size.

Iteration 3 ($t = 3$)

Line 5; Gradient: $g_3 = f'(\theta_2) = 2(4.8001) = $ $9.6002$

Gradient continues to shrink as $\theta$ approaches the minimum at 0

Line 6; Momentum: $m_3 = 0.9(1.88) + 0.1(9.6002) = 1.692 + 0.96002 = $ $2.652$

Raw $m_t$ is still far from the true gradient due to zero-init bias

Line 7; Scale: $v_3 = 0.999(0.19594) + 0.001(9.6002^2) = 0.19574 + 0.09216 = $ $0.28790$

Line 8; Correct momentum: $\hat{m}_3 = \frac{2.652}{1 - 0.9^3} = \frac{2.652}{0.271} = $ $9.786$

Correction factor $1 - 0.729 = 0.271$, still a meaningful boost

Line 9; Correct scale: $\hat{v}_3 = \frac{0.28790}{1 - 0.999^3} = \frac{0.28790}{0.002997} = $ $96.06$

Line 10; Update: $\theta_3 = 4.8001 - 0.1 \times \frac{9.786}{\sqrt{96.06}} = 4.8001 - 0.1 \times \frac{9.786}{9.801} = 4.8001 - 0.0998 = $ $4.7003$

Three iterations in and the pattern is clear: each step is almost exactly $\alpha = 0.1$, confirming the trust region property. The ratio $\hat{m}_t / \sqrt{\hat{v}_t} \approx 1$, so Adam's effective step is bounded by $\alpha$.

$t$	$\theta$	$g_t$	$m_t$	$v_t$	$\hat{m}_t$	$\hat{v}_t$	Step	$\theta_{\text{new}}$
1	5.000	10.0	1.000	0.100	10.00	100.0	0.100	4.900
2	4.900	9.80	1.880	0.196	9.895	98.02	0.100	4.800
3	4.800	9.60	2.652	0.288	9.786	96.06	0.100	4.700

Pattern: Adam takes steps of almost exactly $\alpha = 0.1$ each iteration, regardless of the gradient magnitude. That's the trust region property from Section 2.1 of the paper: $|\Delta_t| \lessapprox \alpha$. The step size is bounded by $\alpha$, which makes it easy to choose — you know roughly how far each step will go.

PART D

Why These Default Values?

The paper's recommended defaults (Section 2):

$\alpha = 0.001$ ; small base step size. Since Adam normalizes each step to $\approx \alpha$, this controls the overall pace. Most ML problems work well with $10^{-3}$ to $10^{-4}$.
$\beta_1 = 0.9$ ; momentum decay. Averages roughly the last 10 gradients ($1/(1-0.9) = 10$). Enough to smooth noise without being too sluggish.
$\beta_2 = 0.999$ ; scale decay. Averages roughly the last 1000 squared gradients ($1/(1-0.999) = 1000$). Long memory for stable per-parameter scaling.
$\epsilon = 10^{-8}$ ; just prevents $0/0$. Doesn't affect results in practice.

Key insight from the paper: "Good default settings for the tested machine learning problems are $\alpha = 0.001$, $\beta_1 = 0.9$, $\beta_2 = 0.999$ and $\epsilon = 10^{-8}$." These almost never need to be changed.

Comparison to your course methods

Gradient descent: You had to choose $\alpha$ carefully (too big = diverge, too small = crawl). Adam's default $\alpha = 0.001$ works because the adaptive scaling handles the rest.

Line search: Computed $\alpha$ each step, expensive. Adam doesn't need line search ; the second moment handles it.

BFGS: Built an $n \times n$ matrix. Adam stores just two vectors. Same concept (use gradient history to improve steps), radically simpler implementation.

Glossary ; Full Adam

Adam: Adaptive Moment estimation. Combines momentum ($m_t$) + adaptive rates ($v_t$) + bias correction. The name comes from "adaptive moments."
Trust region property: Each Adam step is approximately bounded by $\alpha$. You know roughly how far each step will go. Compare to GD where step size depends unpredictably on gradient magnitude.
Hyperparameters: Settings you choose before training: $\alpha$, $\beta_1$, $\beta_2$, $\epsilon$. Adam's defaults rarely need changing.
The update rule: $\theta_t = \theta_{t-1} - \alpha \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$. Direction from momentum, scaling from second moment, corrected for initialization bias.

Next: Page 13 ; The Experiments

What the paper tested, what the plots show, and why Adam wins