Page 12: The Full Adam Algorithm

MATH 3850 ; Adam Paper, Algorithm 1 ; Every line explained, then a complete hand computation

PART A

The Complete Algorithm (Color-Coded)

Algorithm 1: Adam

Inputs: $\alpha = 0.001$ (step size), $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$, $f(\theta)$ (loss function), $\theta_0$ (initial parameters)

INITIALIZATION

1$m_0 \leftarrow 0$ (first moment ; momentum vector, all zeros)No gradient history yet
2$v_0 \leftarrow 0$ (second moment ; scale vector, all zeros)No squared gradient history yet
3$t \leftarrow 0$ (timestep counter)Will increment each iteration

REPEAT UNTIL CONVERGED

4$t \leftarrow t + 1$Advance clock
5$g_t \leftarrow \nabla_\theta f_t(\theta_{t-1})$Compute gradient on random minibatch
6$m_t \leftarrow \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$Update momentum (Page 09)
7$v_t \leftarrow \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$Update scale tracker (Page 10)
8$\hat{m}_t \leftarrow m_t \;/\; (1 - \beta_1^t)$Bias-correct momentum (Page 11)
9$\hat{v}_t \leftarrow v_t \;/\; (1 - \beta_2^t)$Bias-correct scale (Page 11)
10$\theta_t \leftarrow \theta_{t-1} - \alpha \cdot \hat{m}_t \;/\; (\sqrt{\hat{v}_t} + \epsilon)$THE UPDATE ; combine everything!

Return $\theta_t$ (the optimized parameters)

Line 10 decoded: the heart of Adam

$\theta_t = \theta_{t-1} - \alpha \cdot \hat{m}_t \;/\; (\sqrt{\hat{v}_t} + \epsilon)$

$\theta_t$
Updated parameters at timestep $t$ (what we are solving for)
$\theta_{t-1}$
Parameters from the previous step (our current position)
$\alpha$
Base step size (learning rate). Default 0.001. Because Adam normalizes steps, this directly controls how far each step goes
$\hat{m}_t$
Bias-corrected first moment: the smoothed gradient direction (numerator = which way to go)
$\sqrt{\hat{v}_t}$
Square root of bias-corrected second moment: per-parameter gradient magnitude (denominator = how much to scale down)
$\epsilon$
Tiny constant ($10^{-8}$) to prevent division by zero. Does not affect results in practice

Compare to plain gradient descent: $\theta_t = \theta_{t-1} - \alpha \cdot g_t$

Adam replaces $g_t$ with $\hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$. This does two things at once:

The ratio $\hat{m}_t / \sqrt{\hat{v}_t}$ is approximately $\pm 1$ for each parameter (Page 01 showed this). So the effective step size is approximately $\alpha$ for every parameter, regardless of gradient scale. That's the "trust region" property the paper discusses in Section 2.1.

PART B

Data Flow: How the Pieces Connect

GRADIENT g_t (noisy) MOMENTUM m_t = β₁·m + (1-β₁)·g smooths direction SCALE v_t = β₂·v + (1-β₂)·g² tracks gradient magnitude CORRECT m̂ = m/(1-β₁ᵗ) fix zero-init bias CORRECT v̂ = v/(1-β₂ᵗ) fix zero-init bias UPDATE θ θ - α · m̂ / (√v̂ + ε) direction ÷ scale = adaptive step direction magnitude² PART C

Complete Hand Computation: 3 Iterations of Adam

Setup

Minimize $f(\theta) = \theta^2$. Derivative: $f'(\theta) = 2\theta$.

Start at $\theta_0 = 5$. Hyperparameters: $\alpha = 0.1$, $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$.

Initialize: $m_0 = 0$, $v_0 = 0$, $t = 0$.

(Using a simple 1-parameter function so we can focus on the algorithm, not the linear algebra.)

Iteration 1 ($t = 1$)

Line 5 — Gradient: $g_1 = f'(\theta_0) = 2(5) = $ $10$

Line 6 — Momentum: $m_1 = 0.9(0) + 0.1(10) = 0 + 1 = $ $1$

Line 7 — Scale: $v_1 = 0.999(0) + 0.001(10^2) = 0 + 0.1 = $ $0.1$

Line 8 — Correct momentum: $\hat{m}_1 = \frac{1}{1 - 0.9^1} = \frac{1}{0.1} = $ $10$

Line 9 — Correct scale: $\hat{v}_1 = \frac{0.1}{1 - 0.999^1} = \frac{0.1}{0.001} = $ $100$

Line 10 — Update: $\theta_1 = 5 - 0.1 \times \frac{10}{\sqrt{100} + 10^{-8}} = 5 - 0.1 \times \frac{10}{10} = 5 - 0.1 = $ $4.9$

Notice: effective step = $\alpha \times \hat{m}/\sqrt{\hat{v}} = 0.1 \times 10/10 = 0.1$. The step size is approximately $\alpha$ regardless of gradient magnitude! That's Adam's trust region property.

Iteration 2 ($t = 2$)

Line 5; Gradient: $g_2 = f'(\theta_1) = 2(4.9) = $ $9.8$

Gradient shrank slightly because $\theta$ moved closer to 0

Line 6; Momentum: $m_2 = 0.9(1) + 0.1(9.8) = 0.9 + 0.98 = $ $1.88$

Blends 90% of old momentum with 10% of new gradient

Line 7; Scale: $v_2 = 0.999(0.1) + 0.001(9.8^2) = 0.0999 + 0.09604 = $ $0.19594$

Second moment grows as it accumulates squared gradient history

Line 8; Correct momentum: $\hat{m}_2 = \frac{1.88}{1 - 0.9^2} = \frac{1.88}{0.19} = $ $9.895$

Correction factor $0.19$ is still small, so the boost is still large

Line 9; Correct scale: $\hat{v}_2 = \frac{0.19594}{1 - 0.999^2} = \frac{0.19594}{0.001999} = $ $98.02$

$\beta_2^2 = 0.998$, so correction factor is still tiny (massive boost)

Line 10; Update: $\theta_2 = 4.9 - 0.1 \times \frac{9.895}{\sqrt{98.02} + 10^{-8}} = 4.9 - 0.1 \times \frac{9.895}{9.901} = 4.9 - 0.0999 = $ $4.8001$

Again, effective step $\approx 0.1$. Adam takes steps of approximately $\alpha$ regardless of gradient size.

Iteration 3 ($t = 3$)

Line 5; Gradient: $g_3 = f'(\theta_2) = 2(4.8001) = $ $9.6002$

Gradient continues to shrink as $\theta$ approaches the minimum at 0

Line 6; Momentum: $m_3 = 0.9(1.88) + 0.1(9.6002) = 1.692 + 0.96002 = $ $2.652$

Raw $m_t$ is still far from the true gradient due to zero-init bias

Line 7; Scale: $v_3 = 0.999(0.19594) + 0.001(9.6002^2) = 0.19574 + 0.09216 = $ $0.28790$

Line 8; Correct momentum: $\hat{m}_3 = \frac{2.652}{1 - 0.9^3} = \frac{2.652}{0.271} = $ $9.786$

Correction factor $1 - 0.729 = 0.271$, still a meaningful boost

Line 9; Correct scale: $\hat{v}_3 = \frac{0.28790}{1 - 0.999^3} = \frac{0.28790}{0.002997} = $ $96.06$

Line 10; Update: $\theta_3 = 4.8001 - 0.1 \times \frac{9.786}{\sqrt{96.06}} = 4.8001 - 0.1 \times \frac{9.786}{9.801} = 4.8001 - 0.0998 = $ $4.7003$

Three iterations in and the pattern is clear: each step is almost exactly $\alpha = 0.1$, confirming the trust region property. The ratio $\hat{m}_t / \sqrt{\hat{v}_t} \approx 1$, so Adam's effective step is bounded by $\alpha$.

$t$$\theta$$g_t$$m_t$$v_t$$\hat{m}_t$$\hat{v}_t$Step$\theta_{\text{new}}$
15.00010.01.0000.10010.00100.00.1004.900
24.9009.801.8800.1969.89598.020.1004.800
34.8009.602.6520.2889.78696.060.1004.700
Pattern: Adam takes steps of almost exactly $\alpha = 0.1$ each iteration, regardless of the gradient magnitude. That's the trust region property from Section 2.1 of the paper: $|\Delta_t| \lessapprox \alpha$. The step size is bounded by $\alpha$, which makes it easy to choose — you know roughly how far each step will go.
PART D

Why These Default Values?

The paper's recommended defaults (Section 2):

Key insight from the paper: "Good default settings for the tested machine learning problems are $\alpha = 0.001$, $\beta_1 = 0.9$, $\beta_2 = 0.999$ and $\epsilon = 10^{-8}$." These almost never need to be changed.

Comparison to your course methods

Gradient descent: You had to choose $\alpha$ carefully (too big = diverge, too small = crawl). Adam's default $\alpha = 0.001$ works because the adaptive scaling handles the rest.

Line search: Computed $\alpha$ each step, expensive. Adam doesn't need line search ; the second moment handles it.

BFGS: Built an $n \times n$ matrix. Adam stores just two vectors. Same concept (use gradient history to improve steps), radically simpler implementation.

Glossary ; Full Adam

Adam
Adaptive Moment estimation. Combines momentum ($m_t$) + adaptive rates ($v_t$) + bias correction. The name comes from "adaptive moments."
Trust region property
Each Adam step is approximately bounded by $\alpha$. You know roughly how far each step will go. Compare to GD where step size depends unpredictably on gradient magnitude.
Hyperparameters
Settings you choose before training: $\alpha$, $\beta_1$, $\beta_2$, $\epsilon$. Adam's defaults rarely need changing.
The update rule
$\theta_t = \theta_{t-1} - \alpha \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$. Direction from momentum, scaling from second moment, corrected for initialization bias.