Page 10: Adaptive Learning Rates (Second Moment)

MATH 3850; Adam Paper, Section 2; Giving each parameter its own step size

PART A

The Idea: Volume Control for Each Parameter

Analogy: A mixing board in a recording studio.

A mixing board has a separate volume slider for each instrument. The drums are loud; turn them down. The violin is quiet; turn it up. You don't use ONE volume knob for all instruments, because they're at completely different levels.

Adam does this for parameters:

Parameter with big gradients (loud drums) → smaller step (turn it down)
Parameter with small gradients (quiet violin) → bigger step (turn it up)

This is exactly the per-parameter $\alpha$ idea from Page 05. Adam implements it using a running average of squared gradients.

PART B

The Formula

$$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$$

$v_t$ Running average of squared gradients
("how big has this gradient been?")

$g_t^2$ Elementwise square of gradient
(makes everything positive, emphasizes big values)

$\beta_2 = 0.999$ Very long memory (averages ~1000 steps).
Captures the long-term scale of each gradient.

Effective step: $\frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon}$ Big $v_t$ → big denominator → small step.
Small $v_t$ → small denominator → big step.

$v_t$: Running average of squared gradients at step $t$; measures how large this parameter's gradient has been historically
$v_{t-1}$: Previous step's squared-gradient average; carries the long-term memory of gradient scale
$g_t^2$: Current gradient squared (elementwise); removes sign and emphasizes large values
$\beta_2$: Decay rate (default 0.999). Very high, so $v_t$ has a long memory (~1000 steps). This captures the stable, long-term scale of each parameter's gradient
$v_0 = 0$: Initialization; same bias-toward-zero issue as $m_t$, fixed by the same bias correction on Page 11
$\hat{v}_t$: Bias-corrected version: $v_t / (1 - \beta_2^t)$. Removes the initialization bias so early steps are accurate
$\frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon}$: The effective per-parameter step size. Big past gradients produce big $\hat{v}_t$, which shrinks this fraction (smaller step). Small past gradients produce small $\hat{v}_t$, which grows this fraction (bigger step)
$\epsilon$: A tiny constant ($10^{-8}$) to prevent dividing by zero. Has no practical effect on the math

Why squared gradients? Why not just $|g_t|$?

Squaring does two things:

Makes everything positive. Gradients can be positive or negative. We want to know the magnitude (how big), not the sign. Squaring removes the sign. (Same reason we use $x^2$ instead of $x$ in "sum of squares" from statistics.)
Emphasizes outliers. A gradient of 10 contributes $100$ when squared, while a gradient of 1 contributes just $1$. This makes $v_t$ more sensitive to large gradients, which is exactly what we want; big gradients are the ones that need to be tamed.

The paper calls $v_t$ the "second raw moment estimate" because the second moment of a random variable is $\mathbb{E}[X^2]$; the expected value of the square. $v_t$ estimates $\mathbb{E}[g_t^2]$.

PART C

Full Hand Computation: Two Parameters

Parameter 1 (big gradients) and Parameter 2 (small gradients)

$\beta_2 = 0.999$, $v_0 = 0$, $\alpha = 0.001$

Gradients over 3 steps: Param 1: $g = [10, 12, 8]$. Param 2: $g = [0.1, -0.2, 0.3]$

Parameter 1 (big gradients):

$t = 1$: Compute $v_1$ (squared-gradient average)

Apply the EMA formula: 99.9% of $v_0 = 0$ + 0.1% of $g_1^2 = 10^2 = 100$.

$v_1 = 0.999(0) + 0.001(10^2) = 0 + 0.001(100) = $ $0.1$

Now bias-correct: divide by $(1 - \beta_2^1)$ to undo the initialization-toward-zero.

Corrected: $\hat{v}_1 = 0.1 / (1 - 0.999^1) = 0.1 / 0.001 = $ $100$

Finally compute the effective step size: $\alpha$ divided by $\sqrt{\hat{v}_1}$.

Effective $\alpha$: $0.001 / (\sqrt{100} + 10^{-8}) = 0.001 / 10 = $ $0.0001$

Step size shrunk from 0.001 to 0.0001; 10x smaller because the gradient is big!

$t = 2$: Second step for Parameter 1

Same process: 99.9% of $v_1 = 0.1$ + 0.1% of $g_2^2 = 12^2 = 144$.

$v_2 = 0.999(0.1) + 0.001(12^2) = 0.0999 + 0.144 = $ $0.2439$

Bias-correct and compute step size.

Corrected: $\hat{v}_2 = 0.2439 / (1 - 0.999^2) = 0.2439 / 0.001999 \approx $ $122$

Effective $\alpha$: $0.001 / \sqrt{122} \approx 0.001 / 11.05 \approx $ $0.0000905$

Parameter 2 (small gradients):

$t = 1$: First step for Parameter 2 (tiny gradient)

Same formula, but now $g_1 = 0.1$, so $g_1^2 = 0.01$; a much smaller squared gradient.

$v_1 = 0.999(0) + 0.001(0.1^2) = 0 + 0.001(0.01) = $ $0.00001$

Bias-correct: same division by $(1 - 0.999^1) = 0.001$.

Corrected: $\hat{v}_1 = 0.00001 / 0.001 = $ $0.01$

Now $\sqrt{0.01} = 0.1$ is a tiny denominator, so the step size grows instead of shrinking.

Effective $\alpha$: $0.001 / (\sqrt{0.01} + 10^{-8}) = 0.001 / 0.1 = $ $0.01$

Step size GREW from 0.001 to 0.01; 10x bigger because the gradient is small!

Comparison at $t = 1$; the adaptive magic:

Parameter 1 effective $\alpha$: 0.0001 (small step; big gradient tamed)

Parameter 2 effective $\alpha$: 0.01 (big step; weak gradient boosted)

That's a 100x difference in step size! All automatic.

This is exactly the bathtub solution from Page 08. In SGD, both parameters get the same $\alpha = 0.001$. Adam gives parameter 1 a step of 0.0001 and parameter 2 a step of 0.01. Each direction gets the step size it needs. No hand-tuning required.

PART D

See the Adaptation in Action

Top row: The raw gradients for each parameter. Left has big gradients (~10), right has tiny gradients (~0.2).

Bottom row: The effective step size $\alpha / \sqrt{\hat{v}_t}$ for each. Left gets a tiny step size (big gradient → big $v_t$ → big denominator → small step). Right gets a much bigger step size (small gradient → small $v_t$ → small denominator → big step). ~50x difference! That's the adaptive magic of Adam.

Connection to your course: this IS diagonal preconditioning

In Page 04, Newton's method used $[\nabla^2 f]^{-1}$ (inverse Hessian) to scale each direction by its curvature. But the Hessian is an $n \times n$ matrix; expensive.

Adam approximates a diagonal version: instead of a full matrix, just one number per parameter ($v_t$ for each $\theta_i$). Storage: $n$ numbers instead of $n^2$. For MNIST with $n = 7850$: Adam stores 7,850 numbers; Newton would need 61,622,500.

The paper says (Section 5): "Adam employs a preconditioner... $\hat{v}_t$ is an approximation to the diagonal of the Fisher information matrix." You don't need to fully understand that — just know that Adam's $v_t$ serves the same purpose as Newton's Hessian, but much cheaper.

Glossary; Adaptive Rates

Second moment ($v_t$): EMA of squared gradients. Estimates $\mathbb{E}[g^2]$ — the average squared magnitude of each parameter's gradient. Used to scale the step size.
$\beta_2$: Decay rate for the second moment. Default 0.999. Very long memory (~1000 steps). Higher = more stable scaling.
Effective step size ($\alpha / \sqrt{\hat{v}_t}$): Adam's per-parameter learning rate. Big past gradients → big $v_t$ → small step. Small past gradients → small $v_t$ → big step.
$\epsilon$ ($10^{-8}$): Tiny constant added to $\sqrt{\hat{v}_t}$ to prevent division by zero. Doesn't affect the math in practice.
Diagonal preconditioning: Scaling each parameter independently (diagonal of a matrix). Adam does this; Newton uses a full matrix.
Second raw moment: Statistics term: $\mathbb{E}[X^2]$. The "raw" means not centered (not subtracting the mean first). $v_t$ estimates this for the gradients.

Next: Page 11; Bias Correction

Why the first few steps are wrong, and a simple division that fixes everything