MATH 3850; Adam Paper, Section 2; Giving each parameter its own step size
PART AAnalogy: A mixing board in a recording studio.
A mixing board has a separate volume slider for each instrument. The drums are loud; turn them down. The violin is quiet; turn it up. You don't use ONE volume knob for all instruments, because they're at completely different levels.
Adam does this for parameters:
This is exactly the per-parameter $\alpha$ idea from Page 05. Adam implements it using a running average of squared gradients.
$$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$$
Squaring does two things:
The paper calls $v_t$ the "second raw moment estimate" because the second moment of a random variable is $\mathbb{E}[X^2]$; the expected value of the square. $v_t$ estimates $\mathbb{E}[g_t^2]$.
$\beta_2 = 0.999$, $v_0 = 0$, $\alpha = 0.001$
Gradients over 3 steps: Param 1: $g = [10, 12, 8]$. Param 2: $g = [0.1, -0.2, 0.3]$
Parameter 1 (big gradients):
$t = 1$: Compute $v_1$ (squared-gradient average)
Apply the EMA formula: 99.9% of $v_0 = 0$ + 0.1% of $g_1^2 = 10^2 = 100$.
$v_1 = 0.999(0) + 0.001(10^2) = 0 + 0.001(100) = $ $0.1$
Now bias-correct: divide by $(1 - \beta_2^1)$ to undo the initialization-toward-zero.
Corrected: $\hat{v}_1 = 0.1 / (1 - 0.999^1) = 0.1 / 0.001 = $ $100$
Finally compute the effective step size: $\alpha$ divided by $\sqrt{\hat{v}_1}$.
Effective $\alpha$: $0.001 / (\sqrt{100} + 10^{-8}) = 0.001 / 10 = $ $0.0001$
Step size shrunk from 0.001 to 0.0001; 10x smaller because the gradient is big!
$t = 2$: Second step for Parameter 1
Same process: 99.9% of $v_1 = 0.1$ + 0.1% of $g_2^2 = 12^2 = 144$.
$v_2 = 0.999(0.1) + 0.001(12^2) = 0.0999 + 0.144 = $ $0.2439$
Bias-correct and compute step size.
Corrected: $\hat{v}_2 = 0.2439 / (1 - 0.999^2) = 0.2439 / 0.001999 \approx $ $122$
Effective $\alpha$: $0.001 / \sqrt{122} \approx 0.001 / 11.05 \approx $ $0.0000905$
Parameter 2 (small gradients):
$t = 1$: First step for Parameter 2 (tiny gradient)
Same formula, but now $g_1 = 0.1$, so $g_1^2 = 0.01$; a much smaller squared gradient.
$v_1 = 0.999(0) + 0.001(0.1^2) = 0 + 0.001(0.01) = $ $0.00001$
Bias-correct: same division by $(1 - 0.999^1) = 0.001$.
Corrected: $\hat{v}_1 = 0.00001 / 0.001 = $ $0.01$
Now $\sqrt{0.01} = 0.1$ is a tiny denominator, so the step size grows instead of shrinking.
Effective $\alpha$: $0.001 / (\sqrt{0.01} + 10^{-8}) = 0.001 / 0.1 = $ $0.01$
Step size GREW from 0.001 to 0.01; 10x bigger because the gradient is small!
Comparison at $t = 1$; the adaptive magic:
Parameter 1 effective $\alpha$: 0.0001 (small step; big gradient tamed)
Parameter 2 effective $\alpha$: 0.01 (big step; weak gradient boosted)
That's a 100x difference in step size! All automatic.
This is exactly the bathtub solution from Page 08. In SGD, both parameters get the same $\alpha = 0.001$. Adam gives parameter 1 a step of 0.0001 and parameter 2 a step of 0.01. Each direction gets the step size it needs. No hand-tuning required.
In Page 04, Newton's method used $[\nabla^2 f]^{-1}$ (inverse Hessian) to scale each direction by its curvature. But the Hessian is an $n \times n$ matrix; expensive.
Adam approximates a diagonal version: instead of a full matrix, just one number per parameter ($v_t$ for each $\theta_i$). Storage: $n$ numbers instead of $n^2$. For MNIST with $n = 7850$: Adam stores 7,850 numbers; Newton would need 61,622,500.
The paper says (Section 5): "Adam employs a preconditioner... $\hat{v}_t$ is an approximation to the diagonal of the Fisher information matrix." You don't need to fully understand that — just know that Adam's $v_t$ serves the same purpose as Newton's Hessian, but much cheaper.