MATH 3850; This is why Adam exists. Everything in the paper is a solution to these two problems.
Where we are in the story: Pages 00-05 gave you the optimization foundations. Pages 06-07 introduced the ML setting (stochastic gradients, loss functions). Now we identify the exact two problems that SGD has; and preview how Adam fixes each one. Pages 09-12 will show the details.
PROBLEM 1
The Zigzag Problem (Noisy Gradients Waste Movement)
Analogy: A drunk person walking home.
They know roughly which direction home is, but each step wobbles left or right randomly. They zigzag down the street, wasting half their energy going sideways instead of forward. They eventually get home, but it takes much longer than walking straight.
In SGD, each gradient is computed from a random minibatch. The direction is approximately right but noisy. Step $t$ might say "go northeast." Step $t+1$ (different random batch) might say "go northwest." The path zigzags instead of going straight north.
The noise doesn't average out fast enough step-by-step. You're wasting compute on sideways movement.
Adam's Fix: Momentum (First Moment $m_t$)
Instead of following today's noisy gradient, follow a smoothed-out average of recent gradients.
The momentum (smoothed gradient) at step $t$; a running average that filters out noise
$\beta_1$
Decay rate controlling how much history to keep (default 0.9, meaning 90% old + 10% new)
$m_{t-1}$
The momentum from the previous step; carries forward the accumulated direction
$g_t$
The current noisy gradient from this step's random minibatch
$(1 - \beta_1)$
Weight on the new gradient (0.1 with default $\beta_1$); only a small fraction of new info gets in
This is like asking 10 drunk friends for directions and averaging their answers; the random wobbles cancel out, and the average points in the right direction.
$\beta_1 = 0.9$ means: "90% of the old average + 10% of the new gradient." Heavy smoothing. The zigzag gets ironed out.
Details: Page 09
PROBLEM 2
The One-Size-Fits-All Problem (Same $\alpha$ for Every Parameter)
Analogy: The bathtub.
Imagine an elongated valley shaped like a bathtub. It's very steep across the narrow width (y-direction) and very gentle along the long length (x-direction).
If $\alpha$ is big enough for x (gentle slope) → you overshoot in y (steep slope)
If $\alpha$ is small enough for y → you crawl in x
You can't win with ONE step size for both directions
In real ML models, some parameters have huge gradients and others have tiny ones. A single $\alpha$ can't serve both; it's either too aggressive for the big ones or too timid for the small ones.
This is the condition number problem from Page 03, but worse because in ML you have thousands of parameters, each needing a different scale.
Adam's Fix: Adaptive Learning Rates (Second Moment $v_t$)
Give each parameter its OWN step size, automatically adjusted based on its gradient history.
Adam tracks how big each parameter's gradient has been:
Running average of squared gradients at step $t$; tracks how big this parameter's gradient has been
$\beta_2$
Decay rate for the second moment (default 0.999, very long memory spanning ~1000 steps)
$v_{t-1}$
The previous step's squared-gradient average; carries the long-term gradient scale
$g_t^2$
The current gradient squared (elementwise); makes everything positive and emphasizes large values
$\alpha / \sqrt{v_t}$
Effective step size: big $v_t$ means big past gradients, so divide to shrink the step; small $v_t$ means weak gradients, so the step grows
Then divides by $\sqrt{v_t}$ in the update. Parameters with big gradients get a big denominator (smaller step). Parameters with small gradients get a small denominator (bigger step).
$\beta_2 = 0.999$ means: "99.9% of old average + 0.1% of new squared gradient." Very long memory; captures the long-term scale of each parameter's gradient.
Details: Page 10
See Both Problems (and Both Fixes) Side by Side
Left (SGD): Both problems visible. The path zigzags (problem 1: noisy gradients) and struggles with the elongated contours (problem 2: y is 20x steeper than x but both use the same $\alpha$).
Right (Adam): Both problems solved. The path is smooth (momentum killed the zigzag) and adapts to the shape (adaptive rates give y a smaller step and x a bigger step). Much faster convergence.
Preview: How the Two Fixes Work Together
Imagine two parameters with very different gradients over 3 steps:
Parameter 1 (big gradients): $g = [10, 12, 8]$
This parameter has consistently large gradients; like the steep y-direction in the bathtub.