Page 08: The Two Problems Adam Solves

MATH 3850; This is why Adam exists. Everything in the paper is a solution to these two problems.

Where we are in the story: Pages 00-05 gave you the optimization foundations. Pages 06-07 introduced the ML setting (stochastic gradients, loss functions). Now we identify the exact two problems that SGD has; and preview how Adam fixes each one. Pages 09-12 will show the details.

PROBLEM 1

The Zigzag Problem (Noisy Gradients Waste Movement)

Analogy: A drunk person walking home.

They know roughly which direction home is, but each step wobbles left or right randomly. They zigzag down the street, wasting half their energy going sideways instead of forward. They eventually get home, but it takes much longer than walking straight.

In SGD, each gradient is computed from a random minibatch. The direction is approximately right but noisy. Step $t$ might say "go northeast." Step $t+1$ (different random batch) might say "go northwest." The path zigzags instead of going straight north.

The noise doesn't average out fast enough step-by-step. You're wasting compute on sideways movement.

Adam's Fix: Momentum (First Moment $m_t$)

Instead of following today's noisy gradient, follow a smoothed-out average of recent gradients.

Adam keeps a running average of past gradients:

$$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$

$m_t$: The momentum (smoothed gradient) at step $t$; a running average that filters out noise
$\beta_1$: Decay rate controlling how much history to keep (default 0.9, meaning 90% old + 10% new)
$m_{t-1}$: The momentum from the previous step; carries forward the accumulated direction
$g_t$: The current noisy gradient from this step's random minibatch
$(1 - \beta_1)$: Weight on the new gradient (0.1 with default $\beta_1$); only a small fraction of new info gets in

This is like asking 10 drunk friends for directions and averaging their answers; the random wobbles cancel out, and the average points in the right direction.

$\beta_1 = 0.9$ means: "90% of the old average + 10% of the new gradient." Heavy smoothing. The zigzag gets ironed out.

Details: Page 09

PROBLEM 2

The One-Size-Fits-All Problem (Same $\alpha$ for Every Parameter)

Analogy: The bathtub.

Imagine an elongated valley shaped like a bathtub. It's very steep across the narrow width (y-direction) and very gentle along the long length (x-direction).

If $\alpha$ is big enough for x (gentle slope) → you overshoot in y (steep slope)
If $\alpha$ is small enough for y → you crawl in x
You can't win with ONE step size for both directions

In real ML models, some parameters have huge gradients and others have tiny ones. A single $\alpha$ can't serve both; it's either too aggressive for the big ones or too timid for the small ones.

This is the condition number problem from Page 03, but worse because in ML you have thousands of parameters, each needing a different scale.

Adam's Fix: Adaptive Learning Rates (Second Moment $v_t$)

Give each parameter its OWN step size, automatically adjusted based on its gradient history.

Adam tracks how big each parameter's gradient has been:

$$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$$

$v_t$: Running average of squared gradients at step $t$; tracks how big this parameter's gradient has been
$\beta_2$: Decay rate for the second moment (default 0.999, very long memory spanning ~1000 steps)
$v_{t-1}$: The previous step's squared-gradient average; carries the long-term gradient scale
$g_t^2$: The current gradient squared (elementwise); makes everything positive and emphasizes large values
$\alpha / \sqrt{v_t}$: Effective step size: big $v_t$ means big past gradients, so divide to shrink the step; small $v_t$ means weak gradients, so the step grows

Then divides by $\sqrt{v_t}$ in the update. Parameters with big gradients get a big denominator (smaller step). Parameters with small gradients get a small denominator (bigger step).

$\beta_2 = 0.999$ means: "99.9% of old average + 0.1% of new squared gradient." Very long memory; captures the long-term scale of each parameter's gradient.

Details: Page 10

See Both Problems (and Both Fixes) Side by Side

Left (SGD): Both problems visible. The path zigzags (problem 1: noisy gradients) and struggles with the elongated contours (problem 2: y is 20x steeper than x but both use the same $\alpha$).

Right (Adam): Both problems solved. The path is smooth (momentum killed the zigzag) and adapts to the shape (adaptive rates give y a smaller step and x a bigger step). Much faster convergence.

Preview: How the Two Fixes Work Together

Imagine two parameters with very different gradients over 3 steps:

Parameter 1 (big gradients): $g = [10, 12, 8]$

This parameter has consistently large gradients; like the steep y-direction in the bathtub.

Average magnitude $\approx 10$. Adam's adaptive rate: $\alpha / \sqrt{v} \approx \alpha / \sqrt{100} = \alpha / 10$

Effective step size: $\alpha / 10$ (small steps; big gradient is tamed)

Parameter 2 (small gradients): $g = [0.1, -0.2, 0.15]$

This parameter has tiny gradients; like the gentle x-direction in the bathtub. SGD barely moves it.

Average magnitude $\approx 0.15$. Adam's adaptive rate: $\alpha / \sqrt{v} \approx \alpha / \sqrt{0.02} = \alpha / 0.14$

Effective step size: $\alpha / 0.14 \approx 7\alpha$ (big steps; weak gradient is boosted)

Result:

Parameter 1 gets step size $\alpha/10$. Parameter 2 gets step size $7\alpha$. That's a 70x difference! All automatic, no hand-tuning needed.

Meanwhile, momentum ($m_t$) smooths both parameters' noisy gradients so the direction is clean.

The Whole Paper in Two Sentences

Adam = momentum (smooths noise) + adaptive rates (scales each parameter) + bias correction (fixes early steps)

The next 4 pages explain each piece in detail with full hand computations.

Glossary; The Two Problems

Zigzag problem: SGD's noisy gradients cause the path to wobble side-to-side instead of going straight to the minimum. Wastes compute.
One-size-fits-all problem: SGD uses one $\alpha$ for all parameters. Some need big steps, others need tiny steps. One value can't serve both.
Momentum ($m_t$, first moment): A running average of recent gradients. Smooths out noise. Formula: $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$.
Adaptive learning rate ($v_t$, second moment): A running average of squared gradients. Gives each parameter its own step size. Formula: $v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$.
$\beta_1$ (beta-1): Controls momentum smoothing. Default 0.9. Higher = smoother but slower to react to direction changes.
$\beta_2$ (beta-2): Controls adaptive rate memory. Default 0.999. Higher = longer memory of gradient scales.

Next: Page 09; Exponential Moving Averages and Momentum

How Adam smooths the zigzag; the first moment in full detail