Page 09: Momentum (First Moment)

MATH 3850; Adam Paper, Section 2; Smoothing out noisy gradients with exponential moving averages

PART A

The Idea: Average Out the Noise

Analogy: Weather forecasting.

Tomorrow's weather forecast isn't just today's temperature. It's a weighted blend: "90% of yesterday's forecast + 10% of today's actual temperature." This way, one weird cold day doesn't throw off the forecast. The average smooths out day-to-day noise.

Adam does the same thing with gradients:

Yesterday's forecast = previous momentum $m_{t-1}$
Today's temperature = current gradient $g_t$
90% old + 10% new = the exponential moving average (EMA)

PART B

The Formula: Exponential Moving Average (EMA)

$$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$

$m_t$ The new smoothed gradient
("momentum" at step $t$)

$\beta_1 \cdot m_{t-1}$ 90% of the old average
("memory" of past gradients)

$(1 - \beta_1) \cdot g_t$ 10% of the new gradient
("today's observation")

$\beta_1 = 0.9$ The default. Higher = smoother
but slower to react to changes

$m_t$: The smoothed gradient (momentum) at timestep $t$; used in place of the raw noisy gradient
$m_{t-1}$: Previous step's momentum; carries forward the accumulated average direction
$g_t$: Raw stochastic gradient from the current minibatch; noisy but unbiased
$\beta_1$: Decay rate (default 0.9). Controls the tradeoff: higher = smoother average but slower to react to new info
$m_0 = 0$: Initialization; we start with no gradient history, which biases early estimates toward zero (fixed by bias correction on Page 11)

Why "exponential" moving average?

Because each past gradient's influence decays exponentially with time:

$g_t$ (current) gets weight $(1-\beta_1) = 0.1$
$g_{t-1}$ (1 step ago) gets weight $(1-\beta_1)\beta_1 = 0.1 \times 0.9 = 0.09$
$g_{t-2}$ (2 steps ago) gets weight $(1-\beta_1)\beta_1^2 = 0.1 \times 0.81 = 0.081$
$g_{t-10}$ (10 steps ago) gets weight $0.1 \times 0.9^{10} = 0.1 \times 0.35 = 0.035$
$g_{t-50}$ (50 steps ago) gets weight $0.1 \times 0.9^{50} \approx 0.0005$ (basically forgotten)

With $\beta_1 = 0.9$, the EMA effectively averages the last $\sim 10$ gradients (anything older is <5% weight). It's like a sliding window, but smoother.

Connection to BFGS: In A2, BFGS used gradient history (the $s_k$ and $y_k$ vectors) to build a matrix $H_k$. Adam's momentum also uses gradient history, but much simpler; just a running weighted average, not a matrix update. Same spirit: "use the past to improve the present step."

PART C

Full Hand Computation

Computing $m_t$ Step by Step ($\beta_1 = 0.9$, $m_0 = 0$)

Suppose the noisy gradients are: $g_1 = 8,\; g_2 = 2,\; g_3 = 7,\; g_4 = -1,\; g_5 = 6$

(These jump around because each step uses a different random batch)

$t = 1$: $m_1 = 0.9(0) + 0.1(8)$

Applying the EMA formula: 90% of previous momentum ($m_0 = 0$, no history) + 10% of current gradient ($g_1 = 8$).

$= 0 + 0.8 = $ $0.8$

Only 10% of $g_1$ got through. The rest is from $m_0 = 0$ (no history yet). This is the initialization bias; Page 11 fixes it.

$t = 2$: $m_2 = 0.9(0.8) + 0.1(2)$

90% of previous momentum ($m_1 = 0.8$) + 10% of new gradient ($g_2 = 2$).

$= 0.72 + 0.2 = $ $0.92$

Still low. Slowly incorporating more data.

$t = 3$: $m_3 = 0.9(0.92) + 0.1(7)$

90% of $m_2 = 0.92$ + 10% of $g_3 = 7$. A big gradient, but the EMA only lets 10% in.

$= 0.828 + 0.7 = $ $1.528$

$t = 4$: $m_4 = 0.9(1.528) + 0.1(-1)$

90% of $m_3 = 1.528$ + 10% of $g_4 = -1$. This gradient is an outlier pointing the wrong way.

$= 1.3752 + (-0.1) = $ $1.2752$

$g_4 = -1$ was an outlier (negative). But $m_4$ barely budged; the average absorbs it. That's momentum working!

$t = 5$: $m_5 = 0.9(1.2752) + 0.1(6)$

90% of $m_4 = 1.2752$ + 10% of $g_5 = 6$. Back to a positive gradient, and the average keeps climbing.

$= 1.14768 + 0.6 = $ $1.74768$

Notice: the true average of $[8, 2, 7, -1, 6]$ is $22/5 = 4.4$. But $m_5 = 1.75$; way too low! That's the initialization bias (we started from $m_0 = 0$). The bias-corrected version: $\hat{m}_5 = 1.75 / (1 - 0.9^5) = 1.75 / 0.41 = 4.27$; much closer to 4.4!

$t$	$g_t$ (noisy)	$m_t$ (EMA)	$\hat{m}_t$ (corrected)	True avg so far
1	8	0.800	8.000	8.00
2	2	0.920	4.842	5.00
3	7	1.528	5.614	5.67
4	-1	1.275	3.878	4.00
5	6	1.748	4.271	4.40

Key takeaway from the table: The raw $m_t$ column is too low (biased toward 0 from initialization). The corrected $\hat{m}_t$ column closely tracks the true running average. The bias correction (Page 11) is what makes the EMA useful in practice.

PART D

See It Visually

Left: Raw stochastic gradients (red bars). They jump around wildly; some positive, some negative, big variance. The dashed yellow line is the true gradient (5.0). Each bar is the gradient from a different random batch.

Right: The same noisy gradients (faded red), with the EMA overlaid. The purple line ($m_t$) starts too low (initialization bias) and slowly climbs. The green line ($\hat{m}_t$) is the bias-corrected version; it tracks the true gradient (yellow dashed) much more closely. The noise is smoothed out. No more zigzag.

Why this solves Problem 1 (zigzag)

Without momentum: each step follows the noisy $g_t$ directly. Noise pushes you left, right, up, down randomly.

With momentum: each step follows the smoothed $\hat{m}_t$. The random wobbles cancel out in the average. What remains is the consistent direction; the true gradient. Zigzag eliminated.

Think of it this way: if 8 out of 10 recent gradients pointed north and 2 pointed south, the average points north. The momentum ignores the 2 outliers and follows the majority direction.

Glossary; Momentum

Exponential Moving Average (EMA): A running weighted average where recent values matter more. Formula: $m_t = \beta m_{t-1} + (1-\beta)x_t$. Older values decay exponentially.
First moment ($m_t$): Adam's EMA of gradients. Estimates the mean (average direction) of the gradient. Called "first moment" because the mean is the first statistical moment.
$\beta_1$: Decay rate for the first moment. Default 0.9. Controls how much history to keep. Higher = smoother but slower to adapt.
Momentum: Using a running average of past gradients to smooth out noise and maintain consistent direction. Like a heavy ball rolling downhill; it has inertia that resists sudden direction changes.
Initialization bias: $m_0 = 0$, so early values of $m_t$ are biased toward 0. Fixed by dividing by $(1 - \beta_1^t)$. Details on Page 11.

Next: Page 10; Adaptive Learning Rates (Second Moment)

How Adam gives each parameter its own step size