MATH 3850; Adam Paper, Section 2; Smoothing out noisy gradients with exponential moving averages
PART AAnalogy: Weather forecasting.
Tomorrow's weather forecast isn't just today's temperature. It's a weighted blend: "90% of yesterday's forecast + 10% of today's actual temperature." This way, one weird cold day doesn't throw off the forecast. The average smooths out day-to-day noise.
Adam does the same thing with gradients:
$$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$
Because each past gradient's influence decays exponentially with time:
With $\beta_1 = 0.9$, the EMA effectively averages the last $\sim 10$ gradients (anything older is <5% weight). It's like a sliding window, but smoother.
Suppose the noisy gradients are: $g_1 = 8,\; g_2 = 2,\; g_3 = 7,\; g_4 = -1,\; g_5 = 6$
(These jump around because each step uses a different random batch)
$t = 1$: $m_1 = 0.9(0) + 0.1(8)$
Applying the EMA formula: 90% of previous momentum ($m_0 = 0$, no history) + 10% of current gradient ($g_1 = 8$).
$= 0 + 0.8 = $ $0.8$
Only 10% of $g_1$ got through. The rest is from $m_0 = 0$ (no history yet). This is the initialization bias; Page 11 fixes it.
$t = 2$: $m_2 = 0.9(0.8) + 0.1(2)$
90% of previous momentum ($m_1 = 0.8$) + 10% of new gradient ($g_2 = 2$).
$= 0.72 + 0.2 = $ $0.92$
Still low. Slowly incorporating more data.
$t = 3$: $m_3 = 0.9(0.92) + 0.1(7)$
90% of $m_2 = 0.92$ + 10% of $g_3 = 7$. A big gradient, but the EMA only lets 10% in.
$= 0.828 + 0.7 = $ $1.528$
$t = 4$: $m_4 = 0.9(1.528) + 0.1(-1)$
90% of $m_3 = 1.528$ + 10% of $g_4 = -1$. This gradient is an outlier pointing the wrong way.
$= 1.3752 + (-0.1) = $ $1.2752$
$g_4 = -1$ was an outlier (negative). But $m_4$ barely budged; the average absorbs it. That's momentum working!
$t = 5$: $m_5 = 0.9(1.2752) + 0.1(6)$
90% of $m_4 = 1.2752$ + 10% of $g_5 = 6$. Back to a positive gradient, and the average keeps climbing.
$= 1.14768 + 0.6 = $ $1.74768$
Notice: the true average of $[8, 2, 7, -1, 6]$ is $22/5 = 4.4$. But $m_5 = 1.75$; way too low! That's the initialization bias (we started from $m_0 = 0$). The bias-corrected version: $\hat{m}_5 = 1.75 / (1 - 0.9^5) = 1.75 / 0.41 = 4.27$; much closer to 4.4!
| $t$ | $g_t$ (noisy) | $m_t$ (EMA) | $\hat{m}_t$ (corrected) | True avg so far |
|---|---|---|---|---|
| 1 | 8 | 0.800 | 8.000 | 8.00 |
| 2 | 2 | 0.920 | 4.842 | 5.00 |
| 3 | 7 | 1.528 | 5.614 | 5.67 |
| 4 | -1 | 1.275 | 3.878 | 4.00 |
| 5 | 6 | 1.748 | 4.271 | 4.40 |
Without momentum: each step follows the noisy $g_t$ directly. Noise pushes you left, right, up, down randomly.
With momentum: each step follows the smoothed $\hat{m}_t$. The random wobbles cancel out in the average. What remains is the consistent direction; the true gradient. Zigzag eliminated.
Think of it this way: if 8 out of 10 recent gradients pointed north and 2 pointed south, the average points north. The momentum ignores the 2 outliers and follows the majority direction.