MATH 3850 ; Adam Paper, Section 3 ; Why the first few steps are wrong, and how one division fixes it
PART AAnalogy: You just moved to a new city.
Someone asks: "what's the average temperature here?" On your first day, you measured 15°C. But that's just one day ; it's not a good estimate of the average. After 30 days, your running average is solid. After 1 day? Garbage.
Adam has the same problem. It initializes $m_0 = 0$ and $v_0 = 0$. At step 1, the EMA barely reflects the actual gradient ; it's dominated by the zero initialization. Bias correction fixes this by scaling up the early estimates.
Suppose the true gradient is always $g = 5$ (constant for clarity).
$t = 1$: $m_1 = 0.9 \times \underbrace{0}_{m_0} + 0.1 \times 5 = $ $0.5$
Should be near 5. Got 0.5. That's 10x too low!
Why? Because 90% of the weight went to $m_0 = 0$ (the zero initialization), and only 10% to the actual gradient.
$t = 2$: $m_2 = 0.9(0.5) + 0.1(5) = 0.45 + 0.5 = $ $0.95$
Still only 0.95. Should be 5. Getting there, but slowly.
$t = 10$: After 10 steps...
$m_{10} = 5 \times (1 - 0.9^{10}) = 5 \times (1 - 0.349) = 5 \times 0.651 = $ $3.26$
After 10 whole steps, still only 65% of the true value!
$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \qquad\qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$
That's it. Just divide by $(1 - \beta^t)$. This one division completely eliminates the initialization bias.
At step $t$, the EMA can be written as:
$m_t = (1 - \beta_1) \sum_{i=1}^{t} \beta_1^{t-i} g_i$
If all gradients equal the true value $g$, this simplifies to:
$m_t = g \cdot (1 - \beta_1^t)$
So $m_t$ is always a fraction $(1 - \beta_1^t)$ of the true value. Dividing by that fraction cancels it out:
$\hat{m}_t = \frac{m_t}{1 - \beta_1^t} = \frac{g \cdot (1 - \beta_1^t)}{1 - \beta_1^t} = g$ ✓
Same setup: $g = 5$ (constant), $\beta_1 = 0.9$
$t = 1$: $m_1 = 0.5$
Correction factor: $1 - \beta_1^1 = 1 - 0.9 = 0.1$
$\hat{m}_1 = 0.5 / 0.1 = $ $5.0$ ✓ Exactly right!
$t = 2$: $m_2 = 0.95$
$1 - \beta_1^2 = 1 - 0.81 = 0.19$
$\hat{m}_2 = 0.95 / 0.19 = $ $5.0$ ✓ Still exact!
$t = 10$: $m_{10} = 3.26$
$1 - \beta_1^{10} = 1 - 0.349 = 0.651$
$\hat{m}_{10} = 3.26 / 0.651 = $ $5.0$ ✓
$t = 100$: $1 - \beta_1^{100} = 1 - 0.9^{100} \approx 1 - 0.0000266 \approx 0.99997$
$\hat{m}_{100} = m_{100} / 0.99997 \approx m_{100}$
After 100 steps, the correction factor is ~1. The bias has disappeared on its own. Bias correction only matters for the first ~20-30 steps.
For $v_t$ with $\beta_2 = 0.999$, the bias is MUCH worse because $0.999^t$ decays very slowly. At $t = 1$: $1 - 0.999^1 = 0.001$, so $\hat{v}_1 = v_1 / 0.001 = 1000 \times v_1$. The correction is massive early on. This is why the paper says bias correction is "especially important" when $\beta_2$ is close to 1.
| $t$ | $\beta_1^t$ | $1 - \beta_1^t$ | $m_t$ | $\hat{m}_t = m_t / (1 - \beta_1^t)$ | Error |
|---|---|---|---|---|---|
| 1 | 0.9 | 0.1 | 0.50 | 5.00 | 0% |
| 2 | 0.81 | 0.19 | 0.95 | 5.00 | 0% |
| 5 | 0.59 | 0.41 | 2.05 | 5.00 | 0% |
| 10 | 0.35 | 0.65 | 3.26 | 5.00 | 0% |
| 20 | 0.12 | 0.88 | 4.39 | 5.00 | 0% |
| 50 | 0.005 | 0.995 | 4.97 | 5.00 | 0% |
| 100 | ≈0 | ≈1 | 5.00 | 5.00 | 0% |
Without bias correction, Adam degrades to RMSProp (a different optimizer). The paper shows in Section 6.4 (Figure 4) that removing bias correction leads to instabilities ; especially when $\beta_2$ is close to 1, which is the recommended setting. Bias correction is what separates Adam from its predecessors.