Page 11: Bias Correction

MATH 3850 ; Adam Paper, Section 3 ; Why the first few steps are wrong, and how one division fixes it

PART A

The Problem: Starting from Zero

Analogy: You just moved to a new city.

Someone asks: "what's the average temperature here?" On your first day, you measured 15°C. But that's just one day ; it's not a good estimate of the average. After 30 days, your running average is solid. After 1 day? Garbage.

Adam has the same problem. It initializes $m_0 = 0$ and $v_0 = 0$. At step 1, the EMA barely reflects the actual gradient ; it's dominated by the zero initialization. Bias correction fixes this by scaling up the early estimates.

See the Problem: $m_t$ Without Correction ($\beta_1 = 0.9$)

Suppose the true gradient is always $g = 5$ (constant for clarity).

$t = 1$: $m_1 = 0.9 \times \underbrace{0}_{m_0} + 0.1 \times 5 = $ $0.5$

Should be near 5. Got 0.5. That's 10x too low!

Why? Because 90% of the weight went to $m_0 = 0$ (the zero initialization), and only 10% to the actual gradient.

$t = 2$: $m_2 = 0.9(0.5) + 0.1(5) = 0.45 + 0.5 = $ $0.95$

Still only 0.95. Should be 5. Getting there, but slowly.

$t = 10$: After 10 steps...

$m_{10} = 5 \times (1 - 0.9^{10}) = 5 \times (1 - 0.349) = 5 \times 0.651 = $ $3.26$

After 10 whole steps, still only 65% of the true value!

PART B

The Fix: Divide by $(1 - \beta^t)$

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \qquad\qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

That's it. Just divide by $(1 - \beta^t)$. This one division completely eliminates the initialization bias.

$\hat{m}_t$: Bias-corrected first moment estimate (what Adam actually uses for the direction)
$m_t$: Raw first moment (exponential moving average of gradients, biased toward zero early on)
$\beta_1^t$: Decay rate raised to the power of the current timestep. At $t=1$ this is 0.9, shrinking the denominator and boosting the correction
$1 - \beta_1^t$: The correction factor. Small early (big correction), approaches 1 later (no correction needed)
$\hat{v}_t$: Bias-corrected second moment estimate (what Adam uses for per-parameter scaling)
$v_t$: Raw second moment (exponential moving average of squared gradients)
$\beta_2^t$: Second decay rate raised to timestep. With $\beta_2 = 0.999$, this decays very slowly, so $v_t$ needs correction for hundreds of steps

Why does this work?

At step $t$, the EMA can be written as:

$m_t = (1 - \beta_1) \sum_{i=1}^{t} \beta_1^{t-i} g_i$

If all gradients equal the true value $g$, this simplifies to:

$m_t = g \cdot (1 - \beta_1^t)$

So $m_t$ is always a fraction $(1 - \beta_1^t)$ of the true value. Dividing by that fraction cancels it out:

$\hat{m}_t = \frac{m_t}{1 - \beta_1^t} = \frac{g \cdot (1 - \beta_1^t)}{1 - \beta_1^t} = g$ ✓

Hand Computation: Bias Correction Applied

Same setup: $g = 5$ (constant), $\beta_1 = 0.9$

$t = 1$: $m_1 = 0.5$

Correction factor: $1 - \beta_1^1 = 1 - 0.9 = 0.1$

$\hat{m}_1 = 0.5 / 0.1 = $ $5.0$ ✓ Exactly right!

$t = 2$: $m_2 = 0.95$

$1 - \beta_1^2 = 1 - 0.81 = 0.19$

$\hat{m}_2 = 0.95 / 0.19 = $ $5.0$ ✓ Still exact!

$t = 10$: $m_{10} = 3.26$

$1 - \beta_1^{10} = 1 - 0.349 = 0.651$

$\hat{m}_{10} = 3.26 / 0.651 = $ $5.0$ ✓

$t = 100$: $1 - \beta_1^{100} = 1 - 0.9^{100} \approx 1 - 0.0000266 \approx 0.99997$

$\hat{m}_{100} = m_{100} / 0.99997 \approx m_{100}$

After 100 steps, the correction factor is ~1. The bias has disappeared on its own. Bias correction only matters for the first ~20-30 steps.

For $v_t$ with $\beta_2 = 0.999$, the bias is MUCH worse because $0.999^t$ decays very slowly. At $t = 1$: $1 - 0.999^1 = 0.001$, so $\hat{v}_1 = v_1 / 0.001 = 1000 \times v_1$. The correction is massive early on. This is why the paper says bias correction is "especially important" when $\beta_2$ is close to 1.

$t$	$\beta_1^t$	$1 - \beta_1^t$	$m_t$	$\hat{m}_t = m_t / (1 - \beta_1^t)$	Error
1	0.9	0.1	0.50	5.00	0%
2	0.81	0.19	0.95	5.00	0%
5	0.59	0.41	2.05	5.00	0%
10	0.35	0.65	3.26	5.00	0%
20	0.12	0.88	4.39	5.00	0%
50	0.005	0.995	4.97	5.00	0%
100	≈0	≈1	5.00	5.00	0%

Red column ($m_t$): biased ; starts at 0.5 and slowly climbs toward 5. Would take ~50 steps to get close.
Green column ($\hat{m}_t$): corrected ; hits 5.0 immediately at $t=1$ and stays there. The division perfectly compensates for the zero initialization.

PART C

See It Visually

Left ($\beta_1 = 0.9$): The purple line ($m_t$, biased) starts way below the true value (yellow dashed) and slowly climbs. The green line ($\hat{m}_t$, corrected) sits right on the true value from step 1. The red shading shows the gap that bias correction eliminates.

Right ($\beta_2 = 0.999$): The bias is MUCH worse because $\beta_2$ is closer to 1. The purple line barely moves off zero for the first 20 steps! Without correction, $v_t$ would be wildly wrong early on, causing Adam's step sizes to be completely off. The correction (green) fixes it.

Why does the paper care so much about this?

Without bias correction, Adam degrades to RMSProp (a different optimizer). The paper shows in Section 6.4 (Figure 4) that removing bias correction leads to instabilities ; especially when $\beta_2$ is close to 1, which is the recommended setting. Bias correction is what separates Adam from its predecessors.

Glossary ; Bias Correction

Initialization bias: $m_0 = 0$ and $v_0 = 0$, so early EMA values are biased toward zero. Not the true average ; just an artifact of starting from nothing.
Bias correction factor $(1 - \beta^t)$: Divide $m_t$ or $v_t$ by this to get the unbiased estimate. At $t=1$, this is small (big correction). As $t \to \infty$, this approaches 1 (no correction needed).
$\hat{m}_t$ and $\hat{v}_t$: The bias-corrected estimates. $\hat{m}_t = m_t / (1-\beta_1^t)$ and $\hat{v}_t = v_t / (1-\beta_2^t)$. These are what Adam actually uses in the update rule.
RMSProp: An earlier optimizer that's basically Adam without bias correction and without the first moment (momentum). Adam is "RMSProp + momentum + bias correction."

Next: Page 12 ; The Full Adam Algorithm

All pieces assembled, every line explained, complete hand computation