Page 14: Convergence, AdaMax, and the Big Picture

MATH 3850 ; Adam Paper, Sections 4 & 7 ; What theory guarantees, the infinity-norm variant, and where Adam fits in optimization

PART A

Convergence Theory: The Regret Framework

Analogy: A poker player's regret.

Imagine you play poker 1,000 hands. After seeing all the results, you compute: "if I had played the single best fixed strategy for ALL hands, how much more would I have won?" That gap is your regret.

If your regret grows slowly (sublinearly) ; say it doubles when the number of hands quadruples ; then your average regret per hand goes to zero. You're approaching the best fixed strategy without knowing it in advance.

Regret (paper's Definition, Section 4):

$$R(T) = \sum_{t=1}^{T} [f_t(\theta_t) - f_t(\theta^*)]$$

$R(T)$: Total regret after $T$ steps. The cumulative gap between Adam's actual performance and the best possible fixed strategy
$T$: Total number of optimization steps (iterations). More steps = more chances for regret to accumulate
$f_t(\theta_t)$: The loss at step $t$ using the parameters Adam chose. This is what actually happened
$f_t(\theta^*)$: The loss at step $t$ if we had used the single best fixed $\theta^*$ (known only in hindsight)
$\sum_{t=1}^{T}$: Sum over all $T$ steps. Each step contributes a gap $f_t(\theta_t) - f_t(\theta^*)$ which is always $\geq 0$

If $R(T)$ grows slowly, Adam is doing well — it's not falling far behind the optimal choice.

What does the paper prove? (Theorem 4.1 and Corollary 4.2)

Under certain assumptions (bounded gradients, convex losses):

$$\frac{R(T)}{T} = O\!\left(\frac{1}{\sqrt{T}}\right)$$

$R(T)/T$: Average regret per step. If this goes to zero, Adam is approaching optimal performance on average
$O(\cdot)$: Big-O notation: "grows no faster than." Hides constant factors to focus on the rate of growth
$1/\sqrt{T}$: The convergence rate. Doubling accuracy requires 4x more steps (square root). Slow growth of regret is good for us

In plain English: the average regret per step goes to zero as you take more steps. After 10,000 steps, your average loss is within $1/\sqrt{10000} = 1/100$ of optimal. After a million steps: within $1/1000$ of optimal.

This matches the best known rate for any online convex optimization method. Adam isn't just good in practice — it has provably competitive convergence.

What Does $O(1/\sqrt{T})$ Mean Concretely?

After $T = 100$ steps:

Average regret $\leq C / \sqrt{100} = C / 10$

After $T = 10{,}000$ steps:

Average regret $\leq C / \sqrt{10000} = C / 100$

100x more steps, but only 10x better. Square root growth is slow — that's good for us.

After $T = 1{,}000{,}000$ steps:

Average regret $\leq C / 1000$ ; practically zero.

Connection to your course: this is like convergence rate analysis from Page 03, but for the online/stochastic setting. Instead of "distance to minimum shrinks," it's "average regret shrinks." Same spirit, different framework.

The assumptions are important: Theorem 4.1 assumes convex loss functions and bounded gradients. Neural networks violate both (non-convex, potentially unbounded gradients). So the theory doesn't formally apply to most practical uses of Adam. But Adam still works great empirically (Page 13). This theory-practice gap is normal and widely accepted in ML.

PART B

AdaMax: The Infinity-Norm Variant (Section 7.1)

Analogy: Adam uses the average of squared gradients ($v_t$) to scale steps. AdaMax uses the maximum instead. It's like grading on the hardest test rather than the average test.

Adam's second moment (L2 norm):

$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$ (average of squares)

AdaMax's second moment (L∞ norm):

$u_t = \max(\beta_2 \cdot u_{t-1},\; |g_t|)$ (maximum of absolutes)

$u_t$: AdaMax's second moment at step $t$. Tracks the exponentially-decaying maximum of gradient magnitudes (not the average)
$\beta_2 \cdot u_{t-1}$: Previous max, slightly decayed. The $\beta_2$ factor lets old large gradients gradually fade
$|g_t|$: Absolute value of the current gradient. If this exceeds the decayed history, it becomes the new max
$\max(\cdot, \cdot)$: Take the larger of the two. This is the L-infinity norm operation: dominated by the single biggest value

The update becomes:

$$\theta_t = \theta_{t-1} - \frac{\alpha}{1 - \beta_1^t} \cdot \frac{m_t}{u_t}$$

Why does AdaMax exist?

Two nice properties:

No bias correction needed for $u_t$. The max operation doesn't suffer from the zero-initialization bias that the average does. One less thing to worry about.
Simpler step bound: $|\Delta_t| \leq \alpha$, always. Clean guarantee.

In practice, Adam and AdaMax perform similarly. The paper presents AdaMax as an interesting theoretical variant. Default recommendation is still Adam.

Hand Computation: AdaMax vs Adam ($u_t$ vs $v_t$)

Gradients: $g_1 = 5$, $g_2 = -3$, $g_3 = 8$. $\beta_2 = 0.999$.

Adam ($v_t$, average of squares):

$v_1 = 0.001 \times 25 = 0.025$

$v_2 = 0.999(0.025) + 0.001(9) = 0.034$

$v_3 = 0.999(0.034) + 0.001(64) = 0.098$

$v_3 = 0.098$ (running average of $[25, 9, 64]$)

AdaMax ($u_t$, running max):

$u_1 = \max(0, |5|) = 5$

$u_2 = \max(0.999 \times 5, |-3|) = \max(4.995, 3) = 4.995$

$u_3 = \max(0.999 \times 4.995, |8|) = \max(4.990, 8) = 8$

$u_3 = 8$ (largest gradient magnitude seen)

AdaMax is dominated by the biggest gradient ever seen. Adam averages them. Both serve the same purpose: scaling step sizes based on gradient magnitude.

PART C

The Big Picture: Where Adam Fits in Optimization History

1847

Gradient Descent (Cauchy) — Walk downhill with fixed step size. Simple but slow. One $\alpha$ for everything.

1687/1970s

Newton's Method — Use curvature (Hessian) for perfect steps. Quadratic convergence but expensive ($n^2$ storage).

1970

BFGS (Broyden, Fletcher, Goldfarb, Shanno) — Approximate Newton cheaply using gradient history. Superlinear convergence. Still $n^2$ storage.

1951/1986

SGD + Momentum (Robbins-Monro, Rumelhart) — Use random minibatches for speed. Add momentum to smooth noise. But still one $\alpha$ for all.

2011

AdaGrad (Duchi et al.) — Per-parameter adaptive rates! Great for sparse features. But step size shrinks to zero over time (accumulates forever).

2012

RMSProp (Hinton) — Fix AdaGrad's shrinking by using a decaying average instead of total accumulation. But no momentum and no bias correction.

2015

Adam (Kingma & Ba) — Combines everything: momentum (smooths noise) + adaptive rates (per-parameter scaling, decaying average) + bias correction (fixes initialization). $O(n)$ storage. Works out of the box.

Adam is the synthesis. It takes the best ideas from decades of optimization research and combines them into one clean, efficient, practical algorithm. That's why it became the default optimizer in ML frameworks like PyTorch and TensorFlow.

The Entire Paper in One Page

The problem: SGD is fast but noisy (zigzags) and uses one step size for all parameters (bathtub problem).

Solution 1 — Momentum ($m_t$): Average past gradients to smooth out noise. $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$

Solution 2 — Adaptive rates ($v_t$): Track squared gradient magnitudes per parameter. Divide by $\sqrt{v_t}$ to give each parameter its own step size. $v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$

Fix — Bias correction: Both $m_t$ and $v_t$ start at 0, making early estimates too low. Divide by $(1-\beta^t)$ to fix.

The update: $\theta_t = \theta_{t-1} - \alpha \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$

Theory: Average regret $O(1/\sqrt{T})$ — matches the best possible rate for online convex optimization.

Practice: Works on convex AND non-convex problems. Default settings ($\alpha = 0.001$, $\beta_1 = 0.9$, $\beta_2 = 0.999$) rarely need tuning. Became the standard optimizer in deep learning.

Glossary ; Convergence and AdaMax

Regret $R(T)$: Total accumulated gap between Adam's choices and the best fixed choice in hindsight. Lower regret = closer to optimal.
$O(1/\sqrt{T})$ average regret: Average regret per step shrinks like $1/\sqrt{T}$. After $T$ steps, you're within $C/\sqrt{T}$ of optimal on average. Goes to zero.
Online convex optimization: The theoretical framework used to analyze Adam. At each step, nature reveals a new convex loss function, and you choose parameters before seeing it.
AdaMax: Adam variant using max (L∞ norm) instead of average (L2 norm) for the second moment. Simpler bound, no bias correction needed for $u_t$.
$u_t = \max(\beta_2 \cdot u_{t-1}, |g_t|)$: AdaMax's second moment: the exponentially-decaying max of gradient magnitudes.

Study Guide Complete!

15 pages covering everything from basic derivatives to the full Adam algorithm.

Go to the Index page for a linked overview of all pages.