MATH 3850 ; Adam Paper, Sections 4 & 7 ; What theory guarantees, the infinity-norm variant, and where Adam fits in optimization
PART AAnalogy: A poker player's regret.
Imagine you play poker 1,000 hands. After seeing all the results, you compute: "if I had played the single best fixed strategy for ALL hands, how much more would I have won?" That gap is your regret.
If your regret grows slowly (sublinearly) ; say it doubles when the number of hands quadruples ; then your average regret per hand goes to zero. You're approaching the best fixed strategy without knowing it in advance.
Regret (paper's Definition, Section 4):
$$R(T) = \sum_{t=1}^{T} [f_t(\theta_t) - f_t(\theta^*)]$$
If $R(T)$ grows slowly, Adam is doing well — it's not falling far behind the optimal choice.
Under certain assumptions (bounded gradients, convex losses):
$$\frac{R(T)}{T} = O\!\left(\frac{1}{\sqrt{T}}\right)$$
In plain English: the average regret per step goes to zero as you take more steps. After 10,000 steps, your average loss is within $1/\sqrt{10000} = 1/100$ of optimal. After a million steps: within $1/1000$ of optimal.
This matches the best known rate for any online convex optimization method. Adam isn't just good in practice — it has provably competitive convergence.
After $T = 100$ steps:
Average regret $\leq C / \sqrt{100} = C / 10$
After $T = 10{,}000$ steps:
Average regret $\leq C / \sqrt{10000} = C / 100$
100x more steps, but only 10x better. Square root growth is slow — that's good for us.
After $T = 1{,}000{,}000$ steps:
Average regret $\leq C / 1000$ ; practically zero.
Connection to your course: this is like convergence rate analysis from Page 03, but for the online/stochastic setting. Instead of "distance to minimum shrinks," it's "average regret shrinks." Same spirit, different framework.
Analogy: Adam uses the average of squared gradients ($v_t$) to scale steps. AdaMax uses the maximum instead. It's like grading on the hardest test rather than the average test.
Adam's second moment (L2 norm):
$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$ (average of squares)
AdaMax's second moment (L∞ norm):
$u_t = \max(\beta_2 \cdot u_{t-1},\; |g_t|)$ (maximum of absolutes)
The update becomes:
$$\theta_t = \theta_{t-1} - \frac{\alpha}{1 - \beta_1^t} \cdot \frac{m_t}{u_t}$$
Two nice properties:
In practice, Adam and AdaMax perform similarly. The paper presents AdaMax as an interesting theoretical variant. Default recommendation is still Adam.
Gradients: $g_1 = 5$, $g_2 = -3$, $g_3 = 8$. $\beta_2 = 0.999$.
Adam ($v_t$, average of squares):
$v_1 = 0.001 \times 25 = 0.025$
$v_2 = 0.999(0.025) + 0.001(9) = 0.034$
$v_3 = 0.999(0.034) + 0.001(64) = 0.098$
$v_3 = 0.098$ (running average of $[25, 9, 64]$)
AdaMax ($u_t$, running max):
$u_1 = \max(0, |5|) = 5$
$u_2 = \max(0.999 \times 5, |-3|) = \max(4.995, 3) = 4.995$
$u_3 = \max(0.999 \times 4.995, |8|) = \max(4.990, 8) = 8$
$u_3 = 8$ (largest gradient magnitude seen)
AdaMax is dominated by the biggest gradient ever seen. Adam averages them. Both serve the same purpose: scaling step sizes based on gradient magnitude.
The problem: SGD is fast but noisy (zigzags) and uses one step size for all parameters (bathtub problem).
Solution 1 — Momentum ($m_t$): Average past gradients to smooth out noise. $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$
Solution 2 — Adaptive rates ($v_t$): Track squared gradient magnitudes per parameter. Divide by $\sqrt{v_t}$ to give each parameter its own step size. $v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$
Fix — Bias correction: Both $m_t$ and $v_t$ start at 0, making early estimates too low. Divide by $(1-\beta^t)$ to fix.
The update: $\theta_t = \theta_{t-1} - \alpha \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$
Theory: Average regret $O(1/\sqrt{T})$ — matches the best possible rate for online convex optimization.
Practice: Works on convex AND non-convex problems. Default settings ($\alpha = 0.001$, $\beta_1 = 0.9$, $\beta_2 = 0.999$) rarely need tuning. Became the standard optimizer in deep learning.
Study Guide Complete!
15 pages covering everything from basic derivatives to the full Adam algorithm.
Go to the Index page for a linked overview of all pages.