MATH 3850: Adam Optimizer Study Guide

15 interactive pages covering foundations through the full paper — built for the April 2 demo

Paper: Kingma & Ba, "Adam: A Method for Stochastic Optimization" (ICLR 2015)

Course: MATH 3850 — Numerical Optimization — University of Lethbridge

Pages with gold borders are the core Adam pages — prioritize these for the demo.

MODULE 1: Math Foundations
00Derivatives, Partial Derivatives, and Gradients
Speedometer analogy. Partial derivatives = freeze other variables. Gradient = list of slopes. Hand computation referencing A2. Vocabulary translation (x → θ, step size → learning rate).
01Vectors, Norms, and Elementwise Operations
L1/L2/L∞ norms with plots. Elementwise multiply, square, sqrt, divide — critical for reading Adam's formulas. Preview of where these show up in Adam.
02Convexity: What Makes Problems "Nice"
Bowl vs bumpy landscape. Second derivative test. 3D plots. Why Adam's theory assumes convex but practice is non-convex.
MODULE 2: Course Concepts That Connect to Adam
03Gradient Descent: The Full Picture
Step sizes (too big/small/right). Convergence rates. Condition number κ and the bathtub problem. Hand computation referencing A1. The fundamental limitation: one α for everything.
04Newton's Method and BFGS
Hessian = curvature. Newton step = jump to model minimum. BFGS = approximate Newton from gradient history. Hand computation mirroring Quiz 4 and A2. Bridge: BFGS → Adam.
05Step Size Selection: Line Search to Adaptive Methods
Exact line search formula. Armijo backtracking. Hand computation. Why even line search gives only ONE α — Adam's adaptive rates solve this.
MODULE 3: Bridge to Machine Learning
06Stochastic Gradient Descent
Full data vs minibatch. Why stochastic (460x faster). Contour plot comparison. Hand computation showing unbiased estimate. Paper notation ($g_t$, $f_t$).
07Machine Learning Basics: What Are We Optimizing?
Prediction machine analogy. Loss functions. MNIST (60K digit images). Logistic regression = weighted votes. Cross-entropy loss. Tiny hand computation.
08The Two Problems Adam Solves
Problem 1: zigzag (noisy gradients). Problem 2: one-size-fits-all step size. SGD vs Adam contour comparison. Dataflow diagram of Adam's structure.
MODULE 4: The Adam Paper
09Momentum (First Moment)
Weather forecast analogy. EMA formula with every symbol explained. Full hand computation (5 steps). Plot: raw noisy gradients vs smoothed EMA. Connection to BFGS.
10Adaptive Learning Rates (Second Moment)
Mixing board analogy. v_t formula. Why squared gradients. Hand computation: two parameters with 50x different effective step sizes. Diagonal preconditioning connection.
11Bias Correction
New city analogy. Why m_0=0 makes early steps wrong. The (1-β^t) fix. Hand computation showing perfect correction. Plot: biased vs corrected for both moments.
12The Full Adam Algorithm
Color-coded pseudocode. SVG dataflow diagram. 3 complete hand-computed iterations with every intermediate value. Hyperparameter explanations. Trust region property.
13The Experiments: Adam vs Everyone
How to read the paper's plots. 4 experiments explained: logistic regression, neural nets, CNNs, bias correction ablation. Key results and what to say in the demo.
14Convergence Theory, AdaMax, and the Big Picture
Regret framework (poker analogy). O(1/√T) bound. AdaMax (max instead of average). Timeline: GD → Newton → BFGS → AdaGrad → RMSProp → Adam. Full paper summary.

Good luck with the demo!

Start with Pages 08 → 09 → 10 → 11 → 12 for the core Adam content.

Use Page 13 for the experiments section of your presentation.