MATH 3850: Adam Optimizer Study Guide

15 interactive pages covering foundations through the full paper — built for the April 2 demo

Paper: Kingma & Ba, "Adam: A Method for Stochastic Optimization" (ICLR 2015)

Course: MATH 3850 — Numerical Optimization — University of Lethbridge

Pages with gold borders are the core Adam pages — prioritize these for the demo.

MODULE 1: Math Foundations

00Derivatives, Partial Derivatives, and Gradients

Speedometer analogy. Partial derivatives = freeze other variables. Gradient = list of slopes. Hand computation referencing A2. Vocabulary translation (x → θ, step size → learning rate).

01Vectors, Norms, and Elementwise Operations

L1/L2/L∞ norms with plots. Elementwise multiply, square, sqrt, divide — critical for reading Adam's formulas. Preview of where these show up in Adam.

02Convexity: What Makes Problems "Nice"

Bowl vs bumpy landscape. Second derivative test. 3D plots. Why Adam's theory assumes convex but practice is non-convex.

MODULE 2: Course Concepts That Connect to Adam

03Gradient Descent: The Full Picture

Step sizes (too big/small/right). Convergence rates. Condition number κ and the bathtub problem. Hand computation referencing A1. The fundamental limitation: one α for everything.

04Newton's Method and BFGS

Hessian = curvature. Newton step = jump to model minimum. BFGS = approximate Newton from gradient history. Hand computation mirroring Quiz 4 and A2. Bridge: BFGS → Adam.

05Step Size Selection: Line Search to Adaptive Methods

Exact line search formula. Armijo backtracking. Hand computation. Why even line search gives only ONE α — Adam's adaptive rates solve this.

MODULE 3: Bridge to Machine Learning

06Stochastic Gradient Descent

Full data vs minibatch. Why stochastic (460x faster). Contour plot comparison. Hand computation showing unbiased estimate. Paper notation ($g_t$, $f_t$).

07Machine Learning Basics: What Are We Optimizing?

Prediction machine analogy. Loss functions. MNIST (60K digit images). Logistic regression = weighted votes. Cross-entropy loss. Tiny hand computation.

08The Two Problems Adam Solves

Problem 1: zigzag (noisy gradients). Problem 2: one-size-fits-all step size. SGD vs Adam contour comparison. Dataflow diagram of Adam's structure.

MODULE 4: The Adam Paper

09Momentum (First Moment)

Weather forecast analogy. EMA formula with every symbol explained. Full hand computation (5 steps). Plot: raw noisy gradients vs smoothed EMA. Connection to BFGS.

10Adaptive Learning Rates (Second Moment)

Mixing board analogy. v_t formula. Why squared gradients. Hand computation: two parameters with 50x different effective step sizes. Diagonal preconditioning connection.

11Bias Correction

New city analogy. Why m_0=0 makes early steps wrong. The (1-β^t) fix. Hand computation showing perfect correction. Plot: biased vs corrected for both moments.

12The Full Adam Algorithm

Color-coded pseudocode. SVG dataflow diagram. 3 complete hand-computed iterations with every intermediate value. Hyperparameter explanations. Trust region property.

13The Experiments: Adam vs Everyone

How to read the paper's plots. 4 experiments explained: logistic regression, neural nets, CNNs, bias correction ablation. Key results and what to say in the demo.

14Convergence Theory, AdaMax, and the Big Picture

Regret framework (poker analogy). O(1/√T) bound. AdaMax (max instead of average). Timeline: GD → Newton → BFGS → AdaGrad → RMSProp → Adam. Full paper summary.

Good luck with the demo!

Start with Pages 08 → 09 → 10 → 11 → 12 for the core Adam content.

Use Page 13 for the experiments section of your presentation.