09Momentum (First Moment)
Weather forecast analogy. EMA formula with every symbol explained. Full hand computation (5 steps). Plot: raw noisy gradients vs smoothed EMA. Connection to BFGS.
10Adaptive Learning Rates (Second Moment)
Mixing board analogy. v_t formula. Why squared gradients. Hand computation: two parameters with 50x different effective step sizes. Diagonal preconditioning connection.
11Bias Correction
New city analogy. Why m_0=0 makes early steps wrong. The (1-β^t) fix. Hand computation showing perfect correction. Plot: biased vs corrected for both moments.
12The Full Adam Algorithm
Color-coded pseudocode. SVG dataflow diagram. 3 complete hand-computed iterations with every intermediate value. Hyperparameter explanations. Trust region property.
13The Experiments: Adam vs Everyone
How to read the paper's plots. 4 experiments explained: logistic regression, neural nets, CNNs, bias correction ablation. Key results and what to say in the demo.
14Convergence Theory, AdaMax, and the Big Picture
Regret framework (poker analogy). O(1/√T) bound. AdaMax (max instead of average). Timeline: GD → Newton → BFGS → AdaGrad → RMSProp → Adam. Full paper summary.