MATH 3850 ; Adam Paper, Section 6 ; What they tested, what the plots show, and why Adam wins
FIRST
How to Read the Paper's Experiment Plots
Every experiment plot in the paper has the same structure:
X-axis: "Iterations over entire dataset" = number of epochs (one epoch = one full pass through all 60,000 images)
Y-axis: "Training cost" = the loss (how wrong the predictions are). Lower is better.
Each colored line = a different optimizer (SGD, AdaGrad, RMSProp, Adam, etc.)
The line that drops fastest and lowest wins.
The paper runs the same task with different optimizers and plots their loss curves. Whichever optimizer reaches low loss in the fewest epochs is the winner.
EXPERIMENT 1
Section 6.1: Logistic Regression on MNIST
What they tested
Task: Classify handwritten digits (0-9) using logistic regression (Page 07).
Model: Logistic regression — 7,850 parameters. Convex problem (no local minima trap).
Why this experiment: It's convex, so we can cleanly compare optimizers without worrying about getting stuck in bad local minima. A fair, controlled test.
Optimizers compared: Adam, SGD with Nesterov momentum, AdaGrad.
Batch size: 128.
Result: Adam converges as fast as AdaGrad, both significantly faster than SGD.
Left (MNIST):Adam (yellow) and AdaGrad (green) both drop quickly. SGD (red) is noticeably slower. Adam matches AdaGrad's speed on this clean convex problem.
Right (IMDB): Same comparison but on sparse text data (movie reviews as bag-of-words). Sparse features are AdaGrad's strength — and Adam matches it here too, while beating SGD and RMSProp. Adam is competitive everywhere, not just on one type of problem.
EXPERIMENT 2
Section 6.2: Multi-layer Neural Networks (MNIST)
What they tested
Task: Same digit classification, but with a neural network (2 hidden layers, 1000 units each).
Key difference from Experiment 1: Neural networks are non-convex — theory doesn't guarantee convergence, but Adam works well in practice.
Two sub-experiments:
(a) With dropout (a technique that randomly disables parts of the network during training to prevent overfitting — adds even more noise)
(b) With deterministic loss (no dropout, cleaner signal)
Result: Adam outperforms AdaGrad, RMSProp, AdaDelta, and SGD on both settings.
Why does this matter?
This is the real test. Logistic regression (Experiment 1) was convex — the theory says Adam should work. Neural networks are non-convex — theory doesn't apply. The fact that Adam still wins shows it's practically robust, not just theoretically sound.
Also: dropout adds extra stochasticity beyond minibatch noise. Adam's momentum smoothing handles both sources of noise effectively.
Task: Classify small color images (airplanes, cars, birds, etc.) — 10 classes.
Dataset: CIFAR-10 — 50,000 images, 32×32 pixels, 3 color channels.
Model: CNN — a more complex architecture with convolution layers (filters that detect patterns like edges and shapes).
Key finding: Adam and SGD converge similarly in the long run, but Adam converges faster initially. AdaGrad converges much slower on CNNs.
Result: Adam adapts learning rates per-layer automatically (no manual layer-specific tuning needed), which is a practical advantage over SGD.
EXPERIMENT 4
Section 6.4: Bias Correction Ablation
What they tested
Question: Does bias correction (Page 11) actually matter, or can we skip it?
Setup: Train a variational autoencoder (VAE) with and without bias correction, across many settings of $\alpha$, $\beta_1$, and $\beta_2$.
Key finding: Without bias correction, high $\beta_2$ values (close to 1) cause instabilities — the loss jumps around or fails to converge, especially in the first few epochs.
Result: Bias correction is essential. Without it, Adam degrades to RMSProp (which lacks correction) and performs worse across the board.
Left (10 epochs):With bias correction (red) achieves lower loss than without correction (green) across all step sizes. The gap is significant early in training.
Right (100 epochs): Gap narrows over time (bias correction matters less as $t$ grows), but the corrected version is still consistently better. This is the paper's main evidence that bias correction is worth including.
SUMMARY
The Complete Picture from All Experiments
What the experiments prove:
On convex problems (logistic regression): Adam matches AdaGrad, both beat SGD
On non-convex problems (neural networks): Adam beats all competitors
On sparse data (IMDB text): Adam handles sparsity as well as AdaGrad (which was designed for it)
With extra noise (dropout): Adam's momentum smooths it out, performs best
Bias correction matters: removing it degrades performance, especially early on
Default hyperparameters work: $\alpha = 0.001$, $\beta_1 = 0.9$, $\beta_2 = 0.999$ performed well across all experiments
For your demo: the key takeaway
"Adam combines the best of both worlds: AdaGrad's ability to handle sparse gradients and different parameter scales, and momentum's ability to smooth out stochastic noise. It works well out of the box with default settings, across convex and non-convex problems, with and without additional noise like dropout."
Glossary ; Experiments
Training cost / loss
The y-axis on all experiment plots. How wrong the model's predictions are. Lower = better.
Epoch
One complete pass through the training data. The x-axis on the plots.
Dropout
Randomly disabling parts of a neural network during training. Prevents overfitting but adds noise.
CNN (Convolutional Neural Network)
A neural network that uses filters to detect patterns (edges, shapes) in images. More complex than logistic regression.
Ablation study
Removing one component (e.g., bias correction) to see how much it matters. "If I remove this, does performance drop?"
AdaGrad
Adaptive gradient method. Good for sparse features. But accumulates squared gradients forever, so step size shrinks to zero over time. Adam fixes this with the decaying average ($\beta_2$).
RMSProp
Like Adam but without momentum and without bias correction. Adam = RMSProp + momentum + bias correction.