Page 13: The Experiments

MATH 3850 ; Adam Paper, Section 6 ; What they tested, what the plots show, and why Adam wins

FIRST

How to Read the Paper's Experiment Plots

Every experiment plot in the paper has the same structure:

The paper runs the same task with different optimizers and plots their loss curves. Whichever optimizer reaches low loss in the fewest epochs is the winner.

EXPERIMENT 1

Section 6.1: Logistic Regression on MNIST

What they tested

Task: Classify handwritten digits (0-9) using logistic regression (Page 07).

Dataset: MNIST — 60,000 images, 784 pixels each, 10 classes.

Model: Logistic regression — 7,850 parameters. Convex problem (no local minima trap).

Why this experiment: It's convex, so we can cleanly compare optimizers without worrying about getting stuck in bad local minima. A fair, controlled test.

Optimizers compared: Adam, SGD with Nesterov momentum, AdaGrad.

Batch size: 128.

Result: Adam converges as fast as AdaGrad, both significantly faster than SGD.

Logistic regression experiments
Left (MNIST): Adam (yellow) and AdaGrad (green) both drop quickly. SGD (red) is noticeably slower. Adam matches AdaGrad's speed on this clean convex problem.

Right (IMDB): Same comparison but on sparse text data (movie reviews as bag-of-words). Sparse features are AdaGrad's strength — and Adam matches it here too, while beating SGD and RMSProp. Adam is competitive everywhere, not just on one type of problem.
EXPERIMENT 2

Section 6.2: Multi-layer Neural Networks (MNIST)

What they tested

Task: Same digit classification, but with a neural network (2 hidden layers, 1000 units each).

Key difference from Experiment 1: Neural networks are non-convex — theory doesn't guarantee convergence, but Adam works well in practice.

Two sub-experiments:

Result: Adam outperforms AdaGrad, RMSProp, AdaDelta, and SGD on both settings.

Why does this matter?

This is the real test. Logistic regression (Experiment 1) was convex — the theory says Adam should work. Neural networks are non-convex — theory doesn't apply. The fact that Adam still wins shows it's practically robust, not just theoretically sound.

Also: dropout adds extra stochasticity beyond minibatch noise. Adam's momentum smoothing handles both sources of noise effectively.

EXPERIMENT 3

Section 6.3: Convolutional Neural Networks (CIFAR-10)

What they tested

Task: Classify small color images (airplanes, cars, birds, etc.) — 10 classes.

Dataset: CIFAR-10 — 50,000 images, 32×32 pixels, 3 color channels.

Model: CNN — a more complex architecture with convolution layers (filters that detect patterns like edges and shapes).

Key finding: Adam and SGD converge similarly in the long run, but Adam converges faster initially. AdaGrad converges much slower on CNNs.

Result: Adam adapts learning rates per-layer automatically (no manual layer-specific tuning needed), which is a practical advantage over SGD.

EXPERIMENT 4

Section 6.4: Bias Correction Ablation

What they tested

Question: Does bias correction (Page 11) actually matter, or can we skip it?

Setup: Train a variational autoencoder (VAE) with and without bias correction, across many settings of $\alpha$, $\beta_1$, and $\beta_2$.

Key finding: Without bias correction, high $\beta_2$ values (close to 1) cause instabilities — the loss jumps around or fails to converge, especially in the first few epochs.

Result: Bias correction is essential. Without it, Adam degrades to RMSProp (which lacks correction) and performs worse across the board.

Bias correction ablation
Left (10 epochs): With bias correction (red) achieves lower loss than without correction (green) across all step sizes. The gap is significant early in training.

Right (100 epochs): Gap narrows over time (bias correction matters less as $t$ grows), but the corrected version is still consistently better. This is the paper's main evidence that bias correction is worth including.
SUMMARY

The Complete Picture from All Experiments

What the experiments prove:

  1. On convex problems (logistic regression): Adam matches AdaGrad, both beat SGD
  2. On non-convex problems (neural networks): Adam beats all competitors
  3. On sparse data (IMDB text): Adam handles sparsity as well as AdaGrad (which was designed for it)
  4. With extra noise (dropout): Adam's momentum smooths it out, performs best
  5. Bias correction matters: removing it degrades performance, especially early on
  6. Default hyperparameters work: $\alpha = 0.001$, $\beta_1 = 0.9$, $\beta_2 = 0.999$ performed well across all experiments
Optimizer Scoreboard Across All Experiments Convex Non-Convex Sparse Data Dropout Auto-Tuning Adam 1 1 1 1 1 AdaGrad 1 3 1 3 3 RMSProp 2 2 3 2 2 SGD 3 4 4 4 4 Rankings: 1 = best, 4 = worst (based on convergence speed across paper's experiments) Adam is the only optimizer that places 1st in every category

For your demo: the key takeaway

"Adam combines the best of both worlds: AdaGrad's ability to handle sparse gradients and different parameter scales, and momentum's ability to smooth out stochastic noise. It works well out of the box with default settings, across convex and non-convex problems, with and without additional noise like dropout."

Glossary ; Experiments

Training cost / loss
The y-axis on all experiment plots. How wrong the model's predictions are. Lower = better.
Epoch
One complete pass through the training data. The x-axis on the plots.
Dropout
Randomly disabling parts of a neural network during training. Prevents overfitting but adds noise.
CNN (Convolutional Neural Network)
A neural network that uses filters to detect patterns (edges, shapes) in images. More complex than logistic regression.
Ablation study
Removing one component (e.g., bias correction) to see how much it matters. "If I remove this, does performance drop?"
AdaGrad
Adaptive gradient method. Good for sparse features. But accumulates squared gradients forever, so step size shrinks to zero over time. Adam fixes this with the decaying average ($\beta_2$).
RMSProp
Like Adam but without momentum and without bias correction. Adam = RMSProp + momentum + bias correction.