Page 06: Stochastic Gradient Descent

MATH 3850; What changes when your data is too big to look at all at once

PART A

The Problem: Data is Too Big

Analogy: Surveying a giant farm field.

You need to know the average soil temperature of a 10,000-acre field.

In optimization: "the field" is your entire dataset. "Taking a reading" is computing the gradient for one data point. With 60,000 data points (like MNIST), computing the exact gradient every step is painfully slow.

Exact Gradient Descent (your course)

  • Look at ALL data to compute gradient
  • Exact direction every step
  • Smooth path to minimum
  • Slow per step (big datasets = painful)

$$\nabla f(\theta) = \frac{1}{N}\sum_{i=1}^{N} \nabla f_i(\theta)$$

sum over ALL $N$ data points

$\nabla f(\theta)$
The exact (full-batch) gradient of the loss with respect to parameters $\theta$
$N$
Total number of data points in the entire dataset (e.g., 60,000 for MNIST)
$\nabla f_i(\theta)$
Gradient contribution from a single data point $i$
$\frac{1}{N}\sum$
Average over all data points; every single one gets used each step

Stochastic Gradient Descent (the paper)

  • Look at a random batch (128 points)
  • Approximate direction; noisy but close
  • Zigzaggy path, but still gets there
  • Fast per step

$$g_t = \nabla f_t(\theta)$$

gradient from just one random batch

$g_t$
The stochastic gradient at timestep $t$; a noisy estimate of the true gradient
$f_t$
Loss evaluated on minibatch $t$ (a random subset, e.g., 128 images out of 60,000)
$\theta$
Current model parameters (the knobs we are tuning)
"Stochastic" just means "random." You grab a random subset of data (a minibatch, typically 128 data points), compute the gradient on just that subset, and take a step. The gradient is noisy because it's based on a sample, not the full dataset. But on average, it points in the right direction.
PART B

First: How to Read the Plots Below

What are those ring-shaped lines?

They're like elevation lines on a hiking trail map, but for a math function instead of a mountain.

  1. Imagine a bowl sitting on a table. The bottom is the minimum.
  2. Look straight down at the bowl from above (bird's eye view). You'd see rings, like a target.
  3. Each ring = one height level. All points on the same ring have the same function value.
  4. Center = lowest point (minimum). Outer rings = higher up (worse).
Full-batch GD (smooth, direct) SGD (noisy, zigzaggy) Smooth Path vs Noisy Path on the Same Landscape Both reach the minimum, but SGD wastes movement zigzagging side-to-side
3D bowl vs contour map
Left: The 3D bowl. Red star at the bottom is the minimum.
Right: Same bowl viewed from above. The rings are elevation lines. Center = minimum. This is called a contour plot, and we'll use it to compare exact GD vs SGD below.
PART C

See the Difference: Exact vs. Stochastic

Exact GD vs SGD on contour
Both plots: Start at yellow square (top right), trying to reach the star (center = minimum).

Left (Exact GD): Each green dot is one step. Path is smooth and direct. This is what you've been doing in class; using the full dataset every step.

Right (SGD): Each red dot is one step. Path zigzags because each step uses a random batch, so the gradient direction is noisy. Still reaches the minimum, but wastes movement going side-to-side.
PART D

Why Bother with the Noisy Version?

The Adam paper uses MNIST: 60,000 images, each 784 pixels.

Exact GD: each step

Compute gradient using all 60,000 images

= 60,000 gradient calculations

Very slow per step

SGD: each step

Compute gradient using 128 random images

= 128 gradient calculations

470x faster per step!

Each SGD step is noisier. But you can take hundreds of noisy steps in the time it takes for one exact step. In practice, SGD wins overwhelmingly.

Hand Computation: SGD vs Exact GD on a Tiny Example

Suppose our "dataset" has 4 data points with gradients: $g_1 = 6, g_2 = 2, g_3 = 8, g_4 = 4$

Exact gradient (use ALL data):

We average every data point's gradient to get the true direction.

$\nabla f = \frac{1}{4}(6 + 2 + 8 + 4) = \frac{20}{4} = $ $5.0$ (exact average)

Stochastic gradient (random batch of 2):

Instead of all 4 points, we only sample 2 and average those. Cheaper but noisier.

Randomly pick data points 1 and 3: $g_t = \frac{1}{2}(6 + 8) = \frac{14}{2} = $ $7.0$

Not exactly 5.0, but close. The "noise" is $7.0 - 5.0 = 2.0$.

Another random batch (points 2 and 4):

$g_t = \frac{1}{2}(2 + 4) = \frac{6}{2} = $ $3.0$

This time we underestimate. Noise is $3.0 - 5.0 = -2.0$.

On average over many batches:

If we average ALL possible minibatch gradients, the noise cancels and we recover the true gradient.

$\mathbb{E}[g_t] = \frac{7.0 + 3.0}{2} = $ $5.0$ = the exact gradient!

The stochastic gradient is unbiased: noisy per step, but correct on average.

This is the key insight: each individual SGD step is noisy, but the noise averages out over many steps. The path zigzags but still converges. And each step is MUCH cheaper than exact GD.

PART E

The Paper's Notation for SGD

The Adam paper writes:

$$g_t = \nabla_\theta f_t(\theta_{t-1})$$

$g_t$
The stochastic gradient vector at timestep $t$; one number per parameter
$\nabla_\theta$
"Take partial derivatives with respect to every parameter in $\theta$"
$f_t$
Loss function on minibatch $t$; a different random subset of data each step (this subscript $t$ is what makes it stochastic)
$\theta_{t-1}$
Parameter values from the previous step; we evaluate the gradient at where we currently are

The subscript $t$ on $f_t$ is crucial: it means each step uses a different random batch. That's what makes it stochastic. In your course, $f$ was always the same function. Here, $f_t$ changes every step because the data sample changes.

Connection to your course

In your assignments, $f(x) = \frac{1}{2}x^TQx$ was a fixed function. You computed $\nabla f(x_k) = Qx_k$ using the entire function.

In the Adam paper, $f_t(\theta)$ is the loss on a random batch. The gradient $g_t$ is noisier but cheaper. Everything else (the update rule, the step, the convergence) works the same way; just with a noisy gradient instead of an exact one.

Glossary; Stochastic Gradient Descent

Stochastic
"Random." The gradient is computed from a random sample of data instead of the full dataset.
Minibatch
The random sample. Typically 64 or 128 data points out of thousands. Bigger batch = less noise but slower.
SGD (Stochastic Gradient Descent)
Gradient descent using minibatch gradients. Fast per step but noisy/zigzaggy.
$g_t$
The gradient at timestep $t$, computed from minibatch $t$. This is a noisy estimate of the true gradient.
$f_t(\theta)$
The loss function evaluated on the random minibatch at step $t$. Changes every step because the batch changes.
Unbiased estimate
$\mathbb{E}[g_t] = \nabla f(\theta)$. The stochastic gradient is wrong on any single step, but correct on average.
Epoch
One complete pass through the entire dataset. If you have 60,000 images and batches of 128, one epoch = 469 steps.
Contour plot
Bird's-eye view of a function. Rings = elevation lines. Center = minimum. Used to visualize optimization paths.