Page 03: Gradient Descent; The Full Picture

MATH 3850; Step sizes, convergence rates, condition numbers, and why GD alone isn't enough

PART A

Step Size: The Most Important Knob in GD

Analogy: Walking downhill blindfolded.

The step size $\alpha$ is how far you walk each step. Three scenarios:

Too small ($\alpha = 0.02$): Tiny careful steps. Safe but painfully slow; takes forever to reach the bottom.
Just right ($\alpha = 0.1$): Confident stride. Gets to the bottom in a reasonable number of steps.
Too big ($\alpha = 0.9$): You lunge forward, overshoot the valley, end up on the other side, lunge back, overshoot again. You're bouncing back and forth and might never settle down.

Left: $\alpha = 0.02$; steps are tiny, barely moves after 15 iterations. Crawling.
Middle: $\alpha = 0.1$; smooth convergence, reaches near-zero quickly.
Right: $\alpha = 0.9$; bounces wildly! Each step overshoots the minimum and lands on the other side. This can even diverge (blow up) if $\alpha$ is too large.

Hand Computation: Why $\alpha$ too big causes bouncing

We trace GD on a simple quadratic to see what happens when the step size $\alpha$ is nearly at the stability limit. The GD update rule is $x_{k+1} = x_k - \alpha \cdot f'(x_k)$.

$f(x) = x^2$, derivative $= 2x$, $\alpha = 0.9$, start at $x_0 = 10$:

Iteration 1:

$x_1 = 10 - 0.9 \times 2(10) = 10 - 18 = $ $-8$

We overshot! Went from $+10$ to $-8$; past the minimum at $0$ and out the other side.

Iteration 2:

$x_2 = -8 - 0.9 \times 2(-8) = -8 + 14.4 = $ $6.4$

Bounced back to the positive side. |x| went from 10 → 8 → 6.4. Slowly converging, but wastefully.

The pattern:

$x_{k+1} = x_k(1 - 2\alpha) = x_k(1 - 1.8) = -0.8 \cdot x_k$

The factor is $-0.8$: the sign flips each step (bouncing) and $|{-0.8}| \lt 1$ (slowly shrinking). If $\alpha \gt 1$, the factor $\gt 1$ and it blows up!

This is why your Assignment 1 used Armijo line search to pick $\alpha$; it guarantees the step size isn't too big.

The stability condition for $f(x) = \frac{1}{2}x^TQx$:

For GD to converge, $\alpha$ must satisfy: $\alpha \lt \frac{2}{\lambda_{\max}(Q)}$

$\alpha$: The step size (learning rate) used each iteration of gradient descent
$\lambda_{\max}(Q)$: The largest eigenvalue of $Q$; controls the steepest curvature direction
$\frac{2}{\lambda_{\max}}$: The critical threshold: any $\alpha$ above this causes GD to diverge (blow up)

For $f(x) = x^2$: $Q = 2$, so $\alpha \lt 2/2 = 1$. That's why $\alpha = 0.9$ barely works but $\alpha = 1.1$ would explode.

PART B

Convergence: How Fast Are We Getting There?

Analogy: Paying off debt.

Linear convergence (GD): You pay off 20% of your remaining debt each month. After a while the payments get tiny, but the debt never quite hits zero. Slow and steady.
Quadratic convergence (Newton): You pay off a percentage that DOUBLES each month; 10%, 20%, 40%, 80%, done. Explosively fast near the end.
Superlinear convergence (BFGS): Between linear and quadratic. Accelerates as you get closer.

Linear convergence (gradient descent):

$$\|x_{k+1} - x^*\| \leq r \cdot \|x_k - x^*\| \quad \text{where } r = \frac{\kappa - 1}{\kappa + 1}$$

Each step, the distance to the minimum shrinks by the same ratio $r$. If $r = 0.8$, you lose 20% each step.

$x_{k+1}$: The next iterate (where we land after one GD step)
$x_k$: The current iterate (where we are now)
$x^*$: The true minimum of the function (what we are trying to reach)
$\|x_k - x^*\|$: Distance from the current point to the minimum; "how far we still have to go"
$r$: Convergence ratio: the fraction of error that remains after each step (lower is faster)
$\kappa$: Condition number $= \lambda_{\max}/\lambda_{\min}$; measures how stretched the bowl is (1 = round, 100 = bathtub)

$\kappa$ (kappa) is the condition number; it measures how "stretched" the bowl is.

$\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}$ = ratio of largest to smallest eigenvalue of $Q$

Left: GD convergence on a log scale. A straight line means linear convergence; constant ratio each step. This is what you saw in Assignment 1.

Right: Effect of $\kappa$. $\kappa = 2$ (round bowl) converges fast. $\kappa = 10$ (stretched) is slower. $\kappa = 50$ (very stretched) is painfully slow. Higher $\kappa$ = slower GD.

PART C

The Condition Number: Why Some Problems Are Harder

Analogy: A bathtub vs. a mixing bowl.

A round mixing bowl ($\kappa \approx 1$): slopes are similar in all directions. GD goes straight to the bottom.

A long narrow bathtub ($\kappa = 50$): very steep across the width, very gentle along the length. GD takes big steps across (steep) but tiny steps along (gentle). The result: it zigzags back and forth across the width while slowly creeping along the length.

Hand Computation: Condition Numbers from Your Assignments

For each diagonal $Q$, the eigenvalues are just the diagonal entries. The condition number $\kappa = \lambda_{\max}/\lambda_{\min}$ tells us how stretched the bowl is, and the convergence ratio $r = (\kappa - 1)/(\kappa + 1)$ tells us how much error remains after each GD step.

Assignment 2 (SR1): $Q = \begin{bmatrix}2&0\\0&1\end{bmatrix}$

$\lambda_{\max} = 2, \quad \lambda_{\min} = 1$

$\kappa = 2/1 = $ $2$; nearly round. GD works well.

Convergence ratio: $r = (2-1)/(2+1) = 1/3 \approx 0.33$; error shrinks by 67% each step!

Quiz 6 (BFGS): $Q = \begin{bmatrix}2&0\\0&3\end{bmatrix}$

$\lambda_{\max} = 3, \quad \lambda_{\min} = 2$

$\kappa = 3/2 = $ $1.5$; almost round. Very easy for GD.

Convergence ratio: $r = (1.5-1)/(1.5+1) = 0.5/2.5 = 0.2$; error shrinks by 80% each step!

A harder problem: $Q = \begin{bmatrix}100&0\\0&1\end{bmatrix}$

$\kappa = 100/1 = $ $100$; very elongated. GD zigzags badly.

Convergence ratio: $r = 99/101 \approx 0.98$; error only shrinks by 2% each step. Hundreds of iterations needed!

This is exactly why methods like Newton and BFGS exist; they handle high $\kappa$ much better. And it's why Adam uses adaptive per-parameter step sizes: each direction gets its own $\alpha$, so the bathtub problem goes away.

Left ($\kappa = 1$, round): Contours are circular. GD goes straight to the center. No zigzag.

Right ($\kappa = 10$, stretched): Contours are elongated ellipses. GD zigzags because y is 10x steeper than x, but $\alpha$ is the same for both. This is the fundamental limitation that Adam's adaptive learning rates fix: give x and y their own step sizes.

Round contours ($\kappa \approx 1$): GD heads straight for the minimum. Elongated contours ($\kappa = 10$): GD wastes steps zigzagging across the narrow axis.

The fundamental limitation of gradient descent

One step size $\alpha$ for ALL parameters. If you have 784 parameters (like MNIST), some might need $\alpha = 0.1$ and others might need $\alpha = 0.001$. GD forces them all to use the same $\alpha$.

Solutions explored in your course:

Newton's method → uses curvature (Hessian) to automatically scale each direction. But expensive.
BFGS → approximates Newton cheaply. But still needs matrix storage.
Adam → gives each parameter its own step size using just two vectors. Cheap and effective.

Glossary; Gradient Descent

Step size / Learning rate ($\alpha$): How far you step each iteration. Too small = slow. Too big = bounce/diverge. Must be $\lt 2/\lambda_{\max}$.
Linear convergence: Error shrinks by a constant ratio each step. Straight line on log scale. GD achieves this.
Quadratic convergence: Error shrinks by a SQUARED ratio; explosively fast. Newton's method achieves this.
Condition number ($\kappa$): $\kappa = \lambda_{\max} / \lambda_{\min}$. Measures how stretched the bowl is. $\kappa = 1$ is round (easy). $\kappa = 100$ is a bathtub (hard for GD).
Convergence ratio ($r$): $r = (\kappa - 1)/(\kappa + 1)$. How much error remains after each step. Lower is better.
Divergence: The opposite of convergence; the algorithm gets worse each step. Happens when $\alpha$ is too big.

Next: Page 04; Newton's Method and BFGS

Using curvature and gradient history to take smarter steps; and how this connects to Adam