MATH 3850; Step sizes, convergence rates, condition numbers, and why GD alone isn't enough
PART AAnalogy: Walking downhill blindfolded.
The step size $\alpha$ is how far you walk each step. Three scenarios:
We trace GD on a simple quadratic to see what happens when the step size $\alpha$ is nearly at the stability limit. The GD update rule is $x_{k+1} = x_k - \alpha \cdot f'(x_k)$.
$f(x) = x^2$, derivative $= 2x$, $\alpha = 0.9$, start at $x_0 = 10$:
Iteration 1:
$x_1 = 10 - 0.9 \times 2(10) = 10 - 18 = $ $-8$
We overshot! Went from $+10$ to $-8$; past the minimum at $0$ and out the other side.
Iteration 2:
$x_2 = -8 - 0.9 \times 2(-8) = -8 + 14.4 = $ $6.4$
Bounced back to the positive side. |x| went from 10 → 8 → 6.4. Slowly converging, but wastefully.
The pattern:
$x_{k+1} = x_k(1 - 2\alpha) = x_k(1 - 1.8) = -0.8 \cdot x_k$
The factor is $-0.8$: the sign flips each step (bouncing) and $|{-0.8}| \lt 1$ (slowly shrinking). If $\alpha \gt 1$, the factor $\gt 1$ and it blows up!
This is why your Assignment 1 used Armijo line search to pick $\alpha$; it guarantees the step size isn't too big.
The stability condition for $f(x) = \frac{1}{2}x^TQx$:
For GD to converge, $\alpha$ must satisfy: $\alpha \lt \frac{2}{\lambda_{\max}(Q)}$
For $f(x) = x^2$: $Q = 2$, so $\alpha \lt 2/2 = 1$. That's why $\alpha = 0.9$ barely works but $\alpha = 1.1$ would explode.
Analogy: Paying off debt.
Linear convergence (gradient descent):
$$\|x_{k+1} - x^*\| \leq r \cdot \|x_k - x^*\| \quad \text{where } r = \frac{\kappa - 1}{\kappa + 1}$$
Each step, the distance to the minimum shrinks by the same ratio $r$. If $r = 0.8$, you lose 20% each step.
$\kappa$ (kappa) is the condition number; it measures how "stretched" the bowl is.
$\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}$ = ratio of largest to smallest eigenvalue of $Q$
Analogy: A bathtub vs. a mixing bowl.
A round mixing bowl ($\kappa \approx 1$): slopes are similar in all directions. GD goes straight to the bottom.
A long narrow bathtub ($\kappa = 50$): very steep across the width, very gentle along the length. GD takes big steps across (steep) but tiny steps along (gentle). The result: it zigzags back and forth across the width while slowly creeping along the length.
For each diagonal $Q$, the eigenvalues are just the diagonal entries. The condition number $\kappa = \lambda_{\max}/\lambda_{\min}$ tells us how stretched the bowl is, and the convergence ratio $r = (\kappa - 1)/(\kappa + 1)$ tells us how much error remains after each GD step.
Assignment 2 (SR1): $Q = \begin{bmatrix}2&0\\0&1\end{bmatrix}$
$\lambda_{\max} = 2, \quad \lambda_{\min} = 1$
$\kappa = 2/1 = $ $2$; nearly round. GD works well.
Convergence ratio: $r = (2-1)/(2+1) = 1/3 \approx 0.33$; error shrinks by 67% each step!
Quiz 6 (BFGS): $Q = \begin{bmatrix}2&0\\0&3\end{bmatrix}$
$\lambda_{\max} = 3, \quad \lambda_{\min} = 2$
$\kappa = 3/2 = $ $1.5$; almost round. Very easy for GD.
Convergence ratio: $r = (1.5-1)/(1.5+1) = 0.5/2.5 = 0.2$; error shrinks by 80% each step!
A harder problem: $Q = \begin{bmatrix}100&0\\0&1\end{bmatrix}$
$\kappa = 100/1 = $ $100$; very elongated. GD zigzags badly.
Convergence ratio: $r = 99/101 \approx 0.98$; error only shrinks by 2% each step. Hundreds of iterations needed!
This is exactly why methods like Newton and BFGS exist; they handle high $\kappa$ much better. And it's why Adam uses adaptive per-parameter step sizes: each direction gets its own $\alpha$, so the bathtub problem goes away.
Round contours ($\kappa \approx 1$): GD heads straight for the minimum. Elongated contours ($\kappa = 10$): GD wastes steps zigzagging across the narrow axis.
One step size $\alpha$ for ALL parameters. If you have 784 parameters (like MNIST), some might need $\alpha = 0.1$ and others might need $\alpha = 0.001$. GD forces them all to use the same $\alpha$.
Solutions explored in your course: