Page 05: Step Size Selection

MATH 3850: From line search to adaptive methods, the direct bridge to Adam

PART A

Exact Line Search: Find the Best $\alpha$ by Calculus

Analogy: You know which way is downhill. Now how far should you walk?

Exact line search says: "minimize $f$ along the downhill direction. Find the $\alpha$ that makes $f$ as small as possible along that line." It's like looking down a hallway and walking to the exact lowest point of the floor.

For a quadratic $f(x) = \frac{1}{2}x^TQx$:

$$\alpha_k = \frac{g_k^T g_k}{g_k^T Q g_k}$$

This gives the exact step size that minimizes $f$ along the gradient direction.

$\alpha_k$: The optimal step size at iteration $k$; changes every iteration because the gradient changes
$g_k$: The gradient $\nabla f(x_k)$ at the current point; for quadratics $g_k = Qx_k$
$g_k^T g_k$: Dot product of the gradient with itself; equals $\|g_k\|^2$, the squared magnitude of the gradient
$Q$: The Hessian matrix of the quadratic; encodes curvature in every direction
$g_k^T Q g_k$: The gradient "weighted" by curvature; large when the gradient points along a steep direction, small when along a flat direction

Why this formula? Set $\frac{d}{d\alpha}f(x_k - \alpha g_k) = 0$ and solve for $\alpha$. The numerator ($g^Tg$) is the gradient magnitude. The denominator ($g^TQg$) accounts for curvature in that direction.

Hand Computation: Exact Line Search (from A1/A2)

We compute the optimal step size $\alpha_0$ for the first GD iteration. The gradient $g_0 = Qx_0$ tells us the direction; the formula $\alpha = g^Tg / g^TQg$ tells us exactly how far to go.

$Q = \begin{bmatrix}2&0\\0&3\end{bmatrix}$, $x_0 = [1, 1]^T$, so $g_0 = Qx_0 = [2, 3]^T$

Numerator: $g_0^T g_0$ (squared gradient magnitude)

$[2, 3] \cdot [2, 3] = 4 + 9 = $ $13$

This measures how steep the slope is at $x_0$.

Denominator: $g_0^T Q g_0$ (gradient weighted by curvature)

First: $Qg_0 = \begin{bmatrix}2&0\\0&3\end{bmatrix}\begin{bmatrix}2\\3\end{bmatrix} = \begin{bmatrix}4\\9\end{bmatrix}$

Then: $g_0^T (Qg_0) = [2, 3] \cdot [4, 9] = 8 + 27 = $ $35$

Result:

$\alpha_0 = 13/35 \approx 0.371$

This is the optimal step size for this particular gradient and curvature. Not 0.1 or 0.01; the exact best value.

You computed this exact calculation in A2 and Quiz 6. The line search gives a different $\alpha$ each iteration based on the current gradient and curvature.

Exact line search on a 2D quadratic surface showing contours, gradient vector, and the optimal step from x0=(1,1)

Contours of $f(x) = \frac{1}{2}(2x_1^2 + 3x_2^2)$ with the exact line search step from $x_0 = (1,1)$. The gradient (red arrow) points uphill; the step (teal arrow) goes downhill with $\alpha = 13/35 \approx 0.371$, landing exactly at the minimum along that direction.

PART B

Armijo Backtracking: "Start Big, Shrink Until It Works"

Analogy: Trying shoes on. Start with the biggest size. If it's too big (doesn't satisfy the condition), try the next size down. Keep going until you find one that fits.

Armijo sufficient decrease condition:

$$f(x_k + \alpha p_k) \leq f(x_k) + c \cdot \alpha \cdot g_k^T p_k$$

In English: "the actual decrease must be at least $c$ fraction of the predicted decrease."

$f(x_k + \alpha p_k)$: The function value after taking a step of size $\alpha$ in direction $p_k$ (the "actual" new value)
$x_k$: The current point (where we are now)
$\alpha$: The trial step size; starts at 1 and gets shrunk until the condition passes
$p_k$: The search direction; for steepest descent, $p_k = -g_k$ (downhill)
$f(x_k)$: The function value at the current point (before stepping)
$c$: A small constant (e.g., $10^{-3}$); controls how much decrease we demand. Smaller $c$ = more lenient
$g_k$: The gradient $\nabla f(x_k)$; for steepest descent $g_k = f'(x_k)$
$g_k^T p_k$: Directional derivative: the predicted rate of decrease along $p_k$. Negative when $p_k$ is a descent direction (which is what we want)
$c \cdot \alpha \cdot g_k^T p_k$: The "predicted decrease" scaled by $c$. Since $g_k^T p_k \lt 0$, this whole term is negative, making the right side smaller than $f(x_k)$
$\eta$: Shrink factor (e.g., 0.8); when the condition fails, multiply $\alpha$ by $\eta$ to try a smaller step

Algorithm (from A1):

INITIALIZATION

1Set $\alpha = 1$, $c = 10^{-3}$, $\eta = 0.8$start with full step

2Compute $p_k = -g_k$ (steepest descent direction)go downhill

BACKTRACKING LOOP

3Check: $f(x_k + \alpha p_k) \leq f(x_k) + c \cdot \alpha \cdot g_k^T p_k$Armijo test

4If YES: accept $\alpha$ and stopsufficient decrease

5If NO: $\alpha \leftarrow \eta \cdot \alpha$ and go to step 3shrink and retry

Your A1 used $c = 10^{-3}$, $\eta = 0.8$.

Hand Computation: One Armijo Backtrack

We walk through the Armijo backtracking procedure step by step. The gradient at $x_k = 5$ is $g_k = f'(5) = 2(5) = 10$ (the slope is steep and positive, so we need to go left). The search direction is $p_k = -g_k = -10$ (downhill). We try $\alpha = 1$ first and check if the function decreased "enough."

$f(x) = x^2$, $x_k = 5$, $g_k = 10$, $p_k = -10$ (steepest descent direction), $c = 0.001$

Try $\alpha = 1$:

$f(5 + 1 \cdot (-10)) = f(-5) = 25$

Check: $25 \leq f(5) + 0.001 \cdot 1 \cdot (10)(-10) = 25 + 0.001(-100) = 25 - 0.1 = 24.9$

$25 \leq 24.9$? NO; we overshot! (Landed at same height on the other side)

Shrink: $\alpha = 0.8 \cdot 1 = 0.8$

$f(5 + 0.8(-10)) = f(-3) = 9$

Check: $9 \leq 25 + 0.001 \cdot 0.8 \cdot (-100) = 25 - 0.08 = 24.92$

$9 \leq 24.92$? YES

Accept $\alpha = 0.8$

The Armijo condition is a safety check: "did the function actually decrease enough?" It prevents the overshooting problem from Page 03.

Armijo Backtracking Iterations

Iteration	$lpha$	New point	$f(x_k + lpha p_k)$	Armijo bound	Pass?
1	$1.0$	$5 + 1(-10) = -5$	$25$	$25 - 0.1 = 24.9$	NO
2	$0.8$	$5 + 0.8(-10) = -3$	$9$	$25 - 0.08 = 24.92$	YES

With $lpha = 1$ we overshot (landed at same height on the other side). After one shrink ($ imes 0.8$), $f$ drops from 25 to 9, easily passing the Armijo test.

Armijo backtracking plot showing phi(alpha), Armijo line, and accepted/rejected step sizes

The teal curve is the actual function value $\phi(lpha) = f(x_k + lpha p_k)$ along the search direction. The dashed red line is the Armijo bound. lpha = 1$ (red dot) fails because the curve is above the line; lpha = 0.8$ (teal square) passes easily. The gold star marks the true minimum at $lpha^* = 0.5$.

The Armijo line slopes gently downward from $f(x_k)$. We accept any $\alpha$ where the actual function value (teal curve) is below this line. $\alpha = 1$ overshoots (teal curve above the line), so we shrink to $\alpha = 0.8$ which passes.

PART C

The Problem: Still One $\alpha$ for Everything

Even with line search, we're stuck

Both exact line search and Armijo find one scalar $\alpha$ per iteration. Every parameter gets the same step size.

But remember the bathtub from Page 03? Parameter $x$ needs a big step and parameter $y$ needs a tiny step. One $\alpha$ can't serve both.

Fixed GD

You pick a constant $\alpha$
1 step size (same forever)
Too big? Diverge. Too small? Crawl.

Exact Line Search

Calculus formula: $g^Tg / g^TQg$
1 per iteration
Same $\alpha$ for all params; needs $Q$

Armijo Backtracking

Trial and error (start big, shrink)
1 per iteration
Same $\alpha$ for all params; extra evals

Adam

Automatic from gradient history
1 per PARAMETER
No line search needed!

PART D

The Adaptive Idea: Each Parameter Gets Its Own Step Size

Analogy: Instead of one thermostat for the whole building, each room gets its own thermostat. The server room needs to be cold. The lobby needs to be warm. One setting can't work for both; you need per-room control.

Adam gives each parameter its own "thermostat" (step size), automatically adjusted based on how that parameter's gradient has been behaving.

The Bridge: How Adam Replaces Line Search

In your course, you learned three ways to pick step sizes: fixed, exact line search, Armijo backtracking. All give ONE $\alpha$ for all parameters.

Adam does something different:

For each parameter $\theta_i$: track how big its gradient has been (using $v_t$, the second moment)
If gradient is consistently large → $v_t$ is large → $\sqrt{v_t}$ is large → divide by it → smaller step (tame the big gradient)
If gradient is consistently small → $v_t$ is small → $\sqrt{v_t}$ is small → divide by it → bigger step (boost the weak gradient)

The effective step size for parameter $i$:

$$\alpha_i^{\text{effective}} = \frac{\alpha}{\sqrt{\hat{v}_{t,i}} + \epsilon}$$

$\alpha_i^{\text{effective}}$: The actual step size used for parameter $i$; different for each parameter
$\alpha$: The global learning rate (e.g., 0.001); a baseline that gets adjusted per-parameter
$\hat{v}_{t,i}$: Bias-corrected second moment estimate for parameter $i$; roughly the running average of squared gradients
$\sqrt{\hat{v}_{t,i}}$: Scales the step: large past gradients produce a large denominator, shrinking the step. Small past gradients produce a small denominator, enlarging the step
$\epsilon$: A tiny constant (e.g., $10^{-8}$) to prevent division by zero when $\hat{v}$ is near zero

Each parameter $i$ gets its own denominator $\sqrt{\hat{v}_{t,i}}$, so each gets its own effective step size. No line search needed. No matrix storage needed. Just two running averages.

Summary of the journey so far:
Pages 00-02: Foundations (derivatives, gradients, convexity)
Pages 03-04: GD is slow, Newton/BFGS use curvature to go faster but are expensive
Page 05 (here): Even line search gives only one $\alpha$ for all params; Adam solves this

Next up (Pages 06-08): The ML setting: stochastic gradients, what we're actually optimizing, and the two problems Adam solves. Then Pages 09-12: Adam itself.

Glossary: Step Size Selection

Exact line search: Find the optimal $\alpha$ by calculus. Formula: $\alpha = g^Tg / g^TQg$ for quadratics.
Armijo condition: A safety check: "did the function decrease enough?" $f(x + \alpha p) \leq f(x) + c \cdot \alpha \cdot g^Tp$. Prevents overshooting.
Backtracking: Start with $\alpha = 1$, shrink by factor $\eta$ until Armijo passes. Parameters from A1: $\eta = 0.8$, $c = 10^{-3}$.
Adaptive step size: Each parameter gets its own step size, automatically adjusted. Adam does this using the second moment $v_t$.
Effective step size ($\alpha / \sqrt{\hat{v}_t}$): Adam's per-parameter step size. Large past gradients → large $v_t$ → small step. Small past gradients → small $v_t$ → big step.

Next: Page 06: Stochastic Gradient Descent

What changes when you can only see a random sample of the data?