MATH 3850: From line search to adaptive methods, the direct bridge to Adam
PART AAnalogy: You know which way is downhill. Now how far should you walk?
Exact line search says: "minimize $f$ along the downhill direction. Find the $\alpha$ that makes $f$ as small as possible along that line." It's like looking down a hallway and walking to the exact lowest point of the floor.
For a quadratic $f(x) = \frac{1}{2}x^TQx$:
$$\alpha_k = \frac{g_k^T g_k}{g_k^T Q g_k}$$
This gives the exact step size that minimizes $f$ along the gradient direction.
Why this formula? Set $\frac{d}{d\alpha}f(x_k - \alpha g_k) = 0$ and solve for $\alpha$. The numerator ($g^Tg$) is the gradient magnitude. The denominator ($g^TQg$) accounts for curvature in that direction.
We compute the optimal step size $\alpha_0$ for the first GD iteration. The gradient $g_0 = Qx_0$ tells us the direction; the formula $\alpha = g^Tg / g^TQg$ tells us exactly how far to go.
$Q = \begin{bmatrix}2&0\\0&3\end{bmatrix}$, $x_0 = [1, 1]^T$, so $g_0 = Qx_0 = [2, 3]^T$
Numerator: $g_0^T g_0$ (squared gradient magnitude)
$[2, 3] \cdot [2, 3] = 4 + 9 = $ $13$
This measures how steep the slope is at $x_0$.
Denominator: $g_0^T Q g_0$ (gradient weighted by curvature)
First: $Qg_0 = \begin{bmatrix}2&0\\0&3\end{bmatrix}\begin{bmatrix}2\\3\end{bmatrix} = \begin{bmatrix}4\\9\end{bmatrix}$
Then: $g_0^T (Qg_0) = [2, 3] \cdot [4, 9] = 8 + 27 = $ $35$
Result:
$\alpha_0 = 13/35 \approx 0.371$
This is the optimal step size for this particular gradient and curvature. Not 0.1 or 0.01; the exact best value.
You computed this exact calculation in A2 and Quiz 6. The line search gives a different $\alpha$ each iteration based on the current gradient and curvature.
Contours of $f(x) = \frac{1}{2}(2x_1^2 + 3x_2^2)$ with the exact line search step from $x_0 = (1,1)$. The gradient (red arrow) points uphill; the step (teal arrow) goes downhill with $\alpha = 13/35 \approx 0.371$, landing exactly at the minimum along that direction.
Analogy: Trying shoes on. Start with the biggest size. If it's too big (doesn't satisfy the condition), try the next size down. Keep going until you find one that fits.
Armijo sufficient decrease condition:
$$f(x_k + \alpha p_k) \leq f(x_k) + c \cdot \alpha \cdot g_k^T p_k$$
In English: "the actual decrease must be at least $c$ fraction of the predicted decrease."
Algorithm (from A1):
INITIALIZATION
BACKTRACKING LOOP
Your A1 used $c = 10^{-3}$, $\eta = 0.8$.
We walk through the Armijo backtracking procedure step by step. The gradient at $x_k = 5$ is $g_k = f'(5) = 2(5) = 10$ (the slope is steep and positive, so we need to go left). The search direction is $p_k = -g_k = -10$ (downhill). We try $\alpha = 1$ first and check if the function decreased "enough."
$f(x) = x^2$, $x_k = 5$, $g_k = 10$, $p_k = -10$ (steepest descent direction), $c = 0.001$
Try $\alpha = 1$:
$f(5 + 1 \cdot (-10)) = f(-5) = 25$
Check: $25 \leq f(5) + 0.001 \cdot 1 \cdot (10)(-10) = 25 + 0.001(-100) = 25 - 0.1 = 24.9$
$25 \leq 24.9$? NO; we overshot! (Landed at same height on the other side)
Shrink: $\alpha = 0.8 \cdot 1 = 0.8$
$f(5 + 0.8(-10)) = f(-3) = 9$
Check: $9 \leq 25 + 0.001 \cdot 0.8 \cdot (-100) = 25 - 0.08 = 24.92$
$9 \leq 24.92$? YES
Accept $\alpha = 0.8$
The Armijo condition is a safety check: "did the function actually decrease enough?" It prevents the overshooting problem from Page 03.
| Iteration | $lpha$ | New point | $f(x_k + lpha p_k)$ | Armijo bound | Pass? |
|---|---|---|---|---|---|
| 1 | $1.0$ | $5 + 1(-10) = -5$ | $25$ | $25 - 0.1 = 24.9$ | NO |
| 2 | $0.8$ | $5 + 0.8(-10) = -3$ | $9$ | $25 - 0.08 = 24.92$ | YES |
With $lpha = 1$ we overshot (landed at same height on the other side). After one shrink ($ imes 0.8$), $f$ drops from 25 to 9, easily passing the Armijo test.
The teal curve is the actual function value $\phi(lpha) = f(x_k + lpha p_k)$ along the search direction. The dashed red line is the Armijo bound. lpha = 1$ (red dot) fails because the curve is above the line; lpha = 0.8$ (teal square) passes easily. The gold star marks the true minimum at $lpha^* = 0.5$.
The Armijo line slopes gently downward from $f(x_k)$. We accept any $\alpha$ where the actual function value (teal curve) is below this line. $\alpha = 1$ overshoots (teal curve above the line), so we shrink to $\alpha = 0.8$ which passes.
Both exact line search and Armijo find one scalar $\alpha$ per iteration. Every parameter gets the same step size.
But remember the bathtub from Page 03? Parameter $x$ needs a big step and parameter $y$ needs a tiny step. One $\alpha$ can't serve both.
Analogy: Instead of one thermostat for the whole building, each room gets its own thermostat. The server room needs to be cold. The lobby needs to be warm. One setting can't work for both; you need per-room control.
Adam gives each parameter its own "thermostat" (step size), automatically adjusted based on how that parameter's gradient has been behaving.
In your course, you learned three ways to pick step sizes: fixed, exact line search, Armijo backtracking. All give ONE $\alpha$ for all parameters.
Adam does something different:
The effective step size for parameter $i$:
$$\alpha_i^{\text{effective}} = \frac{\alpha}{\sqrt{\hat{v}_{t,i}} + \epsilon}$$
Each parameter $i$ gets its own denominator $\sqrt{\hat{v}_{t,i}}$, so each gets its own effective step size. No line search needed. No matrix storage needed. Just two running averages.