MATH 3850; Using curvature and gradient history to take smarter steps
PART AAnalogy: The gradient tells you "which way is downhill." The Hessian tells you "how curvy is the ground here?"
Imagine two hills. Both slope downward to the left. But one is a gentle rolling hill (low curvature) and the other is a cliff edge (high curvature). You'd take a big step on the gentle hill but a tiny careful step near the cliff. The Hessian captures this difference.
The Hessian is the matrix of second partial derivatives:
$$\nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix}$$
Gradient = first derivatives = slope (direction). Hessian = second derivatives = curvature (how fast the slope changes).
For $f(x,y) = x^2 + 3y^2$: $\nabla^2 f = \begin{bmatrix}2&0\\0&6\end{bmatrix}$
Curvature in x is 2 (gentle), curvature in y is 6 (steep). The Hessian captures this difference; GD cannot.
The Hessian tells Newton's method how much curvature each direction has, so it can automatically pick big steps where the curve is gentle and small steps where it is steep.
Analogy: Instead of just feeling the slope (GD), you build a little bowl-shaped model of the ground around your feet, then jump to the bottom of THAT bowl.
If the model is accurate, you land at (or very near) the actual minimum. That's why Newton converges so much faster; each step is carefully calculated, not just "walk downhill with a fixed stride."
Newton's update:
$$p_k = -[\nabla^2 f(x_k)]^{-1} \nabla f(x_k)$$
In English: "invert the Hessian, multiply by the gradient, negate."
Compare to GD: $p_k = -\alpha \cdot \nabla f(x_k)$
Newton replaces the fixed $\alpha$ with the inverse Hessian $[\nabla^2 f]^{-1}$. The Hessian automatically scales each direction by its curvature. No step size needed!
Newton's method: compute the gradient (which way is downhill), compute the Hessian (how curvy is the ground), invert the Hessian, then multiply to get a step that accounts for curvature in every direction.
$f(x_1, x_2) = x_1^2 + 3x_2^2$, starting at $x_0 = [4, 2]^T$
Step 1: Compute gradient $g_0 = \nabla f(x_0)$
$\frac{\partial f}{\partial x_1} = 2x_1 = 2(4) = 8$
$\frac{\partial f}{\partial x_2} = 6x_2 = 6(2) = 12$
$g_0 = [8, 12]^T$
Step 2: Compute Hessian $B_0 = \nabla^2 f(x_0)$
$B_0 = \begin{bmatrix}2 & 0 \\ 0 & 6\end{bmatrix}$
(For quadratics, the Hessian is constant; doesn't depend on $x_0$)
Step 3: Compute $B_0^{-1}$ (inverse of 2×2 diagonal matrix)
For a diagonal matrix, just flip each diagonal entry:
$B_0^{-1} = \begin{bmatrix}1/2 & 0 \\ 0 & 1/6\end{bmatrix}$
Step 4: Newton step $p_0 = -B_0^{-1} g_0$
$p_0 = -\begin{bmatrix}1/2 & 0 \\ 0 & 1/6\end{bmatrix}\begin{bmatrix}8\\12\end{bmatrix} = -\begin{bmatrix}4\\2\end{bmatrix} = $ $\begin{bmatrix}-4\\-2\end{bmatrix}$
Step 5: Update $x_1 = x_0 + p_0$
$x_1 = \begin{bmatrix}4\\2\end{bmatrix} + \begin{bmatrix}-4\\-2\end{bmatrix} = $ $\begin{bmatrix}0\\0\end{bmatrix}$
Converged in ONE step! That's quadratic convergence on a quadratic function.
Compare to GD: with $\alpha = 0.3$, $x_1 = [4,2] - 0.3[8,12] = [1.6, -1.6]$; not even close to $[0,0]$. Newton used the curvature to pick the perfect step size in each direction.
Analogy: Newton's method requires a full survey of the terrain (computing the Hessian) before each step. Expensive!
BFGS says: "I'll keep a rough sketch of the terrain that I update after each step using the gradients I've already computed. It gets more accurate over time, and I never have to do a full survey."
BFGS update (from Assignment 2):
$$H_{k+1} = (I - \rho_k s_k y_k^T) H_k (I - \rho_k y_k s_k^T) + \rho_k s_k s_k^T$$
Key idea: You learn about the curvature from how the gradient CHANGED when you took a step. If the gradient changed a lot → high curvature. If it barely changed → low curvature.
BFGS starts with $H_0 = I$ (no curvature information), so the first step is just steepest descent. After that step, it uses the gradient change to update $H$ toward the true inverse Hessian $Q^{-1}$.
$f(x) = \frac{1}{2}x^TQx$ with $Q = \begin{bmatrix}2&0\\0&3\end{bmatrix}$, $x_0 = [1,1]^T$, $H_0 = I$
Gradient: $g_0 = Qx_0 = \begin{bmatrix}2\\3\end{bmatrix}$
Search direction: $p_0 = -H_0 g_0 = -I \begin{bmatrix}2\\3\end{bmatrix} = \begin{bmatrix}-2\\-3\end{bmatrix}$
(With $H_0 = I$, this is just steepest descent; same as GD)
Exact line search: $\alpha_0 = \frac{g_0^T g_0}{g_0^T Q g_0}$
$g_0^T g_0 = 4 + 9 = 13$
$g_0^T Q g_0 = [2,3]\begin{bmatrix}2&0\\0&3\end{bmatrix}\begin{bmatrix}2\\3\end{bmatrix} = [2,3]\begin{bmatrix}4\\9\end{bmatrix} = 8 + 27 = 35$
$\alpha_0 = 13/35$
New point: $x_1 = x_0 + \alpha_0 p_0 = [1,1]^T + \frac{13}{35}[-2,-3]^T = $ $[9/35, -4/35]^T$
Secant quantities: $s_0 = x_1 - x_0$, $y_0 = g_1 - g_0$
Then apply the BFGS update formula to get $H_1$; a better approximation to $Q^{-1}$.
Notice: BFGS starts as GD (since $H_0 = I$) but gets better each iteration as $H_k$ improves. By the time it converges, $H_k \approx Q^{-1}$ and each step is Newton-like.
The key insight: BFGS and Adam both use gradient history to adapt step sizes. BFGS stores a full matrix ($n^2$ numbers). Adam stores just two vectors ($2n$ numbers). For MNIST with $n = 784$: BFGS needs 614,656 numbers. Adam needs 1,568. That's the tradeoff; Adam is less precise but massively cheaper.
BFGS uses gradient history to build a matrix. Adam uses gradient history to build two running averages (momentum $m_t$ and scale $v_t$). Same spirit, different implementation.