Page 04: Newton's Method and BFGS

MATH 3850; Using curvature and gradient history to take smarter steps

PART A

The Hessian: Measuring Curvature

Analogy: The gradient tells you "which way is downhill." The Hessian tells you "how curvy is the ground here?"

Imagine two hills. Both slope downward to the left. But one is a gentle rolling hill (low curvature) and the other is a cliff edge (high curvature). You'd take a big step on the gentle hill but a tiny careful step near the cliff. The Hessian captures this difference.

The Hessian is the matrix of second partial derivatives:

$$\nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix}$$

Gradient = first derivatives = slope (direction). Hessian = second derivatives = curvature (how fast the slope changes).

$\nabla^2 f$: The Hessian matrix: collects all second partial derivatives of $f$
$\frac{\partial^2 f}{\partial x^2}$: Curvature in the $x$ direction (how fast the $x$-slope is changing)
$\frac{\partial^2 f}{\partial y^2}$: Curvature in the $y$ direction
$\frac{\partial^2 f}{\partial x \partial y}$: Cross-curvature: how the $x$-slope changes as you move in $y$ (off-diagonal entries; zero for separable functions)

For $f(x,y) = x^2 + 3y^2$: $\nabla^2 f = \begin{bmatrix}2&0\\0&6\end{bmatrix}$

Curvature in x is 2 (gentle), curvature in y is 6 (steep). The Hessian captures this difference; GD cannot.

The Hessian tells Newton's method how much curvature each direction has, so it can automatically pick big steps where the curve is gentle and small steps where it is steep.

PART B

Newton's Method: Jump to the Minimum of a Local Model

Analogy: Instead of just feeling the slope (GD), you build a little bowl-shaped model of the ground around your feet, then jump to the bottom of THAT bowl.

If the model is accurate, you land at (or very near) the actual minimum. That's why Newton converges so much faster; each step is carefully calculated, not just "walk downhill with a fixed stride."

Newton's update:

$$p_k = -[\nabla^2 f(x_k)]^{-1} \nabla f(x_k)$$

In English: "invert the Hessian, multiply by the gradient, negate."

Compare to GD: $p_k = -\alpha \cdot \nabla f(x_k)$

$p_k$: The Newton step: the direction and distance to move from the current point
$\nabla^2 f(x_k)$: The Hessian at the current point $x_k$; captures curvature in every direction
$[\nabla^2 f(x_k)]^{-1}$: The inverse Hessian; converts the gradient into a curvature-scaled step (big steps where gentle, small steps where steep)
$\nabla f(x_k)$: The gradient at $x_k$; points in the direction of steepest ascent
$-$: Negate so we step downhill (opposite to the gradient direction)

Newton replaces the fixed $\alpha$ with the inverse Hessian $[\nabla^2 f]^{-1}$. The Hessian automatically scales each direction by its curvature. No step size needed!

Left (GD): Takes a step with fixed $\alpha = 0.3$. Gets closer but doesn't reach the minimum.

Right (Newton): Builds a quadratic model (red dashed parabola) at the current point, then jumps to the minimum of that model. For a quadratic function, this lands exactly at the minimum in one step! The curvature ($f'' = 2$) automatically determines the step size: $p = -f'/f'' = -5/2 = -2.5$.

Hand Computation: Newton Step (mirroring Quiz 4)

Newton's method: compute the gradient (which way is downhill), compute the Hessian (how curvy is the ground), invert the Hessian, then multiply to get a step that accounts for curvature in every direction.

$f(x_1, x_2) = x_1^2 + 3x_2^2$, starting at $x_0 = [4, 2]^T$

Step 1: Compute gradient $g_0 = \nabla f(x_0)$

$\frac{\partial f}{\partial x_1} = 2x_1 = 2(4) = 8$

$\frac{\partial f}{\partial x_2} = 6x_2 = 6(2) = 12$

$g_0 = [8, 12]^T$

Step 2: Compute Hessian $B_0 = \nabla^2 f(x_0)$

$B_0 = \begin{bmatrix}2 & 0 \\ 0 & 6\end{bmatrix}$

(For quadratics, the Hessian is constant; doesn't depend on $x_0$)

Step 3: Compute $B_0^{-1}$ (inverse of 2×2 diagonal matrix)

For a diagonal matrix, just flip each diagonal entry:

$B_0^{-1} = \begin{bmatrix}1/2 & 0 \\ 0 & 1/6\end{bmatrix}$

Step 4: Newton step $p_0 = -B_0^{-1} g_0$

$p_0 = -\begin{bmatrix}1/2 & 0 \\ 0 & 1/6\end{bmatrix}\begin{bmatrix}8\\12\end{bmatrix} = -\begin{bmatrix}4\\2\end{bmatrix} = $ $\begin{bmatrix}-4\\-2\end{bmatrix}$

Step 5: Update $x_1 = x_0 + p_0$

$x_1 = \begin{bmatrix}4\\2\end{bmatrix} + \begin{bmatrix}-4\\-2\end{bmatrix} = $ $\begin{bmatrix}0\\0\end{bmatrix}$

Converged in ONE step! That's quadratic convergence on a quadratic function.

Compare to GD: with $\alpha = 0.3$, $x_1 = [4,2] - 0.3[8,12] = [1.6, -1.6]$; not even close to $[0,0]$. Newton used the curvature to pick the perfect step size in each direction.

PART C

BFGS: Approximating Newton Cheaply

Analogy: Newton's method requires a full survey of the terrain (computing the Hessian) before each step. Expensive!

BFGS says: "I'll keep a rough sketch of the terrain that I update after each step using the gradients I've already computed. It gets more accurate over time, and I never have to do a full survey."

BFGS update (from Assignment 2):

$$H_{k+1} = (I - \rho_k s_k y_k^T) H_k (I - \rho_k y_k s_k^T) + \rho_k s_k s_k^T$$

$H_{k+1}$: The updated inverse-Hessian approximation (what we are computing)
$H_k$: The current inverse-Hessian approximation; starts as $I$ (identity) and improves each iteration
$s_k$: $= x_{k+1} - x_k$: the step we just took (position change)
$y_k$: $= g_{k+1} - g_k$: how the gradient changed after that step (tells us about curvature along $s_k$)
$\rho_k$: $= 1/(y_k^T s_k)$: a scalar normalizing factor that keeps the update well-scaled
$I$: The identity matrix; the "do nothing" starting point before BFGS learns any curvature
$s_k y_k^T$: An outer product (rank-1 matrix) encoding the relationship between the step and the gradient change

Key idea: You learn about the curvature from how the gradient CHANGED when you took a step. If the gradient changed a lot → high curvature. If it barely changed → low curvature.

Hand Computation: One BFGS Iteration (from A2)

BFGS starts with $H_0 = I$ (no curvature information), so the first step is just steepest descent. After that step, it uses the gradient change to update $H$ toward the true inverse Hessian $Q^{-1}$.

$f(x) = \frac{1}{2}x^TQx$ with $Q = \begin{bmatrix}2&0\\0&3\end{bmatrix}$, $x_0 = [1,1]^T$, $H_0 = I$

Gradient: $g_0 = Qx_0 = \begin{bmatrix}2\\3\end{bmatrix}$

Search direction: $p_0 = -H_0 g_0 = -I \begin{bmatrix}2\\3\end{bmatrix} = \begin{bmatrix}-2\\-3\end{bmatrix}$

(With $H_0 = I$, this is just steepest descent; same as GD)

Exact line search: $\alpha_0 = \frac{g_0^T g_0}{g_0^T Q g_0}$

$g_0^T g_0 = 4 + 9 = 13$

$g_0^T Q g_0 = [2,3]\begin{bmatrix}2&0\\0&3\end{bmatrix}\begin{bmatrix}2\\3\end{bmatrix} = [2,3]\begin{bmatrix}4\\9\end{bmatrix} = 8 + 27 = 35$

$\alpha_0 = 13/35$

New point: $x_1 = x_0 + \alpha_0 p_0 = [1,1]^T + \frac{13}{35}[-2,-3]^T = $ $[9/35, -4/35]^T$

Secant quantities: $s_0 = x_1 - x_0$, $y_0 = g_1 - g_0$

Then apply the BFGS update formula to get $H_1$; a better approximation to $Q^{-1}$.

Notice: BFGS starts as GD (since $H_0 = I$) but gets better each iteration as $H_k$ improves. By the time it converges, $H_k \approx Q^{-1}$ and each step is Newton-like.

Convergence comparison GD vs BFGS vs Newton

How to read this: Log scale = each horizontal grid line is 10x smaller. A straight line = linear convergence. A curve that gets steeper = superlinear. A near-vertical drop = quadratic.

Red (GD): Straight line; linear, slow and steady.
Yellow (BFGS): Starts like GD but accelerates; superlinear.
Green (Newton): Drops like a rock; quadratic, done in ~10 steps.

In your Assignment 1: GD took 22 iterations, Newton took 10. BFGS would be in between.

PART D

How This All Connects to Adam

The Evolution: GD → Newton → BFGS → Adam

Gradient Descent

Uses: gradient only
Step size: fixed $\alpha$
Storage: 1 vector ($g$)
Speed: linear
Problem: same $\alpha$ for all

BFGS

Uses: gradient + history
Step size: from $H_k$ matrix
Storage: $n \times n$ matrix
Speed: superlinear
Problem: matrix too big for large $n$

Adam

Uses: gradient + history
Step size: per-parameter (from $v_t$)
Storage: 2 vectors ($m, v$)
Speed: good in practice
Handles: noise + scale differences

The key insight: BFGS and Adam both use gradient history to adapt step sizes. BFGS stores a full matrix ($n^2$ numbers). Adam stores just two vectors ($2n$ numbers). For MNIST with $n = 784$: BFGS needs 614,656 numbers. Adam needs 1,568. That's the tradeoff; Adam is less precise but massively cheaper.

BFGS uses gradient history to build a matrix. Adam uses gradient history to build two running averages (momentum $m_t$ and scale $v_t$). Same spirit, different implementation.

Glossary; Newton and BFGS

Hessian ($\nabla^2 f$): Matrix of second derivatives. Tells you curvature; how fast the slope is changing in each direction.
Newton's method: $p = -[\nabla^2 f]^{-1} g$. Uses exact curvature to pick perfect step. Quadratic convergence. Expensive.
Quadratic model: A bowl-shaped approximation of $f$ near the current point. Newton jumps to the minimum of this model.
BFGS: Approximates the inverse Hessian using gradient history ($s_k$, $y_k$ pairs). Starts as GD, improves to near-Newton. Superlinear convergence.
Secant condition ($B_{k+1}s_k = y_k$): The constraint that the Hessian approximation must satisfy: it must correctly predict how the gradient changed on the last step.
Curvature condition ($s_k^T y_k \gt 0$): Ensures BFGS stays positive definite (bowl-shaped). Verified in your A2 work.

Next: Page 05; Step Size Selection

From line search to adaptive methods; the direct bridge to Adam