Page 02: Convexity

MATH 3850; What makes some optimization problems easy and others nightmarishly hard

PART A

The Big Idea: Bowls vs. Bumpy Landscapes

Analogy: Finding the lowest point in the dark.

Imagine you're dropped into a landscape at night with a flashlight that only shows the ground right under your feet.

Convex landscape (a bowl): No matter where you start, if you keep walking downhill, you'll reach THE lowest point. There's only one valley. You can't get tricked.
Non-convex landscape (mountains and valleys): You walk downhill and reach a valley floor... but is it the LOWEST valley? There might be a deeper valley on the other side of a hill that you can't see. You might be stuck in a "local minimum."

Left (convex): $f(x) = x^2$ is a bowl. ONE minimum at the bottom. The red dashed line connects two points on the curve; notice the curve is always BELOW the line. That's the visual test for convexity.

Right (non-convex): A bumpy function with multiple valleys. Gradient descent might find the purple dot (local minimum) instead of the yellow star (global minimum). It depends on where you start. This is why non-convex problems are hard.

PART B

The Formal Definition (and What It Actually Means)

A function $f$ is convex if for any two points $x$ and $y$, and any $\alpha \in [0, 1]$:

$$f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y)$$

This looks scary but it's just saying: "the curve lies below any straight line drawn between two of its points."

$f$: The function we are testing for convexity (e.g., $f(x) = x^2$)
$x, y$: Any two points in the domain; we pick them freely
$\alpha$: A blending weight between 0 and 1; controls where we sit on the line segment from $x$ to $y$
$\alpha x + (1-\alpha)y$: A point partway between $x$ and $y$ (the "midpoint" when $\alpha = 0.5$)
Left side: $f$ evaluated at that in-between point: the actual height of the curve
Right side: The corresponding point on the straight line connecting $f(x)$ and $f(y)$
$\leq$: Convexity means the curve never rises above the line

In plain English:

Pick any two points on the curve
Draw a straight line between them
If the curve is always below or touching that line → convex
If the curve pokes above the line anywhere → not convex

Hand Computation: Is $f(x) = x^2$ convex?

We verify the convexity inequality by picking two concrete points and checking that the function value at the midpoint is below the line connecting the function values at those points.

Test with $x = -2$, $y = 3$, $\alpha = 0.5$ (midpoint):

Left side: $f(\alpha x + (1-\alpha)y)$; "value of f at the midpoint"

Midpoint = $0.5(-2) + 0.5(3) = -1 + 1.5 = 0.5$

$f(0.5) = (0.5)^2 = $ $0.25$

Right side: $\alpha f(x) + (1-\alpha)f(y)$; "midpoint of the LINE between f values"

$f(-2) = 4, \quad f(3) = 9$

$0.5(4) + 0.5(9) = 2 + 4.5 = $ $6.5$

Check: Is left $\leq$ right?

$0.25 \leq 6.5$ ✓ YES

The curve at the midpoint (0.25) is way below the line at the midpoint (6.5). Convex!

In your course, $f(x) = \frac{1}{2}x^TQx$ with $Q$ positive definite is always convex. Every assignment you've done (A1, A2, Quiz 4, Quiz 6) used convex quadratics; that's why gradient descent always found the minimum.

PART C

The Quick Test: Second Derivative

Analogy: Holding a bowl vs. a hat.

Cup your hands upward (like holding water); that's a convex function (curves up). Turn your hands downward (like wearing a hat); that's concave (curves down). The second derivative tells you which one you have.

For one variable:

$f''(x) \gt 0$ everywhere ⇒ convex (curves up, like a bowl)

$f''(x) \lt 0$ everywhere ⇒ concave (curves down, like a hat)

$f''(x)$: The second derivative of $f$; measures how fast the slope is changing (curvature)
$\gt 0$: Curvature is positive everywhere, so the function curves upward like a bowl
$\lt 0$: Curvature is negative everywhere, so the function curves downward like a hat

For multiple variables:

Hessian $\nabla^2 f(x)$ is positive definite everywhere ⇒ convex

(Positive definite means all eigenvalues are positive; the surface curves "up" in every direction)

Examples from your assignments:

A2: $Q = \begin{bmatrix}2&0\\0&1\end{bmatrix}$; eigenvalues are 2 and 1, both positive ⇒ convex

Quiz 6: $Q = \begin{bmatrix}2&0\\0&3\end{bmatrix}$; eigenvalues are 2 and 3, both positive ⇒ convex

PART D

See It in 3D

Left: A convex function in 3D; a smooth bowl with one valley. Gradient descent from ANY starting point rolls to the yellow dot at the bottom. Guaranteed.

Right: A non-convex function in 3D; like an egg carton. Many little valleys everywhere. Gradient descent rolls into whichever valley is nearest, which might not be the deepest one. This is what neural network loss functions look like.

PART E

Why This Matters for the Adam Paper

Adam's THEORY (Section 4)

Assumes the problem is convex
Proves a regret bound: Adam's error decreases like $O(1/\sqrt{T})$
This means: for bowl-shaped problems, Adam is provably good

Adam's PRACTICE (Section 6)

Used on non-convex problems (neural networks)
Theory doesn't guarantee it works here
But empirically, Adam works great anyway
This gap between theory and practice is normal in ML

The practical takeaway

Your assignments (A1, A2, Quiz 4, Quiz 6) all used convex quadratics; nice bowl-shaped functions. Adam's convergence proof (Theorem 4.1) applies to these.

The paper's experiments (Section 6) test Adam on non-convex problems like neural networks. Adam still works well; the theory just can't prove WHY yet. For your demo, you can say: "The theory proves Adam works on convex problems. Empirically, it works on non-convex too."

Why were all your assignments convex? Because convex problems are "solved"; we have strong guarantees. The course builds your understanding on solid ground before tackling the messy real world. Adam bridges that gap: proven on convex, practical on everything.

Glossary; Convexity

Convex function: Bowl-shaped. Any local minimum is THE global minimum. The curve always lies below any line drawn between two of its points. Safe for optimization.
Non-convex function: Bumpy landscape. Multiple local minima. Gradient descent might get stuck in a valley that isn't the deepest. Neural networks are non-convex.
Local minimum: A valley floor; lower than everything nearby. But there might be a deeper valley elsewhere.
Global minimum: THE lowest point of the entire function. What we actually want to find.
Positive definite: A matrix property meaning "the surface curves upward in every direction." Guarantees convexity for quadratic functions. Check: all eigenvalues positive.
Hessian $\nabla^2 f$: The matrix of second partial derivatives. For your 2D assignments: a 2×2 matrix. If positive definite → convex. If indefinite → saddle point.

Next: Page 03; Gradient Descent: The Full Picture

Step sizes, convergence rates, condition numbers, and why GD is too slow for big problems