Page 00: Foundations

MATH 3850; Everything you need before the Adam paper. Derivatives, partial derivatives, the gradient, gradient descent, and the vocabulary translation between your course and the paper.

PART A

Derivatives: Measuring How Steep Things Are

Analogy: Speedometer.

Your car's speedometer tells you how fast your position is changing right now. The derivative does the same thing for a function; it tells you how fast $f(x)$ is changing as you nudge $x$.

Speedometer reads 0 → you're stopped → derivative = 0 → the function is flat (could be a minimum!)
Speedometer reads +60 → moving forward fast → derivative is positive → function is going up steeply
Speedometer reads -30 → going backward → derivative is negative → function is going down

$$f(x) = x^2 \quad\Longrightarrow\quad f'(x) = 2x$$

This says: "the slope of $x^2$ at any point $x$ equals $2x$."

• At $x = 3$: slope = $2(3) = 6$; steeply uphill

• At $x = -1$: slope = $2(-1) = -2$; going downhill

• At $x = 0$: slope = $2(0) = 0$; perfectly flat = minimum!

Why does this matter?

The entire field of optimization is based on one idea: find where the slope is zero. A flat spot (derivative = 0) is a candidate for the minimum. Every algorithm in this course; gradient descent, Newton's method, BFGS, Adam; is just a different strategy for finding those flat spots.

PART B

Partial Derivatives: One Direction at a Time

Analogy: Stereo volume knobs.

Imagine a stereo with two knobs: bass and treble. If you want to know "how does the sound change when I turn the bass knob?" you hold the treble knob still and only turn bass. That's a partial derivative; you change one variable and freeze everything else.

When your function has two inputs; like $f(x, y) = x^2 + y^2$; you can't just take "the derivative." You need to ask separately about each direction:

$f(x, y) = x^2 + y^2$

$\frac{\partial f}{\partial x} = 2x$ "How steep in the x-direction?"
Treat $y$ like a constant (its derivative is 0).
Only differentiate the $x^2$ part.

$\frac{\partial f}{\partial y} = 2y$ "How steep in the y-direction?"
Treat $x$ like a constant (its derivative is 0).
Only differentiate the $y^2$ part.

A partial derivative is literally just a regular derivative where you pretend the other variables are constants. If you can differentiate $x^2$, you can compute partial derivatives. The symbol $\partial$ ("curly d") just means "I'm only looking at one variable."

Hand Computation: Partial Derivatives of $f(x,y) = 3x^2 + 2xy + y^2$

Step 1: $\frac{\partial f}{\partial x}$; treat $y$ as a constant

$\frac{\partial}{\partial x}(3x^2) = 6x$ (power rule, same as regular derivatives)

$\frac{\partial}{\partial x}(2xy) = 2y$ ($y$ is a constant, so $2xy$ is like $2 \cdot \text{constant} \cdot x$ → derivative is $2y$)

$\frac{\partial}{\partial x}(y^2) = 0$ ($y^2$ is a constant when $y$ is frozen)

⇒ $\frac{\partial f}{\partial x} = 6x + 2y$

Step 2: $\frac{\partial f}{\partial y}$; treat $x$ as a constant

$\frac{\partial}{\partial y}(3x^2) = 0$ ($x^2$ is a constant when $x$ is frozen)

$\frac{\partial}{\partial y}(2xy) = 2x$ ($x$ is a constant, so $2xy$ is like $2x \cdot y$ → derivative is $2x$)

$\frac{\partial}{\partial y}(y^2) = 2y$ (power rule)

⇒ $\frac{\partial f}{\partial y} = 2x + 2y$

Step 3: Evaluate at the point $(1, 2)$

$\frac{\partial f}{\partial x}\Big|_{(1,2)} = 6(1) + 2(2) = 6 + 4 = 10$ "steep in the x-direction"

$\frac{\partial f}{\partial y}\Big|_{(1,2)} = 2(1) + 2(2) = 2 + 4 = 6$ "moderate in the y-direction"

Connection to your assignments: In A2, when you computed $g_0 = Qx_0 = \begin{bmatrix}2&0\\0&1\end{bmatrix}\begin{bmatrix}1\\1\end{bmatrix} = \begin{bmatrix}2\\1\end{bmatrix}$, that WAS computing partial derivatives. For $f(x) = \frac{1}{2}x^TQx$, the gradient is $\nabla f = Qx$; that's just the partials bundled together.

Left: A regular derivative on a 1-variable function. The dashed lines are tangent lines; their slope IS the derivative. At the minimum (yellow star), the slope is 0.

Right: A function of TWO variables ($f(x,y) = x^2 + y^2$) as a 3D bowl. The yellow dot is a point on the surface. The red curve is the slice in the x-direction (freeze y, look at slope in x); that's $\frac{\partial f}{\partial x}$. The green curve is the slice in the y-direction; that's $\frac{\partial f}{\partial y}$.

PART C

The Gradient: All Slopes in One Package

Analogy: A weather vane.

A weather vane points in one direction; the direction the wind is blowing hardest. The gradient does the same thing: it points in the direction where the function increases the steepest. If you want to go downhill (minimize), walk the opposite direction.

$$\nabla f = \left[\frac{\partial f}{\partial x},\; \frac{\partial f}{\partial y}\right]$$

The symbol $\nabla$ ("nabla") just means: "bundle all the partial derivatives into a list."

If you have 2 variables → the gradient is a list of 2 numbers.
If you have 1000 variables → it's a list of 1000 numbers. Same idea, longer list.

Hand Computation: Gradient of $f(x,y) = x^2 + y^2$ at $(3, 1)$

Step 1: Compute the partials

$\frac{\partial f}{\partial x} = 2x, \quad \frac{\partial f}{\partial y} = 2y$

Step 2: Bundle into the gradient

$\nabla f(x,y) = [2x,\; 2y]$

Step 3: Evaluate at $(3, 1)$

$\nabla f(3, 1) = [2(3),\; 2(1)] = [6,\; 2]$

Interpretation: at the point $(3,1)$, the function is going up steeply in x (slope 6) and gently in y (slope 2). The gradient $[6, 2]$ points in the steepest uphill direction.

This is exactly what you compute in every assignment. In A2 with $f(x) = \frac{1}{2}x^TQx$: the gradient $\nabla f = Qx$. At $x_0 = [1,1]^T$ with $Q = \begin{bmatrix}2&0\\0&1\end{bmatrix}$, you got $g_0 = [2, 1]^T$. That's the gradient; slopes of 2 and 1 in each direction.

How to read this plot: The rings are elevation lines (same height = same ring). Center = minimum. You're standing at the yellow dot $(2,1)$.

The red arrow is the gradient $[4, 2]$; it points UPHILL (toward higher rings).
The green arrow is $-\nabla f = [-4, -2]$; the OPPOSITE direction, pointing DOWNHILL (toward the center/minimum). This is the direction gradient descent walks.

PART D

Gradient Descent: Walking Downhill One Step at a Time

Analogy: Blindfolded on a hill.

You're standing on a hilly landscape, blindfolded. You want to reach the lowest valley. You can't see anything, but you can feel the ground under your feet; you can tell which way slopes.

Feel the slope under your feet; which way tilts uphill? (= compute gradient)
Turn the other way; you want downhill, not uphill (= negate the gradient)
Take a small step in that direction (= multiply by step size $\alpha$)
Repeat from your new spot

You keep going until the ground feels flat (slope $\approx$ 0), meaning you've reached the valley floor.

Why blindfolded? Because in real optimization, you can't "see" the whole function. You only know the slope at your current position.

$$\theta_{\text{new}} = \theta_{\text{old}} - \alpha \cdot \nabla f(\theta_{\text{old}})$$

$\theta_{\text{old}}$ Where you are right now

$\nabla f(\theta_{\text{old}})$ The slope at your spot (points uphill)

$\alpha$ Step size; how far you walk
(small number like 0.1)

minus sign Flips uphill → downhill
(we subtract to go opposite)

Hand Computation: Gradient Descent on $f(x) = x^2$

Minimum is at $x = 0$. Derivative: $f'(x) = 2x$. Step size: $\alpha = 0.1$. Start at $x_0 = 10$.

Iteration 1 ($k = 0$):

Gradient: $g_0 = f'(10) = 2(10) = 20$

Update: $x_1 = x_0 - \alpha \cdot g_0 = 10 - 0.1 \times 20 = 10 - 2 = $ 8

Interpretation: slope is 20 (very steep uphill). We take a step of size 2 downhill.

Iteration 2 ($k = 1$):

Gradient: $g_1 = f'(8) = 2(8) = 16$

Update: $x_2 = 8 - 0.1 \times 16 = 8 - 1.6 = $ 6.4

Iteration 3 ($k = 2$):

Gradient: $g_2 = f'(6.4) = 2(6.4) = 12.8$

Update: $x_3 = 6.4 - 0.1 \times 12.8 = 6.4 - 1.28 = $ 5.12

Notice: each step is $x_{k+1} = x_k(1 - 2\alpha) = 0.8 \cdot x_k$. The position shrinks by 20% each step; that's LINEAR convergence (constant ratio). In Assignment 1, you saw the same pattern over 22 iterations.

Step	Position $x$	Gradient $2x$	Nudge $\alpha \cdot g$	New $x$
0→1	10.00	20.0	2.0	8.00
1→2	8.00	16.0	1.6	6.40
2→3	6.40	12.8	1.28	5.12
3→4	5.12	10.24	1.024	4.10
...	steps shrink as slope flattens near minimum
9→10	1.34	2.68	0.268	1.07
19→20	0.14	0.29	0.029	0.12
29→30	0.02	0.03	0.003	0.01

$Gradient descent on x^2$

Left: Green dots are the steps of gradient descent sliding down $f(x) = x^2$. Red arrows show each step. We start at $x=10$ and approach the minimum at $x=0$.

Right: Position over time. Starts at 10, converges toward 0. Big steps early (steep slope), tiny steps later (flat near minimum). The red dashed line is the goal ($x=0$). This is what linear convergence looks like; steady shrinking at a constant rate.

Why does gradient descent slow down near the minimum?

Because the gradient (slope) gets smaller as you approach the flat bottom. Smaller gradient → smaller step. This is built into the math: the nudge is $\alpha \cdot g_k$, and $g_k \to 0$ near the minimum.

This is actually a good thing; you don't want to overshoot the minimum by taking big steps when you're close.

PART E

Translating Between Your Course and the Adam Paper

The Adam paper uses different words for things you already know. Same math, different vocabulary. Here's the Rosetta Stone:

Your Course Says		The Paper Says	What It Means
$x$ (the variable)	→	$\theta$ (theta)	The thing you're optimizing
minimize $f(x)$	→	minimize $f(\theta)$	Same problem, different letter
step size $\alpha$	→	learning rate $\alpha$	How big each step is
gradient $\nabla f(x)$	→	$g_t = \nabla f_t(\theta)$	Slope (from a random batch, hence $t$)
"the variables"	→	"the parameters"	The numbers you're tuning
convergence rate	→	regret bound	How fast the method approaches optimal
Hessian approx (BFGS)	→	moment estimates	Using gradient history to improve steps

Why "parameters"?

In machine learning, you're tuning a prediction machine. Think of it like this:

The knobs a, b, c are the "parameters" ($\theta = [a, b, c]$). You want the knob settings that make predictions closest to reality. The "error" (how wrong the predictions are) is the function you minimize. Gradient descent turns the knobs a little each step, reducing the error.

Complete Glossary; Foundations

Derivative $f'(x)$: Slope of a curve at a point. "How fast is $f$ changing?" Positive = uphill, negative = downhill, zero = flat (potential minimum).
Partial derivative $\frac{\partial f}{\partial x}$: Regular derivative but freezing all other variables. "How steep in just the x-direction?"
Gradient $\nabla f$: A list of all partial derivatives. Points in the steepest uphill direction. If you have $n$ variables, it's a list of $n$ numbers.
$\theta$ (theta): The Adam paper's name for the variables. Same as $x$ or $x_k$ in your course.
Parameters: ML word for "the variables you're optimizing." The knobs on the prediction machine.
Step size / $\alpha$ / Learning rate: How far you step each iteration. Three different names for the same number.
Gradient descent: Algorithm: compute slope, step downhill, repeat. $\theta_{\text{new}} = \theta_{\text{old}} - \alpha \cdot \nabla f(\theta_{\text{old}})$
Convergence: The algorithm is getting closer to the answer each step. Linear convergence = constant shrink ratio each step (like 0.8x each time).

Next: Page 01; Vectors, Norms, and Elementwise Operations

The tools Adam uses in every formula