MATH 3850; Everything you need before the Adam paper. Derivatives, partial derivatives, the gradient, gradient descent, and the vocabulary translation between your course and the paper.
PART AAnalogy: Speedometer.
Your car's speedometer tells you how fast your position is changing right now. The derivative does the same thing for a function; it tells you how fast $f(x)$ is changing as you nudge $x$.
$$f(x) = x^2 \quad\Longrightarrow\quad f'(x) = 2x$$
This says: "the slope of $x^2$ at any point $x$ equals $2x$."
• At $x = 3$: slope = $2(3) = 6$; steeply uphill
• At $x = -1$: slope = $2(-1) = -2$; going downhill
• At $x = 0$: slope = $2(0) = 0$; perfectly flat = minimum!
The entire field of optimization is based on one idea: find where the slope is zero. A flat spot (derivative = 0) is a candidate for the minimum. Every algorithm in this course; gradient descent, Newton's method, BFGS, Adam; is just a different strategy for finding those flat spots.
Analogy: Stereo volume knobs.
Imagine a stereo with two knobs: bass and treble. If you want to know "how does the sound change when I turn the bass knob?" you hold the treble knob still and only turn bass. That's a partial derivative; you change one variable and freeze everything else.
When your function has two inputs; like $f(x, y) = x^2 + y^2$; you can't just take "the derivative." You need to ask separately about each direction:
$f(x, y) = x^2 + y^2$
Step 1: $\frac{\partial f}{\partial x}$; treat $y$ as a constant
$\frac{\partial}{\partial x}(3x^2) = 6x$ (power rule, same as regular derivatives)
$\frac{\partial}{\partial x}(2xy) = 2y$ ($y$ is a constant, so $2xy$ is like $2 \cdot \text{constant} \cdot x$ → derivative is $2y$)
$\frac{\partial}{\partial x}(y^2) = 0$ ($y^2$ is a constant when $y$ is frozen)
⇒ $\frac{\partial f}{\partial x} = 6x + 2y$
Step 2: $\frac{\partial f}{\partial y}$; treat $x$ as a constant
$\frac{\partial}{\partial y}(3x^2) = 0$ ($x^2$ is a constant when $x$ is frozen)
$\frac{\partial}{\partial y}(2xy) = 2x$ ($x$ is a constant, so $2xy$ is like $2x \cdot y$ → derivative is $2x$)
$\frac{\partial}{\partial y}(y^2) = 2y$ (power rule)
⇒ $\frac{\partial f}{\partial y} = 2x + 2y$
Step 3: Evaluate at the point $(1, 2)$
$\frac{\partial f}{\partial x}\Big|_{(1,2)} = 6(1) + 2(2) = 6 + 4 = 10$ "steep in the x-direction"
$\frac{\partial f}{\partial y}\Big|_{(1,2)} = 2(1) + 2(2) = 2 + 4 = 6$ "moderate in the y-direction"
Connection to your assignments: In A2, when you computed $g_0 = Qx_0 = \begin{bmatrix}2&0\\0&1\end{bmatrix}\begin{bmatrix}1\\1\end{bmatrix} = \begin{bmatrix}2\\1\end{bmatrix}$, that WAS computing partial derivatives. For $f(x) = \frac{1}{2}x^TQx$, the gradient is $\nabla f = Qx$; that's just the partials bundled together.
Analogy: A weather vane.
A weather vane points in one direction; the direction the wind is blowing hardest. The gradient does the same thing: it points in the direction where the function increases the steepest. If you want to go downhill (minimize), walk the opposite direction.
$$\nabla f = \left[\frac{\partial f}{\partial x},\; \frac{\partial f}{\partial y}\right]$$
The symbol $\nabla$ ("nabla") just means: "bundle all the partial derivatives into a list."
If you have 2 variables → the gradient is a list of 2 numbers.
If you have 1000 variables → it's a list of 1000 numbers. Same idea, longer list.
Step 1: Compute the partials
$\frac{\partial f}{\partial x} = 2x, \quad \frac{\partial f}{\partial y} = 2y$
Step 2: Bundle into the gradient
$\nabla f(x,y) = [2x,\; 2y]$
Step 3: Evaluate at $(3, 1)$
$\nabla f(3, 1) = [2(3),\; 2(1)] = [6,\; 2]$
Interpretation: at the point $(3,1)$, the function is going up steeply in x (slope 6) and gently in y (slope 2). The gradient $[6, 2]$ points in the steepest uphill direction.
This is exactly what you compute in every assignment. In A2 with $f(x) = \frac{1}{2}x^TQx$: the gradient $\nabla f = Qx$. At $x_0 = [1,1]^T$ with $Q = \begin{bmatrix}2&0\\0&1\end{bmatrix}$, you got $g_0 = [2, 1]^T$. That's the gradient; slopes of 2 and 1 in each direction.
Analogy: Blindfolded on a hill.
You're standing on a hilly landscape, blindfolded. You want to reach the lowest valley. You can't see anything, but you can feel the ground under your feet; you can tell which way slopes.
You keep going until the ground feels flat (slope $\approx$ 0), meaning you've reached the valley floor.
Why blindfolded? Because in real optimization, you can't "see" the whole function. You only know the slope at your current position.
$$\theta_{\text{new}} = \theta_{\text{old}} - \alpha \cdot \nabla f(\theta_{\text{old}})$$
Minimum is at $x = 0$. Derivative: $f'(x) = 2x$. Step size: $\alpha = 0.1$. Start at $x_0 = 10$.
Iteration 1 ($k = 0$):
Gradient: $g_0 = f'(10) = 2(10) = 20$
Update: $x_1 = x_0 - \alpha \cdot g_0 = 10 - 0.1 \times 20 = 10 - 2 = $ 8
Interpretation: slope is 20 (very steep uphill). We take a step of size 2 downhill.
Iteration 2 ($k = 1$):
Gradient: $g_1 = f'(8) = 2(8) = 16$
Update: $x_2 = 8 - 0.1 \times 16 = 8 - 1.6 = $ 6.4
Iteration 3 ($k = 2$):
Gradient: $g_2 = f'(6.4) = 2(6.4) = 12.8$
Update: $x_3 = 6.4 - 0.1 \times 12.8 = 6.4 - 1.28 = $ 5.12
Notice: each step is $x_{k+1} = x_k(1 - 2\alpha) = 0.8 \cdot x_k$. The position shrinks by 20% each step; that's LINEAR convergence (constant ratio). In Assignment 1, you saw the same pattern over 22 iterations.
| Step | Position $x$ | Gradient $2x$ | Nudge $\alpha \cdot g$ | New $x$ |
|---|---|---|---|---|
| 0→1 | 10.00 | 20.0 | 2.0 | 8.00 |
| 1→2 | 8.00 | 16.0 | 1.6 | 6.40 |
| 2→3 | 6.40 | 12.8 | 1.28 | 5.12 |
| 3→4 | 5.12 | 10.24 | 1.024 | 4.10 |
| ... | steps shrink as slope flattens near minimum | |||
| 9→10 | 1.34 | 2.68 | 0.268 | 1.07 |
| 19→20 | 0.14 | 0.29 | 0.029 | 0.12 |
| 29→30 | 0.02 | 0.03 | 0.003 | 0.01 |
Because the gradient (slope) gets smaller as you approach the flat bottom. Smaller gradient → smaller step. This is built into the math: the nudge is $\alpha \cdot g_k$, and $g_k \to 0$ near the minimum.
This is actually a good thing; you don't want to overshoot the minimum by taking big steps when you're close.
The Adam paper uses different words for things you already know. Same math, different vocabulary. Here's the Rosetta Stone:
| Your Course Says | The Paper Says | What It Means | |
|---|---|---|---|
| $x$ (the variable) | → | $\theta$ (theta) | The thing you're optimizing |
| minimize $f(x)$ | → | minimize $f(\theta)$ | Same problem, different letter |
| step size $\alpha$ | → | learning rate $\alpha$ | How big each step is |
| gradient $\nabla f(x)$ | → | $g_t = \nabla f_t(\theta)$ | Slope (from a random batch, hence $t$) |
| "the variables" | → | "the parameters" | The numbers you're tuning |
| convergence rate | → | regret bound | How fast the method approaches optimal |
| Hessian approx (BFGS) | → | moment estimates | Using gradient history to improve steps |
In machine learning, you're tuning a prediction machine. Think of it like this:
Next: Page 01; Vectors, Norms, and Elementwise Operations
The tools Adam uses in every formula