MATH 3850; The tools Adam uses in every single formula
PART AAnalogy: A grocery receipt.
A receipt is a list of numbers: [$4.50, $2.00, $8.99, $1.25]. A vector is the same thing; just a list of numbers bundled together. Instead of prices, the numbers represent parameters, gradients, or positions.
A vector with 3 entries:
$$\mathbf{v} = \begin{bmatrix} 3 \\ -2 \\ 5 \end{bmatrix} = [3, -2, 5]$$
Both notations mean the same thing. The column form (left) is what textbooks use. The row form (right) is easier to write inline.
In your course, $x_0 = [1, 1]^T$ is a vector with 2 entries. In the Adam paper, $\theta$ might have 784 entries (one per pixel in an image). Same idea, bigger list.
Analogy: How far is it from home to school?
There are different ways to measure that distance:
All three measure "how big" the vector is, just differently.
For $\mathbf{v} = [3, -4]$:
L2 norm (Euclidean / straight-line distance):
$\|\mathbf{v}\|_2 = \sqrt{3^2 + (-4)^2} = \sqrt{9 + 16} = \sqrt{25} = 5$
Pythagorean theorem! The hypotenuse of a 3-4-5 triangle.
L1 norm (Manhattan / taxi distance):
$\|\mathbf{v}\|_1 = |3| + |-4| = 3 + 4 = 7$
Add the absolute values. Like walking on a grid: 3 blocks east, 4 blocks south.
L∞ norm (max / biggest component):
$\|\mathbf{v}\|_\infty = \max(|3|, |-4|) = 4$
Just pick the biggest absolute value. The Adam paper uses this in its convergence proof (Theorem 4.1).
General norm formulas:
$\|\mathbf{v}\|_2 = \sqrt{\sum_i v_i^2}$, $\|\mathbf{v}\|_1 = \sum_i |v_i|$, $\|\mathbf{v}\|_\infty = \max_i |v_i|$
Unit balls: the set of all vectors with norm exactly 1 under each measure. The shape reveals how each norm "sees" distance.
We compute all three norms for the same vector to see how each one measures "size" differently.
L2 norm (straight-line distance from origin):
$\|\mathbf{v}\|_2 = \sqrt{1^2 + (-3)^2 + 2^2} = \sqrt{1 + 9 + 4} = \sqrt{14} \approx$ 3.742
L1 norm (taxi distance; add absolute values):
$\|\mathbf{v}\|_1 = |1| + |-3| + |2| = 1 + 3 + 2 =$ 6
L-infinity norm (largest absolute entry):
$\|\mathbf{v}\|_\infty = \max(|1|, |-3|, |2|) = \max(1, 3, 2) =$ 3
In Quiz 4, when you checked $\|p_0\| \leq \Delta_0$ for the trust region, you were computing the L2 norm of the step vector and comparing it to the radius. Same operation.
Analogy: Apply a discount to every item on a receipt.
If every item gets 10% off, you multiply each price by 0.9 separately. You don't add all prices first; each item gets its own calculation. That's "elementwise": apply the operation to each entry independently.
The Adam algorithm does operations elementwise on vectors. When the paper writes $g_t^2$, it means "square each component of the gradient separately." When it writes $\hat{m}_t / \sqrt{\hat{v}_t}$, it means "divide each component of $\hat{m}$ by the corresponding square root of $\hat{v}$." If you miss this, the formulas look impossible.
Multiply matching entries:
$[2, 5, 3] \odot [4, 1, 6]$
$= [2 \times 4,\; 5 \times 1,\; 3 \times 6]$
$= [8, 5, 18]$
Square each entry:
$[3, -2, 5]^2$
$= [3^2,\; (-2)^2,\; 5^2]$
$= [9, 4, 25]$
Note: negatives become positive!
Square root each entry:
$\sqrt{[9, 4, 25]}$
$= [\sqrt{9},\; \sqrt{4},\; \sqrt{25}]$
$= [3, 2, 5]$
Divide matching entries:
$[10, 6, 15] \;/\; [2, 3, 5]$
$= [10/2,\; 6/3,\; 15/5]$
$= [5, 2, 3]$
Given gradient $g = [4, -1, 3]$:
Step 1: Square the gradient elementwise ($g^2$)
$g^2 = [4^2,\; (-1)^2,\; 3^2] =$ $[16, 1, 9]$
This is what Adam does to build $v_t$; it tracks how big each gradient component has been
Step 2: Take the square root elementwise ($\sqrt{g^2}$)
$\sqrt{[16, 1, 9]} = [\sqrt{16},\; \sqrt{1},\; \sqrt{9}] =$ $[4, 1, 3]$
We get back the absolute values! $\sqrt{g^2} = |g|$
Step 3: Divide gradient by the square root ($g \;/\; \sqrt{g^2}$)
$[4, -1, 3] \;/\; [4, 1, 3] = [4/4,\; -1/1,\; 3/3] =$ $[1, -1, 1]$
Each component normalized to $\pm 1$! This is the core idea of Adam; the direction is preserved but the magnitude is equalized across all parameters
This is essentially what Adam's update rule does: $\hat{m}_t / \sqrt{\hat{v}_t}$ normalizes the gradient so that all parameters get steps of similar magnitude, regardless of whether their gradient is huge or tiny.
Here's the Adam update rule (Page 12 will explain every piece). Look at how many elementwise operations there are:
$\theta_t = \theta_{t-1} - \alpha \cdot \hat{m}_t \;/\; (\sqrt{\hat{v}_t} + \epsilon)$
Every operation in Adam is elementwise. Each parameter gets its own independent calculation. That's what makes it "adaptive"; each parameter is handled separately.
Scalar = a single number: $\alpha = 0.001$, $\beta_1 = 0.9$, $\epsilon = 10^{-8}$
Vector = a list of numbers: $\theta = [w_1, w_2, ..., w_{784}]$, $g_t = [\frac{\partial f}{\partial w_1}, ..., \frac{\partial f}{\partial w_{784}}]$
Matrix = a grid of numbers: $Q = \begin{bmatrix}2&0\\0&1\end{bmatrix}$, $H_k$ (the Hessian approximation from BFGS)
Key difference from BFGS: BFGS stores a full matrix $H_k$ (size $n \times n$). With 784 parameters, that's 614,656 numbers! Adam only stores two vectors ($m_t$ and $v_t$, each size $n$). With 784 parameters, that's just 1,568 numbers. Much cheaper.