Page 01: Vectors, Norms, and Elementwise Operations

MATH 3850; The tools Adam uses in every single formula

PART A

Vectors: Just Lists of Numbers

Analogy: A grocery receipt.

A receipt is a list of numbers: [$4.50, $2.00, $8.99, $1.25]. A vector is the same thing; just a list of numbers bundled together. Instead of prices, the numbers represent parameters, gradients, or positions.

A vector with 3 entries:

$$\mathbf{v} = \begin{bmatrix} 3 \\ -2 \\ 5 \end{bmatrix} = [3, -2, 5]$$

Both notations mean the same thing. The column form (left) is what textbooks use. The row form (right) is easier to write inline.

In your course, $x_0 = [1, 1]^T$ is a vector with 2 entries. In the Adam paper, $\theta$ might have 784 entries (one per pixel in an image). Same idea, bigger list.

Why does this matter? In the Adam paper, EVERYTHING is a vector: the parameters $\theta$, the gradient $g_t$, the momentum $m_t$, the second moment $v_t$. Understanding that these are just lists of numbers; and that operations happen to each number independently; is the key to reading Adam's formulas.

PART B

Norms: Measuring the "Size" of a Vector

Analogy: How far is it from home to school?

There are different ways to measure that distance:

As the crow flies (straight line) = L2 norm
By taxi (streets are a grid, go east then north) = L1 norm
Longest single leg (just the farthest you go in one direction) = L∞ norm

All three measure "how big" the vector is, just differently.

For $\mathbf{v} = [3, -4]$:

L2 norm (Euclidean / straight-line distance):

$\|\mathbf{v}\|_2 = \sqrt{3^2 + (-4)^2} = \sqrt{9 + 16} = \sqrt{25} = 5$

Pythagorean theorem! The hypotenuse of a 3-4-5 triangle.

L1 norm (Manhattan / taxi distance):

$\|\mathbf{v}\|_1 = |3| + |-4| = 3 + 4 = 7$

Add the absolute values. Like walking on a grid: 3 blocks east, 4 blocks south.

L∞ norm (max / biggest component):

$\|\mathbf{v}\|_\infty = \max(|3|, |-4|) = 4$

Just pick the biggest absolute value. The Adam paper uses this in its convergence proof (Theorem 4.1).

General norm formulas:

$\|\mathbf{v}\|_2 = \sqrt{\sum_i v_i^2}$, $\|\mathbf{v}\|_1 = \sum_i |v_i|$, $\|\mathbf{v}\|_\infty = \max_i |v_i|$

$\mathbf{v}$: The vector whose size we are measuring (e.g., a gradient or parameter vector)
$v_i$: The $i$-th entry (component) of the vector
$\|\mathbf{v}\|_2$: L2 norm: straight-line distance from the origin; square each entry, sum them, take the square root
$\|\mathbf{v}\|_1$: L1 norm: taxi/Manhattan distance; add the absolute values of all entries
$\|\mathbf{v}\|_\infty$: L-infinity norm: just the single largest absolute entry; used in Adam's convergence proof (Theorem 4.1)

Unit balls: the set of all vectors with norm exactly 1 under each measure. The shape reveals how each norm "sees" distance.

How to read these plots: Each shape shows all vectors with norm = 1 (the "unit ball"). The shape tells you what "distance 1" looks like under each norm.

Left (L2): A circle; straight-line distance. The most familiar one.
Middle (L1): A diamond; taxi distance. Dashed yellow line shows the taxi path (go right, then up).
Right (L∞): A square; only the biggest component matters. If max component is ≤ 1, you're inside the square.

Hand Computation: Norms of $\mathbf{v} = [1, -3, 2]$

We compute all three norms for the same vector to see how each one measures "size" differently.

L2 norm (straight-line distance from origin):

$\|\mathbf{v}\|_2 = \sqrt{1^2 + (-3)^2 + 2^2} = \sqrt{1 + 9 + 4} = \sqrt{14} \approx$ 3.742

L1 norm (taxi distance; add absolute values):

$\|\mathbf{v}\|_1 = |1| + |-3| + |2| = 1 + 3 + 2 =$ 6

L-infinity norm (largest absolute entry):

$\|\mathbf{v}\|_\infty = \max(|1|, |-3|, |2|) = \max(1, 3, 2) =$ 3

In Quiz 4, when you checked $\|p_0\| \leq \Delta_0$ for the trust region, you were computing the L2 norm of the step vector and comparing it to the radius. Same operation.

PART C

Elementwise Operations: Do the Same Thing to Each Entry

Analogy: Apply a discount to every item on a receipt.

If every item gets 10% off, you multiply each price by 0.9 separately. You don't add all prices first; each item gets its own calculation. That's "elementwise": apply the operation to each entry independently.

Why this is critical for Adam

The Adam algorithm does operations elementwise on vectors. When the paper writes $g_t^2$, it means "square each component of the gradient separately." When it writes $\hat{m}_t / \sqrt{\hat{v}_t}$, it means "divide each component of $\hat{m}$ by the corresponding square root of $\hat{v}$." If you miss this, the formulas look impossible.

Elementwise Multiply ($\odot$)

Multiply matching entries:

$[2, 5, 3] \odot [4, 1, 6]$

$= [2 \times 4,\; 5 \times 1,\; 3 \times 6]$

$= [8, 5, 18]$

Elementwise Square ($g^2$)

Square each entry:

$[3, -2, 5]^2$

$= [3^2,\; (-2)^2,\; 5^2]$

$= [9, 4, 25]$

Note: negatives become positive!

Elementwise Square Root ($\sqrt{v}$)

Square root each entry:

$\sqrt{[9, 4, 25]}$

$= [\sqrt{9},\; \sqrt{4},\; \sqrt{25}]$

$= [3, 2, 5]$

Elementwise Division ($a / b$)

Divide matching entries:

$[10, 6, 15] \;/\; [2, 3, 5]$

$= [10/2,\; 6/3,\; 15/5]$

$= [5, 2, 3]$

Hand Computation: Elementwise Operations (Adam-style)

Given gradient $g = [4, -1, 3]$:

Step 1: Square the gradient elementwise ($g^2$)

$g^2 = [4^2,\; (-1)^2,\; 3^2] =$ $[16, 1, 9]$

This is what Adam does to build $v_t$; it tracks how big each gradient component has been

Step 2: Take the square root elementwise ($\sqrt{g^2}$)

$\sqrt{[16, 1, 9]} = [\sqrt{16},\; \sqrt{1},\; \sqrt{9}] =$ $[4, 1, 3]$

We get back the absolute values! $\sqrt{g^2} = |g|$

Step 3: Divide gradient by the square root ($g \;/\; \sqrt{g^2}$)

$[4, -1, 3] \;/\; [4, 1, 3] = [4/4,\; -1/1,\; 3/3] =$ $[1, -1, 1]$

Each component normalized to $\pm 1$! This is the core idea of Adam; the direction is preserved but the magnitude is equalized across all parameters

This is essentially what Adam's update rule does: $\hat{m}_t / \sqrt{\hat{v}_t}$ normalizes the gradient so that all parameters get steps of similar magnitude, regardless of whether their gradient is huge or tiny.

Preview: Where These Show Up in Adam

Here's the Adam update rule (Page 12 will explain every piece). Look at how many elementwise operations there are:

$\theta_t = \theta_{t-1} - \alpha \cdot \hat{m}_t \;/\; (\sqrt{\hat{v}_t} + \epsilon)$

$g_t^2$ = elementwise square (used to build $v_t$)
$\sqrt{\hat{v}_t}$ = elementwise square root
$\hat{m}_t \;/\; (\sqrt{\hat{v}_t} + \epsilon)$ = elementwise division
$\alpha \cdot (...)$ = scalar times vector (multiply each entry by $\alpha$)
$\theta_{t-1} - (...)$ = elementwise subtraction

Every operation in Adam is elementwise. Each parameter gets its own independent calculation. That's what makes it "adaptive"; each parameter is handled separately.

PART D

Quick Reference: Scalar vs Vector vs Matrix

Scalar = a single number: $\alpha = 0.001$, $\beta_1 = 0.9$, $\epsilon = 10^{-8}$

Vector = a list of numbers: $\theta = [w_1, w_2, ..., w_{784}]$, $g_t = [\frac{\partial f}{\partial w_1}, ..., \frac{\partial f}{\partial w_{784}}]$

Matrix = a grid of numbers: $Q = \begin{bmatrix}2&0\\0&1\end{bmatrix}$, $H_k$ (the Hessian approximation from BFGS)

Key difference from BFGS: BFGS stores a full matrix $H_k$ (size $n \times n$). With 784 parameters, that's 614,656 numbers! Adam only stores two vectors ($m_t$ and $v_t$, each size $n$). With 784 parameters, that's just 1,568 numbers. Much cheaper.

Glossary; Vectors, Norms, and Operations

Vector: A list of numbers. Written as $[a, b, c]$ or as a column. The gradient, parameters, and moment estimates in Adam are all vectors.
L2 norm $\|\mathbf{v}\|_2$: Straight-line distance. $\sqrt{v_1^2 + v_2^2 + \cdots}$. Pythagorean theorem extended to any number of dimensions.
L1 norm $\|\mathbf{v}\|_1$: Taxi distance. $|v_1| + |v_2| + \cdots$. Add up all absolute values.
L∞ norm $\|\mathbf{v}\|_\infty$: Biggest component. $\max(|v_1|, |v_2|, \ldots)$. Used in Adam's convergence proof and in AdaMax.
Elementwise ($\odot$, $g^2$, $\sqrt{v}$, $a/b$): Apply the operation to each entry independently. Critical: almost every operation in Adam is elementwise.
Scalar × Vector: Multiply every entry by the scalar. $3 \times [2, 5] = [6, 15]$.

Next: Page 02; Convexity: What Makes Problems "Nice"

Why Adam's theory works on bowls but Adam itself works on everything