MATH 3850; What the Adam paper is actually optimizing: loss functions, parameters, MNIST, and logistic regression
PART AAnalogy: Teaching a child to recognize animals.
You show a child 1,000 photos of cats and dogs with labels. At first, they guess randomly. You say "wrong" or "right." They adjust their internal rules. After enough examples, they can correctly label new photos they've never seen.
Machine learning is the same process but with math:
The ML optimization problem:
$$\min_\theta \frac{1}{N} \sum_{i=1}^{N} L(y_i, \;\hat{y}_i(\theta))$$
This is just a regular minimization problem! Same as $\min_x f(x)$ in your course. The "function" is the average loss. The "variables" are the parameters $\theta$. We use gradient descent (or Adam) to find the $\theta$ that minimizes it.
What is MNIST? A collection of 60,000 handwritten digit images (0 through 9). Each image is 28 × 28 pixels = 784 pixels total. Each pixel is a brightness value from 0 (black) to 1 (white).
MNIST by the numbers:
Analogy: A weighted vote.
Imagine each pixel gets a "vote" for each digit. Pixel 42 might strongly vote for "7" (because 7s tend to have that pixel lit up) but vote against "0". The model learns the vote weights for all 784 pixels × 10 digits.
Logistic regression prediction:
$$\hat{y} = \text{softmax}(W \cdot x + b)$$
The parameters $\theta = \{W, b\}$:
$W$ has $10 \times 784 = 7,840$ entries. $b$ has 10 entries. Total: 7,850 parameters.
These 7,850 numbers are what Adam optimizes. The gradient $g_t$ is a vector of 7,850 partial derivatives; one for each parameter.
Imagine a simplified version: 3 pixels, 2 classes (cat vs dog).
Input image: $x = [0.8, 0.2, 0.5]$ (3 pixel values)
This is the raw data; pixel brightness values the model will look at.
Weight matrix: $W = \begin{bmatrix}0.3 & -0.1 & 0.4 \\ -0.2 & 0.5 & 0.1\end{bmatrix}$ (2 classes × 3 pixels)
Compute scores: $W \cdot x$
Multiply each pixel by its weight and sum; this is the dot product for each class.
Class 0 (cat): $0.3(0.8) + (-0.1)(0.2) + 0.4(0.5) = 0.24 - 0.02 + 0.20 = $ $0.42$
Class 1 (dog): $(-0.2)(0.8) + 0.5(0.2) + 0.1(0.5) = -0.16 + 0.10 + 0.05 = $ $-0.01$
Prediction: class 0 (cat) has score 0.42 > -0.01, so predict CAT
If the true label was "dog," the loss would be high, and gradient descent would adjust the 8 parameters ($W$ has 6, $b$ has 2) to make the dog score higher next time.
This is exactly the same structure as the paper's MNIST experiment, just scaled up: 784 pixels instead of 3, 10 classes instead of 2, 7,850 parameters instead of 8. The optimization problem is identical in form.
Analogy: A test score, but inverted. A loss of 0 = perfect predictions. A loss of 5 = terrible predictions. We want to minimize the loss, just like we minimize $f(x)$ in your course.
The paper uses negative log-likelihood (also called cross-entropy loss):
$$L = -\log(\hat{y}_{\text{correct class}})$$
In English: "how surprised was the model by the correct answer?"
This is the function $f(\theta)$ that Adam minimizes. The "training cost" on the y-axis of the paper's plots (Figures 1-3) is this loss, averaged over the dataset.
The complete picture of what Adam does in the paper's experiments:
Steps 2-6 are just gradient descent (Page 03) with the Adam improvements (Pages 09-12). The ML part is just defining what $f(\theta)$ is (the loss on data). The optimization part is identical to your course.