Page 07: Machine Learning Basics

MATH 3850; What the Adam paper is actually optimizing: loss functions, parameters, MNIST, and logistic regression

PART A

The Setup: A Machine That Learns from Data

Analogy: Teaching a child to recognize animals.

You show a child 1,000 photos of cats and dogs with labels. At first, they guess randomly. You say "wrong" or "right." They adjust their internal rules. After enough examples, they can correctly label new photos they've never seen.

Machine learning is the same process but with math:

The "child" is a mathematical model (a function with adjustable parameters $\theta$)
The "photos" are data points (inputs)
The "labels" are the correct answers
"Wrong or right" is measured by a loss function
"Adjusting rules" is done by optimization (gradient descent!)

The ML optimization problem:

$$\min_\theta \frac{1}{N} \sum_{i=1}^{N} L(y_i, \;\hat{y}_i(\theta))$$

$\theta$: All the model's adjustable parameters (the "knobs to tune"; weights and biases)
$N$: Total number of data points in the training set (e.g., 60,000 MNIST images)
$y_i$: The true label for data point $i$ (e.g., "this image is a 7")
$\hat{y}_i(\theta)$: The model's prediction for data point $i$, which depends on the current parameter values
$L(y, \hat{y})$: The loss function; measures how wrong a single prediction is (small = good, large = bad)
$\frac{1}{N}\sum$: Average the per-point losses over the entire dataset
$\min_\theta$: Find the parameter values that make this average loss as small as possible

This is just a regular minimization problem! Same as $\min_x f(x)$ in your course. The "function" is the average loss. The "variables" are the parameters $\theta$. We use gradient descent (or Adam) to find the $\theta$ that minimizes it.

PART B

MNIST: The Paper's Main Dataset

What is MNIST? A collection of 60,000 handwritten digit images (0 through 9). Each image is 28 × 28 pixels = 784 pixels total. Each pixel is a brightness value from 0 (black) to 1 (white).

Each image is a grid of pixels. To a computer, an image is just a list of 784 numbers (the brightness of each pixel). That list IS the input to the model. The model looks at those 784 numbers and tries to guess which digit (0-9) the image shows.

MNIST by the numbers:

60,000 training images (what the model learns from)
10,000 test images (what we check accuracy on after training)
Each image: 28 × 28 = 784 input values
10 possible outputs (digits 0-9)
Task: given 784 pixel values, predict which digit it is

PART C

Logistic Regression: The Paper's First Experiment

Analogy: A weighted vote.

Imagine each pixel gets a "vote" for each digit. Pixel 42 might strongly vote for "7" (because 7s tend to have that pixel lit up) but vote against "0". The model learns the vote weights for all 784 pixels × 10 digits.

Logistic regression prediction:

$$\hat{y} = \text{softmax}(W \cdot x + b)$$

$\hat{y}$: The model's predicted probability distribution; a vector of 10 probabilities (one per digit 0-9), summing to 1
$x$: The input image flattened into a list of 784 pixel brightness values
$W$: A $10 \times 784$ weight matrix; each row holds one digit's "vote weights" for all 784 pixels
$b$: A bias vector of 10 numbers (one per digit class), shifting the raw scores up or down
$\text{softmax}$: A function that converts raw scores into probabilities; biggest score gets highest probability, all outputs sum to 1
$W \cdot x + b$: The raw "score" for each digit before converting to probabilities (a vector of 10 numbers)

The parameters $\theta = \{W, b\}$:

$W$ has $10 \times 784 = 7,840$ entries. $b$ has 10 entries. Total: 7,850 parameters.

These 7,850 numbers are what Adam optimizes. The gradient $g_t$ is a vector of 7,850 partial derivatives; one for each parameter.

Hand Computation: Tiny Logistic Regression

Imagine a simplified version: 3 pixels, 2 classes (cat vs dog).

Input image: $x = [0.8, 0.2, 0.5]$ (3 pixel values)

This is the raw data; pixel brightness values the model will look at.

Weight matrix: $W = \begin{bmatrix}0.3 & -0.1 & 0.4 \\ -0.2 & 0.5 & 0.1\end{bmatrix}$ (2 classes × 3 pixels)

Compute scores: $W \cdot x$

Multiply each pixel by its weight and sum; this is the dot product for each class.

Class 0 (cat): $0.3(0.8) + (-0.1)(0.2) + 0.4(0.5) = 0.24 - 0.02 + 0.20 = $ $0.42$

Class 1 (dog): $(-0.2)(0.8) + 0.5(0.2) + 0.1(0.5) = -0.16 + 0.10 + 0.05 = $ $-0.01$

Prediction: class 0 (cat) has score 0.42 > -0.01, so predict CAT

If the true label was "dog," the loss would be high, and gradient descent would adjust the 8 parameters ($W$ has 6, $b$ has 2) to make the dog score higher next time.

This is exactly the same structure as the paper's MNIST experiment, just scaled up: 784 pixels instead of 3, 10 classes instead of 2, 7,850 parameters instead of 8. The optimization problem is identical in form.

PART D

The Loss Function: Measuring "How Wrong"

Analogy: A test score, but inverted. A loss of 0 = perfect predictions. A loss of 5 = terrible predictions. We want to minimize the loss, just like we minimize $f(x)$ in your course.

The paper uses negative log-likelihood (also called cross-entropy loss):

$$L = -\log(\hat{y}_{\text{correct class}})$$

$L$: The loss for a single data point; a non-negative number where 0 means a perfect prediction
$\hat{y}_{\text{correct class}}$: The probability the model assigned to the correct answer (e.g., how much probability it put on "7" when the image really was a 7)
$-\log(\cdot)$: Negative logarithm; turns a probability close to 1 into a small loss, and a probability close to 0 into a very large loss

In English: "how surprised was the model by the correct answer?"

If model predicted 95% probability for the correct digit → $L = -\log(0.95) = 0.05$ (small loss, good!)
If model predicted 10% for the correct digit → $L = -\log(0.10) = 2.30$ (big loss, bad!)
If model predicted 1% → $L = -\log(0.01) = 4.61$ (terrible!)

This is the function $f(\theta)$ that Adam minimizes. The "training cost" on the y-axis of the paper's plots (Figures 1-3) is this loss, averaged over the dataset.

Putting it all together

The complete picture of what Adam does in the paper's experiments:

Start with random parameters $\theta_0$ (random weight matrix $W$, random bias $b$)
Each step: grab a random batch of 128 MNIST images
Forward pass: compute predictions for all 128 images using current $\theta$
Loss: compute average cross-entropy loss on the batch (how wrong were we?)
Gradient: $g_t$ = partial derivatives of loss w.r.t. all 7,850 parameters
Adam update: use $g_t$ to update $\theta$ (momentum + adaptive rates + bias correction)
Repeat for thousands of steps until loss is small

Steps 2-6 are just gradient descent (Page 03) with the Adam improvements (Pages 09-12). The ML part is just defining what $f(\theta)$ is (the loss on data). The optimization part is identical to your course.

Glossary; Machine Learning Basics

Model: A mathematical function with adjustable parameters. Takes input data, produces predictions.
Parameters ($\theta$, or $W$ and $b$): The numbers the model adjusts during training. Logistic regression on MNIST has 7,850 of them.
Loss function ($L$): Measures how wrong the predictions are. Small loss = good predictions. We minimize this.
Training: The process of adjusting parameters to minimize the loss. Uses gradient descent (or Adam).
MNIST: 60,000 images of handwritten digits (0-9). Each image is 28×28 = 784 pixels. The paper's main test dataset.
Logistic regression: A simple model: multiply pixels by weights, add bias, pick the highest score. The paper's first experiment (Section 6.1).
Cross-entropy / negative log-likelihood: The loss function used in the paper. $L = -\log(p_{\text{correct}})$. Penalizes confident wrong predictions heavily.
Forward pass: Computing the prediction: input → model → output. The "evaluate $f(\theta)$" step.
Epoch: One full pass through all training data. With 60,000 images and batch size 128: ~469 steps per epoch.

Next: Page 08; The Two Problems Adam Solves

Zigzag (noise) and one-size-fits-all (step sizes); the setup for understanding Adam