Loss Functions
A loss function maps a prediction and a ground truth to a scalar measuring how wrong the prediction is. The choice of loss function encodes our assumptions about the task, the noise model, and what "good performance" means.
Cross Entropy
Entropy measures the average surprise (in nats if using , or bits if using ) of observing samples from . It is maximized by the uniform distribution and equals zero only for a deterministic distribution.
The decomposition is key. depends only on the true data distribution, which is fixed during training. Since we optimize over the model (which defines ), only the term varies with model parameters. Therefore minimizing cross entropy is equivalent to minimizing KL divergence from to .
where is the model's predicted probability for the true class. This is the negative log-likelihood of the correct class.
For binary classification (), the full loss averaged over samples is:
where and is the sigmoid output.
This is simply the predicted probability minus the one-hot target. No derivatives of log or softmax appear explicitly, because they cancel. This cancellation also improves numerical stability, which is why F.cross_entropy in PyTorch takes raw logits, not probabilities.
When it is used: Standard loss for classification tasks (logistic regression, neural network classifiers, language model next-token prediction). For language models, the loss is averaged over all tokens in a sequence: .
KL Divergence
For continuous distributions:
Key Properties
- Non-negativity (Gibbs' inequality): with equality iff almost everywhere. This follows from Jensen's inequality applied to the concave function .
- Asymmetry: in general. Therefore KL divergence is not a metric (it also fails the triangle inequality).
- Additivity: For independent variables, .
- Invariance under reparameterization: KL divergence is unchanged by invertible transformations of the random variable.
Forward vs. Reverse KL
The asymmetry of KL divergence has profound consequences for learning:
| Direction | Name | Behavior | where but | Result |
|---|---|---|---|---|
| Forward KL | Mean-seeking | Infinite penalty | covers all modes of | |
| Reverse KL | Mode-seeking | No penalty | locks onto one mode of |
Reverse KL () is what variational inference minimizes (the ELBO). It allows to ignore modes of , producing sharper but potentially incomplete approximations. This is why variational autoencoders sometimes suffer from "mode collapse."
For multivariate Gaussians and :
where is the dimensionality. This is used in the VAE loss, where is the encoder posterior and is the prior.
When it is used: VAE regularization (matching the encoder posterior to a prior) (Kingma & Welling, 2014), knowledge distillation (matching a student network's output distribution to a teacher's) (Hinton et al., 2015), reinforcement learning (PPO's clipped objective constrains policy updates; RLHF penalizes drift from a reference policy) (Schulman et al., 2017; Ouyang et al., 2022).
Mean Squared Error and Regression Losses
| Loss | Formula | Gradient | Properties |
|---|---|---|---|
| MSE (L2) | Sensitive to outliers; smooth everywhere | ||
| MAE (L1) | $ | y - \hat{y} | $ |
| Huber | $\begin{cases} \frac{1}{2}(y-\hat{y})^2 & | y-\hat{y} | \leq \delta \ \delta( |
| Log-cosh | Twice-differentiable Huber approximation |
Contrastive and Ranking Losses
where is the margin. The loss pushes the anchor closer to the positive and farther from the negative in embedding space.
where is the positive pair, includes the positive and negatives, and is the temperature. This is a softmax cross-entropy over similarity scores.
Loss Functions in Practice
| Task | Loss | Output Activation | PyTorch |
|---|---|---|---|
| Multi-class classification | Cross entropy | Softmax (implicit) | F.cross_entropy(logits, labels) |
| Binary classification | Binary cross entropy | Sigmoid | F.binary_cross_entropy_with_logits |
| Regression | MSE | None (linear) | F.mse_loss |
| Robust regression | Huber | None | F.smooth_l1_loss |
| Language modeling | Cross entropy (per token) | Softmax | F.cross_entropy(logits.view(-1, V), labels.view(-1)) |
| Object detection | Focal loss + L1/GIoU | Sigmoid + linear | Custom |
| Contrastive learning | InfoNCE | Cosine similarity | Custom |
| VAE | Reconstruction + KL | Task-dependent + Gaussian | Custom |
Notation Summary
| Symbol | Meaning |
|---|---|
| True (data) distribution | |
| Model (predicted) distribution | |
| Entropy of | |
| Cross entropy between and | |
| KL divergence from to | |
| Predicted probability for true class | |
| Ground-truth label for sample | |
| Number of samples | |
| Number of classes | |
| Sequence length | |
| Temperature parameter | |
| Loss value | |
| Sigmoid function |
References
- Geoffrey Hinton, Oriol Vinyals, Jeff Dean (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop.
- Diederik P. Kingma, Max Welling (2014). Auto-Encoding Variational Bayes. ICLR.
- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS.
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov (2017). Proximal Policy Optimization Algorithms. arXiv.