Skip to main content

Loss Functions

A loss function :Y^×YR\ell: \hat{\mathcal{Y}} \times \mathcal{Y} \to \mathbb{R} maps a prediction and a ground truth to a scalar measuring how wrong the prediction is. The choice of loss function encodes our assumptions about the task, the noise model, and what "good performance" means.

Cross Entropy

The **entropy** of a discrete distribution $p$ is:

H(p)=xp(x)logp(x)=Exp[logp(x)]H(p) = -\sum_{x} p(x) \log p(x) = \mathbb{E}_{x \sim p}\left[-\log p(x)\right]

Entropy measures the average surprise (in nats if using ln\ln, or bits if using log2\log_2) of observing samples from pp. It is maximized by the uniform distribution and equals zero only for a deterministic distribution.

The **cross entropy** between a true distribution $p$ and a predicted distribution $q$ is:

H(p,q)=xp(x)logq(x)=H(p)+DKL(pq)H(p, q) = -\sum_{x} p(x) \log q(x) = H(p) + D_{\text{KL}}(p \| q)

The decomposition H(p,q)=H(p)+DKL(pq)H(p,q) = H(p) + D_{\text{KL}}(p \| q) is key. H(p)H(p) depends only on the true data distribution, which is fixed during training. Since we optimize over the model (which defines qq), only the DKLD_{\text{KL}} term varies with model parameters. Therefore minimizing cross entropy is equivalent to minimizing KL divergence from qq to pp.

**One-hot classification.** For a classification problem with $K$ classes, the true label for sample $i$ is a one-hot vector $p = e_c$ (all zeros except position $c$). The cross entropy reduces to:

L=k=1Kpklogqk=logqc\mathcal{L} = -\sum_{k=1}^K p_k \log q_k = -\log q_c

where qc=y^cq_c = \hat{y}_c is the model's predicted probability for the true class. This is the negative log-likelihood of the correct class.

For binary classification (K=2K=2), the full loss averaged over NN samples is:

LBCE=1Ni=1N[yilogy^i+(1yi)log(1y^i)]\mathcal{L}_{\text{BCE}} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log \hat{y}_i + (1 - y_i) \log (1 - \hat{y}_i) \right]

where yi{0,1}y_i \in \{0, 1\} and y^i=σ(zi)(0,1)\hat{y}_i = \sigma(z_i) \in (0, 1) is the sigmoid output.

**Gradient simplicity.** The gradient of cross-entropy loss composed with softmax has a remarkably clean form. For logits $z \in \mathbb{R}^K$ and softmax output $\hat{y} = \text{softmax}(z)$:

Lzk=y^k1[k=c]\frac{\partial \mathcal{L}}{\partial z_k} = \hat{y}_k - \mathbf{1}[k = c]

This is simply the predicted probability minus the one-hot target. No derivatives of log or softmax appear explicitly, because they cancel. This cancellation also improves numerical stability, which is why F.cross_entropy in PyTorch takes raw logits, not probabilities.

When it is used: Standard loss for classification tasks (logistic regression, neural network classifiers, language model next-token prediction). For language models, the loss is averaged over all tokens in a sequence: L=1Tt=1TlogP(xtx<t)\mathcal{L} = -\frac{1}{T}\sum_{t=1}^T \log P(x_t | x_{<t}).

KL Divergence

The **Kullback--Leibler divergence** measures how much one distribution $p$ differs from another $q$:

DKL(pq)=xp(x)logp(x)q(x)=Exp[logp(x)q(x)]D_{\text{KL}}(p \| q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_{x \sim p}\left[\log \frac{p(x)}{q(x)}\right]

For continuous distributions:

DKL(pq)=p(x)logp(x)q(x)dxD_{\text{KL}}(p \| q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx

Key Properties

  1. Non-negativity (Gibbs' inequality): DKL(pq)0D_{\text{KL}}(p \| q) \geq 0 with equality iff p=qp = q almost everywhere. This follows from Jensen's inequality applied to the concave function log\log.
  2. Asymmetry: DKL(pq)DKL(qp)D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p) in general. Therefore KL divergence is not a metric (it also fails the triangle inequality).
  3. Additivity: For independent variables, DKL(p1p2q1q2)=DKL(p1q1)+DKL(p2q2)D_{\text{KL}}(p_1 p_2 \| q_1 q_2) = D_{\text{KL}}(p_1 \| q_1) + D_{\text{KL}}(p_2 \| q_2).
  4. Invariance under reparameterization: KL divergence is unchanged by invertible transformations of the random variable.

Forward vs. Reverse KL

The asymmetry of KL divergence has profound consequences for learning:

DirectionNameBehaviorqq where p>0p > 0 but q0q \approx 0Result
DKL(pq)D_{\text{KL}}(p \| q)Forward KLMean-seekingInfinite penaltyqq covers all modes of pp
DKL(qp)D_{\text{KL}}(q \| p)Reverse KLMode-seekingNo penaltyqq locks onto one mode of pp
**Forward KL** ($D_{\text{KL}}(p \| q)$) is what we minimize when we train by maximum likelihood (cross-entropy). It forces $q$ to spread mass everywhere $p$ has mass, which can produce overly diffuse predictions when $p$ is multimodal.

Reverse KL (DKL(qp)D_{\text{KL}}(q \| p)) is what variational inference minimizes (the ELBO). It allows qq to ignore modes of pp, producing sharper but potentially incomplete approximations. This is why variational autoencoders sometimes suffer from "mode collapse."

**Closed form for Gaussians.** For $p = \mathcal{N}(\mu_1, \sigma_1^2)$ and $q = \mathcal{N}(\mu_2, \sigma_2^2)$:

DKL(pq)=logσ2σ1+σ12+(μ1μ2)22σ2212D_{\text{KL}}(p \| q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

For multivariate Gaussians p=N(μ1,Σ1)p = \mathcal{N}(\mu_1, \Sigma_1) and q=N(μ2,Σ2)q = \mathcal{N}(\mu_2, \Sigma_2):

DKL(pq)=12[logΣ2Σ1d+tr(Σ21Σ1)+(μ2μ1)Σ21(μ2μ1)]D_{\text{KL}}(p \| q) = \frac{1}{2}\left[\log\frac{|\Sigma_2|}{|\Sigma_1|} - d + \text{tr}(\Sigma_2^{-1}\Sigma_1) + (\mu_2 - \mu_1)^\top \Sigma_2^{-1}(\mu_2 - \mu_1)\right]

where dd is the dimensionality. This is used in the VAE loss, where pp is the encoder posterior and q=N(0,I)q = \mathcal{N}(0, I) is the prior.

When it is used: VAE regularization (matching the encoder posterior to a prior) (Kingma & Welling, 2014), knowledge distillation (matching a student network's output distribution to a teacher's) (Hinton et al., 2015), reinforcement learning (PPO's clipped objective constrains policy updates; RLHF penalizes drift from a reference policy) (Schulman et al., 2017; Ouyang et al., 2022).

Mean Squared Error and Regression Losses

For regression with continuous targets $y_i \in \mathbb{R}$:

LMSE=1Ni=1N(yiy^i)2\mathcal{L}_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2

MSE is the maximum likelihood loss under the assumption that $y = f(x) + \epsilon$ where $\epsilon \sim \mathcal{N}(0, \sigma^2)$. The negative log-likelihood of $y | x$ under this model is $\frac{1}{2\sigma^2}(y - f(x))^2 + \text{const}$, which is proportional to MSE.
LossFormulaGradientProperties
MSE (L2)(yy^)2(y - \hat{y})^22(y^y)2(\hat{y} - y)Sensitive to outliers; smooth everywhere
MAE (L1)$y - \hat{y}$
Huber$\begin{cases} \frac{1}{2}(y-\hat{y})^2 &y-\hat{y}\leq \delta \ \delta(
Log-coshlogcosh(yy^)\log \cosh(y - \hat{y})tanh(y^y)\tanh(\hat{y} - y)Twice-differentiable Huber approximation

Contrastive and Ranking Losses

Given an anchor $a$, a positive $p$ (same class), and a negative $n$ (different class):

Ltriplet=max ⁣(0,  f(a)f(p)2f(a)f(n)2+α)\mathcal{L}_{\text{triplet}} = \max\!\left(0, \; \|f(a) - f(p)\|^2 - \|f(a) - f(n)\|^2 + \alpha\right)

where α>0\alpha > 0 is the margin. The loss pushes the anchor closer to the positive and farther from the negative in embedding space.

The contrastive loss used in CLIP [@radford2021clip], SimCLR [@chen2020simclr], and similar frameworks:

LInfoNCE=logexp(f(x)f(x+)/τ)j=1Kexp(f(x)f(xj)/τ)\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(f(x)^\top f(x^+) / \tau)}{\sum_{j=1}^{K} \exp(f(x)^\top f(x_j) / \tau)}

where x+x^+ is the positive pair, {xj}\{x_j\} includes the positive and K1K-1 negatives, and τ>0\tau > 0 is the temperature. This is a softmax cross-entropy over similarity scores.

InfoNCE is a lower bound on mutual information: $\mathcal{L}_{\text{InfoNCE}} \geq -I(X; X^+) + \log K$ [@oord2018representation]. More negatives ($K$) give a tighter bound but require more computation.

Loss Functions in Practice

TaskLossOutput ActivationPyTorch
Multi-class classificationCross entropySoftmax (implicit)F.cross_entropy(logits, labels)
Binary classificationBinary cross entropySigmoidF.binary_cross_entropy_with_logits
RegressionMSENone (linear)F.mse_loss
Robust regressionHuberNoneF.smooth_l1_loss
Language modelingCross entropy (per token)SoftmaxF.cross_entropy(logits.view(-1, V), labels.view(-1))
Object detectionFocal loss + L1/GIoUSigmoid + linearCustom
Contrastive learningInfoNCECosine similarityCustom
VAEReconstruction + KLTask-dependent + GaussianCustom
**Label smoothing** [@szegedy2016rethinking] replaces hard one-hot labels $p = e_c$ with smoothed labels $p_k = (1-\epsilon)\mathbf{1}[k=c] + \epsilon/K$, where $\epsilon$ is typically $0.1$. This prevents the model from becoming overconfident (logits $\to \pm\infty$) and acts as a form of regularization. It is equivalent to adding a small amount of KL divergence toward the uniform distribution.

Notation Summary

SymbolMeaning
ppTrue (data) distribution
qqModel (predicted) distribution
H(p)H(p)Entropy of pp
H(p,q)H(p, q)Cross entropy between pp and qq
DKL(pq)D_{\text{KL}}(p \| q)KL divergence from qq to pp
y^c\hat{y}_cPredicted probability for true class cc
yiy_iGround-truth label for sample ii
NNNumber of samples
KKNumber of classes
TTSequence length
τ\tauTemperature parameter
L\mathcal{L}Loss value
σ()\sigma(\cdot)Sigmoid function

References