Information Theory
Information theory, founded by Shannon (1948), provides the mathematical framework for quantifying information, uncertainty, and the cost of communication. In machine learning, information-theoretic quantities serve as training objectives (cross-entropy loss), regularizers (KL divergence), model selection criteria (MDL), and theoretical tools for understanding representations (information bottleneck). This chapter covers the key concepts and their ML applications.
Entropy
For continuous distributions, the differential entropy is .
Properties of entropy:
- Non-negativity: for discrete distributions (with equality iff is a point mass). Note: differential entropy can be negative.
- Maximum entropy: For outcomes: with equality iff is uniform.
- Concavity: -- mixing distributions increases uncertainty.
- Chain rule: -- joint entropy = marginal + conditional.
- Conditioning reduces entropy: with equality iff .
Maximum entropy distributions. Given constraints, the maximum entropy distribution makes the fewest assumptions:
- Given mean and variance Gaussian
- Given support Uniform
- Given mean on Poisson
- Given support and mean Exponential
This principle justifies common distributional assumptions in ML.
This is maximized at ( bit) and zero at . The binary entropy function appears in the cross-entropy loss for binary classification, where it measures the cost per sample of using model probabilities when the true label distribution is .
Cross-Entropy
Since is constant during optimization (the data distribution is fixed), minimizing cross-entropy is equivalent to minimizing KL divergence from to .
- Classification: is the one-hot label and is the softmax output. Then , which is the negative log-likelihood of the correct class.
- Language modeling: is the one-hot next-token label and is the model's predicted distribution over vocabulary. The loss averaged over tokens is the cross-entropy, and is the perplexity.
- Regression with Gaussian noise: , which reduces to MSE for fixed .
In all cases, minimizing cross-entropy = maximizing log-likelihood = minimizing KL divergence from the empirical distribution to the model.
Perplexity measures how many equally likely next tokens the model considers on average. Lower perplexity = better model. A perplexity of means the model is as uncertain as a uniform distribution over tokens. For English text, good language models achieve perplexity - (circa 2024).
KL Divergence
For continuous distributions: .
Properties of KL divergence:
- Non-negative (Gibbs' inequality): with equality iff a.e.
- Not a metric: Not symmetric (), does not satisfy triangle inequality.
- Additive for products: for independent distributions.
- Invariant under reparameterization: KL divergence is unchanged by invertible transformations of the sample space.
- Can be infinite: if there exists with and (support mismatch).
| Direction | Minimizer Behavior | Name | Used In |
|---|---|---|---|
| must cover all modes of (zero-avoiding) | Forward KL (mean-seeking) | MLE, cross-entropy loss | |
| concentrates on one mode of (zero-forcing) | Reverse KL (mode-seeking) | VI, ELBO, policy optimization |
- MLE minimizes forward KL : the model must assign non-zero probability everywhere the data has support, or pay infinite cost. This makes MLE-trained models "mode-covering" -- they produce diverse but sometimes low-quality samples.
- VI minimizes reverse KL : the approximation can safely ignore modes of it cannot fit. This makes VI "mode-seeking" -- the approximation is tight around one mode but may miss others.
- RLHF uses forward KL as a constraint: prevents the policy from deviating too far from the reference model, maintaining generation quality.
- GANs approximate various -divergences (including KL) depending on the discriminator architecture.
For the special case and (the VAE KL term):
This closed-form KL is why VAEs use Gaussian encoder and prior -- it makes the ELBO tractable.
f-Divergences
| Name | ML Usage | ||
|---|---|---|---|
| KL divergence | MLE, ELBO | ||
| Reverse KL | VI, policy optimization | ||
| Jensen-Shannon | Original GAN objective | ||
| Chi-squared | Importance weighting diagnostics | ||
| Total variation | $\frac{1}{2} | t-1 | $ |
| Hellinger | Statistical testing |
where is the convex conjugate of . This is the basis of -GANs ([?nowozin2016fgan]): the discriminator maximizes the variational bound, while the generator minimizes it. Different choices of yield different GAN objectives:
- : KL-GAN
- JS divergence: Original GAN
- : Least-squares GAN
The Wasserstein distance (used in WGAN) is not an -divergence -- it is an optimal transport distance that metrizes weak convergence, avoiding the mode collapse issues of -divergences.
Mutual Information
Equivalently: .
Properties of mutual information:
- Non-negative: with equality iff .
- Symmetric: (unlike KL divergence).
- Invariant under bijections: for invertible .
- Bounds entropy: .
- Chain rule: .
| Application | How MI Is Used |
|---|---|
| Feature selection | Select features that maximize with the target |
| InfoNCE loss (Oord et al., 2018) | Lower bound on using contrastive learning: |
| Information bottleneck ([?tishby2000information]) | Find representation that maximizes while minimizing |
| Representation quality | Good representations have high (predictive) and low $I(Z; X |
| Independence testing | -- stronger than correlation (captures nonlinear dependence) |
| Variational bounds | Used in MINE ([?belghazi2018mine]), InfoNCE, and other contrastive objectives |
MI captures all statistical dependencies (not just linear), making it the "gold standard" measure of association. However, MI is notoriously hard to estimate in high dimensions -- practical estimators use variational bounds.
Conditional Entropy and Chain Rules
- (chain rule)
- (mutual information as overlap)
- (joint = sum - overlap)
These identities are the information-theoretic analogue of the inclusion-exclusion principle for set sizes.
Data Processing Inequality
Processing data can only lose information, never create it. Equality holds iff is also a Markov chain (i.e., is a sufficient statistic of for ).
Implications:
- Later layers cannot recover information lost by earlier layers. This motivates careful design of early layers and residual/skip connections (which break the Markov chain).
- Information bottleneck theory ([?tishby2015deep]) conjectures that training has two phases: (1) fitting -- increases; (2) compression -- decreases, keeping only task-relevant information.
- Sufficient statistics: If a layer is a sufficient statistic for given (i.e., ), then no information about has been lost. This is the ideal case.
Skip connections in ResNets violate the strict Markov chain structure, allowing to remain close to even in very deep networks.
Rate-Distortion Theory
where is a distortion measure (e.g., MSE).
is a rate-distortion problem: minimize the "rate" (compression) subject to acceptable "distortion" (preserving task-relevant information). The Lagrange multiplier trades off compression vs. prediction quality. VAEs solve a related problem: the KL term upper-bounds the rate under the variational approximation.
Lossy compression, quantization, and knowledge distillation can all be framed as rate-distortion problems.
Maximum Entropy Principle
where are Lagrange multipliers and is the normalizing constant. This is exactly an exponential family distribution with sufficient statistics .
- Constraint on mean exponential distribution
- Constraints on mean and variance Gaussian
- Constraint on probabilities summing to 1 uniform
- Constraint on mean of categorical softmax (Boltzmann) distribution
This connects to statistical mechanics (Boltzmann distribution), information geometry (exponential families as maximum entropy), and ML (softmax as maximum entropy classifier).
Notation Summary
| Symbol | Meaning |
|---|---|
| Entropy of distribution | |
| Differential entropy (continuous) | |
| Cross-entropy between and | |
| KL divergence from to | |
| -divergence from to | |
| Mutual information between and | |
| $H(X | Y)$ |
| Joint entropy of and | |
| Rate-distortion function | |
| Convex conjugate of | |
| PPL | Perplexity: |
| Natural logarithm (nats) or (bits) |