Skip to main content

Information Theory

Information theory, founded by Shannon (1948), provides the mathematical framework for quantifying information, uncertainty, and the cost of communication. In machine learning, information-theoretic quantities serve as training objectives (cross-entropy loss), regularizers (KL divergence), model selection criteria (MDL), and theoretical tools for understanding representations (information bottleneck). This chapter covers the key concepts and their ML applications.

Entropy

The **entropy** of a discrete distribution $p$ measures the expected information content (or uncertainty):

H(p)=xp(x)logp(x)=Ep[logp(X)]H(p) = -\sum_x p(x) \log p(x) = \mathbb{E}_{p}[-\log p(X)]

For continuous distributions, the differential entropy is h(p)=p(x)logp(x)dxh(p) = -\int p(x) \log p(x) \, dx.

Properties of entropy:

  • Non-negativity: H(p)0H(p) \geq 0 for discrete distributions (with equality iff pp is a point mass). Note: differential entropy can be negative.
  • Maximum entropy: For KK outcomes: H(p)logKH(p) \leq \log K with equality iff pp is uniform.
  • Concavity: H(λp+(1λ)q)λH(p)+(1λ)H(q)H(\lambda p + (1-\lambda)q) \geq \lambda H(p) + (1-\lambda)H(q) -- mixing distributions increases uncertainty.
  • Chain rule: H(X,Y)=H(X)+H(YX)H(X, Y) = H(X) + H(Y|X) -- joint entropy = marginal + conditional.
  • Conditioning reduces entropy: H(XY)H(X)H(X|Y) \leq H(X) with equality iff XYX \perp Y.
**Operational interpretation.** Entropy is the expected number of bits (if $\log_2$) or nats (if $\ln$) needed to optimally encode samples from $p$. Shannon's source coding theorem states that the optimal lossless compression rate for i.i.d. samples from $p$ is exactly $H(p)$ bits per sample. No encoding can do better on average.

Maximum entropy distributions. Given constraints, the maximum entropy distribution makes the fewest assumptions:

  • Given mean and variance \to Gaussian
  • Given support [a,b][a,b] \to Uniform
  • Given mean λ\lambda on N\mathbb{N} \to Poisson
  • Given support R+\mathbb{R}^+ and mean \to Exponential

This principle justifies common distributional assumptions in ML.

**Binary entropy.** For a Bernoulli variable with $P(X=1) = p$:

H(p)=plogp(1p)log(1p)H(p) = -p\log p - (1-p)\log(1-p)

This is maximized at p=1/2p = 1/2 (H=log2=1H = \log 2 = 1 bit) and zero at p{0,1}p \in \{0, 1\}. The binary entropy function appears in the cross-entropy loss for binary classification, where it measures the cost per sample of using model probabilities qq when the true label distribution is pp.

Cross-Entropy

The **cross-entropy** between a true distribution $p$ and a model $q$:

H(p,q)=xp(x)logq(x)=H(p)+DKL(pq)H(p, q) = -\sum_x p(x) \log q(x) = H(p) + D_{\text{KL}}(p \| q)

Since H(p)H(p) is constant during optimization (the data distribution is fixed), minimizing cross-entropy is equivalent to minimizing KL divergence from pp to qq.

**Cross-entropy as the universal training loss.** Cross-entropy loss in ML is exactly this quantity applied to empirical distributions:
  • Classification: pp is the one-hot label eye_y and qq is the softmax output. Then H(ey,q)=logqyH(e_y, q) = -\log q_y, which is the negative log-likelihood of the correct class.
  • Language modeling: pp is the one-hot next-token label and qq is the model's predicted distribution over vocabulary. The loss logq(xtx<t)-\log q(x_t | x_{<t}) averaged over tokens is the cross-entropy, and exp(H)\exp(H) is the perplexity.
  • Regression with Gaussian noise: logN(yμ,σ2)=(yμ)22σ2+12log(2πσ2)-\log \mathcal{N}(y|\mu, \sigma^2) = \frac{(y-\mu)^2}{2\sigma^2} + \frac{1}{2}\log(2\pi\sigma^2), which reduces to MSE for fixed σ\sigma.

In all cases, minimizing cross-entropy = maximizing log-likelihood = minimizing KL divergence from the empirical distribution to the model.

**Perplexity.** The perplexity of a language model is:

PPL=exp(1Tt=1Tlogp(xtx<t))=exp(Hcross-entropy)\text{PPL} = \exp\left(-\frac{1}{T}\sum_{t=1}^T \log p(x_t | x_{<t})\right) = \exp(H_{\text{cross-entropy}})

Perplexity measures how many equally likely next tokens the model considers on average. Lower perplexity = better model. A perplexity of KK means the model is as uncertain as a uniform distribution over KK tokens. For English text, good language models achieve perplexity 15\sim 15-2525 (circa 2024).

KL Divergence

The **Kullback-Leibler divergence** from $q$ to $p$ [@kullback1951information]:

DKL(pq)=xp(x)logp(x)q(x)=Ep[logp(x)q(x)]0D_{\text{KL}}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_p\left[\log \frac{p(x)}{q(x)}\right] \geq 0

For continuous distributions: DKL(pq)=p(x)logp(x)q(x)dxD_{\text{KL}}(p \| q) = \int p(x) \log \frac{p(x)}{q(x)} dx.

Properties of KL divergence:

  • Non-negative (Gibbs' inequality): DKL(pq)0D_{\text{KL}}(p \| q) \geq 0 with equality iff p=qp = q a.e.
  • Not a metric: Not symmetric (DKL(pq)DKL(qp)D_{\text{KL}}(p\|q) \neq D_{\text{KL}}(q\|p)), does not satisfy triangle inequality.
  • Additive for products: DKL(p1p2q1q2)=DKL(p1q1)+DKL(p2q2)D_{\text{KL}}(p_1 p_2 \| q_1 q_2) = D_{\text{KL}}(p_1\|q_1) + D_{\text{KL}}(p_2\|q_2) for independent distributions.
  • Invariant under reparameterization: KL divergence is unchanged by invertible transformations of the sample space.
  • Can be infinite: DKL(pq)=D_{\text{KL}}(p\|q) = \infty if there exists xx with p(x)>0p(x) > 0 and q(x)=0q(x) = 0 (support mismatch).
DirectionMinimizer BehaviorNameUsed In
DKL(pq)D_{\text{KL}}(p \| q)qq must cover all modes of pp (zero-avoiding)Forward KL (mean-seeking)MLE, cross-entropy loss
DKL(qp)D_{\text{KL}}(q \| p)qq concentrates on one mode of pp (zero-forcing)Reverse KL (mode-seeking)VI, ELBO, policy optimization
**KL direction matters profoundly in practice.**
  • MLE minimizes forward KL DKL(pdatapθ)D_{\text{KL}}(p_{\text{data}} \| p_\theta): the model must assign non-zero probability everywhere the data has support, or pay infinite cost. This makes MLE-trained models "mode-covering" -- they produce diverse but sometimes low-quality samples.
  • VI minimizes reverse KL DKL(qϕp(θD))D_{\text{KL}}(q_\phi \| p(\theta|\mathcal{D})): the approximation qq can safely ignore modes of pp it cannot fit. This makes VI "mode-seeking" -- the approximation is tight around one mode but may miss others.
  • RLHF uses forward KL as a constraint: DKL(ππref)ϵD_{\text{KL}}(\pi \| \pi_{\text{ref}}) \leq \epsilon prevents the policy from deviating too far from the reference model, maintaining generation quality.
  • GANs approximate various ff-divergences (including KL) depending on the discriminator architecture.
**KL between Gaussians.** For two multivariate Gaussians $p = \mathcal{N}(\mu_1, \Sigma_1)$ and $q = \mathcal{N}(\mu_2, \Sigma_2)$ in $\mathbb{R}^d$:

DKL(pq)=12[tr(Σ21Σ1)+(μ2μ1)Σ21(μ2μ1)d+logΣ2Σ1]D_{\text{KL}}(p \| q) = \frac{1}{2}\left[\text{tr}(\Sigma_2^{-1}\Sigma_1) + (\mu_2 - \mu_1)^\top \Sigma_2^{-1}(\mu_2 - \mu_1) - d + \log\frac{|\Sigma_2|}{|\Sigma_1|}\right]

For the special case p=N(μ,diag(σ2))p = \mathcal{N}(\mu, \text{diag}(\sigma^2)) and q=N(0,I)q = \mathcal{N}(0, I) (the VAE KL term):

DKL(pq)=12j=1d[σj2+μj21logσj2]D_{\text{KL}}(p \| q) = \frac{1}{2}\sum_{j=1}^d \left[\sigma_j^2 + \mu_j^2 - 1 - \log \sigma_j^2\right]

This closed-form KL is why VAEs use Gaussian encoder and prior -- it makes the ELBO tractable.

f-Divergences

The **$f$-divergence** [@csiszar1967information; @ali1966general] generalizes KL divergence using a convex function $f$ with $f(1) = 0$:

Df(pq)=Eq[f(p(x)q(x))]=q(x)f(p(x)q(x))dxD_f(p \| q) = \mathbb{E}_q\left[f\left(\frac{p(x)}{q(x)}\right)\right] = \int q(x) f\left(\frac{p(x)}{q(x)}\right) dx

Namef(t)f(t)Df(pq)D_f(p\|q)ML Usage
KL divergencetlogtt \log tEp[log(p/q)]\mathbb{E}_p[\log(p/q)]MLE, ELBO
Reverse KLlogt-\log tEq[log(q/p)]\mathbb{E}_q[\log(q/p)]VI, policy optimization
Jensen-Shannon(1+t)log1+t2+tlogt-(1+t)\log\frac{1+t}{2} + t\log t12DKL(pm)+12DKL(qm)\frac{1}{2}D_{\text{KL}}(p\|m) + \frac{1}{2}D_{\text{KL}}(q\|m)Original GAN objective
Chi-squared(t1)2(t-1)^2Eq[(p/q1)2]\mathbb{E}_q[(p/q - 1)^2]Importance weighting diagnostics
Total variation$\frac{1}{2}t-1$
Hellinger(t1)2(\sqrt{t} - 1)^2(pq)2dx\int (\sqrt{p} - \sqrt{q})^2 dxStatistical testing
**Variational representation and GANs.** Every $f$-divergence has a variational (dual) representation:

Df(pq)=supT{Ep[T(x)]Eq[f(T(x))]}D_f(p \| q) = \sup_T \left\{\mathbb{E}_p[T(x)] - \mathbb{E}_q[f^*(T(x))]\right\}

where ff^* is the convex conjugate of ff. This is the basis of ff-GANs ([?nowozin2016fgan]): the discriminator TT maximizes the variational bound, while the generator minimizes it. Different choices of ff yield different GAN objectives:

  • f(t)=tlogtf(t) = t \log t: KL-GAN
  • JS divergence: Original GAN
  • f(t)=(t1)2f(t) = (t-1)^2: Least-squares GAN

The Wasserstein distance (used in WGAN) is not an ff-divergence -- it is an optimal transport distance that metrizes weak convergence, avoiding the mode collapse issues of ff-divergences.

Mutual Information

The **mutual information** between $X$ and $Y$ measures the reduction in uncertainty about one variable given knowledge of the other:

I(X;Y)=DKL(p(X,Y)p(X)p(Y))=H(X)H(XY)=H(Y)H(YX)I(X; Y) = D_{\text{KL}}(p(X, Y) \| p(X)p(Y)) = H(X) - H(X|Y) = H(Y) - H(Y|X)

Equivalently: I(X;Y)=H(X)+H(Y)H(X,Y)I(X;Y) = H(X) + H(Y) - H(X,Y).

Properties of mutual information:

  • Non-negative: I(X;Y)0I(X;Y) \geq 0 with equality iff XYX \perp Y.
  • Symmetric: I(X;Y)=I(Y;X)I(X;Y) = I(Y;X) (unlike KL divergence).
  • Invariant under bijections: I(X;Y)=I(f(X);g(Y))I(X;Y) = I(f(X); g(Y)) for invertible f,gf, g.
  • Bounds entropy: I(X;Y)min(H(X),H(Y))I(X;Y) \leq \min(H(X), H(Y)).
  • Chain rule: I(X;Y,Z)=I(X;Y)+I(X;ZY)I(X; Y, Z) = I(X; Y) + I(X; Z | Y).
**Mutual information in ML:**
ApplicationHow MI Is Used
Feature selectionSelect features XiX_i that maximize I(Xi;Y)I(X_i; Y) with the target
InfoNCE loss (Oord et al., 2018)Lower bound on I(X;Z)I(X; Z) using contrastive learning: I(X;Z)logNLNCEI(X;Z) \geq \log N - \mathcal{L}_{\text{NCE}}
Information bottleneck ([?tishby2000information])Find representation ZZ that maximizes I(Z;Y)I(Z; Y) while minimizing I(X;Z)I(X; Z)
Representation qualityGood representations have high I(Z;Y)I(Z; Y) (predictive) and low $I(Z; X
Independence testingI(X;Y)=0    XYI(X; Y) = 0 \iff X \perp Y -- stronger than correlation (captures nonlinear dependence)
Variational boundsUsed in MINE ([?belghazi2018mine]), InfoNCE, and other contrastive objectives

MI captures all statistical dependencies (not just linear), making it the "gold standard" measure of association. However, MI is notoriously hard to estimate in high dimensions -- practical estimators use variational bounds.

Conditional Entropy and Chain Rules

The **conditional entropy** of $X$ given $Y$ is the expected uncertainty remaining in $X$ after observing $Y$:

H(XY)=x,yp(x,y)logp(xy)=H(X,Y)H(Y)H(X|Y) = -\sum_{x,y} p(x,y) \log p(x|y) = H(X,Y) - H(Y)

**The information diagram.** The relationships between entropy, conditional entropy, mutual information, and joint entropy can be visualized as a Venn diagram:
  • H(X,Y)=H(X)+H(YX)=H(Y)+H(XY)H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) (chain rule)
  • I(X;Y)=H(X)H(XY)=H(Y)H(YX)I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) (mutual information as overlap)
  • H(X,Y)=H(X)+H(Y)I(X;Y)H(X,Y) = H(X) + H(Y) - I(X;Y) (joint = sum - overlap)

These identities are the information-theoretic analogue of the inclusion-exclusion principle for set sizes.

Data Processing Inequality

For a Markov chain $X \to Y \to Z$ (i.e., $X \perp Z \mid Y$, so $Z$ depends on $X$ only through $Y$):

I(X;Z)I(X;Y)I(X; Z) \leq I(X; Y)

Processing data can only lose information, never create it. Equality holds iff XZYX \to Z \to Y is also a Markov chain (i.e., ZZ is a sufficient statistic of YY for XX).

**Data processing inequality in neural networks.** In a neural network $X \to h_1 \to h_2 \to \cdots \to h_L \to \hat{Y}$, each layer forms a Markov chain, so:

I(X;Y^)I(X;hL)I(X;h1)I(X;X)=H(X)I(X; \hat{Y}) \leq I(X; h_L) \leq \cdots \leq I(X; h_1) \leq I(X; X) = H(X)

Implications:

  • Later layers cannot recover information lost by earlier layers. This motivates careful design of early layers and residual/skip connections (which break the Markov chain).
  • Information bottleneck theory ([?tishby2015deep]) conjectures that training has two phases: (1) fitting -- I(h;Y)I(h; Y) increases; (2) compression -- I(X;h)I(X; h) decreases, keeping only task-relevant information.
  • Sufficient statistics: If a layer hlh_l is a sufficient statistic for YY given XX (i.e., I(hl;Y)=I(X;Y)I(h_l; Y) = I(X; Y)), then no information about YY has been lost. This is the ideal case.

Skip connections in ResNets violate the strict Markov chain structure, allowing I(X;hL)I(X; h_L) to remain close to I(X;X)I(X; X) even in very deep networks.

Rate-Distortion Theory

The **rate-distortion function** $R(D)$ characterizes the minimum number of bits per sample needed to represent data from source $p(x)$ with average distortion at most $D$:

R(D)=minp(x^x):E[d(x,x^)]DI(X;X^)R(D) = \min_{p(\hat{x}|x): \mathbb{E}[d(x,\hat{x})] \leq D} I(X; \hat{X})

where d(x,x^)d(x, \hat{x}) is a distortion measure (e.g., MSE).

**Rate-distortion and representation learning.** The information bottleneck objective

minZI(X;Z)βI(Z;Y)\min_Z I(X; Z) - \beta \cdot I(Z; Y)

is a rate-distortion problem: minimize the "rate" I(X;Z)I(X; Z) (compression) subject to acceptable "distortion" I(Z;Y)I(Z; Y) (preserving task-relevant information). The Lagrange multiplier β\beta trades off compression vs. prediction quality. VAEs solve a related problem: the KL term DKL(q(zx)p(z))D_{\text{KL}}(q(z|x) \| p(z)) upper-bounds the rate I(X;Z)I(X; Z) under the variational approximation.

Lossy compression, quantization, and knowledge distillation can all be framed as rate-distortion problems.

Maximum Entropy Principle

The **maximum entropy distribution** subject to constraints $\mathbb{E}_p[f_i(X)] = c_i$ is:

p(x)=1Z(λ)h(x)exp(iλifi(x))p^*(x) = \frac{1}{Z(\lambda)} h(x) \exp\left(\sum_i \lambda_i f_i(x)\right)

where λi\lambda_i are Lagrange multipliers and Z(λ)Z(\lambda) is the normalizing constant. This is exactly an exponential family distribution with sufficient statistics fif_i.

**Maximum entropy and exponential families.** The maximum entropy principle provides a principled way to construct probability distributions when you know some statistics of the data but nothing else. The result is always an exponential family:
  • Constraint on mean \to exponential distribution
  • Constraints on mean and variance \to Gaussian
  • Constraint on probabilities summing to 1 \to uniform
  • Constraint on mean of categorical \to softmax (Boltzmann) distribution

This connects to statistical mechanics (Boltzmann distribution), information geometry (exponential families as maximum entropy), and ML (softmax as maximum entropy classifier).

Notation Summary

SymbolMeaning
H(p)H(p)Entropy of distribution pp
h(p)h(p)Differential entropy (continuous)
H(p,q)H(p, q)Cross-entropy between pp and qq
DKL(pq)D_{\text{KL}}(p \| q)KL divergence from qq to pp
Df(pq)D_f(p \| q)ff-divergence from qq to pp
I(X;Y)I(X; Y)Mutual information between XX and YY
$H(XY)$
H(X,Y)H(X,Y)Joint entropy of XX and YY
R(D)R(D)Rate-distortion function
ff^*Convex conjugate of ff
PPLPerplexity: exp(Hcross-entropy)\exp(H_{\text{cross-entropy}})
log\logNatural logarithm (nats) or log2\log_2 (bits)

References