Skip to main content

Distributions

Probability distributions are the building blocks of statistical modeling. Every ML model implicitly or explicitly defines a probability distribution: a classifier outputs a Categorical distribution, a regression model outputs a Gaussian, a language model outputs a distribution over token sequences. This chapter covers the distributions that appear most frequently in machine learning, their properties, and their relationships.

Discrete Distributions

**Bernoulli** -- single binary trial with success probability $p$:

P(X=x)=px(1p)1x,x{0,1}P(X = x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}

E[X]=p\mathbb{E}[X] = p, Var(X)=p(1p)\text{Var}(X) = p(1-p). The Bernoulli is an exponential family distribution with natural parameter η=logp1p\eta = \log\frac{p}{1-p} (the log-odds, or logit).

**Categorical** -- generalization to $K$ classes with probabilities $\pi = (\pi_1, \dots, \pi_K)$: P(X=k)=πk,k=1Kπk=1P(X = k) = \pi_k, \quad \sum_{k=1}^K \pi_k = 1

E[1X=k]=πk\mathbb{E}[\mathbf{1}_{X=k}] = \pi_k, Var(1X=k)=πk(1πk)\text{Var}(\mathbf{1}_{X=k}) = \pi_k(1 - \pi_k).

**Binomial** -- number of successes in $n$ independent Bernoulli$(p)$ trials:

P(X=k)=(nk)pk(1p)nk,k=0,1,,nP(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n

E[X]=np\mathbb{E}[X] = np, Var(X)=np(1p)\text{Var}(X) = np(1-p). As nn \to \infty with npnp fixed, the Binomial converges to the Poisson. As nn \to \infty with pp fixed, Xnpnp(1p)N(0,1)\frac{X - np}{\sqrt{np(1-p)}} \to \mathcal{N}(0,1) by the CLT.

**Poisson** -- models the number of events in a fixed interval when events occur independently at a constant rate $\lambda$:

P(X=k)=λkeλk!,k=0,1,2,P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots

E[X]=Var(X)=λ\mathbb{E}[X] = \text{Var}(X) = \lambda. The Poisson distribution is its own conjugate family for the rate parameter. It appears in count data modeling (event frequency, word counts in bag-of-words models).

The Categorical distribution is the output distribution for classification. The softmax function converts logits $z \in \mathbb{R}^K$ to a Categorical distribution: $\pi_k = \frac{e^{z_k}}{\sum_j e^{z_j}}$. The softmax is the inverse of the log-odds (logit) function and arises naturally from the exponential family form of the Categorical. Every classification neural network implicitly defines a Categorical distribution over labels. **Relationships between discrete distributions.** The discrete distributions form a hierarchy:
  • Bernoulli is Categorical with K=2K = 2, and Binomial with n=1n = 1
  • Binomial is the sum of nn i.i.d. Bernoulli trials
  • Multinomial is the multivariate generalization: counts of KK categories across nn trials
  • Poisson is the limit of Binomial(n,λ/n)(n, \lambda/n) as nn \to \infty (rare events)
  • Geometric is the number of trials until the first success: P(X=k)=(1p)k1pP(X = k) = (1-p)^{k-1}p

The Gaussian Distribution

The **univariate Gaussian** (normal) distribution with mean $\mu$ and variance $\sigma^2$:

p(x)=12πσ2exp((xμ)22σ2)p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

The multivariate Gaussian for xRnx \in \mathbb{R}^n with mean μ\mu and covariance matrix Σ\Sigma:

p(x)=1(2π)n/2Σ1/2exp(12(xμ)Σ1(xμ))p(x) = \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(x - \mu)^\top \Sigma^{-1}(x - \mu)\right)

The exponent (xμ)Σ1(xμ)(x - \mu)^\top \Sigma^{-1}(x - \mu) is the Mahalanobis distance from xx to μ\mu. Level sets of the density are ellipsoids aligned with the eigenvectors of Σ\Sigma.

PropertyFormulaProof Sketch
Marginalp(xA)=N(μA,ΣAA)p(x_A) = \mathcal{N}(\mu_A, \Sigma_{AA})Integrate out xBx_B; completing the square
Conditional$p(x_Ax_B) = \mathcal{N}(\mu_{A
SumX+YN(μX+μY,ΣX+ΣY)X + Y \sim \mathcal{N}(\mu_X + \mu_Y, \Sigma_X + \Sigma_Y) if independentConvolution of Gaussians via completing the square
Linear transformAX+bN(Aμ+b,AΣA)AX + b \sim \mathcal{N}(A\mu + b, A\Sigma A^\top)Moment generating function or change of variables
ProductN(μ1,Σ1)N(μ2,Σ2)N(μ,Σ)\mathcal{N}(\mu_1, \Sigma_1) \cdot \mathcal{N}(\mu_2, \Sigma_2) \propto \mathcal{N}(\mu_*, \Sigma_*)Σ1=Σ11+Σ21\Sigma_*^{-1} = \Sigma_1^{-1} + \Sigma_2^{-1} (precision addition)
**Gaussian conditioning and Bayesian linear regression.** The conditional formula $\mu_{A|B} = \mu_A + \Sigma_{AB}\Sigma_{BB}^{-1}(x_B - \mu_B)$ is a linear function of $x_B$ -- this is why Bayesian linear regression (Gaussian prior + Gaussian likelihood) yields a Gaussian posterior in closed form. The posterior mean is the regularized least-squares solution, and the posterior covariance quantifies parameter uncertainty. The Gaussian distribution is central to ML for several deep reasons:
  1. Maximum entropy: Among all distributions with given mean and variance, the Gaussian has maximum entropy (maximum uncertainty). Using a Gaussian makes the fewest assumptions beyond first and second moments.
  2. Central Limit Theorem: Sums of many independent random variables converge to a Gaussian, regardless of the original distribution. This justifies Gaussian noise models whenever errors arise from many small independent sources.
  3. Algebraic closure: Gaussians are closed under marginalization, conditioning, linear transformation, and products. This enables exact inference in linear-Gaussian models (Kalman filter, factor analysis, Gaussian processes).
  4. MLE connection: Minimizing MSE loss yf(x)2\|y - f(x)\|^2 is equivalent to MLE under Gaussian noise yN(f(x),σ2I)y \sim \mathcal{N}(f(x), \sigma^2 I).
**Precision matrix and conditional independence.** The **precision matrix** $\Lambda = \Sigma^{-1}$ encodes conditional independence: $\Lambda_{ij} = 0$ iff $X_i \perp X_j \mid X_{\setminus\{i,j\}}$. This makes the precision matrix the natural parameterization for Gaussian graphical models (also called Gaussian Markov Random Fields). In a graph where edges represent non-zero precision entries, inference is efficient when the graph is sparse.

Exponential Family

A distribution belongs to the **exponential family** if its density can be written as:

p(xη)=h(x)exp(ηT(x)A(η))p(x | \eta) = h(x) \exp\left(\eta^\top T(x) - A(\eta)\right)

where η\eta is the natural parameter, T(x)T(x) is the sufficient statistic, A(η)A(\eta) is the log-partition function (ensures normalization), and h(x)h(x) is the base measure.

Distributionη\eta (natural param)T(x)T(x) (sufficient stat)A(η)A(\eta) (log-partition)
Bernoulli(pp)logp1p\log\frac{p}{1-p}xxlog(1+eη)\log(1 + e^\eta)
Gaussian(μ,σ2\mu, \sigma^2)(μ/σ2,1/2σ2)(\mu/\sigma^2, -1/2\sigma^2)(x,x2)(x, x^2)η12/(4η2)12log(2η2)-\eta_1^2/(4\eta_2) - \frac{1}{2}\log(-2\eta_2)
Poisson(λ\lambda)logλ\log \lambdaxxeηe^\eta
Categorical(π\pi)logπk/πK\log \pi_k / \pi_K1x=k\mathbf{1}_{x=k}logkeηk\log\sum_k e^{\eta_k} (logsumexp)
Dirichlet(α\alpha)αk1\alpha_k - 1logxk\log x_klogΓ(αk)logΓ(αk)\sum \log\Gamma(\alpha_k) - \log\Gamma(\sum \alpha_k)
**Why the exponential family matters for ML:**
  1. Sufficient statistics: T(x)T(x) captures all information about η\eta from the data. For nn i.i.d. samples, iT(xi)\sum_i T(x_i) is sufficient -- you never need to store raw data.
  2. Log-partition function properties: ηA(η)=E[T(X)]\nabla_\eta A(\eta) = \mathbb{E}[T(X)] and η2A(η)=Cov[T(X)]\nabla^2_\eta A(\eta) = \text{Cov}[T(X)]. Since covariance matrices are PSD, A(η)A(\eta) is convex, making MLE a convex problem.
  3. Conjugate priors always exist and have a known form.
  4. GLMs (Generalized Linear Models): Each exponential family member defines a GLM. Logistic regression uses Bernoulli, Poisson regression uses Poisson, linear regression uses Gaussian. The link function is g(μ)=η=wxg(\mu) = \eta = w^\top x.
  5. Softmax is the log-partition: The Categorical log-partition A(η)=logsumexp(η)A(\eta) = \text{logsumexp}(\eta) is exactly the softmax denominator. The gradient A=softmax(η)\nabla A = \text{softmax}(\eta) gives the mean parameters (class probabilities).

Softmax and Temperature

The **softmax function** maps logits $z \in \mathbb{R}^K$ to a probability distribution:

softmax(z)k=ezkj=1Kezj\text{softmax}(z)_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}

With temperature τ>0\tau > 0: softmax(z/τ)k\text{softmax}(z/\tau)_k. As τ0\tau \to 0, the distribution approaches argmax (deterministic). As τ\tau \to \infty, it approaches uniform.

**Softmax properties and numerical stability:**
  • Invariance to shift: softmax(z+c)=softmax(z)\text{softmax}(z + c) = \text{softmax}(z) for any scalar cc. For numerical stability, compute softmax(zmaxkzk)\text{softmax}(z - \max_k z_k).
  • Gradient: softmax(z)izj=softmax(z)i(δijsoftmax(z)j)\frac{\partial \text{softmax}(z)_i}{\partial z_j} = \text{softmax}(z)_i(\delta_{ij} - \text{softmax}(z)_j). This means the Jacobian is diag(p)pp\text{diag}(p) - pp^\top.
  • Log-softmax: logsoftmax(z)k=zklogsumexp(z)\log\text{softmax}(z)_k = z_k - \text{logsumexp}(z). Always compute log-softmax directly (numerically stable) rather than log(softmax(z))\log(\text{softmax}(z)).
  • Hardmax limit: As τ0\tau \to 0, softmax approaches one-hot encoding of argmaxz\arg\max z. The Gumbel-Softmax trick ([?jang2017categorical]; [?maddison2017concrete]) provides a differentiable approximation to discrete sampling.
**Temperature in practice:**
TemperatureEffectUse Case
τ1\tau \ll 1Sharp, near-deterministicGreedy decoding, distillation (hard labels)
τ=1\tau = 1Standard softmaxTraining, standard inference
τ>1\tau > 1Smoother, more uniformKnowledge distillation (soft labels), exploration
τ\tau \to \inftyUniform distributionMaximum entropy / random baseline

In knowledge distillation (Hinton et al., 2015), the teacher's softmax is evaluated at high temperature to expose "dark knowledge" (relative probabilities of non-target classes). The student is trained to match these soft targets.

Continuous Distributions Beyond the Gaussian

The **continuous uniform** distribution on $[a, b]$:

p(x)=1bafor x[a,b]p(x) = \frac{1}{b - a} \quad \text{for } x \in [a, b]

E[X]=a+b2\mathbb{E}[X] = \frac{a+b}{2}, Var(X)=(ba)212\text{Var}(X) = \frac{(b-a)^2}{12}. Among all distributions on [a,b][a,b], the uniform has maximum entropy.

The **Beta** distribution on $[0, 1]$ with shape parameters $\alpha, \beta > 0$:

p(x)=xα1(1x)β1B(α,β),x[0,1]p(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}, \quad x \in [0, 1]

where B(α,β)=Γ(α)Γ(β)Γ(α+β)B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}. E[X]=αα+β\mathbb{E}[X] = \frac{\alpha}{\alpha + \beta}, Var(X)=αβ(α+β)2(α+β+1)\text{Var}(X) = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)}. The Beta is the conjugate prior for the Bernoulli parameter pp.

The **Dirichlet** distribution generalizes the Beta to $K$-dimensional probability vectors:

p(πα)=Γ(kαk)kΓ(αk)k=1Kπkαk1,πΔK1p(\pi | \alpha) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} \prod_{k=1}^K \pi_k^{\alpha_k - 1}, \quad \pi \in \Delta^{K-1}

where ΔK1={π:πk0,kπk=1}\Delta^{K-1} = \{\pi : \pi_k \geq 0, \sum_k \pi_k = 1\} is the probability simplex. E[πk]=αk/α0\mathbb{E}[\pi_k] = \alpha_k / \alpha_0 where α0=kαk\alpha_0 = \sum_k \alpha_k. The concentration parameter α0\alpha_0 controls how peaked the distribution is: α00\alpha_0 \to 0 concentrates on vertices (sparse), α0\alpha_0 \to \infty concentrates on the center (uniform).

The **Gamma** distribution with shape $\alpha > 0$ and rate $\beta > 0$:

p(x)=βαΓ(α)xα1eβx,x>0p(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0

E[X]=α/β\mathbb{E}[X] = \alpha/\beta, Var(X)=α/β2\text{Var}(X) = \alpha/\beta^2. Special cases: Exponential (α=1\alpha = 1), Chi-squared (α=k/2,β=1/2\alpha = k/2, \beta = 1/2). The Gamma is the conjugate prior for the Poisson rate and the Gaussian precision.

DistributionSupportKey PropertyML Usage
Gaussian N(μ,σ2)\mathcal{N}(\mu, \sigma^2)R\mathbb{R}Max entropy for given mean/varianceNoise model, latent space, weight initialization
Uniform U(a,b)U(a,b)[a,b][a,b]Max entropy on bounded supportPrior for bounded parameters, random init
Beta B(α,β)B(\alpha, \beta)[0,1][0,1]Conjugate to BernoulliPrior on probabilities, dropout rate
Dirichlet Dir(α)\text{Dir}(\alpha)Simplex ΔK1\Delta^{K-1}Conjugate to CategoricalPrior on mixture weights, topic models
Gamma Γ(α,β)\Gamma(\alpha, \beta)R+\mathbb{R}^+Conjugate to Poisson/Gaussian precisionPrior on positive parameters
Student-ttR\mathbb{R}Heavy tails (robust)Robust regression, outlier modeling
LaplaceR\mathbb{R}$p(x) \propto e^{-x
Log-normalR+\mathbb{R}^+logXN\log X \sim \mathcal{N}Attention scores, gradient norms

Mixture Models

A **Gaussian mixture model** (GMM) combines $K$ Gaussian components:

p(x)=k=1KπkN(xμk,Σk)p(x) = \sum_{k=1}^K \pi_k \, \mathcal{N}(x | \mu_k, \Sigma_k)

where πk\pi_k are mixing weights (kπk=1\sum_k \pi_k = 1, πk0\pi_k \geq 0). GMMs can approximate any continuous density with compact support to arbitrary accuracy (universal approximation for densities), given enough components.

**EM algorithm for GMMs.** Since the component assignments $z_i$ are unobserved (latent), direct MLE is intractable. The **Expectation-Maximization (EM)** algorithm alternates:
  1. E-step: Compute responsibilities rik=P(zi=kxi,θ(t))=πkN(xiμk,Σk)jπjN(xiμj,Σj)r_{ik} = P(z_i = k | x_i, \theta^{(t)}) = \frac{\pi_k \mathcal{N}(x_i | \mu_k, \Sigma_k)}{\sum_j \pi_j \mathcal{N}(x_i | \mu_j, \Sigma_j)}
  2. M-step: Update parameters using weighted statistics:
    • μk(t+1)=irikxiirik\mu_k^{(t+1)} = \frac{\sum_i r_{ik} x_i}{\sum_i r_{ik}}, Σk(t+1)=irik(xiμk)(xiμk)irik\Sigma_k^{(t+1)} = \frac{\sum_i r_{ik}(x_i - \mu_k)(x_i - \mu_k)^\top}{\sum_i r_{ik}}, πk(t+1)=1Nirik\pi_k^{(t+1)} = \frac{1}{N}\sum_i r_{ik}

EM monotonically increases the log-likelihood and converges to a local maximum. It generalizes beyond GMMs to any latent variable model.

**Latent variable models.** Mixture models introduce a latent variable $z$ (the component assignment). The marginal $p(x) = \sum_z p(x|z)p(z)$ integrates over the latent. This is the same structure as:
  • VAEs (Kingma & Welling, 2014): continuous latent zRdz \in \mathbb{R}^d, p(xz)p(x|z) is a neural network decoder
  • Diffusion models (Ho et al., 2020): chain of latents xTxT1x0x_T \to x_{T-1} \to \cdots \to x_0
  • Topic models (LDA): latent topic assignments for each word
  • Hidden Markov models: latent discrete states with temporal dependencies

The key computational challenge in all these models is computing or approximating the posterior p(zx)p(z|x).

Important Limit Theorems

Let $X_1, X_2, \ldots$ be i.i.d. with mean $\mu$ and finite variance. The **strong law of large numbers** states:

Xˉn=1ni=1nXia.s.μas n\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{\text{a.s.}} \mu \quad \text{as } n \to \infty

This justifies using empirical averages (mini-batch gradients, empirical risk) as estimators of expectations.

Let $X_1, X_2, \ldots$ be i.i.d. with mean $\mu$ and variance $\sigma^2 < \infty$. Then:

Xˉnμσ/ndN(0,1)as n\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1) \quad \text{as } n \to \infty

The CLT tells us the rate: Xˉn\bar{X}_n is approximately N(μ,σ2/n)\mathcal{N}(\mu, \sigma^2/n), so the estimation error scales as O(1/n)O(1/\sqrt{n}) regardless of the original distribution.

**CLT in ML:**
  • Mini-batch gradients: The average gradient over a mini-batch of size BB has variance σ2/B\sigma^2/B, approximately Gaussian for moderate BB by the CLT. This justifies the learning rate linear scaling rule: if you increase BB by kk, increase η\eta by kk to maintain the same noise-to-signal ratio.
  • Confidence intervals: For any estimator computed as an average (accuracy, loss, BLEU score), the CLT gives approximate confidence intervals: μ^±zα/2σ^/n\hat{\mu} \pm z_{\alpha/2} \cdot \hat{\sigma}/\sqrt{n}.
  • Batch normalization: The sample mean and variance computed over a mini-batch converge to population statistics by the LLN, with CLT-governed fluctuations.

Transformations of Random Variables

If $X \sim p_X$ and $Y = g(X)$ where $g$ is a differentiable bijection with inverse $g^{-1}$, then:

pY(y)=pX(g1(y))detg1y=pX(g1(y))detgx1p_Y(y) = p_X(g^{-1}(y)) \left|\det \frac{\partial g^{-1}}{\partial y}\right| = p_X(g^{-1}(y)) \left|\det \frac{\partial g}{\partial x}\right|^{-1}

The Jacobian determinant accounts for the stretching or compression of volume by gg.

**Change of variables in ML.** This formula is the foundation of:
  • Normalizing flows ([?rezende2015variational]): Compose simple bijections g1g2gLg_1 \circ g_2 \circ \cdots \circ g_L to transform a simple base distribution (e.g., Gaussian) into a complex target. The log-density of the transformed variable is logpX(x)llogdetJl\log p_X(x) - \sum_l \log|\det J_l|.
  • Reparameterization trick (Kingma & Welling, 2014): Sample z=μ+σϵz = \mu + \sigma \odot \epsilon where ϵN(0,I)\epsilon \sim \mathcal{N}(0, I). This moves the stochasticity out of the distribution parameters, enabling backpropagation through the sampling step.
  • Probability integral transform: If XFX \sim F, then F(X)U(0,1)F(X) \sim U(0,1). This is used for calibration assessment and copula models.

Notation Summary

SymbolMeaning
p,πp, \piProbability density / mixing weight
μ\muMean (scalar or vector)
σ2,Σ\sigma^2, \SigmaVariance (scalar) / covariance (matrix)
Λ=Σ1\Lambda = \Sigma^{-1}Precision matrix
N(μ,Σ)\mathcal{N}(\mu, \Sigma)Gaussian distribution
τ\tauTemperature parameter
KKNumber of classes or mixture components
zzLogits or latent variable
η\etaNatural parameter (exponential family)
T(x)T(x)Sufficient statistic
A(η)A(\eta)Log-partition function
B(α,β)B(\alpha, \beta)Beta function
Γ()\Gamma(\cdot)Gamma function
ΔK1\Delta^{K-1}(K1)(K-1)-dimensional probability simplex

References