Skip to main content

Distributions

Probability distributions are the building blocks of statistical modeling. Every ML model implicitly or explicitly defines a probability distribution: a classifier outputs a Categorical distribution, a regression model outputs a Gaussian, a language model outputs a distribution over token sequences. This chapter covers the distributions that appear most frequently in machine learning, their properties, and their relationships.

Discrete Distributions

**Bernoulli**: a single binary trial with success probability $p$:

P(X=x)=px(1p)1x,x{0,1}P(X = x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}

E[X]=p\mathbb{E}[X] = p, Var(X)=p(1p)\text{Var}(X) = p(1-p). The Bernoulli is an exponential family distribution with natural parameter η=logp1p\eta = \log\frac{p}{1-p} (the log-odds, or logit).

**Categorical**: the generalization to $K$ classes with probabilities $\pi = (\pi_1, \dots, \pi_K)$: P(X=k)=πk,k=1Kπk=1P(X = k) = \pi_k, \quad \sum_{k=1}^K \pi_k = 1

E[1X=k]=πk\mathbb{E}[\mathbf{1}_{X=k}] = \pi_k, Var(1X=k)=πk(1πk)\text{Var}(\mathbf{1}_{X=k}) = \pi_k(1 - \pi_k).

**Binomial**: the number of successes in $n$ independent Bernoulli$(p)$ trials:

P(X=k)=(nk)pk(1p)nk,k=0,1,,nP(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n

E[X]=np\mathbb{E}[X] = np, Var(X)=np(1p)\text{Var}(X) = np(1-p). As nn \to \infty with npnp fixed, the Binomial converges to the Poisson. As nn \to \infty with pp fixed, Xnpnp(1p)N(0,1)\frac{X - np}{\sqrt{np(1-p)}} \to \mathcal{N}(0,1) by the CLT.

**Poisson**: models the number of events in a fixed interval when events occur independently at a constant rate $\lambda$:

P(X=k)=λkeλk!,k=0,1,2,P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots

E[X]=Var(X)=λ\mathbb{E}[X] = \text{Var}(X) = \lambda. The conjugate prior for the Poisson rate λ\lambda is the Gamma distribution. It appears in count data modeling (event frequency, word counts in bag-of-words models).

The Categorical distribution is the output distribution for classification. The softmax function converts logits $z \in \mathbb{R}^K$ to a Categorical distribution: $\pi_k = \frac{e^{z_k}}{\sum_j e^{z_j}}$. The softmax is the inverse of the log-odds (logit) function and arises naturally from the exponential family form of the Categorical. Every classification neural network implicitly defines a Categorical distribution over labels. **Relationships between discrete distributions.** The discrete distributions form a hierarchy:
  • Bernoulli is Categorical with K=2K = 2, and Binomial with n=1n = 1
  • Binomial is the sum of nn i.i.d. Bernoulli trials
  • Multinomial is the multivariate generalization: counts of KK categories across nn trials
  • Poisson is the limit of Binomial(n,λ/n)(n, \lambda/n) as nn \to \infty (rare events)
  • Geometric is the number of trials until the first success: P(X=k)=(1p)k1pP(X = k) = (1-p)^{k-1}p

The Gaussian Distribution

The **univariate Gaussian** (normal) distribution with mean $\mu$ and variance $\sigma^2$:

p(x)=12πσ2exp((xμ)22σ2)p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

The multivariate Gaussian for xRnx \in \mathbb{R}^n with mean μ\mu and covariance matrix Σ\Sigma:

p(x)=1(2π)n/2Σ1/2exp(12(xμ)Σ1(xμ))p(x) = \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(x - \mu)^\top \Sigma^{-1}(x - \mu)\right)

The exponent (xμ)Σ1(xμ)(x - \mu)^\top \Sigma^{-1}(x - \mu) is the Mahalanobis distance from xx to μ\mu. Level sets of the density are ellipsoids aligned with the eigenvectors of Σ\Sigma.

PropertyFormulaProof Sketch
Marginalp(xA)=N(μA,ΣAA)p(x_A) = \mathcal{N}(\mu_A, \Sigma_{AA})Integrate out xBx_B; completing the square
Conditional$p(x_Ax_B) = \mathcal{N}(\mu_{A
SumX+YN(μX+μY,ΣX+ΣY)X + Y \sim \mathcal{N}(\mu_X + \mu_Y, \Sigma_X + \Sigma_Y) if independentConvolution of Gaussians via completing the square
Linear transformAX+bN(Aμ+b,AΣA)AX + b \sim \mathcal{N}(A\mu + b, A\Sigma A^\top)Moment generating function or change of variables
ProductN(μ1,Σ1)N(μ2,Σ2)N(μ,Σ)\mathcal{N}(\mu_1, \Sigma_1) \cdot \mathcal{N}(\mu_2, \Sigma_2) \propto \mathcal{N}(\mu_*, \Sigma_*)Σ1=Σ11+Σ21\Sigma_*^{-1} = \Sigma_1^{-1} + \Sigma_2^{-1} (precision addition)
**Gaussian conditioning and Bayesian linear regression.** The conditional formula $\mu_{A|B} = \mu_A + \Sigma_{AB}\Sigma_{BB}^{-1}(x_B - \mu_B)$ is a linear function of $x_B$, which is why Bayesian linear regression (Gaussian prior + Gaussian likelihood) yields a Gaussian posterior in closed form. The posterior mean is the regularized least-squares solution, and the posterior covariance quantifies parameter uncertainty. The Gaussian distribution is central to ML for several deep reasons:
  1. Maximum entropy: Among all distributions with given mean and variance, the Gaussian has maximum entropy (maximum uncertainty). Using a Gaussian makes the fewest assumptions beyond first and second moments.
  2. Central Limit Theorem: Sums of many independent random variables converge to a Gaussian, regardless of the original distribution. This justifies Gaussian noise models whenever errors arise from many small independent sources.
  3. Algebraic closure: Gaussians are closed under marginalization, conditioning, linear transformation, and products. This enables exact inference in linear-Gaussian models (Kalman filter, factor analysis, Gaussian processes).
  4. MLE connection: Minimizing MSE loss yf(x)2\|y - f(x)\|^2 is equivalent to MLE under Gaussian noise yN(f(x),σ2I)y \sim \mathcal{N}(f(x), \sigma^2 I).
**Precision matrix and conditional independence.** The **precision matrix** $\Lambda = \Sigma^{-1}$ encodes conditional independence: $\Lambda_{ij} = 0$ iff $X_i \perp X_j \mid X_{\setminus\{i,j\}}$. This makes the precision matrix the natural parameterization for Gaussian graphical models (also called Gaussian Markov Random Fields). In a graph where edges represent non-zero precision entries, inference is efficient when the graph is sparse. Let $x = (1, 0)^\top$, $\mu = (0, 0)^\top$, and $\Sigma = \begin{pmatrix} 2 & 0 \\ 0 & 1 \end{pmatrix}$. We evaluate $p(x)$ for the multivariate Gaussian.

Step 1: Determinant and inverse. Because Σ\Sigma is diagonal, Σ=21=2|\Sigma| = 2 \cdot 1 = 2 and Σ1=(1/2001)\Sigma^{-1} = \begin{pmatrix} 1/2 & 0 \\ 0 & 1 \end{pmatrix}.

Step 2: Mahalanobis distance. With xμ=(1,0)x - \mu = (1, 0)^\top,

(xμ)Σ1(xμ)=12(1)2+1(0)2=12.(x - \mu)^\top \Sigma^{-1}(x - \mu) = \tfrac{1}{2}(1)^2 + 1 \cdot (0)^2 = \tfrac{1}{2}.

Step 3: Normalizing constant. For n=2n = 2, (2π)n/2Σ1/2=2π2(2\pi)^{n/2}|\Sigma|^{1/2} = 2\pi \cdot \sqrt{2}.

Step 4: Assemble.

p(x)=12π2exp(1212)=e1/42π20.77888.8860.0877.p(x) = \frac{1}{2\pi\sqrt{2}} \exp\left(-\tfrac{1}{2} \cdot \tfrac{1}{2}\right) = \frac{e^{-1/4}}{2\pi\sqrt{2}} \approx \frac{0.7788}{8.886} \approx 0.0877.

Consider a one-dimensional two-component GMM and the dataset $x = \{0, 1, 4, 5\}$. Suppose the current parameters are $\mu_1 = 0$, $\mu_2 = 5$, equal variances $\sigma_1^2 = \sigma_2^2 = 1$, and equal weights $\pi_1 = \pi_2 = 0.5$. Run one EM iteration. Try the E-step yourself before reading on, using the responsibility formula from the EM remark below.

E-step. The unnormalized weights for component kk are πkN(xiμk,1)\pi_k\,\mathcal{N}(x_i \mid \mu_k, 1), and rikr_{ik} normalizes them across kk. Because the two components are symmetric here, the responsibility ri1r_{i1} for component 1 depends only on the squared-distance gap di=12[(xiμ2)2(xiμ1)2]d_i = \tfrac{1}{2}\big[(x_i - \mu_2)^2 - (x_i - \mu_1)^2\big], giving ri1=σ(di)r_{i1} = \sigma(d_i) with σ\sigma the logistic function:

  • x=0x = 0: d=12(250)=12.5d = \tfrac{1}{2}(25 - 0) = 12.5, so r11=σ(12.5)1.000r_{11} = \sigma(12.5) \approx 1.000.
  • x=1x = 1: d=12(161)=7.5d = \tfrac{1}{2}(16 - 1) = 7.5, so r210.9994r_{21} \approx 0.9994.
  • x=4x = 4: d=12(116)=7.5d = \tfrac{1}{2}(1 - 16) = -7.5, so r310.0006r_{31} \approx 0.0006.
  • x=5x = 5: d=12(025)=12.5d = \tfrac{1}{2}(0 - 25) = -12.5, so r410.000r_{41} \approx 0.000.

M-step (means). Using μk(t+1)=irikxiirik\mu_k^{(t+1)} = \frac{\sum_i r_{ik} x_i}{\sum_i r_{ik}}:

μ1(t+1)=1.0000+0.99941+0.00064+051.000+0.9994+0.0006+01.00182.0000.501.\mu_1^{(t+1)} = \frac{1.000\cdot 0 + 0.9994\cdot 1 + 0.0006\cdot 4 + 0\cdot 5}{1.000 + 0.9994 + 0.0006 + 0} \approx \frac{1.0018}{2.000} \approx 0.501.

By the symmetry ri2=1ri1r_{i2} = 1 - r_{i1}, the mirror computation gives μ2(t+1)4.499\mu_2^{(t+1)} \approx 4.499.

M-step (weights). π1(t+1)=14iri1=14(2.000)=0.500\pi_1^{(t+1)} = \tfrac{1}{4}\sum_i r_{i1} = \tfrac{1}{4}(2.000) = 0.500, and π2(t+1)=0.500\pi_2^{(t+1)} = 0.500.

After one iteration the means have moved toward the cluster centroids (0.50.5 and 4.54.5), illustrating how EM tightens the fit. Iterating further leaves them essentially fixed at this symmetric optimum.

Exponential Family

A distribution belongs to the **exponential family** if its density can be written as:

p(xη)=h(x)exp(ηT(x)A(η))p(x | \eta) = h(x) \exp\left(\eta^\top T(x) - A(\eta)\right)

where η\eta is the natural parameter, T(x)T(x) is the sufficient statistic, A(η)A(\eta) is the log-partition function (ensures normalization), and h(x)h(x) is the base measure.

Distributionη\eta (natural param)T(x)T(x) (sufficient stat)A(η)A(\eta) (log-partition)
Bernoulli(pp)logp1p\log\frac{p}{1-p}xxlog(1+eη)\log(1 + e^\eta)
Gaussian(μ,σ2\mu, \sigma^2)(μ/σ2,1/2σ2)(\mu/\sigma^2, -1/2\sigma^2)(x,x2)(x, x^2)η12/(4η2)12log(2η2)-\eta_1^2/(4\eta_2) - \frac{1}{2}\log(-2\eta_2)
Poisson(λ\lambda)logλ\log \lambdaxxeηe^\eta
Categorical(π\pi)logπk/πK\log \pi_k / \pi_K1x=k\mathbf{1}_{x=k}logkeηk\log\sum_k e^{\eta_k} (logsumexp)
Dirichlet(α\alpha)αk1\alpha_k - 1logxk\log x_klogΓ(αk)logΓ(αk)\sum \log\Gamma(\alpha_k) - \log\Gamma(\sum \alpha_k)
**Why the exponential family matters for ML:**
  1. Sufficient statistics: T(x)T(x) captures all information about η\eta from the data. For nn i.i.d. samples, iT(xi)\sum_i T(x_i) is sufficient, so you never need to store raw data.
  2. Log-partition function properties: ηA(η)=E[T(X)]\nabla_\eta A(\eta) = \mathbb{E}[T(X)] and η2A(η)=Cov[T(X)]\nabla^2_\eta A(\eta) = \text{Cov}[T(X)]. Since covariance matrices are PSD, A(η)A(\eta) is convex, making MLE a convex problem.
  3. Conjugate priors always exist and have a known form.
  4. GLMs (Generalized Linear Models): Each exponential family member defines a GLM. Logistic regression uses Bernoulli, Poisson regression uses Poisson, linear regression uses Gaussian. The link function is g(μ)=η=wxg(\mu) = \eta = w^\top x.
  5. Softmax is the log-partition: The Categorical log-partition A(η)=logsumexp(η)A(\eta) = \text{logsumexp}(\eta) is exactly the softmax denominator. The gradient A=softmax(η)\nabla A = \text{softmax}(\eta) gives the mean parameters (class probabilities).

Softmax and Temperature

The **softmax function** maps logits $z \in \mathbb{R}^K$ to a probability distribution:

softmax(z)k=ezkj=1Kezj\text{softmax}(z)_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}

With temperature τ>0\tau > 0: softmax(z/τ)k\text{softmax}(z/\tau)_k. As τ0\tau \to 0, the distribution approaches argmax (deterministic). As τ\tau \to \infty, it approaches uniform.

**Softmax properties and numerical stability:**
  • Invariance to shift: softmax(z+c)=softmax(z)\text{softmax}(z + c) = \text{softmax}(z) for any scalar cc. For numerical stability, compute softmax(zmaxkzk)\text{softmax}(z - \max_k z_k).
  • Gradient: softmax(z)izj=softmax(z)i(δijsoftmax(z)j)\frac{\partial \text{softmax}(z)_i}{\partial z_j} = \text{softmax}(z)_i(\delta_{ij} - \text{softmax}(z)_j). This means the Jacobian is diag(p)pp\text{diag}(p) - pp^\top.
  • Log-softmax: logsoftmax(z)k=zklogsumexp(z)\log\text{softmax}(z)_k = z_k - \text{logsumexp}(z). Always compute log-softmax directly (numerically stable) rather than log(softmax(z))\log(\text{softmax}(z)).
  • Hardmax limit: As τ0\tau \to 0, softmax approaches one-hot encoding of argmaxz\arg\max z. The Gumbel-Softmax trick ([?jang2017categorical]; [?maddison2017concrete]) provides a differentiable approximation to discrete sampling.
**Temperature in practice:**
TemperatureEffectUse Case
τ1\tau \ll 1Sharp, near-deterministicGreedy decoding, distillation (hard labels)
τ=1\tau = 1Standard softmaxTraining, standard inference
τ>1\tau > 1Smoother, more uniformKnowledge distillation (soft labels), exploration
τ\tau \to \inftyUniform distributionMaximum entropy / random baseline

In knowledge distillation (Hinton et al., 2015), the teacher's softmax is evaluated at high temperature to expose "dark knowledge" (relative probabilities of non-target classes). The student is trained to match these soft targets.

Continuous Distributions Beyond the Gaussian

The **continuous uniform** distribution on $[a, b]$:

p(x)=1bafor x[a,b]p(x) = \frac{1}{b - a} \quad \text{for } x \in [a, b]

E[X]=a+b2\mathbb{E}[X] = \frac{a+b}{2}, Var(X)=(ba)212\text{Var}(X) = \frac{(b-a)^2}{12}. Among all distributions on [a,b][a,b], the uniform has maximum entropy.

The **Beta** distribution on $[0, 1]$ with shape parameters $\alpha, \beta > 0$:

p(x)=xα1(1x)β1B(α,β),x[0,1]p(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}, \quad x \in [0, 1]

where B(α,β)=Γ(α)Γ(β)Γ(α+β)B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}. E[X]=αα+β\mathbb{E}[X] = \frac{\alpha}{\alpha + \beta}, Var(X)=αβ(α+β)2(α+β+1)\text{Var}(X) = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)}. The Beta is the conjugate prior for the Bernoulli parameter pp.

The **Dirichlet** distribution generalizes the Beta to $K$-dimensional probability vectors:

p(πα)=Γ(kαk)kΓ(αk)k=1Kπkαk1,πΔK1p(\pi | \alpha) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} \prod_{k=1}^K \pi_k^{\alpha_k - 1}, \quad \pi \in \Delta^{K-1}

where ΔK1={π:πk0,kπk=1}\Delta^{K-1} = \{\pi : \pi_k \geq 0, \sum_k \pi_k = 1\} is the probability simplex. E[πk]=αk/α0\mathbb{E}[\pi_k] = \alpha_k / \alpha_0 where α0=kαk\alpha_0 = \sum_k \alpha_k. The concentration parameter α0\alpha_0 controls how peaked the distribution is: α00\alpha_0 \to 0 concentrates on vertices (sparse), α0\alpha_0 \to \infty concentrates on the center (uniform).

The **Gamma** distribution with shape $\alpha > 0$ and rate $\beta > 0$:

p(x)=βαΓ(α)xα1eβx,x>0p(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0

E[X]=α/β\mathbb{E}[X] = \alpha/\beta, Var(X)=α/β2\text{Var}(X) = \alpha/\beta^2. Special cases: Exponential (α=1\alpha = 1), Chi-squared (α=k/2,β=1/2\alpha = k/2, \beta = 1/2). The Gamma is the conjugate prior for the Poisson rate and the Gaussian precision.

DistributionSupportKey PropertyML Usage
Gaussian N(μ,σ2)\mathcal{N}(\mu, \sigma^2)R\mathbb{R}Max entropy for given mean/varianceNoise model, latent space, weight initialization
Uniform U(a,b)U(a,b)[a,b][a,b]Max entropy on bounded supportPrior for bounded parameters, random init
Beta B(α,β)B(\alpha, \beta)[0,1][0,1]Conjugate to BernoulliPrior on probabilities, dropout rate
Dirichlet Dir(α)\text{Dir}(\alpha)Simplex ΔK1\Delta^{K-1}Conjugate to CategoricalPrior on mixture weights, topic models
Gamma Γ(α,β)\Gamma(\alpha, \beta)R+\mathbb{R}^+Conjugate to Poisson/Gaussian precisionPrior on positive parameters
Student-ttR\mathbb{R}Heavy tails (robust)Robust regression, outlier modeling
LaplaceR\mathbb{R}$p(x) \propto e^{-x
Log-normalR+\mathbb{R}^+logXN\log X \sim \mathcal{N}Multiplicative positive quantities (scales, durations)

Mixture Models

A **Gaussian mixture model** (GMM) combines $K$ Gaussian components:

p(x)=k=1KπkN(xμk,Σk)p(x) = \sum_{k=1}^K \pi_k \, \mathcal{N}(x | \mu_k, \Sigma_k)

where πk\pi_k are mixing weights (kπk=1\sum_k \pi_k = 1, πk0\pi_k \geq 0). Given enough components, GMMs are universal approximators for densities: they can approximate a broad class of smooth densities to arbitrary accuracy, which makes them a flexible default for modeling multimodal data.

**EM algorithm for GMMs.** Since the component assignments $z_i$ are unobserved (latent), direct MLE is intractable. The **Expectation-Maximization (EM)** algorithm alternates:
  1. E-step: Compute responsibilities rik=P(zi=kxi,θ(t))=πkN(xiμk,Σk)jπjN(xiμj,Σj)r_{ik} = P(z_i = k | x_i, \theta^{(t)}) = \frac{\pi_k \mathcal{N}(x_i | \mu_k, \Sigma_k)}{\sum_j \pi_j \mathcal{N}(x_i | \mu_j, \Sigma_j)}
  2. M-step: Update parameters using weighted statistics:
    • μk(t+1)=irikxiirik\mu_k^{(t+1)} = \frac{\sum_i r_{ik} x_i}{\sum_i r_{ik}}, Σk(t+1)=irik(xiμk)(xiμk)irik\Sigma_k^{(t+1)} = \frac{\sum_i r_{ik}(x_i - \mu_k)(x_i - \mu_k)^\top}{\sum_i r_{ik}}, πk(t+1)=1Nirik\pi_k^{(t+1)} = \frac{1}{N}\sum_i r_{ik}

EM monotonically increases the log-likelihood and converges to a local maximum. It generalizes beyond GMMs to any latent variable model.

**Latent variable models.** Mixture models introduce a latent variable $z$ (the component assignment). The marginal $p(x) = \sum_z p(x|z)p(z)$ integrates over the latent. This is the same structure as:
  • VAEs (Kingma & Welling, 2014): continuous latent zRdz \in \mathbb{R}^d, p(xz)p(x|z) is a neural network decoder
  • Diffusion models (Ho et al., 2020): chain of latents xTxT1x0x_T \to x_{T-1} \to \cdots \to x_0
  • Topic models (LDA): latent topic assignments for each word
  • Hidden Markov models: latent discrete states with temporal dependencies

The key computational challenge in all these models is computing or approximating the posterior p(zx)p(z|x).

Important Limit Theorems

Let $X_1, X_2, \ldots$ be i.i.d. with mean $\mu$ and finite variance. The **strong law of large numbers** states:

Xˉn=1ni=1nXia.s.μas n\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{\text{a.s.}} \mu \quad \text{as } n \to \infty

This justifies using empirical averages (mini-batch gradients, empirical risk) as estimators of expectations.

Let $X_1, X_2, \ldots$ be i.i.d. with mean $\mu$ and variance $\sigma^2 < \infty$. Then:

Xˉnμσ/ndN(0,1)as n\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1) \quad \text{as } n \to \infty

The CLT tells us the rate: Xˉn\bar{X}_n is approximately N(μ,σ2/n)\mathcal{N}(\mu, \sigma^2/n), so the estimation error scales as O(1/n)O(1/\sqrt{n}) regardless of the original distribution.

**CLT in ML:**
  • Mini-batch gradients: The average gradient over a mini-batch of size BB has variance σ2/B\sigma^2/B, approximately Gaussian for moderate BB by the CLT. This motivates the empirical learning rate linear scaling rule (Goyal et al., 2017): increasing BB by kk and increasing η\eta by kk keeps the noise-to-signal ratio roughly constant. The rule is a heuristic that works well up to moderate batch sizes and is known to break down in the very large batch regime.
  • Confidence intervals: For any estimator computed as an average (accuracy, loss, BLEU score), the CLT gives approximate confidence intervals: μ^±zα/2σ^/n\hat{\mu} \pm z_{\alpha/2} \cdot \hat{\sigma}/\sqrt{n}.
  • Batch normalization: The sample mean and variance computed over a mini-batch converge to population statistics by the LLN, with CLT-governed fluctuations.

Transformations of Random Variables

If $X \sim p_X$ and $Y = g(X)$ where $g$ is a differentiable bijection with inverse $g^{-1}$, then:

pY(y)=pX(g1(y))detg1y=pX(g1(y))detgx1p_Y(y) = p_X(g^{-1}(y)) \left|\det \frac{\partial g^{-1}}{\partial y}\right| = p_X(g^{-1}(y)) \left|\det \frac{\partial g}{\partial x}\right|^{-1}

The Jacobian determinant accounts for the stretching or compression of volume by gg.

**Change of variables in ML.** This formula is the foundation of:
  • Normalizing flows (Rezende & Mohamed, 2015): Compose simple bijections g1g2gLg_1 \circ g_2 \circ \cdots \circ g_L to transform a simple base distribution (e.g., Gaussian) into a complex target. The log-density of the transformed variable is logpX(x)llogdetJl\log p_X(x) - \sum_l \log|\det J_l|.
  • Reparameterization trick (Kingma & Welling, 2014): Sample z=μ+σϵz = \mu + \sigma \odot \epsilon where ϵN(0,I)\epsilon \sim \mathcal{N}(0, I). This moves the stochasticity out of the distribution parameters, enabling backpropagation through the sampling step.
  • Probability integral transform: If XFX \sim F with FF continuous, then F(X)U(0,1)F(X) \sim U(0,1). (The result requires continuity of FF; it fails for discrete or mixed distributions.) This is used for calibration assessment and copula models.

Notation Summary

SymbolMeaning
p,πp, \piProbability density / mixing weight
μ\muMean (scalar or vector)
σ2,Σ\sigma^2, \SigmaVariance (scalar) / covariance (matrix)
Λ=Σ1\Lambda = \Sigma^{-1}Precision matrix
N(μ,Σ)\mathcal{N}(\mu, \Sigma)Gaussian distribution
τ\tauTemperature parameter
KKNumber of classes or mixture components
zzLogits or latent variable
η\etaNatural parameter (exponential family)
T(x)T(x)Sufficient statistic
A(η)A(\eta)Log-partition function
B(α,β)B(\alpha, \beta)Beta function
Γ()\Gamma(\cdot)Gamma function
ΔK1\Delta^{K-1}(K1)(K-1)-dimensional probability simplex

References