Distributions
Probability distributions are the building blocks of statistical modeling. Every ML model implicitly or explicitly defines a probability distribution: a classifier outputs a Categorical distribution, a regression model outputs a Gaussian, a language model outputs a distribution over token sequences. This chapter covers the distributions that appear most frequently in machine learning, their properties, and their relationships.
Discrete Distributions
, . The Bernoulli is an exponential family distribution with natural parameter (the log-odds, or logit).
, .
, . As with fixed, the Binomial converges to the Poisson. As with fixed, by the CLT.
. The Poisson distribution is its own conjugate family for the rate parameter. It appears in count data modeling (event frequency, word counts in bag-of-words models).
- Bernoulli is Categorical with , and Binomial with
- Binomial is the sum of i.i.d. Bernoulli trials
- Multinomial is the multivariate generalization: counts of categories across trials
- Poisson is the limit of Binomial as (rare events)
- Geometric is the number of trials until the first success:
The Gaussian Distribution
The multivariate Gaussian for with mean and covariance matrix :
The exponent is the Mahalanobis distance from to . Level sets of the density are ellipsoids aligned with the eigenvectors of .
| Property | Formula | Proof Sketch |
|---|---|---|
| Marginal | Integrate out ; completing the square | |
| Conditional | $p(x_A | x_B) = \mathcal{N}(\mu_{A |
| Sum | if independent | Convolution of Gaussians via completing the square |
| Linear transform | Moment generating function or change of variables | |
| Product | (precision addition) |
- Maximum entropy: Among all distributions with given mean and variance, the Gaussian has maximum entropy (maximum uncertainty). Using a Gaussian makes the fewest assumptions beyond first and second moments.
- Central Limit Theorem: Sums of many independent random variables converge to a Gaussian, regardless of the original distribution. This justifies Gaussian noise models whenever errors arise from many small independent sources.
- Algebraic closure: Gaussians are closed under marginalization, conditioning, linear transformation, and products. This enables exact inference in linear-Gaussian models (Kalman filter, factor analysis, Gaussian processes).
- MLE connection: Minimizing MSE loss is equivalent to MLE under Gaussian noise .
Exponential Family
where is the natural parameter, is the sufficient statistic, is the log-partition function (ensures normalization), and is the base measure.
| Distribution | (natural param) | (sufficient stat) | (log-partition) |
|---|---|---|---|
| Bernoulli() | |||
| Gaussian() | |||
| Poisson() | |||
| Categorical() | (logsumexp) | ||
| Dirichlet() |
- Sufficient statistics: captures all information about from the data. For i.i.d. samples, is sufficient -- you never need to store raw data.
- Log-partition function properties: and . Since covariance matrices are PSD, is convex, making MLE a convex problem.
- Conjugate priors always exist and have a known form.
- GLMs (Generalized Linear Models): Each exponential family member defines a GLM. Logistic regression uses Bernoulli, Poisson regression uses Poisson, linear regression uses Gaussian. The link function is .
- Softmax is the log-partition: The Categorical log-partition is exactly the softmax denominator. The gradient gives the mean parameters (class probabilities).
Softmax and Temperature
With temperature : . As , the distribution approaches argmax (deterministic). As , it approaches uniform.
- Invariance to shift: for any scalar . For numerical stability, compute .
- Gradient: . This means the Jacobian is .
- Log-softmax: . Always compute log-softmax directly (numerically stable) rather than .
- Hardmax limit: As , softmax approaches one-hot encoding of . The Gumbel-Softmax trick ([?jang2017categorical]; [?maddison2017concrete]) provides a differentiable approximation to discrete sampling.
| Temperature | Effect | Use Case |
|---|---|---|
| Sharp, near-deterministic | Greedy decoding, distillation (hard labels) | |
| Standard softmax | Training, standard inference | |
| Smoother, more uniform | Knowledge distillation (soft labels), exploration | |
| Uniform distribution | Maximum entropy / random baseline |
In knowledge distillation (Hinton et al., 2015), the teacher's softmax is evaluated at high temperature to expose "dark knowledge" (relative probabilities of non-target classes). The student is trained to match these soft targets.
Continuous Distributions Beyond the Gaussian
, . Among all distributions on , the uniform has maximum entropy.
where . , . The Beta is the conjugate prior for the Bernoulli parameter .
where is the probability simplex. where . The concentration parameter controls how peaked the distribution is: concentrates on vertices (sparse), concentrates on the center (uniform).
, . Special cases: Exponential (), Chi-squared (). The Gamma is the conjugate prior for the Poisson rate and the Gaussian precision.
| Distribution | Support | Key Property | ML Usage |
|---|---|---|---|
| Gaussian | Max entropy for given mean/variance | Noise model, latent space, weight initialization | |
| Uniform | Max entropy on bounded support | Prior for bounded parameters, random init | |
| Beta | Conjugate to Bernoulli | Prior on probabilities, dropout rate | |
| Dirichlet | Simplex | Conjugate to Categorical | Prior on mixture weights, topic models |
| Gamma | Conjugate to Poisson/Gaussian precision | Prior on positive parameters | |
| Student- | Heavy tails (robust) | Robust regression, outlier modeling | |
| Laplace | $p(x) \propto e^{- | x | |
| Log-normal | Attention scores, gradient norms |
Mixture Models
where are mixing weights (, ). GMMs can approximate any continuous density with compact support to arbitrary accuracy (universal approximation for densities), given enough components.
- E-step: Compute responsibilities
- M-step: Update parameters using weighted statistics:
- , ,
EM monotonically increases the log-likelihood and converges to a local maximum. It generalizes beyond GMMs to any latent variable model.
- VAEs (Kingma & Welling, 2014): continuous latent , is a neural network decoder
- Diffusion models (Ho et al., 2020): chain of latents
- Topic models (LDA): latent topic assignments for each word
- Hidden Markov models: latent discrete states with temporal dependencies
The key computational challenge in all these models is computing or approximating the posterior .
Important Limit Theorems
This justifies using empirical averages (mini-batch gradients, empirical risk) as estimators of expectations.
The CLT tells us the rate: is approximately , so the estimation error scales as regardless of the original distribution.
- Mini-batch gradients: The average gradient over a mini-batch of size has variance , approximately Gaussian for moderate by the CLT. This justifies the learning rate linear scaling rule: if you increase by , increase by to maintain the same noise-to-signal ratio.
- Confidence intervals: For any estimator computed as an average (accuracy, loss, BLEU score), the CLT gives approximate confidence intervals: .
- Batch normalization: The sample mean and variance computed over a mini-batch converge to population statistics by the LLN, with CLT-governed fluctuations.
Transformations of Random Variables
The Jacobian determinant accounts for the stretching or compression of volume by .
- Normalizing flows ([?rezende2015variational]): Compose simple bijections to transform a simple base distribution (e.g., Gaussian) into a complex target. The log-density of the transformed variable is .
- Reparameterization trick (Kingma & Welling, 2014): Sample where . This moves the stochasticity out of the distribution parameters, enabling backpropagation through the sampling step.
- Probability integral transform: If , then . This is used for calibration assessment and copula models.
Notation Summary
| Symbol | Meaning |
|---|---|
| Probability density / mixing weight | |
| Mean (scalar or vector) | |
| Variance (scalar) / covariance (matrix) | |
| Precision matrix | |
| Gaussian distribution | |
| Temperature parameter | |
| Number of classes or mixture components | |
| Logits or latent variable | |
| Natural parameter (exponential family) | |
| Sufficient statistic | |
| Log-partition function | |
| Beta function | |
| Gamma function | |
| -dimensional probability simplex |