Distributions
Probability distributions are the building blocks of statistical modeling. Every ML model implicitly or explicitly defines a probability distribution: a classifier outputs a Categorical distribution, a regression model outputs a Gaussian, a language model outputs a distribution over token sequences. This chapter covers the distributions that appear most frequently in machine learning, their properties, and their relationships.
Discrete Distributions
, . The Bernoulli is an exponential family distribution with natural parameter (the log-odds, or logit).
, .
, . As with fixed, the Binomial converges to the Poisson. As with fixed, by the CLT.
. The conjugate prior for the Poisson rate is the Gamma distribution. It appears in count data modeling (event frequency, word counts in bag-of-words models).
- Bernoulli is Categorical with , and Binomial with
- Binomial is the sum of i.i.d. Bernoulli trials
- Multinomial is the multivariate generalization: counts of categories across trials
- Poisson is the limit of Binomial as (rare events)
- Geometric is the number of trials until the first success:
The Gaussian Distribution
The multivariate Gaussian for with mean and covariance matrix :
The exponent is the Mahalanobis distance from to . Level sets of the density are ellipsoids aligned with the eigenvectors of .
| Property | Formula | Proof Sketch |
|---|---|---|
| Marginal | Integrate out ; completing the square | |
| Conditional | $p(x_A | x_B) = \mathcal{N}(\mu_{A |
| Sum | if independent | Convolution of Gaussians via completing the square |
| Linear transform | Moment generating function or change of variables | |
| Product | (precision addition) |
- Maximum entropy: Among all distributions with given mean and variance, the Gaussian has maximum entropy (maximum uncertainty). Using a Gaussian makes the fewest assumptions beyond first and second moments.
- Central Limit Theorem: Sums of many independent random variables converge to a Gaussian, regardless of the original distribution. This justifies Gaussian noise models whenever errors arise from many small independent sources.
- Algebraic closure: Gaussians are closed under marginalization, conditioning, linear transformation, and products. This enables exact inference in linear-Gaussian models (Kalman filter, factor analysis, Gaussian processes).
- MLE connection: Minimizing MSE loss is equivalent to MLE under Gaussian noise .
Step 1: Determinant and inverse. Because is diagonal, and .
Step 2: Mahalanobis distance. With ,
Step 3: Normalizing constant. For , .
Step 4: Assemble.
E-step. The unnormalized weights for component are , and normalizes them across . Because the two components are symmetric here, the responsibility for component 1 depends only on the squared-distance gap , giving with the logistic function:
- : , so .
- : , so .
- : , so .
- : , so .
M-step (means). Using :
By the symmetry , the mirror computation gives .
M-step (weights). , and .
After one iteration the means have moved toward the cluster centroids ( and ), illustrating how EM tightens the fit. Iterating further leaves them essentially fixed at this symmetric optimum.
Exponential Family
where is the natural parameter, is the sufficient statistic, is the log-partition function (ensures normalization), and is the base measure.
| Distribution | (natural param) | (sufficient stat) | (log-partition) |
|---|---|---|---|
| Bernoulli() | |||
| Gaussian() | |||
| Poisson() | |||
| Categorical() | (logsumexp) | ||
| Dirichlet() |
- Sufficient statistics: captures all information about from the data. For i.i.d. samples, is sufficient, so you never need to store raw data.
- Log-partition function properties: and . Since covariance matrices are PSD, is convex, making MLE a convex problem.
- Conjugate priors always exist and have a known form.
- GLMs (Generalized Linear Models): Each exponential family member defines a GLM. Logistic regression uses Bernoulli, Poisson regression uses Poisson, linear regression uses Gaussian. The link function is .
- Softmax is the log-partition: The Categorical log-partition is exactly the softmax denominator. The gradient gives the mean parameters (class probabilities).
Softmax and Temperature
With temperature : . As , the distribution approaches argmax (deterministic). As , it approaches uniform.
- Invariance to shift: for any scalar . For numerical stability, compute .
- Gradient: . This means the Jacobian is .
- Log-softmax: . Always compute log-softmax directly (numerically stable) rather than .
- Hardmax limit: As , softmax approaches one-hot encoding of . The Gumbel-Softmax trick ([?jang2017categorical]; [?maddison2017concrete]) provides a differentiable approximation to discrete sampling.
| Temperature | Effect | Use Case |
|---|---|---|
| Sharp, near-deterministic | Greedy decoding, distillation (hard labels) | |
| Standard softmax | Training, standard inference | |
| Smoother, more uniform | Knowledge distillation (soft labels), exploration | |
| Uniform distribution | Maximum entropy / random baseline |
In knowledge distillation (Hinton et al., 2015), the teacher's softmax is evaluated at high temperature to expose "dark knowledge" (relative probabilities of non-target classes). The student is trained to match these soft targets.
Continuous Distributions Beyond the Gaussian
, . Among all distributions on , the uniform has maximum entropy.
where . , . The Beta is the conjugate prior for the Bernoulli parameter .
where is the probability simplex. where . The concentration parameter controls how peaked the distribution is: concentrates on vertices (sparse), concentrates on the center (uniform).
, . Special cases: Exponential (), Chi-squared (). The Gamma is the conjugate prior for the Poisson rate and the Gaussian precision.
| Distribution | Support | Key Property | ML Usage |
|---|---|---|---|
| Gaussian | Max entropy for given mean/variance | Noise model, latent space, weight initialization | |
| Uniform | Max entropy on bounded support | Prior for bounded parameters, random init | |
| Beta | Conjugate to Bernoulli | Prior on probabilities, dropout rate | |
| Dirichlet | Simplex | Conjugate to Categorical | Prior on mixture weights, topic models |
| Gamma | Conjugate to Poisson/Gaussian precision | Prior on positive parameters | |
| Student- | Heavy tails (robust) | Robust regression, outlier modeling | |
| Laplace | $p(x) \propto e^{- | x | |
| Log-normal | Multiplicative positive quantities (scales, durations) |
Mixture Models
where are mixing weights (, ). Given enough components, GMMs are universal approximators for densities: they can approximate a broad class of smooth densities to arbitrary accuracy, which makes them a flexible default for modeling multimodal data.
- E-step: Compute responsibilities
- M-step: Update parameters using weighted statistics:
- , ,
EM monotonically increases the log-likelihood and converges to a local maximum. It generalizes beyond GMMs to any latent variable model.
- VAEs (Kingma & Welling, 2014): continuous latent , is a neural network decoder
- Diffusion models (Ho et al., 2020): chain of latents
- Topic models (LDA): latent topic assignments for each word
- Hidden Markov models: latent discrete states with temporal dependencies
The key computational challenge in all these models is computing or approximating the posterior .
Important Limit Theorems
This justifies using empirical averages (mini-batch gradients, empirical risk) as estimators of expectations.
The CLT tells us the rate: is approximately , so the estimation error scales as regardless of the original distribution.
- Mini-batch gradients: The average gradient over a mini-batch of size has variance , approximately Gaussian for moderate by the CLT. This motivates the empirical learning rate linear scaling rule (Goyal et al., 2017): increasing by and increasing by keeps the noise-to-signal ratio roughly constant. The rule is a heuristic that works well up to moderate batch sizes and is known to break down in the very large batch regime.
- Confidence intervals: For any estimator computed as an average (accuracy, loss, BLEU score), the CLT gives approximate confidence intervals: .
- Batch normalization: The sample mean and variance computed over a mini-batch converge to population statistics by the LLN, with CLT-governed fluctuations.
Transformations of Random Variables
The Jacobian determinant accounts for the stretching or compression of volume by .
- Normalizing flows (Rezende & Mohamed, 2015): Compose simple bijections to transform a simple base distribution (e.g., Gaussian) into a complex target. The log-density of the transformed variable is .
- Reparameterization trick (Kingma & Welling, 2014): Sample where . This moves the stochasticity out of the distribution parameters, enabling backpropagation through the sampling step.
- Probability integral transform: If with continuous, then . (The result requires continuity of ; it fails for discrete or mixed distributions.) This is used for calibration assessment and copula models.
Notation Summary
| Symbol | Meaning |
|---|---|
| Probability density / mixing weight | |
| Mean (scalar or vector) | |
| Variance (scalar) / covariance (matrix) | |
| Precision matrix | |
| Gaussian distribution | |
| Temperature parameter | |
| Number of classes or mixture components | |
| Logits or latent variable | |
| Natural parameter (exponential family) | |
| Sufficient statistic | |
| Log-partition function | |
| Beta function | |
| Gamma function | |
| -dimensional probability simplex |
References
- Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He (2017). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677. ↗
- Geoffrey Hinton, Oriol Vinyals, Jeff Dean (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop. ↗
- Jonathan Ho, Ajay Jain, Pieter Abbeel (2020). Denoising Diffusion Probabilistic Models. NeurIPS. ↗
- Diederik P. Kingma, Max Welling (2014). Auto-Encoding Variational Bayes. ICLR. ↗
- Danilo Jimenez Rezende, Shakir Mohamed (2015). Variational Inference with Normalizing Flows. ICML. ↗