Bayesian Inference
Bayesian inference is the principled framework for updating beliefs in light of evidence. While most deep learning uses point estimation (MLE/MAP), the Bayesian perspective illuminates why regularization works, provides calibrated uncertainty estimates, and underpins generative models from VAEs to diffusion models. Understanding the Bayesian framework is essential even if you never compute a full posterior.
The Bayesian Framework
The posterior encodes everything we know about after observing . It is a complete description of parameter uncertainty, not just a point estimate.
| Term | Name | Role | ML Analogy |
|---|---|---|---|
| Prior | Belief about before seeing data | Regularization (L2 = Gaussian prior) | |
| Likelihood | How likely the data is given | Training loss (negative log-likelihood) | |
| Posterior | Updated belief after seeing data | The full "answer" to learning | |
| Evidence (marginal likelihood) | Normalizing constant; model quality | Model selection criterion |
This is a weighted average of predictions from all plausible parameter values, weighted by their posterior probability. The predictive distribution is wider (more uncertain) when the posterior is broad (little data) and narrower when the posterior is concentrated (lots of data). Point estimates (MLE/MAP) use a single and underestimate uncertainty.
This means the final posterior is independent of the order in which data arrives -- only the total data matters (for i.i.d. data). Online learning algorithms that update parameters incrementally can be seen as approximations to sequential Bayesian updating.
MLE vs MAP vs Full Bayesian
MLE is consistent (converges to the true parameter as ) and asymptotically efficient (achieves the Cramer-Rao lower bound). However, it can overfit with finite data and provides no uncertainty estimates.
MAP adds a regularization term to the MLE objective. It is a point estimate that uses the prior but still discards the shape of the posterior.
| Method | Objective | Regularization | Uncertainty | Computation |
|---|---|---|---|---|
| MLE | $\max \log p(\mathcal{D} | \theta)$ | None | No |
| MAP | $\max \log p(\mathcal{D} | \theta) + \log p(\theta)$ | Via prior | No |
| Full Bayesian | Compute $p(\theta | \mathcal{D})$ | Automatic | Yes |
| Ensemble | Train models, average predictions | Implicit (diversity) | Approximate | training cost |
| MC Dropout | Average predictions with dropout on | Implicit (dropout) | Approximate | forward passes |
| Prior | Penalty | Regularization |
|---|---|---|
| L2 (weight decay) | ||
| L1 (sparsity) | ||
| Adaptive shrinkage | Sparse Bayesian learning | |
| Uniform (improper) | No regularization (= MLE) |
The regularization coefficient is the precision (inverse variance) of the prior. Larger means a more informative (tighter) prior, corresponding to stronger regularization.
Conjugate Priors
| Likelihood | Conjugate Prior | Posterior | Posterior Parameters |
|---|---|---|---|
| Bernoulli() | Beta | Beta | , |
| Categorical() | Dirichlet | Dirichlet | (count of class ) |
| Gaussian(, known ) | Gaussian | Gaussian | Precision-weighted mean; |
| Gaussian(, known ) | Inv-Gamma | Inv-Gamma | , |
| Poisson() | Gamma | Gamma | , |
| Multinomial | Dirichlet | Dirichlet | Counts added to pseudocounts |
- Beta prior for Bernoulli: Acts like having already seen successes and failures. The prior strength is (total pseudo-count).
- Dirichlet prior for Categorical: Acts like having seen examples of class . Uniform prior: . Laplace smoothing: (adds 1 to each count).
- Gaussian prior for Gaussian mean: The posterior mean is a precision-weighted average of the prior mean and sample mean: . As , the posterior concentrates on (data overwhelms the prior).
The Evidence Lower Bound (ELBO)
For most models of interest (neural networks, deep generative models), the posterior is intractable because the evidence integral cannot be computed in closed form. Two main strategies exist: variational inference (optimization-based) and MCMC (sampling-based).
Since appears in the KL (and is unknown), we cannot minimize this directly. Instead, we maximize the Evidence Lower Bound (ELBO):
where the inequality is Jensen's (since is concave). Rearranging:
The gap between and the ELBO is exactly . Maximizing the ELBO simultaneously: (1) makes close to the true posterior, and (2) provides a lower bound on the model evidence.
- Reconstruction term : encourages to place mass on parameters that explain the data well (fit the data).
- KL regularization : penalizes for deviating from the prior (stay simple).
This is the Bayesian analogue of the bias-variance tradeoff: the reconstruction term reduces bias while the KL term controls variance/complexity. In VAEs, the same decomposition appears with latent variables instead of parameters: reconstruction loss vs. KL to the latent prior.
Approximate Inference Methods
- Metropolis-Hastings: Propose , accept with probability .
- Hamiltonian Monte Carlo (HMC): Use gradient information to make proposals that follow the posterior geometry, achieving much higher acceptance rates in high dimensions.
- Stochastic Gradient MCMC (SG-MCMC): Replace full-data gradients with mini-batch gradients, enabling MCMC at scale. SGD with appropriately decaying learning rate is approximately SG-Langevin dynamics.
MCMC is asymptotically exact (converges to the true posterior) but slow for high-dimensional problems. VI is faster but introduces approximation error from the variational family.
| Method | Quality | Speed | Scalability | Key Limitation |
|---|---|---|---|---|
| Exact (conjugate) | Exact | Fast | Small models only | Requires conjugacy |
| MCMC (HMC) | Asymptotically exact | Slow | params | Mixing time, burn-in |
| SG-MCMC | Approximate | Moderate | params | Mini-batch noise |
| Variational Inference | Approximate | Fast | params | Restricted family |
| Laplace approximation | Gaussian at MAP | Very fast | params | Unimodal, symmetric |
| Deep ensembles | Empirically good | cost | Any model size | No theoretical guarantee |
where is the Hessian of the negative log-posterior at the MAP point. This is the second-order Taylor expansion of around its mode. The Laplace approximation is fast (just requires the MAP point plus the Hessian) but assumes the posterior is approximately Gaussian -- it misses multimodality, skewness, and heavy tails.
For neural networks, the full Hessian is in memory. Practical approaches use diagonal, Kronecker-factored (KFAC), or low-rank Hessian approximations ([?ritter2018scalable]).
Model Selection and Evidence
Comparing models:
This is why the ELBO is a lower bound on : maximizing the ELBO over model architectures favors models that are complex enough to explain the data but no more.
ML Connections
| Model | Bayesian Component | What It Computes |
|---|---|---|
| VAE | Encoder $q_\phi(z | x)p(z |
| Bayesian NN | Prior over weights ; posterior $p(w | \mathcal{D})$ gives uncertainty |
| GPT pretraining | MLE over token sequences: $\max \sum_t \log p(x_t | x_{<t}; \theta)$ |
| Diffusion models | Variational bound on log-likelihood; forward process defines the prior | Score matching approximates |
| Gaussian processes | Exact Bayesian inference with kernel prior | Non-parametric regression with uncertainty |
| Bayesian optimization | Posterior over objective function; acquisition function trades off exploration/exploitation | Hyperparameter tuning |
| Neural architecture search | Prior over architectures; performance posterior guides search | Automated model design |
- Computational cost: Computing the posterior over parameters is intractable.
- Prior specification: What is a good prior for transformer weights? We have little domain knowledge to guide the choice.
- Overparameterization: When , the prior has outsized influence; a bad prior hurts more than no prior.
- Empirical success: SGD + regularization + ensembles achieves competitive uncertainty without explicit Bayesian methods.
Despite this, Bayesian ideas permeate deep learning: weight decay is a Gaussian prior, dropout is approximate variational inference, the ELBO trains VAEs, and model evidence guides hyperparameter selection. Understanding the Bayesian perspective illuminates why these techniques work.
Notation Summary
| Symbol | Meaning |
|---|---|
| Model parameters | |
| Observed data | |
| Prior distribution | |
| $p(\mathcal{D} | \theta)$ |
| $p(\theta | \mathcal{D})$ |
| Evidence (marginal likelihood) | |
| Variational approximation to the posterior | |
| Variational parameters | |
| Variational family | |
| ELBO | Evidence lower bound |
| MLE | Maximum likelihood estimation |
| MAP | Maximum a posteriori estimation |
| MCMC | Markov Chain Monte Carlo |
| HMC | Hamiltonian Monte Carlo |
| Hessian (in Laplace approximation) |