Skip to main content

Bayesian Inference

Bayesian inference is the principled framework for updating beliefs in light of evidence. While most deep learning uses point estimation (MLE/MAP), the Bayesian perspective illuminates why regularization works, provides calibrated uncertainty estimates, and underpins generative models from VAEs to diffusion models. Understanding the Bayesian framework is essential even if you never compute a full posterior.

The Bayesian Framework

Given data $\mathcal{D}$ and model parameters $\theta$, Bayesian inference computes the **posterior distribution** via Bayes' theorem:

p(θD)=p(Dθ)p(θ)p(D)=p(Dθ)p(θ)p(Dθ)p(θ)dθp(\theta | \mathcal{D}) = \frac{p(\mathcal{D} | \theta) \, p(\theta)}{p(\mathcal{D})} = \frac{p(\mathcal{D} | \theta) \, p(\theta)}{\int p(\mathcal{D} | \theta') \, p(\theta') \, d\theta'}

The posterior encodes everything we know about θ\theta after observing D\mathcal{D}. It is a complete description of parameter uncertainty, not just a point estimate.

TermNameRoleML Analogy
p(θ)p(\theta)PriorBelief about θ\theta before seeing dataRegularization (L2 = Gaussian prior)
p(Dθ)p(\mathcal{D} \| \theta)LikelihoodHow likely the data is given θ\thetaTraining loss (negative log-likelihood)
p(θD)p(\theta \| \mathcal{D})PosteriorUpdated belief after seeing dataThe full "answer" to learning
p(D)p(\mathcal{D})Evidence (marginal likelihood)Normalizing constant; model qualityModel selection criterion
**Bayesian prediction.** Given a new input $x_*$, the Bayesian predictive distribution integrates over parameter uncertainty:

p(yx,D)=p(yx,θ)p(θD)dθp(y_* | x_*, \mathcal{D}) = \int p(y_* | x_*, \theta) \, p(\theta | \mathcal{D}) \, d\theta

This is a weighted average of predictions from all plausible parameter values, weighted by their posterior probability. The predictive distribution is wider (more uncertain) when the posterior is broad (little data) and narrower when the posterior is concentrated (lots of data). Point estimates (MLE/MAP) use a single θ\theta^* and underestimate uncertainty.

**Sequential Bayesian updating.** Bayesian inference is inherently sequential: the posterior from one batch of data becomes the prior for the next. For data $\mathcal{D} = \mathcal{D}_1 \cup \mathcal{D}_2$:

p(θD1,D2)p(D2θ)p(θD1)new priorp(\theta | \mathcal{D}_1, \mathcal{D}_2) \propto p(\mathcal{D}_2 | \theta) \cdot \underbrace{p(\theta | \mathcal{D}_1)}_{\text{new prior}}

This means the final posterior is independent of the order in which data arrives -- only the total data matters (for i.i.d. data). Online learning algorithms that update parameters incrementally can be seen as approximations to sequential Bayesian updating.

MLE vs MAP vs Full Bayesian

**Maximum Likelihood Estimation (MLE)** finds the parameters that maximize the likelihood, ignoring the prior entirely:

θMLE=argmaxθp(Dθ)=argmaxθi=1Nlogp(xiθ)\theta_{\text{MLE}} = \arg\max_\theta p(\mathcal{D} | \theta) = \arg\max_\theta \sum_{i=1}^N \log p(x_i | \theta)

MLE is consistent (converges to the true parameter as nn \to \infty) and asymptotically efficient (achieves the Cramer-Rao lower bound). However, it can overfit with finite data and provides no uncertainty estimates.

**Maximum A Posteriori (MAP)** estimation finds the mode of the posterior:

θMAP=argmaxθp(θD)=argmaxθ[logp(Dθ)+logp(θ)]\theta_{\text{MAP}} = \arg\max_\theta p(\theta | \mathcal{D}) = \arg\max_\theta \left[\log p(\mathcal{D} | \theta) + \log p(\theta)\right]

MAP adds a regularization term logp(θ)\log p(\theta) to the MLE objective. It is a point estimate that uses the prior but still discards the shape of the posterior.

MethodObjectiveRegularizationUncertaintyComputation
MLE$\max \log p(\mathcal{D}\theta)$NoneNo
MAP$\max \log p(\mathcal{D}\theta) + \log p(\theta)$Via priorNo
Full BayesianCompute $p(\theta\mathcal{D})$AutomaticYes
EnsembleTrain MM models, average predictionsImplicit (diversity)ApproximateM×M \times training cost
MC DropoutAverage predictions with dropout onImplicit (dropout)ApproximateM×M \times forward passes
**MAP and regularization are the same thing.** The choice of prior determines the regularizer:
Prior p(θ)p(\theta)logp(θ)-\log p(\theta) PenaltyRegularization
N(0,1/λ)\mathcal{N}(0, 1/\lambda)λ2θ22+const\frac{\lambda}{2}\|\theta\|_2^2 + \text{const}L2 (weight decay)
Laplace(0,1/λ)\text{Laplace}(0, 1/\lambda)λθ1+const\lambda\|\theta\|_1 + \text{const}L1 (sparsity)
Horseshoe\text{Horseshoe}Adaptive shrinkageSparse Bayesian learning
Uniform (improper)00No regularization (= MLE)

The regularization coefficient λ\lambda is the precision (inverse variance) of the prior. Larger λ\lambda means a more informative (tighter) prior, corresponding to stronger regularization.

Conjugate Priors

A prior $p(\theta)$ is **conjugate** to a likelihood $p(\mathcal{D}|\theta)$ if the posterior $p(\theta|\mathcal{D})$ has the same functional form as the prior. Formally, the prior family is closed under Bayesian updating with the given likelihood.
LikelihoodConjugate PriorPosteriorPosterior Parameters
Bernoulli(pp)Beta(α,β)(\alpha, \beta)Beta(α,β)(\alpha', \beta')α=α+xi\alpha' = \alpha + \sum x_i, β=β+nxi\beta' = \beta + n - \sum x_i
Categorical(π\pi)Dirichlet(α)(\alpha)Dirichlet(α)(\alpha')αk=αk+nk\alpha_k' = \alpha_k + n_k (count of class kk)
Gaussian(μ\mu, known σ2\sigma^2)Gaussian(μ0,σ02)(\mu_0, \sigma_0^2)Gaussian(μn,σn2)(\mu_n, \sigma_n^2)Precision-weighted mean; σn2=σ02+n/σ2\sigma_n^{-2} = \sigma_0^{-2} + n/\sigma^2
Gaussian(σ2\sigma^2, known μ\mu)Inv-Gamma(α,β)(\alpha, \beta)Inv-Gamma(α,β)(\alpha', \beta')α=α+n/2\alpha' = \alpha + n/2, β=β+12(xiμ)2\beta' = \beta + \frac{1}{2}\sum(x_i - \mu)^2
Poisson(λ\lambda)Gamma(α,β)(\alpha, \beta)Gamma(α,β)(\alpha', \beta')α=α+xi\alpha' = \alpha + \sum x_i, β=β+n\beta' = \beta + n
MultinomialDirichletDirichletCounts added to pseudocounts
**Interpreting conjugate prior parameters.** The prior hyperparameters can be interpreted as "pseudo-observations":
  • Beta(α,β)(\alpha, \beta) prior for Bernoulli: Acts like having already seen α1\alpha - 1 successes and β1\beta - 1 failures. The prior strength is α+β\alpha + \beta (total pseudo-count).
  • Dirichlet(α)(\alpha) prior for Categorical: Acts like having seen αk1\alpha_k - 1 examples of class kk. Uniform prior: αk=1\alpha_k = 1. Laplace smoothing: αk=1\alpha_k = 1 (adds 1 to each count).
  • Gaussian prior for Gaussian mean: The posterior mean is a precision-weighted average of the prior mean and sample mean: μn=σ02μ0+(n/σ2)xˉσ02+n/σ2\mu_n = \frac{\sigma_0^{-2} \mu_0 + (n/\sigma^2)\bar{x}}{\sigma_0^{-2} + n/\sigma^2}. As nn \to \infty, the posterior concentrates on xˉ\bar{x} (data overwhelms the prior).

The Evidence Lower Bound (ELBO)

For most models of interest (neural networks, deep generative models), the posterior p(θD)p(\theta|\mathcal{D}) is intractable because the evidence integral p(D)=p(Dθ)p(θ)dθp(\mathcal{D}) = \int p(\mathcal{D}|\theta)p(\theta)d\theta cannot be computed in closed form. Two main strategies exist: variational inference (optimization-based) and MCMC (sampling-based).

**Variational Inference (VI)** approximates the posterior with a tractable distribution $q_\phi(\theta)$ from a variational family $\mathcal{Q}$ by minimizing the KL divergence: ϕ=argminϕDKL(qϕ(θ)p(θD))\phi^* = \arg\min_\phi D_{\text{KL}}(q_\phi(\theta) \| p(\theta | \mathcal{D}))

Since p(θD)p(\theta|\mathcal{D}) appears in the KL (and is unknown), we cannot minimize this directly. Instead, we maximize the Evidence Lower Bound (ELBO):

ELBO(ϕ)=Eqϕ[logp(Dθ)]DKL(qϕ(θ)p(θ))logp(D)\text{ELBO}(\phi) = \mathbb{E}_{q_\phi}[\log p(\mathcal{D} | \theta)] - D_{\text{KL}}(q_\phi(\theta) \| p(\theta)) \leq \log p(\mathcal{D})

**Derivation of the ELBO.** Start from the log-evidence:

logp(D)=logp(D,θ)dθ=logp(D,θ)qϕ(θ)qϕ(θ)dθqϕ(θ)logp(D,θ)qϕ(θ)dθ\log p(\mathcal{D}) = \log \int p(\mathcal{D}, \theta) d\theta = \log \int \frac{p(\mathcal{D}, \theta)}{q_\phi(\theta)} q_\phi(\theta) d\theta \geq \int q_\phi(\theta) \log \frac{p(\mathcal{D}, \theta)}{q_\phi(\theta)} d\theta

where the inequality is Jensen's (since log\log is concave). Rearranging:

logp(D)=Eq[logp(Dθ)]DKL(qp(θ))ELBO+DKL(qϕ(θ)p(θD))0\log p(\mathcal{D}) = \underbrace{\mathbb{E}_q[\log p(\mathcal{D}|\theta)] - D_{\text{KL}}(q \| p(\theta))}_{\text{ELBO}} + \underbrace{D_{\text{KL}}(q_\phi(\theta) \| p(\theta|\mathcal{D}))}_{\geq 0}

The gap between logp(D)\log p(\mathcal{D}) and the ELBO is exactly DKL(qp(θD))D_{\text{KL}}(q \| p(\theta|\mathcal{D})). Maximizing the ELBO simultaneously: (1) makes qq close to the true posterior, and (2) provides a lower bound on the model evidence.

**ELBO decomposition and the reconstruction-regularization tradeoff.** The ELBO has two terms:
  1. Reconstruction term Eqϕ[logp(Dθ)]\mathbb{E}_{q_\phi}[\log p(\mathcal{D}|\theta)]: encourages qq to place mass on parameters that explain the data well (fit the data).
  2. KL regularization DKL(qϕp(θ))-D_{\text{KL}}(q_\phi \| p(\theta)): penalizes qq for deviating from the prior (stay simple).

This is the Bayesian analogue of the bias-variance tradeoff: the reconstruction term reduces bias while the KL term controls variance/complexity. In VAEs, the same decomposition appears with latent variables instead of parameters: reconstruction loss vs. KL to the latent prior.

Approximate Inference Methods

**Markov Chain Monte Carlo (MCMC)** draws samples from the posterior by constructing a Markov chain whose stationary distribution is $p(\theta|\mathcal{D})$. The most common algorithms are:
  1. Metropolis-Hastings: Propose θq(θθt)\theta' \sim q(\theta'|\theta_t), accept with probability min(1,p(θD)q(θtθ)p(θtD)q(θθt))\min(1, \frac{p(\theta'|\mathcal{D})q(\theta_t|\theta')}{p(\theta_t|\mathcal{D})q(\theta'|\theta_t)}).
  2. Hamiltonian Monte Carlo (HMC): Use gradient information to make proposals that follow the posterior geometry, achieving much higher acceptance rates in high dimensions.
  3. Stochastic Gradient MCMC (SG-MCMC): Replace full-data gradients with mini-batch gradients, enabling MCMC at scale. SGD with appropriately decaying learning rate is approximately SG-Langevin dynamics.

MCMC is asymptotically exact (converges to the true posterior) but slow for high-dimensional problems. VI is faster but introduces approximation error from the variational family.

MethodQualitySpeedScalabilityKey Limitation
Exact (conjugate)ExactFastSmall models onlyRequires conjugacy
MCMC (HMC)Asymptotically exactSlow104\sim 10^4 paramsMixing time, burn-in
SG-MCMCApproximateModerate106\sim 10^6 paramsMini-batch noise
Variational InferenceApproximateFast109\sim 10^9 paramsRestricted family Q\mathcal{Q}
Laplace approximationGaussian at MAPVery fast106\sim 10^6 paramsUnimodal, symmetric
Deep ensemblesEmpirically goodM×M \times costAny model sizeNo theoretical guarantee
**Laplace approximation.** Approximate the posterior as a Gaussian centered at the MAP estimate:

p(θD)N(θMAP,H1)p(\theta|\mathcal{D}) \approx \mathcal{N}(\theta_{\text{MAP}}, H^{-1})

where H=2logp(θD)θMAPH = -\nabla^2 \log p(\theta|\mathcal{D})\big|_{\theta_{\text{MAP}}} is the Hessian of the negative log-posterior at the MAP point. This is the second-order Taylor expansion of logp(θD)\log p(\theta|\mathcal{D}) around its mode. The Laplace approximation is fast (just requires the MAP point plus the Hessian) but assumes the posterior is approximately Gaussian -- it misses multimodality, skewness, and heavy tails.

For neural networks, the full Hessian is O(P2)O(P^2) in memory. Practical approaches use diagonal, Kronecker-factored (KFAC), or low-rank Hessian approximations ([?ritter2018scalable]).

Model Selection and Evidence

The **model evidence** (marginal likelihood) $p(\mathcal{D}|\mathcal{M})$ for model $\mathcal{M}$ scores how well the model class (not just the best fit) explains the data:

p(DM)=p(Dθ,M)p(θM)dθp(\mathcal{D}|\mathcal{M}) = \int p(\mathcal{D}|\theta, \mathcal{M}) \, p(\theta|\mathcal{M}) \, d\theta

Comparing models: p(M1D)p(M2D)=p(DM1)p(DM2)p(M1)p(M2)\frac{p(\mathcal{M}_1|\mathcal{D})}{p(\mathcal{M}_2|\mathcal{D})} = \frac{p(\mathcal{D}|\mathcal{M}_1)}{p(\mathcal{D}|\mathcal{M}_2)} \cdot \frac{p(\mathcal{M}_1)}{p(\mathcal{M}_2)}

**Bayesian Occam's razor.** The evidence automatically penalizes complexity. A complex model spreads its prior probability over many possible datasets, so $p(\mathcal{D}|\mathcal{M})$ is small for any particular $\mathcal{D}$. A simple model concentrates its prior on fewer datasets -- if $\mathcal{D}$ is one of them, the evidence is high. The evidence balances fit and complexity without an explicit regularization term.

This is why the ELBO is a lower bound on logp(D)\log p(\mathcal{D}): maximizing the ELBO over model architectures favors models that are complex enough to explain the data but no more.

ML Connections

ModelBayesian ComponentWhat It Computes
VAEEncoder $q_\phi(zx)approximatesposteriorapproximates posteriorp(z
Bayesian NNPrior over weights p(w)p(w); posterior $p(w\mathcal{D})$ gives uncertainty
GPT pretrainingMLE over token sequences: $\max \sum_t \log p(x_tx_{<t}; \theta)$
Diffusion modelsVariational bound on log-likelihood; forward process defines the priorScore matching approximates logp(x)\nabla \log p(x)
Gaussian processesExact Bayesian inference with kernel priorNon-parametric regression with uncertainty
Bayesian optimizationPosterior over objective function; acquisition function trades off exploration/exploitationHyperparameter tuning
Neural architecture searchPrior over architectures; performance posterior guides searchAutomated model design
**The gap between Bayesian theory and deep learning practice.** In theory, full Bayesian inference is optimal: the predictive distribution minimizes expected loss under the true posterior. In practice, deep learning relies almost entirely on point estimates (SGD/Adam) for several reasons:
  1. Computational cost: Computing the posterior over 10910^9 parameters is intractable.
  2. Prior specification: What is a good prior for transformer weights? We have little domain knowledge to guide the choice.
  3. Overparameterization: When PNP \gg N, the prior has outsized influence; a bad prior hurts more than no prior.
  4. Empirical success: SGD + regularization + ensembles achieves competitive uncertainty without explicit Bayesian methods.

Despite this, Bayesian ideas permeate deep learning: weight decay is a Gaussian prior, dropout is approximate variational inference, the ELBO trains VAEs, and model evidence guides hyperparameter selection. Understanding the Bayesian perspective illuminates why these techniques work.

Notation Summary

SymbolMeaning
θ\thetaModel parameters
D\mathcal{D}Observed data
p(θ)p(\theta)Prior distribution
$p(\mathcal{D}\theta)$
$p(\theta\mathcal{D})$
p(D)p(\mathcal{D})Evidence (marginal likelihood)
qϕ(θ)q_\phi(\theta)Variational approximation to the posterior
ϕ\phiVariational parameters
Q\mathcal{Q}Variational family
ELBOEvidence lower bound
MLEMaximum likelihood estimation
MAPMaximum a posteriori estimation
MCMCMarkov Chain Monte Carlo
HMCHamiltonian Monte Carlo
HHHessian (in Laplace approximation)