Skip to main content

Diffusion Models

Diffusion models are a class of generative models that learn to reverse a noise-corruption process. They have become the dominant approach for image, video, and audio generation, achieving state-of-the-art sample quality while offering stable training (no adversarial dynamics) and a principled variational objective. This chapter covers the mathematical foundations: the forward and reverse processes, the training objective and its derivation, the connection to score matching, and practical techniques like guidance.

Forward Process

The **forward (diffusion) process** gradually adds Gaussian noise to data $x_0 \sim q(x_0)$ over $T$ steps [@ho2020ddpm]:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \, \beta_t I)

where βt(0,1)\beta_t \in (0, 1) is the noise schedule. Each step scales down the signal by 1βt\sqrt{1 - \beta_t} and adds noise with variance βt\beta_t, preserving unit variance when the input has unit variance.

**Closed-form sampling at any timestep.** Using $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$, we can sample $x_t$ directly from $x_0$ without iterating through intermediate steps:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, \, (1 - \bar{\alpha}_t) I)

Equivalently: xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon where ϵN(0,I)\epsilon \sim \mathcal{N}(0, I).

Proof. By induction. At step 1: x1=α1x0+β1ϵ1x_1 = \sqrt{\alpha_1}x_0 + \sqrt{\beta_1}\epsilon_1. Assume xt1=αˉt1x0+1αˉt1ϵˉx_{t-1} = \sqrt{\bar{\alpha}_{t-1}}x_0 + \sqrt{1 - \bar{\alpha}_{t-1}}\bar{\epsilon}. Then xt=αtxt1+βtϵt=αtαˉt1x0+αt(1αˉt1)ϵˉ+βtϵtx_t = \sqrt{\alpha_t}x_{t-1} + \sqrt{\beta_t}\epsilon_t = \sqrt{\alpha_t \bar{\alpha}_{t-1}}x_0 + \sqrt{\alpha_t(1 - \bar{\alpha}_{t-1})}\bar{\epsilon} + \sqrt{\beta_t}\epsilon_t. Since ϵˉ\bar{\epsilon} and ϵt\epsilon_t are independent Gaussians, their sum has variance αt(1αˉt1)+βt=1αtαˉt1=1αˉt\alpha_t(1 - \bar{\alpha}_{t-1}) + \beta_t = 1 - \alpha_t\bar{\alpha}_{t-1} = 1 - \bar{\alpha}_t. \square

As tTt \to T (with appropriate schedule), αˉT0\bar{\alpha}_T \approx 0, so xTN(0,I)x_T \approx \mathcal{N}(0, I) -- the data signal is completely destroyed.

**Noise schedules.** The choice of $\beta_t$ (or equivalently $\bar{\alpha}_t$) significantly affects generation quality:
Scheduleβt\beta_t or SNRProperties
Linear (Ho et al., 2020)βt=β1+t1T1(βTβ1)\beta_t = \beta_1 + \frac{t-1}{T-1}(\beta_T - \beta_1)Simple, but wastes steps in near-noise regime
Cosine ([?nichol2021improved])αˉt=cos2(t/T+s1+sπ2)\bar{\alpha}_t = \cos^2(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2})Smoother SNR decay, better for small images
Squared cosineαˉt\bar{\alpha}_t quadraticUsed in Stable Diffusion
Log-SNR linearlog(αˉt/(1αˉt))\log(\bar{\alpha}_t / (1 - \bar{\alpha}_t)) linear in ttUniform in log-SNR space; theoretically motivated

The signal-to-noise ratio (SNR) at step tt is SNR(t)=αˉt/(1αˉt)\text{SNR}(t) = \bar{\alpha}_t / (1 - \bar{\alpha}_t), which monotonically decreases from \infty to 00. The schedule should distribute steps evenly across the useful SNR range.

The forward process is completely deterministic given $x_0$ and $\epsilon$. No neural networks are involved. The "diffusion" is simply a prescribed schedule of noise addition. This means training data can be generated on-the-fly: sample $x_0$ from the dataset, sample $t \sim \text{Uniform}(1,T)$, sample $\epsilon \sim \mathcal{N}(0,I)$, and compute $x_t$ analytically.

Reverse Process

The **reverse process** learns to denoise, generating data from noise by iterating:

pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \, \sigma_t^2 I)

where μθ\mu_\theta is predicted by a neural network. The variance σt2\sigma_t^2 can be fixed (σt2=βt\sigma_t^2 = \beta_t or σt2=β~t=1αˉt11αˉtβt\sigma_t^2 = \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t) or learned.

**Three equivalent parameterizations.** The model can predict any of three quantities, and the mean $\mu_\theta$ is derived from the prediction:
ParameterizationModel predictsMean formulaLoss weight
Noise prediction (ϵ\epsilon-pred)ϵθ(xt,t)ϵ\epsilon_\theta(x_t, t) \approx \epsilonμθ=1αt(xtβt1αˉtϵθ)\mu_\theta = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta\right)Uniform across noise levels
Data prediction (x0x_0-pred)x^θ(xt,t)x0\hat{x}_\theta(x_t, t) \approx x_0μθ=αˉt1βt1αˉtx^θ+αt(1αˉt1)1αˉtxt\mu_\theta = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\hat{x}_\theta + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_tEmphasizes low-noise steps
Velocity prediction (vv-pred)vθαˉtϵ1αˉtx0v_\theta \approx \sqrt{\bar{\alpha}_t}\epsilon - \sqrt{1-\bar{\alpha}_t}x_0Derived from vθv_\thetaBalanced; better for high resolution

All three are mathematically equivalent -- they differ only in the implicit weighting of the loss across noise levels. ϵ\epsilon-prediction is the default in DDPM; vv-prediction is preferred in progressive distillation and high-resolution models.

The Forward Posterior

The forward process posterior $q(x_{t-1} | x_t, x_0)$ -- the distribution of the previous step given both the current noisy version and the original data -- is tractable:

q(xt1xt,x0)=N(xt1;μ~t(xt,x0),β~tI)q(x_{t-1} | x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I)

where:

μ~t=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt,β~t=1αˉt11αˉtβt\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t, \qquad \tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t

This is derived by applying Bayes' rule: q(xt1xt,x0)q(xtxt1)q(xt1x0)q(x_{t-1}|x_t, x_0) \propto q(x_t|x_{t-1}) \cdot q(x_{t-1}|x_0), and completing the square in the Gaussian exponent.

**The key insight of DDPM.** The forward posterior $q(x_{t-1}|x_t, x_0)$ is the "ideal" reverse step -- if we knew $x_0$, we could denoise perfectly. Since we do not know $x_0$, we train a neural network to estimate it (or equivalently, to estimate the noise $\epsilon$ that was added). Substituting the estimate $\hat{x}_0 = (x_t - \sqrt{1-\bar{\alpha}_t}\epsilon_\theta) / \sqrt{\bar{\alpha}_t}$ into $\tilde{\mu}_t$ gives the learned reverse mean.

DDPM Loss

The variational lower bound (VLB) on $-\log p_\theta(x_0)$ decomposes into a sum of KL divergences:

logpθ(x0)DKL(q(xTx0)p(xT))LT (prior matching)+t=2TDKL(q(xt1xt,x0)pθ(xt1xt))Lt1 (denoising matching)logpθ(x0x1)L0 (reconstruction)-\log p_\theta(x_0) \leq \underbrace{D_{\text{KL}}(q(x_T|x_0) \| p(x_T))}_{L_T \text{ (prior matching)}} + \sum_{t=2}^T \underbrace{D_{\text{KL}}(q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t))}_{L_{t-1} \text{ (denoising matching)}} - \underbrace{\log p_\theta(x_0|x_1)}_{L_0 \text{ (reconstruction)}}

Since both q(xt1xt,x0)q(x_{t-1}|x_t, x_0) and pθ(xt1xt)p_\theta(x_{t-1}|x_t) are Gaussian, each KL term has a closed form that reduces to an MSE between means.

The simplified loss drops the per-timestep weighting:

Lsimple=Et,x0,ϵ[ϵϵθ(xt,t)2]\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

where tUniform(1,T)t \sim \text{Uniform}(1, T), x0q(x0)x_0 \sim q(x_0), and ϵN(0,I)\epsilon \sim \mathcal{N}(0, I).

**Training is denoising regression.** The training loop is remarkably simple:
  1. Sample a clean image x0x_0 from the dataset
  2. Sample tUniform(1,T)t \sim \text{Uniform}(1, T) and ϵN(0,I)\epsilon \sim \mathcal{N}(0, I)
  3. Compute xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon
  4. Predict ϵ^=ϵθ(xt,t)\hat{\epsilon} = \epsilon_\theta(x_t, t)
  5. Compute loss ϵϵ^2\|\epsilon - \hat{\epsilon}\|^2 and backpropagate

There is no adversarial training, no mode collapse, no training instability. The loss is a standard regression objective. This simplicity, combined with strong sample quality, is why diffusion models have largely replaced GANs.

Score Matching

The **score function** of a distribution $p(x)$ is the gradient of the log-density:

s(x)=xlogp(x)s(x) = \nabla_x \log p(x)

Score matching ([?hyvarinen2005estimation]) trains a model sθ(x)xlogp(x)s_\theta(x) \approx \nabla_x \log p(x) without requiring the normalizing constant of pp.

**The score-diffusion connection.** The noise prediction $\epsilon_\theta$ and the score are related by:

sθ(xt,t)=xtlogq(xt)ϵθ(xt,t)1αˉts_\theta(x_t, t) = \nabla_{x_t} \log q(x_t) \approx -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1 - \bar{\alpha}_t}}

Derivation. Since q(xtx0)=N(αˉtx0,(1αˉt)I)q(x_t|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I):

xtlogq(xtx0)=xtαˉtx01αˉt=ϵ1αˉt\nabla_{x_t} \log q(x_t|x_0) = -\frac{x_t - \sqrt{\bar{\alpha}_t}x_0}{1 - \bar{\alpha}_t} = -\frac{\epsilon}{\sqrt{1 - \bar{\alpha}_t}}

So training the noise predictor ϵθ\epsilon_\theta is equivalent to training a score estimator, up to a known scaling factor. This unifies DDPM with the score-based SDE framework of Song et al. ([?song2021score]).

**Three views of the same model:**
ViewModel outputsTraining lossSampling
Noise predictionϵθ(xt,t)ϵ\epsilon_\theta(x_t, t) \approx \epsilonϵϵθ2\|\epsilon - \epsilon_\theta\|^2DDPM ancestral sampling
Score estimationsθ(xt,t)logqts_\theta(x_t, t) \approx \nabla \log q_tDenoising score matchingLangevin dynamics / probability flow ODE
Denoisingx^θ(xt,t)x0\hat{x}_\theta(x_t, t) \approx x_0x0x^θ2\|x_0 - \hat{x}_\theta\|^2 (reweighted)DDIM / flow matching

All three are mathematically equivalent. The score/SDE perspective ([?song2021score]) generalizes discrete-time DDPM to continuous time, where the forward process is an SDE dx=f(x,t)dt+g(t)dWdx = f(x,t)dt + g(t)dW and the reverse is dx=[fg2logpt]dt+gdWˉdx = [f - g^2 \nabla \log p_t]dt + g \, d\bar{W}.

Sampling Algorithms

AlgorithmStepsStochastic?Key Idea
DDPM (Ho et al., 2020)TT (1000)YesAncestral sampling; add noise at each step
DDIM ([?song2021denoising])STS \ll TNo (deterministic)Skip steps via non-Markovian reverse; same training
DPM-Solver ([?lu2022dpm])10-20NoHigh-order ODE solver for probability flow ODE
Euler (1st order)20-50Yes/NoSimplest discretization of the reverse SDE/ODE
Heun (2nd order)20-50NoPredictor-corrector; better quality per step
Consistency models ([?song2023consistency])1-2NoDirect mapping from noise to data; distilled or trained
Rectified flow ([?liu2023flow])1-fewNoLearn straight-line trajectories from noise to data
**DDIM and the speed-quality tradeoff.** DDIM (Denoising Diffusion Implicit Models) [@song2021denoising] shows that the same trained model (same $\epsilon_\theta$) can be used with a *non-Markovian* reverse process that skips steps. By selecting a subsequence of $S$ timesteps from $\{1, \ldots, T\}$, DDIM generates samples in $S$ steps instead of $T$. The key insight: the forward marginals $q(x_t|x_0)$ are the same regardless of the reverse process structure, so we can use any reverse process consistent with these marginals.

Setting the stochasticity parameter η=0\eta = 0 gives a fully deterministic ODE (the probability flow ODE), enabling exact likelihood computation and meaningful latent space interpolation.

Classifier-Free Guidance

**Classifier-free guidance** (CFG) [@ho2022cfg] combines conditional and unconditional predictions:

ϵ~θ(xt,t,c)=(1+w)ϵθ(xt,t,c)wϵθ(xt,t,)\tilde{\epsilon}_\theta(x_t, t, c) = (1 + w) \, \epsilon_\theta(x_t, t, c) - w \, \epsilon_\theta(x_t, t, \varnothing)

where cc is the conditioning signal (e.g., text prompt), \varnothing is the null condition (trained by randomly dropping the condition during training with probability puncond0.1p_{\text{uncond}} \approx 0.1), and w0w \geq 0 is the guidance scale.

**CFG as implicit classifier guidance.** Rewriting the guided prediction in terms of scores:

s~(xt,t,c)=s(xt,t)+(1+w)[s(xt,t,c)s(xt,t)]=s(xt,t)+(1+w)xtlogp(cxt)\tilde{s}(x_t, t, c) = s(x_t, t) + (1+w) \cdot [s(x_t, t, c) - s(x_t, t)] = s(x_t, t) + (1+w) \nabla_{x_t} \log p(c|x_t)

This shows that CFG amplifies the implicit classifier gradient logp(cxt)\nabla \log p(c|x_t) by a factor of (1+w)(1+w), pushing samples toward higher conditional likelihood. The effect:

wwBehaviorTrade-off
w=0w = 0Standard conditional modelMaximum diversity
w=1w = 1-33Mild guidanceGood diversity-quality balance
w=5w = 5-1515Strong guidance (typical for image gen)High quality, lower diversity
ww \to \inftyApproaches mode of $p(cx_t)$

Guidance is now standard in essentially all conditional diffusion models (Stable Diffusion, DALL-E, Imagen). The guidance scale is the primary user-facing quality knob.

Latent Diffusion and Practical Architecture

**Latent Diffusion Models (LDM)** [@rombach2022ldm] apply the diffusion process in a learned latent space rather than pixel space:
  1. Encode: z0=E(x0)z_0 = \mathcal{E}(x_0) using a pretrained VAE encoder (256×256×332×32×4256 \times 256 \times 3 \to 32 \times 32 \times 4, typical 8×8\times compression).
  2. Diffuse: Run forward/reverse process on ztz_t instead of xtx_t.
  3. Decode: x^0=D(z^0)\hat{x}_0 = \mathcal{D}(\hat{z}_0) using the VAE decoder.

This reduces the dimensionality by 64×\sim 64\times (from 2562×3=196K256^2 \times 3 = 196K to 322×4=4K32^2 \times 4 = 4K), making the diffusion process dramatically cheaper while preserving perceptual quality. The denoising network is typically a U-Net with cross-attention layers for text conditioning, or more recently a Diffusion Transformer (DiT) ([?peebles2023dit]) that uses a standard Transformer architecture on patchified latents.

**Flow matching** [@lipman2023flow] is a related framework that learns a velocity field $v_\theta(x_t, t)$ defining an ODE from noise to data: $dx/dt = v_\theta(x_t, t)$. Instead of the diffusion forward process, flow matching interpolates linearly: $x_t = (1-t)x_0 + t\epsilon$. The training loss is:

L=Et,x0,ϵ[vθ(xt,t)(ϵx0)2]\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\left[\|v_\theta(x_t, t) - (\epsilon - x_0)\|^2\right]

Flow matching produces straighter trajectories than diffusion, enabling fewer sampling steps. It has become the basis for several modern generative models (e.g., Stable Diffusion 3).

Notation Summary

SymbolMeaning
x0x_0Clean data
xtx_tNoisy data at timestep tt
βt\beta_tNoise schedule at step tt
αt=1βt\alpha_t = 1 - \beta_tSignal retention per step
αˉt=stαs\bar{\alpha}_t = \prod_{s \leq t}\alpha_sCumulative signal retention
SNR(t)=αˉt/(1αˉt)\text{SNR}(t) = \bar{\alpha}_t/(1-\bar{\alpha}_t)Signal-to-noise ratio
ϵ\epsilonGaussian noise added during forward process
ϵθ\epsilon_\thetaNeural network noise prediction
sθs_\thetaScore function estimate: logpt\nabla \log p_t
x^θ\hat{x}_\thetaNeural network clean data prediction
vθv_\thetaVelocity prediction (flow matching)
$q(x_tx_0)$
$p_\theta(x_{t-1}x_t)$
ccConditioning signal (text, class label)
wwGuidance scale (CFG)
TTNumber of diffusion steps
LDMLatent diffusion model
DiTDiffusion Transformer

References