Diffusion Models
Diffusion models are a class of generative models that learn to reverse a noise-corruption process. They have become the dominant approach for image, video, and audio generation, achieving state-of-the-art sample quality while offering stable training (no adversarial dynamics) and a principled variational objective. This chapter covers the mathematical foundations: the forward and reverse processes, the training objective and its derivation, the connection to score matching, and practical techniques like guidance.
Forward Process
where is the noise schedule. Each step scales down the signal by and adds noise with variance , preserving unit variance when the input has unit variance.
Equivalently: where .
Proof. By induction. At step 1: . Assume . Then . Since and are independent Gaussians, their sum has variance .
As (with appropriate schedule), , so -- the data signal is completely destroyed.
| Schedule | or SNR | Properties |
|---|---|---|
| Linear (Ho et al., 2020) | Simple, but wastes steps in near-noise regime | |
| Cosine ([?nichol2021improved]) | Smoother SNR decay, better for small images | |
| Squared cosine | quadratic | Used in Stable Diffusion |
| Log-SNR linear | linear in | Uniform in log-SNR space; theoretically motivated |
The signal-to-noise ratio (SNR) at step is , which monotonically decreases from to . The schedule should distribute steps evenly across the useful SNR range.
Reverse Process
where is predicted by a neural network. The variance can be fixed ( or ) or learned.
| Parameterization | Model predicts | Mean formula | Loss weight |
|---|---|---|---|
| Noise prediction (-pred) | Uniform across noise levels | ||
| Data prediction (-pred) | Emphasizes low-noise steps | ||
| Velocity prediction (-pred) | Derived from | Balanced; better for high resolution |
All three are mathematically equivalent -- they differ only in the implicit weighting of the loss across noise levels. -prediction is the default in DDPM; -prediction is preferred in progressive distillation and high-resolution models.
The Forward Posterior
where:
This is derived by applying Bayes' rule: , and completing the square in the Gaussian exponent.
DDPM Loss
Since both and are Gaussian, each KL term has a closed form that reduces to an MSE between means.
The simplified loss drops the per-timestep weighting:
where , , and .
- Sample a clean image from the dataset
- Sample and
- Compute
- Predict
- Compute loss and backpropagate
There is no adversarial training, no mode collapse, no training instability. The loss is a standard regression objective. This simplicity, combined with strong sample quality, is why diffusion models have largely replaced GANs.
Score Matching
Score matching ([?hyvarinen2005estimation]) trains a model without requiring the normalizing constant of .
Derivation. Since :
So training the noise predictor is equivalent to training a score estimator, up to a known scaling factor. This unifies DDPM with the score-based SDE framework of Song et al. ([?song2021score]).
| View | Model outputs | Training loss | Sampling |
|---|---|---|---|
| Noise prediction | DDPM ancestral sampling | ||
| Score estimation | Denoising score matching | Langevin dynamics / probability flow ODE | |
| Denoising | (reweighted) | DDIM / flow matching |
All three are mathematically equivalent. The score/SDE perspective ([?song2021score]) generalizes discrete-time DDPM to continuous time, where the forward process is an SDE and the reverse is .
Sampling Algorithms
| Algorithm | Steps | Stochastic? | Key Idea |
|---|---|---|---|
| DDPM (Ho et al., 2020) | (1000) | Yes | Ancestral sampling; add noise at each step |
| DDIM ([?song2021denoising]) | No (deterministic) | Skip steps via non-Markovian reverse; same training | |
| DPM-Solver ([?lu2022dpm]) | 10-20 | No | High-order ODE solver for probability flow ODE |
| Euler (1st order) | 20-50 | Yes/No | Simplest discretization of the reverse SDE/ODE |
| Heun (2nd order) | 20-50 | No | Predictor-corrector; better quality per step |
| Consistency models ([?song2023consistency]) | 1-2 | No | Direct mapping from noise to data; distilled or trained |
| Rectified flow ([?liu2023flow]) | 1-few | No | Learn straight-line trajectories from noise to data |
Setting the stochasticity parameter gives a fully deterministic ODE (the probability flow ODE), enabling exact likelihood computation and meaningful latent space interpolation.
Classifier-Free Guidance
where is the conditioning signal (e.g., text prompt), is the null condition (trained by randomly dropping the condition during training with probability ), and is the guidance scale.
This shows that CFG amplifies the implicit classifier gradient by a factor of , pushing samples toward higher conditional likelihood. The effect:
| Behavior | Trade-off | |
|---|---|---|
| Standard conditional model | Maximum diversity | |
| - | Mild guidance | Good diversity-quality balance |
| - | Strong guidance (typical for image gen) | High quality, lower diversity |
| Approaches mode of $p(c | x_t)$ |
Guidance is now standard in essentially all conditional diffusion models (Stable Diffusion, DALL-E, Imagen). The guidance scale is the primary user-facing quality knob.
Latent Diffusion and Practical Architecture
- Encode: using a pretrained VAE encoder (, typical compression).
- Diffuse: Run forward/reverse process on instead of .
- Decode: using the VAE decoder.
This reduces the dimensionality by (from to ), making the diffusion process dramatically cheaper while preserving perceptual quality. The denoising network is typically a U-Net with cross-attention layers for text conditioning, or more recently a Diffusion Transformer (DiT) ([?peebles2023dit]) that uses a standard Transformer architecture on patchified latents.
Flow matching produces straighter trajectories than diffusion, enabling fewer sampling steps. It has become the basis for several modern generative models (e.g., Stable Diffusion 3).
Notation Summary
| Symbol | Meaning |
|---|---|
| Clean data | |
| Noisy data at timestep | |
| Noise schedule at step | |
| Signal retention per step | |
| Cumulative signal retention | |
| Signal-to-noise ratio | |
| Gaussian noise added during forward process | |
| Neural network noise prediction | |
| Score function estimate: | |
| Neural network clean data prediction | |
| Velocity prediction (flow matching) | |
| $q(x_t | x_0)$ |
| $p_\theta(x_{t-1} | x_t)$ |
| Conditioning signal (text, class label) | |
| Guidance scale (CFG) | |
| Number of diffusion steps | |
| LDM | Latent diffusion model |
| DiT | Diffusion Transformer |