Diffusion Models
Diffusion models are a class of generative models that learn to reverse a noise-corruption process. They have become the dominant approach for image, video, and audio generation, achieving state-of-the-art sample quality while offering stable training (no adversarial dynamics) and a principled variational objective. This chapter covers the mathematical foundations: the forward and reverse processes, the training objective and its derivation, the connection to score matching, and practical techniques like guidance.
Forward Process
where is the noise schedule. Each step scales down the signal by and adds noise with variance , preserving unit variance when the input has unit variance.
Equivalently: where .
Proof. By induction. At step 1: . Assume . Then . Since and are independent Gaussians, their sum has variance .
As (with appropriate schedule), , so : the data signal is completely destroyed.
| Schedule | or SNR | Properties |
|---|---|---|
| Linear (Ho et al., 2020) | Simple, but wastes steps in near-noise regime | |
| Cosine (Nichol & Dhariwal, 2021) | Smoother SNR decay, better for small images | |
| Scaled linear (Rombach et al., 2022) | linear in (i.e. interpolated linearly) | Used in Stable Diffusion 1.x and 2.x; gentler than plain linear |
| Log-SNR linear | linear in | Uniform in log-SNR space; theoretically motivated |
The signal-to-noise ratio (SNR) at step is , which monotonically decreases from to . The schedule should distribute steps evenly across the useful SNR range.
Reverse Process
where is predicted by a neural network. The variance can be fixed ( or ) or learned.
| Parameterization | Model predicts | Mean formula | Loss weight |
|---|---|---|---|
| Noise prediction (-pred) | Uniform across noise levels | ||
| Data prediction (-pred) | Emphasizes low-noise steps | ||
| Velocity prediction (-pred) | Derived from | Balanced; better for high resolution |
All three are mathematically equivalent; they differ only in the implicit weighting of the loss across noise levels. -prediction is the default in DDPM; -prediction is preferred in progressive distillation and high-resolution models.
The Forward Posterior
where:
This is derived by applying Bayes' rule: , and completing the square in the Gaussian exponent.
DDPM Loss
Since both and are Gaussian, each KL term has a closed form that reduces to an MSE between means.
The simplified loss drops the per-timestep weighting:
where , , and .
- Sample a clean image from the dataset
- Sample and
- Compute
- Predict
- Compute loss and backpropagate
There is no adversarial training, no mode collapse, no training instability. The loss is a standard regression objective. This simplicity, combined with strong sample quality, is why diffusion models have largely replaced GANs.
Score Matching
Score matching (Hyvärinen, 2005) trains a model without requiring the normalizing constant of .
Derivation. Since :
So training the noise predictor is equivalent to training a score estimator, up to a known scaling factor. This unifies DDPM with the score-based SDE framework of Song et al. (Song et al., 2021).
| View | Model outputs | Training loss | Sampling |
|---|---|---|---|
| Noise prediction | DDPM ancestral sampling | ||
| Score estimation | Denoising score matching | Langevin dynamics / probability flow ODE | |
| Denoising | (reweighted) | DDIM / flow matching |
All three are mathematically equivalent. The score/SDE perspective (Song et al., 2021) generalizes discrete-time DDPM to continuous time, where the forward process is an SDE and the reverse is .
Sampling Algorithms
| Algorithm | Steps | Stochastic? | Key Idea |
|---|---|---|---|
| DDPM (Ho et al., 2020) | (1000) | Yes | Ancestral sampling; add noise at each step |
| DDIM (Song et al., 2021) | No (deterministic) | Skip steps via non-Markovian reverse; same training | |
| DPM-Solver (Lu et al., 2022) | 10-20 | No | High-order ODE solver for probability flow ODE |
| Euler (1st order) | 20-50 | Yes/No | Simplest discretization of the reverse SDE/ODE |
| Heun (2nd order) | 20-50 | No | Predictor-corrector; better quality per step |
| Consistency models (Song et al., 2023) | 1-2 | No | Direct mapping from noise to data; distilled or trained |
| Rectified flow (Liu et al., 2023) | 1-few | No | Learn straight-line trajectories from noise to data |
Setting the stochasticity parameter gives a fully deterministic ODE (the probability flow ODE), enabling exact likelihood computation and meaningful latent space interpolation.
Classifier-Free Guidance
where is the conditioning signal (e.g., text prompt), is the null condition (trained by randomly dropping the condition during training with probability ), and is the guidance scale.
This shows that CFG amplifies the implicit classifier gradient by a factor of , pushing samples toward higher conditional likelihood. The effect:
| Behavior | Trade-off | |
|---|---|---|
| Standard conditional model | Maximum diversity | |
| - | Mild guidance | Good diversity-quality balance |
| - | Strong guidance (typical for image gen) | High quality, lower diversity |
| Approaches mode of $p(c | x_t)$ |
Guidance is now standard in essentially all conditional diffusion models (Stable Diffusion, DALL-E, Imagen). The guidance scale is the primary user-facing quality knob.
Latent Diffusion and Practical Architecture
- Encode: using a pretrained VAE encoder (, typical compression).
- Diffuse: Run forward/reverse process on instead of .
- Decode: using the VAE decoder.
The spatial resolution drops by per side ( in pixel count), and counting the change in channel depth (3 to 4) the total element count falls by (from to ), making the diffusion process dramatically cheaper while preserving perceptual quality. The denoising network is typically a U-Net with cross-attention layers for text conditioning, or more recently a Diffusion Transformer (DiT) (Peebles & Xie, 2023) that uses a standard Transformer architecture on patchified latents.
Flow matching produces straighter trajectories than diffusion, enabling fewer sampling steps. It has become the basis for several modern generative models (e.g., Stable Diffusion 3).
Worked Examples
Step 1: Signal-retention factors. With we get .
Step 2: Cumulative products. Using :
Step 3: Signal-to-noise ratio at . , so by the final step the signal and noise have roughly equal power.
Step 4: Apply the closed-form forward map. From the remark on closed-form sampling, . With and :
Notice that no intermediate were needed: the closed form jumps straight to .
Apply the mixing rule. The CFG equation gives . With the scalar weights are and , applied component-wise:
So .
Interpret the scale. Rewriting as , the conditional-minus-unconditional difference is . Larger scales this difference by , so multiplies it by instead of , pushing the prediction twice as far from the unconditional output along the same direction. This is the diversity-versus-fidelity knob: bigger sharpens adherence to the condition at the cost of sample diversity.
Notation Summary
| Symbol | Meaning |
|---|---|
| Clean data | |
| Noisy data at timestep | |
| Noise schedule at step | |
| Signal retention per step | |
| Cumulative signal retention | |
| Signal-to-noise ratio | |
| Gaussian noise added during forward process | |
| Neural network noise prediction | |
| Score function estimate: | |
| Neural network clean data prediction | |
| Velocity prediction (flow matching) | |
| $q(x_t | x_0)$ |
| $p_\theta(x_{t-1} | x_t)$ |
| Conditioning signal (text, class label) | |
| Guidance scale (CFG) | |
| Number of diffusion steps | |
| LDM | Latent diffusion model |
| DiT | Diffusion Transformer |
References
- Jonathan Ho, Ajay Jain, Pieter Abbeel (2020). Denoising Diffusion Probabilistic Models. NeurIPS. ↗
- Aapo Hyvärinen (2005). Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research. ↗
- Xingchao Liu, Chengyue Gong, Qiang Liu (2023). Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. ICLR. ↗
- Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, Jun Zhu (2022). DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. NeurIPS. ↗
- Alexander Quinn Nichol, Prafulla Dhariwal (2021). Improved Denoising Diffusion Probabilistic Models. ICML. ↗
- William Peebles, Saining Xie (2023). Scalable Diffusion Models with Transformers. ICCV. ↗
- Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Bjorn Ommer (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR. ↗
- Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole (2021). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR. ↗
- Jiaming Song, Chenlin Meng, Stefano Ermon (2021). Denoising Diffusion Implicit Models. ICLR. ↗
- Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever (2023). Consistency Models. ICML. ↗