World Models + Reinforcement Learning
The integration of world models with reinforcement learning is the most mature application of world modeling, with a clear narrative arc from model-based planning to learned dynamics to foundation-scale agents. This section covers the key milestones, from the foundational Dyna architecture through AlphaGo/AlphaZero to modern approaches that leverage generative models.
Although the methods below are introduced in roughly chronological order, they are best understood along the two taxonomy axes used throughout this chapter: the prediction space in which the model operates (perceptual pixel-space versus abstract value-equivalent latent space) and the generative paradigm it adopts (deterministic prediction, autoregressive token modeling, diffusion, or no observation generation at all). These axes, rather than release date, explain why methods like MuZero and DIAMOND make such different design choices despite targeting the same benchmarks. The closing comparison table maps each method onto these axes.
The Dyna Architecture and Its Legacy
All modern world model RL methods can be traced back to the Dyna architecture proposed by Sutton (1990) (Sutton, 1990). Dyna interleaves real experience with simulated experience generated by a learned model, using both to update the value function and policy. The key components (learn a model from real experience, generate simulated experience from the model, update the policy from both) are exactly the three-phase loop used by Dreamer, MuZero, MBPO, and virtually all modern world model RL methods. The evolution from Dyna to DreamerV3 is a story of scaling this basic idea from tabular environments to complex visual environments through advances in deep learning and generative modeling.
AlphaGo and AlphaZero: Search with Perfect Models
Before learned world models achieved prominence, AlphaGo (Silver et al., 2016) (Silver et al., 2016) demonstrated the power of combining neural networks with model-based planning using a known (hand-coded) world model: the rules of Go. AlphaGo used MCTS with a policy network to guide search and a value network to evaluate positions, achieving the first superhuman performance in Go. AlphaZero (Silver et al., 2018) (Silver et al., 2018) removed the reliance on human expert data, learning entirely through self-play with MCTS planning. The key architectural insight, using a neural network to provide both a policy prior and value estimate to guide tree search, directly informed MuZero's design. The critical step from AlphaZero to MuZero was replacing the known game rules with a learned dynamics model, enabling the approach to work in environments where the rules are unknown.
MuZero: Value-Equivalent World Models
Schrittwieser et al. (2020) (Schrittwieser et al., 2020) introduced MuZero (DeepMind), which represents a fundamental rethinking of what world models should predict. Unlike Dreamer-style models that reconstruct observations (pixel-level prediction), MuZero learns a world model that predicts only the quantities relevant to planning: rewards, value functions, and policy priors. The model never reconstructs observations: it operates entirely in an abstract latent space where the only requirement is that the predicted quantities support effective Monte Carlo Tree Search (MCTS).
By combining this learned, value-equivalent dynamics model with MCTS planning, MuZero achieved superhuman performance in Go, chess, shogi, and Atari without being provided the rules of any game. The model learns the "rules" (dynamics) purely from experience, then uses MCTS to search for optimal actions in the learned model. MuZero demonstrated a crucial insight: world models need not be perceptually accurate; they only need to be decision-theoretically accurate. A model that perfectly predicts pixel values but poorly predicts rewards is less useful than one that ignores visual details but accurately predicts task-relevant quantities.
EfficientZero: Sample-Efficient MuZero
Ye et al. (2021) (Ye et al., 2021) proposed EfficientZero, extending MuZero with three key improvements: (1) self-supervised temporal consistency, a SimSiam-style objective that aligns the next-state latent predicted by rolling the dynamics model forward with the latent obtained by encoding the actual next observation, (2) end-to-end value prefix prediction that predicts multi-step returns, and (3) model-based off-policy correction of value targets. Using only 2 hours of real-time game experience (100k environment steps), EfficientZero achieved 194% mean (109% median) human-normalized performance on the 26-game Atari 100k suite, becoming the first algorithm to exceed median human performance on Atari 100k and establishing a new standard for sample efficiency. However, EfficientZero inherits the computational overhead of MCTS planning and was initially validated only on the Atari 100k benchmark, leaving its relative advantage in other domains an open question.
IRIS: World Modeling as Sequence Prediction
Micheli et al. (2023) (Micheli et al., 2023) introduced IRIS (Imagination with auto-Regression over an Inner Speech), which makes an elegant conceptual unification: it treats world modeling as autoregressive sequence prediction, using the same transformer architecture that powers language models. IRIS tokenizes observations with VQ-VAE (Oord et al., 2017) (producing discrete tokens like words in a language) and models environment dynamics autoregressively with a transformer (predicting the next "word" in the environment's "language"). IRIS frames state-action-reward-state sequences as token sequences, achieving competitive Atari 100k performance with a conceptually simple architecture.
IRIS is significant because it demonstrates that world modeling and language modeling are the same computational problem at an abstract level, both involving prediction of the next element in a sequence conditioned on history. This unification opens the door to applying the scaling laws and architectural innovations from NLP to world modeling.
Model-Based Policy Optimization (MBPO)
Janner et al. (2019) (Janner et al., 2019) proposed MBPO, which addresses a practical question: how to benefit from an imperfect world model without being hurt by its errors. MBPO uses short model rollouts (branching from real states stored in a replay buffer) to augment the replay buffer for model-free RL algorithms (SAC). By keeping rollouts short (typically 1-5 steps), MBPO limits the accumulation of model errors while still gaining substantial sample efficiency improvements.
MBPO provided both practical and theoretical contributions: empirically, it achieved 10-100x sample efficiency improvements over model-free SAC on the MuJoCo benchmark (Todorov et al., 2012), and theoretically, it provided conditions under which model-based augmentation provably helps (when the model's error is bounded and the rollout horizon is appropriately chosen). The key practical insight, using the model for short rollouts from real states rather than long imagined trajectories, has influenced the design of many subsequent methods.
SimPLe: Simulated Policy Learning
Kaiser et al. (2020) (Kaiser et al., 2020) proposed SimPLe (Simulated Policy Learning), one of the first successful applications of world models to Atari games. SimPLe trains a deterministic video prediction model (pixel-space) from 100k environment steps, then trains a policy entirely within the learned model using PPO. Despite its simplicity, SimPLe demonstrated that even relatively basic pixel-space world models can support effective policy learning when the environment complexity is moderate. SimPLe established the Atari 100k benchmark as the standard testbed for sample-efficient world model methods.
Generative World Models for RL at Scale
Recent work has explored using large-scale generative models (including diffusion models and autoregressive transformers) as world models for RL. The key insight is that improvements in generative model quality (sharper, more coherent, more diverse generations) can translate into better world model quality, and thus better downstream RL performance. This trend is exemplified by DIAMOND (Alonso et al., 2024), which uses a diffusion world model, and STORM (Zhang et al., 2023), which uses a transformer-based stochastic world model. Wu et al. (2024) (Wu et al., 2024) further demonstrated that pre-training world models on in-the-wild videos (not just task-specific environment data) improves downstream RL performance, supporting the foundation world model hypothesis for RL.
This generative-modeling direction stands in tension with the value-equivalent family discussed above. MuZero and EfficientZero achieve their strongest results precisely by discarding observation fidelity in favor of RL-specific objectives (reward, value, and policy prediction, plus EfficientZero's temporal-consistency and off-policy-correction losses), suggesting that decision-relevant accuracy, rather than generative fidelity, drives sample efficiency. The two families therefore make different bets: generative world models wager that perceptual quality and broad video pre-training transfer to control, while value-equivalent models wager that task-relevant prediction is what matters. Which bet pays off appears to depend on the regime: pixel-grounded generative models excel where rich visual dynamics must be modeled, whereas value-equivalent models dominate the most sample-constrained settings such as Atari 100k. A definitive comparison across matched compute and data budgets remains open.
Adaptive World Models: Model-Based Meta-RL
A complementary line of work targets non-stationarity rather than raw sample efficiency. Clavera et al. (2018) (Clavera et al., 2018) proposed combining model-based RL with meta-learning, training a dynamics model that can rapidly adapt to new environments with a few gradient steps. This meta-model-based approach addresses a key limitation of standard world models: their difficulty in adapting to changing dynamics. By meta-learning the dynamics model, the agent can quickly recalibrate its world model when the environment changes, combining the sample efficiency of model-based methods with the adaptability of meta-learning.
Positioned against the other families, meta-model-based RL trades peak single-task performance for robustness to dynamics shift. Where MuZero and MBPO optimize a single fixed environment (MuZero through value-equivalent search, MBPO through short rollouts that bound model error), meta-learning instead optimizes for fast re-fitting across a distribution of tasks. The adaptability comes at a cost: the meta-trained model must amortize capacity across many environments rather than specializing, so it typically underperforms a dedicated single-environment world model in any one stationary setting while substantially outperforming it once the dynamics change.
Comparative Synthesis
Mapping the methods above onto the chapter's two axes makes the design space legible. The prediction space column records what the model commits to representing, and the generative paradigm column records how (if at all) it produces observations. The decisive split is between models that reconstruct perceptual detail and value-equivalent models that predict only decision-relevant quantities.
| Method | Prediction space | Generative paradigm | Planning / use of model | Core bet |
|---|---|---|---|---|
| Dyna (Sutton, 1990) | Tabular / abstract | None (transition sampling) | Simulated experience for value updates | Reusing a learned model multiplies the value of real data |
| AlphaZero (Silver et al., 2018) | Known rules (perfect model) | None | MCTS over given dynamics | Search plus learned priors beats either alone |
| MuZero (Schrittwieser et al., 2020) | Abstract value-equivalent latent | None (no observation reconstruction) | MCTS over learned dynamics | Decision-theoretic, not perceptual, accuracy matters |
| EfficientZero (Ye et al., 2021) | Abstract value-equivalent latent | None (self-supervised latent consistency) | MCTS over learned dynamics | Auxiliary self-supervision unlocks extreme sample efficiency |
| IRIS (Micheli et al., 2023) | Pixel-space (discrete tokens) | Autoregressive transformer | Policy trained in imagination | World modeling is sequence prediction |
| SimPLe (Kaiser et al., 2020) | Pixel-space | Deterministic video prediction | Policy trained in imagination | Even simple pixel models support policy learning |
| MBPO (Janner et al., 2019) | State-space (proprioceptive) | Short-horizon dynamics rollouts | Replay augmentation for model-free RL | Short rollouts bound compounding model error |
| DIAMOND (Alonso et al., 2024) | Pixel-space | Diffusion | Policy trained in imagination | Higher generative fidelity transfers to control |
| STORM (Zhang et al., 2023) | Pixel-space (stochastic latent) | Transformer with stochastic latents | Policy trained in imagination | Stochastic latent dynamics scale efficiently |
| Meta-MBRL (Clavera et al., 2018) | State-space, meta-learned | Adaptive dynamics rollouts | Fast re-fitting across tasks | Robustness to dynamics shift over peak single-task performance |
Read column by column, the table shows that the field has not converged on a single answer. The value-equivalent methods (MuZero, EfficientZero) dominate the most sample-constrained regimes by refusing to model observations at all, while the generative methods (IRIS, SimPLe, DIAMOND, STORM) accept the cost of pixel-level prediction to capture rich visual dynamics and to inherit scaling tools from generative modeling. MBPO and meta-MBRL sit apart again, using the model not for imagination-based control but to bound error and to adapt across tasks. As noted in the generative-models discussion above, a definitive head-to-head comparison across matched compute and data budgets remains open, so the table positions the methods by design intent rather than by a single performance ranking.
References
- Eloi Alonso, Adam Jelley, Anssi Kanervisto, Tim Shersten (2024). Diffusion for World Modeling: Visual Details Matter in Atari. NeurIPS. ↗
- Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasushi Fujimoto, Tamim Asfour, Pieter Abbeel (2018). Model-Based Reinforcement Learning via Meta-Policy Optimization. CoRL. ↗
- Michael Janner, Justin Fu, Marvin Zhang, Sergey Levine (2019). When to Trust Your Model: Model-Based Policy Optimization. NeurIPS. ↗
- Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos (2020). Model Based Reinforcement Learning for Atari. ICLR. ↗
- Vincent Micheli, Eloi Alonso, Francois Fleuret (2023). Transformers are Sample-Efficient World Models. ICLR. ↗
- Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu (2017). Neural Discrete Representation Learning. NeurIPS. ↗
- Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature. ↗
- David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, Demis Hassabis (2016). Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature. ↗
- David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (2018). A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go Through Self-Play. Science. ↗
- Richard S. Sutton (1990). Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. ICML. ↗
- Emanuel Todorov, Tom Erez, Yuval Tassa (2012). MuJoCo: A physics engine for model-based control. IROS. ↗
- Jialong Wu, Haoyu Ma, Chaoyi Deng, Mingsheng Long (2024). Pre-training Contextualized World Models with In-the-Wild Videos for Reinforcement Learning. NeurIPS. ↗
- Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, Yang Gao (2021). Mastering Atari Games with Limited Data. NeurIPS. ↗
- Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, Gao Huang (2023). STORM: Efficient Stochastic Transformer based World Models for Reinforcement Learning. NeurIPS. ↗