Skip to main content

Introduction & Motivation

The concept of a world model, an internal representation of how the environment works, lies at the heart of intelligent behavior. Humans do not interact with the world purely through trial and error; instead, we simulate potential actions and their consequences in our minds before committing to a decision. This capacity for mental simulation, or "model-based" reasoning, is what allows us to plan, generalize, and adapt efficiently to novel situations [@craik1943nature, @johnsonlaird1983mental]. A child who has never pushed a glass off a table can predict that it will fall and break, because they have an internal model of gravity, fragility, and the consequences of actions on objects.

In artificial intelligence, world models formalize this idea: an agent learns a predictive model of its environment's dynamics, enabling it to plan and make decisions by "imagining" the outcomes of potential actions without executing them in the real world [@ha2018world, @hafner2019learning]. The idea has deep roots in model-based reinforcement learning, from the Dyna architecture (Sutton, 1990) through Gaussian process dynamics models (Deisenroth & Rasmussen, 2011) to modern neural network-based approaches [@nagabandi2018neural, @chua2018deep]. The potential is transformative: a robot that can accurately predict the consequences of its actions in its "mind" before executing them can learn orders of magnitude more efficiently than one that must learn purely through physical trial and error. For a comprehensive survey of model-based RL, see Moerland et al. (2023) (Moerland et al., 2023).

The seminal work of Ha and Schmidhuber (2018) (Ha & Schmidhuber, 2018) crystallized the modern vision of world models for deep learning, building on Schmidhuber's earlier theoretical foundations for learning world models through compression and prediction (Schmidhuber, 2015). Their three-component architecture (a visual encoder (VAE), a dynamics model (MDN-RNN), and a compact controller) established the template that subsequent work has refined and scaled. A key insight was that the controller could be remarkably small (fewer than 1,000 parameters) when paired with a good world model, because the world model provides the structured representation that makes control tractable.

The field builds on a rich history of model-based reasoning in AI and cognitive science. Craik (1943) (Craik, 1943) first articulated the idea that organisms carry "small-scale models" of external reality in their heads, using them to reason about the world, predict events, and guide action. In reinforcement learning, the Dyna architecture (Sutton, 1990) (Sutton, 1990) formalized the interplay between real experience and simulated experience from a learned model. PILCO (Deisenroth and Rasmussen, 2011) (Deisenroth & Rasmussen, 2011) demonstrated that Gaussian process dynamics models could achieve remarkable sample efficiency in continuous control tasks, learning to swing up and balance a cart-pole from fewer than 20 seconds of interaction, a result that motivated the development of neural network-based world models capable of scaling to higher-dimensional observations. Nagabandi et al. (2018) (Nagabandi et al., 2018) bridged the gap between model-based and model-free methods by using learned neural network dynamics models for model-predictive control, while PETS (Chua et al., 2018) (Chua et al., 2018) introduced probabilistic ensemble dynamics models that quantify uncertainty in predictions, using trajectory sampling with CEM for planning under model uncertainty. Read together, this lineage marks a steady shift from high-fidelity models of full dynamics toward models tuned for utility, predicting only enough of the environment to support sample-efficient planning, which is the fidelity-versus-utility tension that organizes this chapter.

The field has expanded dramatically since then, driven by five converging forces:

  1. Sample efficiency imperative. Model-free reinforcement learning requires millions of real-world interactions to learn even simple behaviors, making it impractical for many real-world applications (especially robotics, where each interaction is slow and costly). World models offer a path to orders-of-magnitude improvements in sample efficiency by enabling "learning in imagination," training policies on model-generated trajectories rather than real experience (Hafner et al., 2020), a strategy that recent transformer-based world models continue to push on benchmarks such as Atari 100k (Zhang et al., 2023).

  2. Generative modeling advances. The rapid progress in generative modeling (from VAEs and GANs to diffusion models and autoregressive transformers) has dramatically improved the fidelity of generated world simulations. Modern video generation models can produce photorealistic, temporally coherent sequences that serve as high-quality world simulators [@alonso2024diffusion, @valevski2024diffusion]. The evolution from early action-conditioned video prediction [@oh2015action, @finn2016unsupervised, @chiappa2017recurrent] to diffusion-based world models represents a qualitative leap in simulation fidelity.

  3. Foundation model paradigm. The success of large pre-trained models in NLP and vision has raised the tantalizing possibility of universal world models: foundation-scale models trained on broad data that can simulate diverse environments without domain-specific training [@bruce2024genie, @yang2023learning]. This line has scaled rapidly, with Genie 3 generating photorealistic, interactive worlds from text or image prompts (DeepMind, 2025).

  4. Autonomous driving and robotics. The autonomous driving industry has emerged as a major consumer and driver of world model research, using learned simulators for scenario generation, planning, and testing [@hu2023gaia1, @wang2024driving_world_survey]. Similarly, robotics has embraced world models for sim-to-real transfer and manipulation planning (Hansen et al., 2024).

  5. LLM-as-world-model hypothesis. An emerging perspective suggests that large language models, trained to predict the next token on vast text corpora, have implicitly learned world models, capturing the dynamics, physics, and causal structure of the world through the statistical patterns of language [@hao2023reasoning, @lecun2022path].

This chapter reviews the evolution of world models from classical approaches through modern foundation-scale systems. We organize the literature by application domain and architectural paradigm, covering latent-space and pixel-space models, video prediction, object-centric world models, foundation world models, robotics and autonomous driving applications, reasoning, and the intersection with reinforcement learning. Throughout, we trace the central tension between model fidelity (predicting everything accurately) and model utility (predicting only what matters for downstream tasks).

To keep coverage legible, we state the chapter's boundaries explicitly. In scope are learned predictive models of environment dynamics that an agent uses to plan, imagine, or learn, spanning latent-space and pixel-space models, video prediction models when used as world models, object-centric and foundation-scale world models, and their integration with reinforcement learning, robotics, autonomous driving, and reasoning. Out of scope are pure model-free reinforcement learning (covered only as a baseline for comparison), generic video and image generation that is not used or evaluated as a world model, and classical analytic system identification and physics-based simulators that are hand-specified rather than learned. Where these boundaries are crossed (for example, video generators such as Sora discussed through the lens of world modeling), we make the connection explicit rather than treating such systems as world models by default.


References