Skip to main content

Taxonomy of Approaches

World models can be organized along several orthogonal dimensions, and understanding these dimensions is essential for navigating the rapidly expanding literature.

By prediction space:

  • Pixel-space models (SimPLe (Kaiser et al., 2020), DIAMOND (Alonso et al., 2024), GameNGen (Valevski et al., 2024)): Predict raw observations directly. These models must capture every visual detail, which is computationally expensive but produces interpretable, evaluable predictions. Pixel-space models benefit from advances in generative modeling (diffusion, autoregressive), as higher visual fidelity directly translates to better world model quality. The historical trajectory spans from early action-conditioned predictors [@oh2015action, @finn2016unsupervised] through stochastic video models [@denton2018stochastic, @babaeizadeh2021fitvid] to modern diffusion-based approaches [@alonso2024diffusion, @valevski2024diffusion]. The computational cost is significant: DIAMOND requires ~10 diffusion steps per frame, while VAE-based models generate in a single pass. Pixel-space models excel in domains where visual fidelity matters for downstream tasks (e.g., object recognition from predicted frames, or human evaluation of simulation quality), but struggle with high-resolution environments where the dimensionality of the prediction target makes learning inefficient.
  • Latent-space models (PlaNet, Dreamer series, MuZero, TD-MPC): Predict in learned abstract state spaces. These models learn compressed representations that retain task-relevant information while discarding irrelevant visual details, enabling more efficient planning and learning. The Dreamer series [@hafner2019learning, @hafner2020dream, @hafner2021mastering, @hafner2023mastering] represents the most mature lineage of latent-space world models, evolving the RSSM from GRU-based (v1) through discrete latent spaces (v2) (Hafner et al., 2021) to a symlog-normalized architecture (v3) (Hafner et al., 2023) that handles reward scales spanning 6 orders of magnitude across different domains. MuZero (Schrittwieser et al., 2020) and TD-MPC [@hansen2022temporal, @hansen2024tdmpc2] demonstrate the value-equivalent and planning-optimized variants. PlaNet (Hafner et al., 2019) established the foundational RSSM architecture combining deterministic paths (for stable long-horizon prediction) with stochastic latent variables (for capturing uncertainty). The key advantage of latent-space models is computational efficiency: a 64-dimensional latent state is orders of magnitude cheaper to predict than a 64x64x3 pixel observation.
  • Hybrid models (JEPA, V-JEPA, STORM): Combine latent prediction with selective pixel or representation decoding. These models aim to balance the efficiency of latent prediction with the interpretability and fidelity of observation-space prediction. EfficientZero (Ye et al., 2021) exemplifies this hybrid approach by using latent dynamics with an auxiliary observation reconstruction loss that improves representation quality without requiring pixel-accurate prediction. V-JEPA (Bardes et al., 2024) represents a different hybrid strategy: predicting in a learned representation space (like latent models) but using a self-supervised objective that preserves information relevant to downstream understanding tasks (unlike MuZero's value-centric approach). However, JEPA-style self-supervised objectives can be vulnerable to representation collapse, where the model learns to ignore informative features and converges to trivial solutions.

By generative paradigm:

  • Deterministic models (SimPLe, early dynamics models): Produce single-point predictions of future states. Simple to train but fundamentally unable to represent the multimodality of future outcomes. When the environment is stochastic (as most real environments are), deterministic models average over possible futures, producing blurry or physically implausible predictions that degrade rapidly over long horizons.
  • Stochastic models (PlaNet, Dreamer series, SVG): Use latent variables (typically via VAE) to model distributions over future states, capturing the inherent uncertainty and multimodality of the environment. The RSSM architecture, combining deterministic and stochastic components, is the dominant paradigm. The deterministic path provides a "highway" for gradient flow and stable long-horizon prediction, while the stochastic path captures the unpredictable aspects of the environment. The KL divergence between the prior and posterior distributions quantifies the model's uncertainty: high KL indicates unexpected observations, serving as a natural signal for exploration (Sekar et al., 2020).
  • Autoregressive models (VideoGPT, IRIS, STORM, GAIA-1): Model sequences of discrete tokens autoregressively, treating world simulation as sequence prediction [@yan2021videogpt, @micheli2023transformers, @zhang2025storm_wm, @hu2023gaia1]. These models leverage the powerful scaling properties of autoregressive transformers, and the connection to language modeling enables transfer of architectural insights. The tokenization strategy (VQ-VAE codebook size, patch size, temporal resolution) is a critical design choice: larger codebooks capture more visual detail but increase the sequence length that the transformer must process.
  • Diffusion-based models (DIAMOND (Alonso et al., 2024), GameNGen (Valevski et al., 2024), Sora (OpenAI, 2024)): Generate predictions through iterative denoising, producing the highest visual fidelity at the cost of multiple forward passes per prediction step. The denoising process can be viewed as iterative refinement, where each step corrects the prediction at a different noise level (coarse structure at high noise, fine details at low noise). In diffusion-based world models, the diversity of samples can be used as a rough uncertainty signal, though sample diversity is not equivalent to calibrated uncertainty and should be interpreted with caution.

By application domain:

  • Game playing and Atari (MuZero (Schrittwieser et al., 2020), EfficientZero (Ye et al., 2021), SimPLe (Kaiser et al., 2020), IRIS (Micheli et al., 2023), DIAMOND (Alonso et al., 2024)): The most well-benchmarked domain, with standardized evaluation protocols (Atari 100k, which limits the agent to 100k environment steps; with the standard frame-skip of 4, this corresponds to 400k game frames, approximately 2 hours of real-time play) and clear comparison across methods. The Atari 100k benchmark has driven rapid progress: comparing mean human-normalized scores, from SimPLe's approximately 0.35 (Kaiser et al., 2020) to DreamerV3's approximately 1.12 (Hafner et al., 2023) to DIAMOND's 1.46 (Alonso et al., 2024).
  • Continuous control and robotics (Dreamer series, TD-MPC, DayDreamer): Control tasks requiring precise physical prediction, from locomotion to manipulation. DayDreamer (Wu et al., 2023) demonstrated that DreamerV2 can learn real-world robotic locomotion and manipulation from scratch, learning to walk in 1 hour of real interaction, a landmark result for sample-efficient real-world RL.
  • Autonomous driving (GAIA-1 (Hu et al., 2023), MILE (Hu et al., 2022), OccWorld (Zheng et al., 2024), Cosmos (NVIDIA, 2024)): Safety-critical applications requiring both high fidelity and reliable uncertainty estimation. GAIA-1 (Hu et al., 2023) generates diverse driving scenarios from language and action conditioning, while MILE (Hu et al., 2022) uses multi-modal imitation learning with a latent world model. OccWorld (Zheng et al., 2024) predicts future 3D occupancy grids for planning.
  • Video generation and prediction (SVG, FitVid, Video Diffusion, Sora (OpenAI, 2024)): Video prediction as a stepping stone toward general world modeling, with evaluation focused on visual quality metrics (FVD, LPIPS, SSIM) rather than downstream decision-making. The convergence of video generation and world modeling is a defining trend of the field.
  • Reasoning and language-grounded dynamics (Dynalang, WorldGPT, LLM-as-world-model): Connecting world models to language understanding, enabling compositional reasoning about dynamics. Dynalang (Xiang et al., 2024) trains a multimodal world model on language and visual observations jointly, enabling the agent to follow language instructions and reason about linguistically described goals.
  • Interactive environments (Genie (Bruce et al., 2024), UniSim (Yang et al., 2023), GameNGen (Valevski et al., 2024)): Foundation-scale models that generate interactive worlds from video data, representing the frontier of world modeling, creating entire environments rather than predicting within existing ones.

Across these domains the recurring tension is fidelity versus controllability: games and continuous control reward compact, decision-relevant predictions and sample efficiency over photorealism, autonomous driving demands both high visual fidelity and calibrated uncertainty under safety constraints, and video generation and interactive environments push fidelity and long-horizon coherence while controllability and grounding remain the open challenge.

By the level of structure:

  • Monolithic models: Learn a single, undifferentiated dynamics function that maps entire states to entire states. Most current models fall in this category.
  • Object-centric models (C-SWM, SlotFormer): Explicitly decompose scenes into objects and model their individual dynamics and interactions [@kipf2020contrastive, @wu2023slotformer, @locatello2020object]. These models offer compositional generalization but require solving the unsupervised object discovery problem. Physics-aware graph networks [@battaglia2016interaction, @sanchez2020learning] provide the interaction modeling backbone.
  • Physics-informed models: Incorporate physical priors (conservation laws, symmetries, differential equations) into the dynamics model structure, improving generalization and physical plausibility.

The following table situates representative methods across the facets above, making the cross-cutting tradeoffs explicit:

MethodPrediction spaceGenerative paradigmPrimary domainCentral tradeoff
DreamerV3 (Hafner et al., 2023)LatentStochasticControl, gamesEfficiency and generality over pixel fidelity
MuZero (Schrittwieser et al., 2020)Latent (value-equivalent)DeterministicGamesPlanning accuracy over reconstruction
IRIS (Micheli et al., 2023)Latent tokensAutoregressiveGamesSequence-modeling scalability over per-frame fidelity
DIAMOND (Alonso et al., 2024)PixelDiffusionGamesVisual fidelity over inference cost
GAIA-1 (Hu et al., 2023)Pixel tokensAutoregressiveDrivingScenario diversity over real-time control
Genie (Bruce et al., 2024)LatentAutoregressiveInteractive worldsGenerality and controllability over fidelity

These dimensions are largely orthogonal: one can build a latent-space, stochastic, object-centric, diffusion-based world model for robotics, for example. The trend in the field is toward models that are latent-space (for efficiency), stochastic (for uncertainty), autoregressive or diffusion-based (for quality), and trained on broad data (for generality).


References