Skip to main content

Taxonomy of Approaches

World models can be organized along several orthogonal dimensions, and understanding these dimensions is essential for navigating the rapidly expanding literature.

By prediction space:

  • Pixel-space models (SimPLe, DIAMOND, GameNGen): Predict raw observations directly. These models must capture every visual detail, which is computationally expensive but produces interpretable, evaluable predictions. Pixel-space models benefit from advances in generative modeling (diffusion, autoregressive), as higher visual fidelity directly translates to better world model quality. The historical trajectory spans from early action-conditioned predictors [@oh2015action, @finn2016unsupervised] through stochastic video models [@denton2018stochastic, @babaeizadeh2021fitvid] to modern diffusion-based approaches [@alonso2024diffusion, @valevski2024diffusion]. The computational cost is significant: DIAMOND requires ~10 diffusion steps per frame, while VAE-based models generate in a single pass. Pixel-space models excel in domains where visual fidelity matters for downstream tasks (e.g., object recognition from predicted frames, or human evaluation of simulation quality), but struggle with high-resolution environments where the dimensionality of the prediction target makes learning inefficient.
  • Latent-space models (PlaNet, Dreamer series, MuZero, TD-MPC): Predict in learned abstract state spaces. These models learn compressed representations that retain task-relevant information while discarding irrelevant visual details, enabling more efficient planning and learning. The Dreamer series [@hafner2019learning, @hafner2020dream, @hafner2021mastering, @hafner2023mastering] represents the most mature lineage of latent-space world models, evolving the RSSM from GRU-based (v1) through discrete latent spaces (v3) to a symlog-normalized architecture (v3) that handles reward scales spanning 6 orders of magnitude across different domains. MuZero (Schrittwieser et al., 2020) and TD-MPC [@hansen2022temporal, @hansen2024tdmpc2] demonstrate the value-equivalent and planning-optimized variants. PlaNet (Hafner et al., 2019) established the foundational RSSM architecture combining deterministic paths (for stable long-horizon prediction) with stochastic latent variables (for capturing uncertainty). The key advantage of latent-space models is computational efficiency: a 64-dimensional latent state is orders of magnitude cheaper to predict than a 64x64x3 pixel observation.
  • Hybrid models (JEPA, V-JEPA, STORM): Combine latent prediction with selective pixel or representation decoding. These models aim to balance the efficiency of latent prediction with the interpretability and fidelity of observation-space prediction. EfficientZero (Ye et al., 2021) exemplifies this hybrid approach by using latent dynamics with an auxiliary observation reconstruction loss that improves representation quality without requiring pixel-accurate prediction. V-JEPA (Bardes et al., 2024) represents a different hybrid strategy: predicting in a learned representation space (like latent models) but using a self-supervised objective that preserves information relevant to downstream understanding tasks (unlike MuZero's value-centric approach).

By generative paradigm:

  • Deterministic models (SimPLe, early dynamics models): Produce single-point predictions of future states. Simple to train but fundamentally unable to represent the multimodality of future outcomes. When the environment is stochastic (as most real environments are), deterministic models average over possible futures, producing blurry or physically implausible predictions that degrade rapidly over long horizons.
  • Stochastic models (PlaNet, Dreamer series, SVG): Use latent variables (typically via VAE) to model distributions over future states, capturing the inherent uncertainty and multimodality of the environment. The RSSM architecture, combining deterministic and stochastic components, is the dominant paradigm. The deterministic path provides a "highway" for gradient flow and stable long-horizon prediction, while the stochastic path captures the unpredictable aspects of the environment. The KL divergence between the prior and posterior distributions quantifies the model's uncertainty: high KL indicates unexpected observations, serving as a natural signal for exploration (Sekar et al., 2020).
  • Autoregressive models (VideoGPT, IRIS, STORM, GAIA-1): Model sequences of discrete tokens autoregressively, treating world simulation as sequence prediction [@yan2021videogpt, @micheli2023transformers, @zhang2025storm_wm, @hu2023gaia1]. These models leverage the powerful scaling properties of autoregressive transformers, and the connection to language modeling enables transfer of architectural insights. The tokenization strategy (VQ-VAE codebook size, patch size, temporal resolution) is a critical design choice: larger codebooks capture more visual detail but increase the sequence length that the transformer must process.
  • Diffusion-based models (DIAMOND, GameNGen, Sora): Generate predictions through iterative denoising, producing the highest visual fidelity at the cost of multiple forward passes per prediction step. The denoising process can be viewed as iterative refinement, where each step corrects the prediction at a different noise level (coarse structure at high noise, fine details at low noise). Diffusion-based world models can also provide calibrated uncertainty estimates through the diversity of samples at different noise levels.

By application domain:

  • Game playing and Atari (MuZero, EfficientZero, SimPLe, IRIS, DIAMOND): The most well-benchmarked domain, with standardized evaluation protocols (Atari 100k, which limits the agent to 100k environment steps -- approximately 2 hours of real-time play) and clear comparison across methods. The Atari 100k benchmark has driven rapid progress: from SimPLe's median human-normalized score of 0.44 (2020) to DreamerV3's 1.19 (2023) to DIAMOND's 1.46 (2024).
  • Continuous control and robotics (Dreamer series, TD-MPC, DayDreamer): Control tasks requiring precise physical prediction, from locomotion to manipulation. DayDreamer (Wu et al., 2023) demonstrated that DreamerV2 can learn real-world robotic locomotion and manipulation from scratch, learning to walk in 1 hour of real interaction -- a landmark result for sample-efficient real-world RL.
  • Autonomous driving (GAIA-1, MILE, OccWorld, Cosmos): Safety-critical applications requiring both high fidelity and reliable uncertainty estimation. GAIA-1 (Hu et al., 2023) generates diverse driving scenarios from language and action conditioning, while MILE (Hu et al., 2022) uses multi-modal imitation learning with a latent world model. OccWorld (Zheng et al., 2024) predicts future 3D occupancy grids for planning.
  • Video generation and prediction (SVG, FitVid, Video Diffusion, Sora): Video prediction as a stepping stone toward general world modeling, with evaluation focused on visual quality metrics (FVD, LPIPS, SSIM) rather than downstream decision-making. The convergence of video generation and world modeling is a defining trend of the field.
  • Reasoning and language-grounded dynamics (Dynalang, WorldGPT, LLM-as-world-model): Connecting world models to language understanding, enabling compositional reasoning about dynamics. Dynalang (Xiang et al., 2024) trains a multimodal world model on language and visual observations jointly, enabling the agent to follow language instructions and reason about linguistically described goals.
  • Interactive environments (Genie, UniSim, GameNGen): Foundation-scale models that generate interactive worlds from video data, representing the frontier of world modeling -- creating entire environments rather than predicting within existing ones.

By the level of structure:

  • Monolithic models: Learn a single, undifferentiated dynamics function that maps entire states to entire states. Most current models fall in this category.
  • Object-centric models (C-SWM, SlotFormer): Explicitly decompose scenes into objects and model their individual dynamics and interactions [@kipf2020contrastive, @wu2023slotformer, @locatello2020object]. These models offer compositional generalization but require solving the unsupervised object discovery problem. Physics-aware graph networks [@battaglia2016interaction, @sanchez2020learning] provide the interaction modeling backbone.
  • Physics-informed models: Incorporate physical priors (conservation laws, symmetries, differential equations) into the dynamics model structure, improving generalization and physical plausibility.

These dimensions are largely orthogonal: one can build a latent-space, stochastic, object-centric, diffusion-based world model for robotics, for example. The trend in the field is toward models that are latent-space (for efficiency), stochastic (for uncertainty), autoregressive or diffusion-based (for quality), and trained on broad data (for generality).


References