Open Problems & Future Directions
Long-Horizon Prediction and Compounding Errors
Current world models struggle with predictions beyond a few dozen steps, as errors compound multiplicatively over time (Ke et al., 2019). A small prediction error at step t propagates through subsequent predictions, causing exponential divergence from reality. For a model with per-step error epsilon, the error after H steps can grow as O(epsilon * H) in the best case (additive errors for stable dynamics) or O(epsilon^H) in the worst case (multiplicative errors for unstable dynamics). This compounding is the fundamental limitation that motivates short-horizon methods like MBPO (Janner et al., 2019) (which uses truncated rollouts of at most 15 steps) and the latent-space prediction approach of MuZero (Schrittwieser et al., 2020) (which avoids pixel-level compounding entirely).
Promising directions include: hierarchical world models that operate at multiple temporal scales (e.g., Director (Hafner et al., 2022) learns high-level goals that decompose long-horizon tasks into manageable sub-goals), abstract world models that predict in compressed spaces where errors are less damaging (MuZero, JEPA (Lecun, 2022)), error-aware planning that accounts for model uncertainty when making decisions (using ensemble disagreement or posterior uncertainty to downweight predictions from regions of high model uncertainty), and diffusion-based world models (Alonso et al., 2024) that may compound errors differently than autoregressive models due to their iterative refinement process.
Compositional and Object-Centric World Models
Real-world environments have compositional structure: objects can be combined, physical laws apply uniformly, and novel configurations can arise from familiar components. Learning world models with explicit compositional structure -- object-centric representations, relational reasoning, and modular dynamics [@kipf2020contrastive, @wu2023slotformer] -- remains an important challenge. Current foundation world models (Genie, Sora) learn monolithic models that do not explicitly represent objects, limiting their ability to generalize to novel object configurations. The integration of object discovery (slot attention, unsupervised segmentation) with dynamics modeling is a promising but technically challenging direction.
Sim-to-Real Transfer
World models trained in simulation or on video data must ultimately transfer to real-world settings. The sim-to-real gap -- differences in visual appearance, physics, and dynamics between simulated and real environments -- continues to be a major obstacle (Zhao et al., 2020). While DayDreamer (Wu et al., 2023) demonstrated that world models can be trained directly on real robot data, the data efficiency is not yet sufficient for complex manipulation tasks. Foundation world models trained on diverse video data offer a potential solution: pre-training on broad data may produce models that transfer more easily to specific real-world settings.
Scaling Laws for World Models
Unlike language models, where scaling laws are well-characterized [@kaplan2020scaling, @hoffmann2022training], the relationship between world model size, training data, and downstream task performance is poorly understood. TD-MPC2 (Hansen et al., 2024) provided initial evidence that world models follow favorable scaling laws in the robotics domain, but a comprehensive characterization across domains, architectures, and applications is missing. Understanding these scaling properties is critical for allocating resources effectively and predicting future capabilities.
Unifying Perception, Prediction, and Action
Current systems typically learn perception (encoders), prediction (dynamics models), and action (policies) as separate modules with separate objectives. A grand challenge is developing unified architectures that jointly optimize all three components, potentially achieving emergent capabilities not possible with modular designs. The JEPA framework (Lecun, 2022) proposes one path toward unification, and the success of end-to-end approaches in autonomous driving (e.g., combining perception and planning in a single model) provides empirical motivation.
World Models for Scientific Discovery
Beyond game playing and robotics, world models could serve as learned simulators for scientific domains where physics-based simulators are expensive or incomplete. Applications include weather prediction (where GraphCast (Lam et al., 2023) and FourCastNet (Pathak et al., 2022) have already shown the value of learned simulators, achieving competitive or superior accuracy compared to traditional numerical weather prediction at a fraction of the computational cost), molecular dynamics, protein folding, materials science, and climate modeling. The key question is whether foundation world models trained on video data encode sufficient physical understanding to be useful for scientific simulation, or whether domain-specific training data (from physics simulators) is required.
The World Model vs. Video Generator Debate
The rapid progress in video generation (Sora, Genie, Cosmos) has raised a fundamental question: is a high-fidelity video generator sufficient as a world model, or does it lack something essential? Arguments for sufficiency point to the impressive physical understanding demonstrated by these models. Arguments against point to systematic failures in edge cases, lack of explicit causal reasoning, and the inability to support counterfactual queries ("what would have happened if I had turned left instead of right?"). Resolving this debate is crucial for determining the research agenda: whether to focus on scaling video generation or on developing new architectures with explicit causal structure.
LeCun (2024) (LeCun, 2022) argued forcefully that video generators are not world models because they lack the ability to perform interventions and counterfactual reasoning -- two hallmarks of genuine causal understanding. A video generator can predict what typically follows a given scene, but cannot answer "what would happen if this specific variable changed while everything else remained the same." This distinction, rooted in Pearl's causal hierarchy (Pearl, 2009), suggests that additional architectural inductive biases (causal graphs, intervention mechanisms) may be necessary beyond what purely observational training provides.
Safety and Reliability
World models used for planning in safety-critical domains (autonomous driving, medical robotics) must be reliable -- their predictions must be accurate enough that policies trained on them behave safely in the real world. Developing methods for quantifying world model uncertainty, detecting out-of-distribution inputs (where predictions are likely to be inaccurate), and providing safety guarantees for world-model-based planning is an essential prerequisite for deployment.
References
- Eloi Alonso, Adam Jelley, Anssi Kanervisto, Tim Shersten (2024). Diffusion for World Modeling: Visual Details Matter in Atari. NeurIPS.
- Danijar Hafner, Kuang-Huei Lee, Ian Fischer, Pieter Abbeel (2022). Deep Hierarchical Planning from Pixels. NeurIPS.
- Nicklas Hansen, Hao Su, Xiaolong Wang (2024). TD-MPC2: Scalable, Robust World Models for Continuous Control. ICLR.
- Michael Janner, Justin Fu, Marvin Zhang, Sergey Levine (2019). When to Trust Your Model: Model-Based Policy Optimization. NeurIPS.
- Nan Rosemary Ke, Amanpreet Singh, Ahmed Touati (2019). Modeling the Long Term Future in Model-Based Reinforcement Learning. ICLR.
- Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zhen Zhang, Jackie Stott, Stephan Hoyer, Peter Battaglia, Adrian Weller, Ali Eslami, Matthew Botvinick, Shakir Mohamed, Peter Battaglia (2023). Learning Skillful Medium-Range Global Weather Forecasting. Science.
- Yann Lecun (2022). A Path Towards Autonomous Machine Intelligence. openreview.net.
- Yann LeCun (2022). A Path Towards Autonomous Machine Intelligence. OpenReview.
- Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, Pedram Hassanzadeh, Karthik Kashinath, Animashree Anandkumar (2022). FourCastNet: A Global Data-driven High-resolution Weather Forecasting Model using Adaptive Fourier Neural Operators. arXiv.
- Judea Pearl (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press.
- Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature.
- Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, Ken Goldberg (2023). DayDreamer: World Models for Physical Robot Learning. CoRL.
- Wenshuai Zhao, Jorge Pena Queralta, Tomi Westerlund (2020). Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: A Survey. IEEE Symposium Series on Computational Intelligence.