Skip to main content

World Models for Reasoning

This section surveys the intersection of world models with reasoning, language, and planning. We focus on how world models can be used for abstract reasoning (rather than sensorimotor control), including the use of language as an interface to world models, LLMs as implicit world models, and causal structure learning for reasoning. Out of scope are vision-only world models without reasoning capabilities, purely model-free reasoning systems, and classical symbolic planners that do not learn.

An increasingly important direction connects world models to reasoning and language understanding. Rather than viewing world models purely as tools for RL-based control, this perspective sees world models as a substrate for general-purpose reasoning, enabling agents to reason about causation, predict consequences, and plan in abstract spaces. This is the strong form of the position argued by LeCun's joint-embedding predictive architecture (JEPA) program (Lecun, 2022), which holds that a learned world model, not a policy or a language model alone, is the missing component for human-like reasoning and planning.

The subsections below are organized by where the world model lives relative to language. We distinguish four families along a single axis: whether language is the only signal the model ever sees (LLM-as-world-model), whether language is fused into the world model as one observation modality among several (language-grounded world models), whether the model is built to recover explicit causal mechanisms rather than correlations (causal world models), and whether the world model is invoked as an internal simulator during reasoning itself (reasoning as internal simulation). The first axis (grounding modality) and the second (causal versus correlational structure) are the two organizing principles that separate these families.

LLM-as-World-Model

An emerging and provocative perspective treats large language models themselves as world models that have implicitly learned dynamics of the world through text prediction [@hao2023reasoning, @guan2023leveraging]. The argument is that by predicting the next token in vast text corpora (which describe physical processes, causal relationships, and sequential events), LLMs have been forced to build internal models of how the world works. Evidence for this view includes LLMs' ability to:

  • Predict the outcomes of physical interactions (e.g., "what happens if you drop a glass?")
  • Reason about spatial relationships and navigation
  • Simulate board games and simple physical systems step-by-step
  • Generate plausible continuations of action sequences

A compelling piece of evidence comes from Othello-GPT (Li et al., 2023) (Li et al., 2023), which trained a GPT model purely on sequences of Othello moves (no board state information) and discovered that the model's internal representations encode the board state. Li et al. recovered this representation with nonlinear (two-layer MLP) probes, showing that the model learned the game dynamics (legal moves, piece placement, and flipping rules) purely from move sequences. This demonstrates that sequence prediction can give rise to genuine world models, at least in structured domains. The follow-up work by Neel Nanda et al. (2023) (Nanda et al., 2023) corrected this picture by showing that the representation is in fact linear once probed in the right basis: rather than absolute "black" vs "white", the model encodes "my colour" vs "opponent colour", which a linear probe recovers cleanly. Crucially, this linear world representation is causal and controllable, since editing it via vector arithmetic on the activations changes the model's downstream predictions.

However, the LLM-as-world-model hypothesis faces important critiques. LLMs' "world knowledge" is derived entirely from text, which is a lossy, biased, and incomplete representation of the physical world. They make systematic errors on spatial reasoning, physical prediction, and counterfactual reasoning that suggest their "world model" is more of a statistical pattern match than a genuine simulator (Mitchell & Krakauer, 2023). Gurnee and Tegmark (2024) (Gurnee & Tegmark, 2024) found that LLMs develop linear representations of space and time within their activations, suggesting some genuine structure learning, but these representations remain fragile under distribution shift. The degree to which LLMs learn causal structure (rather than correlational patterns) remains actively debated.

WorldGPT (Ge et al., 2024) (Ge, 2024) explicitly frames LLMs as multimodal world models, integrating visual and textual information for environment simulation. By conditioning a large multimodal model on both visual observations and text descriptions, WorldGPT can predict the visual consequences of described actions, bridging the gap between language-based and vision-based world modeling.

Pre-Trained LMs for Decision-Making (Li et al., 2023) (Li et al., 2023) demonstrated that pre-trained language models can be directly used as interactive decision-making agents by representing states, actions, and goals as text. The language model's pre-trained knowledge of world dynamics (encoded in its weights from training on text describing physical and social interactions) provides a surprisingly effective prior for RL in text-based environments, supporting the view that LLMs encode implicit world models.

Language Models Meet World Models (Xiang et al., 2024) (Xiang et al., 2024) proposed enhancing language models with embodied experiences, training them not just on text but on interaction data from physical environments. The resulting models show improved reasoning about physical dynamics, spatial relationships, and action consequences, suggesting that grounding in embodied experience is necessary for truly robust world knowledge.

These four works can be read along an axis of grounding modality, from least to most grounded. Othello-GPT shows that a world model can emerge from text-only token streams in a closed symbolic domain, but the structure it recovers is only as rich as the symbol sequence. WorldGPT adds visual grounding, predicting perceptual consequences of described actions, which buys multimodal coverage at the cost of needing aligned vision-text data. Pre-Trained LMs for Decision-Making keep the model text-only but place it in an interactive loop, trading perceptual fidelity for cheap reuse of pretrained priors as a decision-making prior. Language Models Meet World Models goes furthest, injecting embodied interaction data, which improves physical reasoning but requires the most expensive and least scalable supervision. The common tradeoff is fidelity versus scalability: text is abundant but lossy about physics, while embodied grounding is faithful but data-hungry, and each approach picks a different point on that frontier.

Language-Grounded World Models

The following approaches integrate language directly into the world model architecture, using language either as an observation modality or as a structured interface for reasoning.

Dynalang (Lin et al., 2024) (Lin et al., 2024) learns to model the world with language. By integrating language annotations into a Dreamer-style world model (adding language as an additional observation modality), Dynalang can ground language understanding in environment dynamics. The model learns to use language for predicting future states and for planning, creating a tighter loop between language comprehension and world understanding.

Dynalang is evaluated on a suite of language-grounded environments: HomeGrid (a gridworld with language hints), Messenger (a benchmark requiring an agent to read game-manual text), LangRoom (an embodied question-answering task), and vision-language navigation in photorealistic Habitat scenes. The problem is framed as a partially observed Markov decision process in which language tokens are folded into the observation stream alongside pixels, and the agent is trained to predict future text and image representations as part of a Dreamer-style world model; the evaluation metric is task return (episodic reward), reported as learning curves against step budget rather than as a single headline number. The clearest concrete result is on Messenger: the language-free and language-conditioned model-free baselines (IMPALA, R2D2, and the task-specific EMMA model) fail to fit the hardest Stage 3 split at all, whereas Dynalang reaches non-trivial performance by reading the game manuals. On HomeGrid, adding language hints on top of instructions improves Dynalang's score, while the same hints degrade the R2D2 baseline, indicating that the world model can exploit predictive language that a model-free policy cannot. A caveat for readers: because the paper reports outcomes through figures rather than tabulated numbers, these are directional comparisons (which method dominates, and where the gap is largest) rather than precise point estimates, and the evaluation uses few seeds, so absolute magnitudes should be read with care. The pattern nonetheless suggests that language provides a useful inductive bias for world modeling, encoding high-level task structure and causal relationships that are difficult to learn from raw observations.

More broadly, there is growing interest in world models that operate on or are conditioned by structured language descriptions rather than (or in addition to) raw sensory inputs (Wong et al., 2023). These models can leverage the compositional structure of language to generalize to novel situations described in natural language. This connects to the "language of thought" hypothesis in cognitive science: that humans reason about the world using an internal symbolic language, which may be analogous to how language-conditioned world models operate.

SayPlan (Ahn et al., 2023) (Rana et al., 2023) grounds LLM planning in 3D scene graphs, using the scene graph as an explicit world model that the LLM reasons over. By providing the LLM with a structured representation of the environment, SayPlan enables long-horizon task planning that is grounded in the actual physical layout. Inner Monologue (Huang et al., 2023) (Huang et al., 2023) uses verbal feedback from the environment (success detection, scene description, human feedback) to enable closed-loop LLM planning, where the LLM updates its internal world model based on observed outcomes. These approaches demonstrate that language provides a natural interface between world models and planning, enabling compositional reasoning about novel scenarios.

Causal World Models

A distinct research direction pursues world models with explicit causal structure, going beyond the correlational patterns learned by standard predictive models. CausalWorld (Ahmed et al., 2020) (Ahmed et al., 2020) proposed a benchmark for causal structure and transfer in robotic manipulation, testing whether agents can identify causal relationships (e.g., pushing an object causes it to move) and generalize across environments with different causal structures. The key argument for causal world models is that correlational models fail under intervention: when the agent takes a novel action that differs from the training distribution, a correlational model may produce incorrect predictions because it has not learned the underlying causal mechanism. Causal models, in contrast, can support counterfactual reasoning ("what would have happened if I had acted differently?"), which is essential for robust planning in novel situations.

Reasoning as Internal Simulation

The relationship between world models and reasoning can be understood through the lens of simulation theory (Craik, 1943): reasoning is the process of running internal simulations and evaluating their outcomes. Chain-of-thought reasoning in LLMs can be viewed as a form of step-by-step simulation, where each reasoning step predicts the outcome of a mental action. This perspective suggests that improving LLMs' reasoning capabilities may require giving them better internal world models: more accurate simulators of causal processes, physical dynamics, and logical inference.

This idea is operationalized concretely by Reasoning via Planning (RAP) (Hao et al., 2023), which makes the LLM-as-world-model view actionable: it repurposes the same LLM as both a world model (predicting the next state given an action) and a reasoning agent, then uses Monte Carlo tree search to plan over the simulated state trajectories rather than generating a single chain of thought greedily. By searching over imagined outcomes and backing up their estimated rewards, RAP improves performance on plan generation, math, and logical-inference tasks relative to standard chain-of-thought prompting, giving empirical weight to the claim that better internal simulation, not just more reasoning tokens, drives reasoning gains.

The convergence of world models (from the RL community) and reasoning (from the NLP community) represents an active research frontier. Current evidence suggests that text-derived world models succeed in domains with rich linguistic structure (game rules, instruction-following, structured symbolic systems) but struggle with spatial reasoning, counterfactual prediction, and physical dynamics where language provides only an indirect and lossy signal. The challenge is to combine the scalability of language supervision with the fidelity of embodied grounding, either through hybrid architectures (like WorldGPT and Dynalang) or through better alignment between linguistic and perceptual modalities.


References