Skip to main content

Classic World Models

This section traces the evolution of latent-space world models from their modern inception to the current state of the art. The trajectory from Ha and Schmidhuber's initial work through the Dreamer series represents one of the most coherent and successful research programs in deep RL, with each iteration building directly on the insights and limitations of its predecessor. For a broader historical perspective on model-based RL, see Moerland et al. (2023) (Moerland et al., 2023).

Pre-Deep-Learning Foundations

Before the deep learning era, model-based RL already demonstrated the power of learned dynamics models. PILCO (Deisenroth and Rasmussen, 2011) (Deisenroth & Rasmussen, 2011) used Gaussian processes to learn dynamics models with calibrated uncertainty, achieving remarkably data-efficient control (learning cart-pole swing-up in under 30 seconds of interaction). PETS (Chua et al., 2018) (Chua et al., 2018) scaled model-based approaches to deeper networks using probabilistic ensembles, demonstrating that ensembles of neural networks could capture epistemic uncertainty sufficiently well for planning. Nagabandi et al. (2018) (Nagabandi et al., 2018) showed that neural network dynamics models could be combined with model-predictive control and then fine-tuned with model-free RL, establishing the model-based initialization, model-free refinement paradigm. These works established that the fundamental bottleneck in model-based RL is the quality and uncertainty calibration of the learned dynamics model -- a theme that continues through all subsequent work.

World Models (Ha & Schmidhuber, 2018)

The foundational work by Ha and Schmidhuber [@ha2018world, @ha2018recurrent] introduced the modern world model paradigm. Their architecture consists of three components: (1) a Variational Autoencoder (VAE) (Kingma & Welling, 2014) that compresses high-dimensional observations (images) into a compact latent code z, (2) a Mixture Density Network-RNN (MDN-RNN) that models the dynamics in latent space, predicting the next latent state and its uncertainty as a mixture of Gaussians, and (3) a compact linear controller that maps the latent state and RNN hidden state to actions.

The key insight was the separation of concerns: the VAE learns what the world looks like (perception), the RNN learns how it changes (dynamics), and the controller learns what to do (policy). The agent was trained entirely in "dreams" -- rollouts generated by the learned model -- using evolution strategies to optimize the controller. This enabled rapid policy learning with minimal real-world interaction. On the CarRacing environment, the dream-trained agent achieved competitive performance with a tiny controller (just 867 parameters), demonstrating that most of the "intelligence" is in the world model, not the policy.

PlaNet: The Recurrent State-Space Model

Hafner et al. (2019) (Hafner et al., 2019) introduced PlaNet (Deep Planning Network), which made two crucial contributions that shaped all subsequent work. First, PlaNet introduced the Recurrent State-Space Model (RSSM), which combines deterministic and stochastic state components:

  • Deterministic state h_t: An RNN hidden state that captures predictable dynamics and provides memory across timesteps.
  • Stochastic state s_t: Sampled from a learned prior p(s_t | h_t) or posterior q(s_t | h_t, o_t) that captures environmental uncertainty and multimodal future possibilities.

The RSSM's hybrid architecture addresses a key limitation of pure RNN models (which cannot represent stochastic futures) and pure state-space models (which have limited memory). Second, PlaNet plans online using the Cross-Entropy Method (CEM), sampling random action sequences, evaluating them through the learned model, and iteratively refining toward the best actions. PlaNet achieved competitive performance on 20 continuous control tasks from image observations using 50x fewer environment interactions than model-free methods (D4PG).

Dreamer v1: Imagination-Based Policy Learning

Hafner et al. (2020) (Hafner et al., 2020) introduced Dreamer v1, which replaced PlaNet's online planning (CEM) with a learned policy and value function trained within imagined trajectories. Rather than planning from scratch at each step (which is computationally expensive), Dreamer learns an actor-critic pair that operates entirely in "imagination" -- trajectories unrolled through the learned RSSM.

The training procedure has three components, trained in a coordinated loop:

  1. World model learning: Learn the RSSM dynamics, encoder, decoder, and reward predictor from real experience.
  2. Behavior learning: Imagine trajectories using the learned world model and train an actor and critic via backpropagation through the dynamics model (analytic gradients rather than the high-variance REINFORCE estimator).
  3. Environment interaction: Execute the learned policy in the real environment to collect new data.

Dreamer v1 achieved state-of-the-art sample efficiency on the DeepMind Control Suite, matching the asymptotic performance of the top model-free method (D4PG) while requiring 20x fewer environment interactions. The use of analytic gradients through the dynamics model for policy learning (rather than CEM planning or REINFORCE) was a key contribution, producing lower-variance gradient estimates that enabled stable learning.

Dreamer v2: Discrete Representations

Hafner et al. (2021) (Hafner et al., 2021) introduced Dreamer v2, which made several important modifications. The most significant was replacing the continuous Gaussian latent state with discrete categorical variables -- each stochastic state is represented as a concatenation of multiple categorical distributions (e.g., 32 categories with 32 classes each). Discrete representations offer several advantages: they can represent multimodal distributions naturally, avoid posterior collapse issues common in continuous VAEs, and are more compatible with the discrete nature of many environments.

Dreamer v2 also introduced KL balancing to control the tradeoff between the prior and posterior losses, preventing the model from converging to a trivial solution. It was the first model-based agent to achieve human-level performance on the Atari 100k benchmark (scoring above 200% human-normalized score), demonstrating that world models could compete with model-free methods even on complex, discrete-action environments.

Dreamer v3: Universal World Models

Hafner et al. (2023) (Hafner et al., 2023) achieved a landmark result with Dreamer v3: a single agent with a single, fixed set of hyperparameters that masters diverse domains including continuous control (DeepMind Control Suite), discrete control (Atari games), long-horizon planning (crafting a diamond in Minecraft -- a feat requiring hundreds of steps of planning), and 3D navigation (DMLab). No previous method could handle this diversity without domain-specific tuning.

Key innovations in Dreamer v3 include:

  • Symlog predictions: Transforming targets with the symlog function (sign(x) * log(|x| + 1)) enables the model to handle reward scales spanning orders of magnitude across domains, eliminating the need for reward normalization.
  • Free bits: Setting a minimum KL divergence below which no gradient flows prevents posterior collapse while avoiding excessive regularization.
  • Improved discrete world model: Refined categorical representations with better initialization and training dynamics.
  • Return normalization: Normalizing returns by the running statistics of the value function rather than raw rewards.

DreamerV3 was the first model-based agent to collect a diamond in Minecraft without human data or curricula, a task that required discovering and executing a complex sequence of subtasks (collect wood, craft tools, mine stone, mine iron, mine diamond) spanning hundreds of environment steps.

DreamerPro: Reconstruction-Free World Models

Deng et al. (2022) (Deng et al., 2022) proposed DreamerPro, which replaces the decoder-based reconstruction objective in the RSSM with a prototypical representation learning objective. Instead of learning to reconstruct pixel observations (which wastes capacity on task-irrelevant visual details), DreamerPro uses prototypical self-supervised learning (inspired by SwAV and BYOL) to learn representations that capture task-relevant structure. DreamerPro matches DreamerV2's performance while being more robust to visual distractors, demonstrating that reconstruction is not necessary for effective world models -- only task-relevant representation learning.

Plan2Explore: Curiosity-Driven World Models

Sekar et al. (2020) (Sekar et al., 2020) proposed Plan2Explore, which uses the world model itself to drive exploration. Rather than exploring randomly, Plan2Explore plans to visit states where the world model's predictions are most uncertain (highest disagreement between ensemble members). This curiosity-driven exploration enables the agent to learn a better world model faster, which in turn enables better downstream task learning. Plan2Explore achieved competitive zero-shot and few-shot performance on the DeepMind Control Suite, demonstrating that a well-explored world model can transfer to new tasks without additional environment interaction. This connects to the broader literature on curiosity-driven exploration (Pathak et al., 2017), where prediction error serves as an intrinsic reward signal.

Director: Hierarchical World Models

Hafner et al. (2022) (Hafner et al., 2022) proposed Director, which extends Dreamer with hierarchical goal-conditioned planning. Rather than predicting a flat sequence of actions, Director learns a high-level "manager" that sets subgoals in latent space and a low-level "worker" that achieves these subgoals. This hierarchical decomposition enables planning at multiple temporal scales, with the manager operating over longer horizons and the worker handling short-term control. Director achieved improved performance on long-horizon tasks where flat planning degrades due to compounding prediction errors.

Masked World Models (MWM)

Seo et al. (2023) (Seo et al., 2023) proposed Masked World Models, which improve the representation learning component of Dreamer-style world models by incorporating masked autoencoder pre-training. By masking random patches of observations and training the encoder to reconstruct them, MWM learns richer visual representations that improve downstream dynamics modeling and policy learning. MWM achieved state-of-the-art results on the DeepMind Control Suite, particularly on tasks with complex visual observations. This work highlighted that the quality of the encoder -- not just the dynamics model -- is a critical bottleneck in world model performance.

TransDreamer and Transformer-Based World Models

Chen et al. (2022) (Chen et al., 2022) proposed TransDreamer, which replaces the RNN backbone in the RSSM with a transformer, enabling parallel training and better long-range dependency modeling. Robine et al. (2023) (Robine et al., 2023) further demonstrated that transformer-based world models can achieve competitive performance on the Atari 100k benchmark with only 100k interactions, suggesting that the transformer architecture's ability to capture long-range temporal dependencies directly benefits dynamics prediction. The shift from RNN to transformer backbones mirrors the broader trend in deep learning and enables world models to leverage the scaling properties of transformers.

Stochastic Latent Actor-Critic (SLAC)

Lee et al. (2020) (Lee et al., 2020) proposed SLAC, which combines a sequential latent variable model with model-free actor-critic training. Unlike Dreamer, which trains the policy entirely in imagination, SLAC uses the latent variable model primarily for representation learning -- providing a compact, informative state representation to the actor-critic. SLAC demonstrated that even when the world model is not used for planning or imagined rollouts, learning a good dynamics model still provides substantial benefits through improved representations. This work helped clarify the dual role of world models: as simulators for planning and as representation learners for model-free RL.

The Objective Mismatch Problem

Lambert et al. (2020) (Lambert et al., 2020) identified a fundamental challenge in model-based RL: the objective mismatch between how world models are trained (minimizing prediction error) and how they are used (maximizing task reward). A world model that is globally accurate may allocate capacity to predicting task-irrelevant details while poorly predicting task-critical dynamics. Conversely, MuZero-style models trained with task-relevant objectives (reward and value prediction) can be highly effective for planning despite having no observation reconstruction capability. This insight motivates value-equivalent and decision-aware approaches to world model training.


References