Classic World Models

This section traces the evolution of latent-space world models from their modern inception to the current state of the art. The trajectory from Ha and Schmidhuber's initial work through the Dreamer series represents one of the most coherent and successful research programs in deep RL, with each iteration building directly on the insights and limitations of its predecessor. For a broader historical perspective on model-based RL, see Moerland et al. (2023) (Moerland et al., 2023).

Pre-Deep-Learning Foundations

Before the deep learning era, model-based RL already demonstrated the power of learned dynamics models. PILCO (Deisenroth and Rasmussen, 2011) (Deisenroth & Rasmussen, 2011) used Gaussian processes to learn dynamics models with calibrated uncertainty, achieving remarkably data-efficient control (learning cart-pole swing-up in under 30 seconds of interaction). PETS (Chua et al., 2018) (Chua et al., 2018) scaled model-based approaches to deeper networks using probabilistic ensembles, demonstrating that ensembles of neural networks could capture epistemic uncertainty sufficiently well for planning. Nagabandi et al. (2018) (Nagabandi et al., 2018) showed that neural network dynamics models could be combined with model-predictive control and then fine-tuned with model-free RL, establishing the model-based initialization, model-free refinement paradigm. These works established that the fundamental bottleneck in model-based RL is the quality and uncertainty calibration of the learned dynamics model, a theme that continues through all subsequent work.

World Models (Ha & Schmidhuber, 2018)

The foundational work by Ha and Schmidhuber [@ha2018world, @ha2018recurrent] introduced the modern world model paradigm. Their architecture consists of three components: (1) a Variational Autoencoder (VAE) (Kingma & Welling, 2014) that compresses high-dimensional observations (images) into a compact latent code z, (2) a Mixture Density Network-RNN (MDN-RNN) that models the dynamics in latent space, predicting the next latent state and its uncertainty as a mixture of Gaussians, and (3) a compact linear controller that maps the latent state and RNN hidden state to actions.

The key insight was the separation of concerns: the VAE learns what the world looks like (perception), the RNN learns how it changes (dynamics), and the controller learns what to do (policy). In the VizDoom Take Cover experiment, the controller was trained entirely in "dreams" (rollouts generated by the learned model) using evolution strategies, then transferred back to the real environment with minimal additional interaction. On the CarRacing environment, the controller was instead optimized on real rollouts, and they demonstrated that a tiny controller (just 867 parameters) could achieve competitive performance, showing that most of the "intelligence" is in the world model, not the policy.

PlaNet: The Recurrent State-Space Model

Hafner et al. (2019) (Hafner et al., 2019) introduced PlaNet (Deep Planning Network), which made two crucial contributions that shaped all subsequent work. First, PlaNet introduced the Recurrent State-Space Model (RSSM), which combines deterministic and stochastic state components:

Deterministic state h_t: An RNN hidden state that captures predictable dynamics and provides memory across timesteps.
Stochastic state s_t: Sampled from a learned prior p(s_t | h_t) or posterior q(s_t | h_t, o_t) that captures environmental uncertainty and multimodal future possibilities.

The RSSM's hybrid architecture addresses a key limitation of pure RNN models (which cannot represent stochastic futures) and pure state-space models (which have limited memory). Second, PlaNet plans online using the Cross-Entropy Method (CEM), sampling random action sequences, evaluating them through the learned model, and iteratively refining toward the best actions. PlaNet achieved competitive performance on 20 continuous control tasks from image observations using 50x fewer environment interactions than model-free methods (D4PG).

Dreamer v1: Imagination-Based Policy Learning

Hafner et al. (2020) (Hafner et al., 2020) introduced Dreamer v1, which replaced PlaNet's online planning (CEM) with a learned policy and value function trained within imagined trajectories. Rather than planning from scratch at each step (which is computationally expensive), Dreamer learns an actor-critic pair that operates entirely in "imagination" (trajectories unrolled through the learned RSSM).

The training procedure has three components, trained in a coordinated loop:

World model learning: Learn the RSSM dynamics, encoder, decoder, and reward predictor from real experience.
Behavior learning: Imagine trajectories using the learned world model and train an actor and critic via backpropagation through the dynamics model (analytic gradients rather than the high-variance REINFORCE estimator).
Environment interaction: Execute the learned policy in the real environment to collect new data.

Dreamer v1 achieved state-of-the-art sample efficiency on the DeepMind Control Suite, matching the asymptotic performance of the top model-free method (D4PG) while requiring 20x fewer environment interactions. The use of analytic gradients through the dynamics model for policy learning (rather than CEM planning or REINFORCE) was a key contribution, producing lower-variance gradient estimates that enabled stable learning.

Dreamer v2: Discrete Representations

Hafner et al. (2021) (Hafner et al., 2021) introduced Dreamer v2, which made several important modifications. The most significant was replacing the continuous Gaussian latent state with discrete categorical variables: each stochastic state is represented as a concatenation of multiple categorical distributions (e.g., 32 categories with 32 classes each). Discrete representations offer several advantages: they can represent multimodal distributions naturally, avoid posterior collapse issues common in continuous VAEs, and are more compatible with the discrete nature of many environments.

Dreamer v2 also introduced KL balancing to control the tradeoff between the prior and posterior losses, preventing the model from converging to a trivial solution. It was the first model-based agent to achieve human-level performance on the Atari benchmark (55 games, 200M frames), reaching roughly human-level median (gamer-normalized) performance using a single GPU, demonstrating that world models could compete with model-free methods even on complex, discrete-action environments.

Dreamer v3: Universal World Models

Hafner et al. (2023) (Hafner et al., 2023) achieved a landmark result with Dreamer v3: a single agent with a single, fixed set of hyperparameters that masters diverse domains including continuous control (DeepMind Control Suite), discrete control (Atari games), long-horizon planning (crafting a diamond in Minecraft, a feat requiring hundreds of steps of planning), and 3D navigation (DMLab). No previous method could handle this diversity without domain-specific tuning.

Key innovations in Dreamer v3 include:

Symlog predictions: Transforming targets with the symlog function (sign(x) * log(|x| + 1)) enables the model to handle reward scales spanning orders of magnitude across domains, eliminating the need for reward normalization.
Free bits: Setting a minimum KL divergence below which no gradient flows prevents posterior collapse while avoiding excessive regularization.
Improved discrete world model: Refined categorical representations with better initialization and training dynamics.
Return normalization: Normalizing returns by the running statistics of the value function rather than raw rewards.

DreamerV3 was the first model-based agent to collect a diamond in Minecraft without human data or curricula, a task that required discovering and executing a complex sequence of subtasks (collect wood, craft tools, mine stone, mine iron, mine diamond) spanning hundreds of environment steps.

The variants that follow each take the Dreamer backbone and push on one axis from the taxonomy: the representation objective (DreamerPro, MWM), the exploration objective (Plan2Explore), the temporal structure of behavior (Director), the dynamics backbone (TransDreamer), or the way the model is consumed at decision time (SLAC). The closing synthesis below ties these tradeoffs together.

DreamerPro: Reconstruction-Free World Models

Deng et al. (2022) (Deng et al., 2022) proposed DreamerPro, which replaces the decoder-based reconstruction objective in the RSSM with a prototypical representation learning objective. Instead of learning to reconstruct pixel observations (which wastes capacity on task-irrelevant visual details), DreamerPro uses prototypical self-supervised learning (inspired by SwAV and BYOL) to learn representations that capture task-relevant structure. DreamerPro matches DreamerV2's performance while being more robust to visual distractors, demonstrating that reconstruction is not necessary for effective world models, only task-relevant representation learning.

Plan2Explore: Curiosity-Driven World Models

Sekar et al. (2020) (Sekar et al., 2020) proposed Plan2Explore, which uses the world model itself to drive exploration. Rather than exploring randomly, Plan2Explore plans to visit states where the world model's predictions are most uncertain (highest disagreement between ensemble members). This curiosity-driven exploration enables the agent to learn a better world model faster, which in turn enables better downstream task learning. Plan2Explore achieved competitive zero-shot and few-shot performance on the DeepMind Control Suite, demonstrating that a well-explored world model can transfer to new tasks without additional environment interaction. This connects to the broader literature on curiosity-driven exploration (Pathak et al., 2017), where prediction error serves as an intrinsic reward signal.

Director: Hierarchical World Models

Hafner et al. (2022) (Hafner et al., 2022) proposed Director, which extends Dreamer with hierarchical goal-conditioned planning. Rather than predicting a flat sequence of actions, Director learns a high-level "manager" that sets subgoals in latent space and a low-level "worker" that achieves these subgoals. This hierarchical decomposition enables planning at multiple temporal scales, with the manager operating over longer horizons and the worker handling short-term control. Director achieved improved performance on long-horizon tasks where flat planning degrades due to compounding prediction errors.

Masked World Models (MWM)

Seo et al. (2023) (Seo et al., 2023) proposed Masked World Models, which improve the representation learning component of Dreamer-style world models by incorporating masked autoencoder pre-training. By masking random patches of observations and training the encoder to reconstruct them, MWM learns richer visual representations that improve downstream dynamics modeling and policy learning. MWM achieved state-of-the-art results on the DeepMind Control Suite, particularly on tasks with complex visual observations. This work highlighted that the quality of the encoder (not just the dynamics model) is a critical bottleneck in world model performance.

TransDreamer and Transformer-Based World Models

Chen et al. (2022) (Chen et al., 2022) proposed TransDreamer, which replaces the RNN backbone in the RSSM with a transformer, enabling parallel training and better long-range dependency modeling. Robine et al. (2023) (Robine et al., 2023) further demonstrated that transformer-based world models can achieve competitive performance on the Atari 100k benchmark with only 100k interactions, suggesting that the transformer architecture's ability to capture long-range temporal dependencies directly benefits dynamics prediction. The shift from RNN to transformer backbones mirrors the broader trend in deep learning and enables world models to leverage the scaling properties of transformers.

Stochastic Latent Actor-Critic (SLAC)

Lee et al. (2020) (Lee et al., 2020) proposed SLAC, which combines a sequential latent variable model with model-free actor-critic training. Unlike Dreamer, which trains the policy entirely in imagination, SLAC uses the latent variable model primarily for representation learning, providing a compact, informative state representation to the actor-critic. SLAC demonstrated that even when the world model is not used for planning or imagined rollouts, learning a good dynamics model still provides substantial benefits through improved representations. This work helped clarify the dual role of world models: as simulators for planning and as representation learners for model-free RL.

The Objective Mismatch Problem

Lambert et al. (2020) (Lambert et al., 2020) identified a fundamental challenge in model-based RL: the objective mismatch between how world models are trained (minimizing prediction error) and how they are used (maximizing task reward). A world model that is globally accurate may allocate capacity to predicting task-irrelevant details while poorly predicting task-critical dynamics. Conversely, MuZero-style models trained with task-relevant objectives (reward and value prediction) can be highly effective for planning despite having no observation reconstruction capability. This insight motivates value-equivalent and decision-aware approaches to world model training.

Synthesis: Tensions Across the Latent-Space Lineage

Read together, these models are best understood not as a single chronological march but as a set of design choices that each navigate a recurring tension, and the same axes from the chapter taxonomy organize them.

The first axis is how the latent is parameterized. PlaNet and Dreamer v1 use continuous Gaussian stochastic states, which are smooth and easy to backpropagate through but prone to posterior collapse; Dreamer v2 and v3 switch to discrete categorical states, trading smoothness for multimodal expressiveness and stability, at the cost of needing straight-through gradients and KL balancing or free bits to train. The tension is expressiveness and stability versus differentiability.

The second axis is how the model is used for decisions. Three distinct strategies appear: online planning over the learned model (PlaNet via CEM), policies and values learned inside imagined rollouts (the Dreamer line, Director's hierarchical variant), and the model used only as a representation learner for a model-free agent (SLAC). The tension is per-step planning cost and asymptotic flexibility versus amortized, cheaper-at-inference learned behavior.

The third axis is what the representation objective optimizes. Reconstruction-based models (PlaNet, the Dreamer line, MWM's masked pretraining) spend capacity modeling pixels, which is interpretable but wastes capacity on task-irrelevant detail and is brittle to visual distractors; reconstruction-free models (DreamerPro's prototypical objective, and in the limit MuZero's value-equivalent objective discussed via the objective-mismatch problem) discard pixel fidelity in favor of task-relevant structure. The tension, made explicit by Lambert et al.'s objective-mismatch framing, is global predictive accuracy versus decision-relevant accuracy.

The final axis is the dynamics backbone: the RNN-based RSSM (PlaNet through Dreamer v3) versus transformer backbones (TransDreamer and related work), which trade recurrent inductive bias and constant-memory rollouts for parallel training and better long-range dependency modeling. Plan2Explore is largely orthogonal to these axes, contributing an exploration objective (model-disagreement curiosity) that can be layered onto any of the above. The broad trend, consistent with the chapter taxonomy, is toward discrete latents, imagined-policy learning, increasingly task-relevant representation objectives, and transformer backbones.

Pre-Deep-Learning Foundations​

World Models (Ha & Schmidhuber, 2018)​

PlaNet: The Recurrent State-Space Model​

Dreamer v1: Imagination-Based Policy Learning​

Dreamer v2: Discrete Representations​

Dreamer v3: Universal World Models​

DreamerPro: Reconstruction-Free World Models​

Plan2Explore: Curiosity-Driven World Models​

Director: Hierarchical World Models​

Masked World Models (MWM)​

TransDreamer and Transformer-Based World Models​

Stochastic Latent Actor-Critic (SLAC)​

The Objective Mismatch Problem​

Synthesis: Tensions Across the Latent-Space Lineage​

References