Skip to main content

Benchmarks & Evaluation

Standard Reinforcement Learning Benchmarks

  • Atari 100k: Atari games evaluated with a budget of exactly 100,000 environment interactions (approximately 2 hours of real-time play) (Kaiser et al., 2020). This is the primary benchmark for sample-efficient world model RL, testing whether a world model can learn enough about game dynamics from limited experience to train an effective policy. Current state-of-the-art: DreamerV3 achieves ~1.19 human-normalized mean score (Hafner et al., 2023); DIAMOND achieves ~1.46 using diffusion-based world modeling (Alonso et al., 2024); EfficientZero achieves superhuman performance on 26/26 games (Ye et al., 2021). The benchmark includes 26 games spanning diverse challenges (exploration, memory, rapid reaction, planning).

  • Atari 200M: Full-scale Atari evaluation (200 million frames) for assessing asymptotic performance. Used to evaluate whether world models can match model-free methods given sufficient data. DreamerV3 matches or exceeds top model-free methods on most games.

  • DeepMind Control Suite (DMControl): A set of continuous control tasks with physics simulation (Tassa et al., 2018). Standard for evaluating world models on locomotion (walker, cheetah, hopper), manipulation (finger, reacher), and balancing tasks (cartpole, pendulum). Tasks range from easy (cartpole balance) to hard (humanoid walk) and test both sample efficiency and asymptotic performance. Dreamer v1/v2/v3 achieved progressively better results, with DreamerV3 matching model-free methods at 1M steps while requiring only 100K steps for most tasks [@hafner2020dream, @hafner2021mastering, @hafner2023mastering]. TD-MPC2 (Hansen et al., 2024) demonstrated scaling to 80+ tasks across diverse embodiments.

  • Minecraft (Diamond): The ultimate long-horizon planning benchmark: autonomously collecting a diamond in Minecraft requires discovering and executing a complex sequence of subtasks (collect wood, craft planks, craft sticks, craft pickaxe, mine cobblestone, craft stone pickaxe, mine iron, smelt iron, craft iron pickaxe, mine diamond) spanning hundreds of environment steps with sparse reward (only received at the end). DreamerV3 (Hafner et al., 2023) was the first (and remains one of the few) agents to achieve this without human data, curricula, or task-specific engineering.

Robotics Benchmarks

  • Meta-World: A suite of 50 robotic manipulation tasks for multi-task and meta-learning evaluation (Yu et al., 2020), ranging from simple reaching to complex assembly. Used to evaluate whether world models can generalize across manipulation skills. TD-MPC2 demonstrated that a single world model can handle 50 Meta-World tasks with a single set of hyperparameters.
  • CALVIN: A benchmark for long-horizon, language-conditioned robotic manipulation, requiring execution of multi-step instructions in a simulated kitchen environment. CALVIN tests the ability to chain 5+ sub-tasks from language descriptions, requiring both world prediction and language grounding.
  • Habitat: An embodied AI platform for navigation and interaction in photorealistic 3D environments (Savva et al., 2019). Used to evaluate world models for visual navigation and household tasks. Habitat 2.0 extended to interactive tasks (rearrangement, social navigation) requiring prediction of object and agent dynamics.
  • CARLA: An autonomous driving simulator used to evaluate world models for driving tasks, including scenario generation, planning, and control (Dosovitskiy et al., 2017). CARLA provides diverse weather, lighting, and traffic conditions, and is the standard testbed for end-to-end driving approaches.
  • RLBench: A suite of 100 robot manipulation tasks with varying difficulty (James et al., 2020), from simple pick-and-place to complex multi-step assembly. RLBench provides both state-based and vision-based evaluation, enabling analysis of how much visual prediction contributes to task performance.
  • BEHAVIOR-1K: Srivastava et al. (2022) (Srivastava et al., 2022) proposed a benchmark of 1,000 everyday activities in realistic simulated environments, requiring world models to predict the outcomes of complex, multi-step household tasks. BEHAVIOR-1K tests long-horizon prediction and compositional generalization in realistic settings.
  • Real-World Robot Benchmarks: DayDreamer-style evaluations on physical robots (quadruped walking, robotic arm manipulation, RC car driving) provide the most practically relevant but least standardized evaluation. The RT-2 benchmark suite (Brohan et al., 2023) evaluates vision-language-action models on real robotic manipulation, providing a testbed for evaluating whether world models transfer from simulation to real hardware.

Video Prediction and Generation Benchmarks

  • FVD (Frechet Video Distance): The standard quantitative metric for video generation quality, measuring the distance between generated and real video distributions in a learned feature space. Lower FVD indicates higher quality, but FVD does not capture temporal coherence or physical plausibility perfectly.
  • SSIM, PSNR, LPIPS: Pixel-level and perceptual similarity metrics for individual frames. These metrics are easy to compute but poorly correlated with downstream task performance.
  • RoboNet, Something-Something, Kinetics: Video datasets used for training and evaluating video prediction models, spanning robotic manipulation, human actions, and diverse activities.
  • PhysBench: Emerging benchmarks specifically designed to test physical understanding (object permanence, gravity, collision dynamics) in video prediction models.

Evaluation Challenges

World model evaluation faces several fundamental challenges:

  1. Prediction quality vs. task performance disconnect: Pixel-level prediction accuracy (PSNR, SSIM, FVD) does not reliably correlate with downstream task performance (Schrittwieser et al., 2020). MuZero's world model produces predictions that look nothing like the actual game screens, yet supports superhuman play. Conversely, a visually perfect world model may predict irrelevant details accurately while missing task-critical dynamics. This disconnect raises a fundamental question: what should we optimize world models for?

  2. Long-horizon evaluation: World model predictions degrade rapidly over long horizons due to compounding errors. Evaluating 1-step prediction quality is easy but uninformative about planning utility; evaluating 100-step prediction quality is informative but difficult (what is the right metric for stochastic multi-step prediction?).

  3. Stochastic environments: In stochastic environments, the "correct" future prediction is not unique. Deterministic metrics penalize the model for generating one plausible future when the ground truth happened to be a different plausible future. Distribution-level metrics (FVD) partially address this but are noisy and expensive to compute.

  4. Generalization evaluation: There is no standard protocol for evaluating whether a world model generalizes across environments, embodiments, or physics parameters. Developing standardized tests for generalization (transfer to new objects, new physics, new visual appearances) is an open challenge.

  5. Safety-critical evaluation: For autonomous driving and medical robotics, the relevant question is not "how accurate is the world model on average?" but "how accurate is it in the worst case?" Evaluating worst-case prediction quality and its impact on downstream safety remains largely unsolved.


References