Video Prediction Models

Video prediction models learn to generate future video frames conditioned on past frames and optionally actions, forming a closely related class of world models. While video prediction and world modeling have historically been separate communities, they are converging rapidly: state-of-the-art world models increasingly use video generation architectures, and video prediction models are increasingly evaluated on their ability to support downstream decision-making.

This section covers learned models of future-frame generation that are intended (or directly usable) for prediction and decision-making: models conditioned on past frames and, where applicable, actions, including the video generation systems whose architectures now serve as environment simulators. We treat pure text-to-video creative generation and single-image generation as out of scope except where a system is widely used or adapted as a world model. By this criterion, general video generators such as Make-A-Video, Stable Video Diffusion, and VideoPoet are included not as creative-generation work in their own right but because their architectures and pretrained backbones have become the substrate for action-conditioned, high-fidelity world models.

We organize the rest of this section around architectural families, after a brief foundations subsection that traces the early methods all later families build on.

Foundations: From LSTMs to Adversarial Training

The foundations of neural video prediction were laid by Srivastava et al. (2015) (Srivastava et al., 2015), who demonstrated that LSTM encoders-decoders could learn useful video representations through future frame prediction. Oh et al. (2015) (Oh et al., 2015) showed that action-conditioned video prediction in Atari games could be achieved with convolutional architectures, establishing the paradigm of conditioning predictions on agent actions. Finn et al. (2016) (Finn et al., 2016) extended this to robotic manipulation, predicting future frames conditioned on robot actions to enable simple planning through visual foresight. Chiappa et al. (2017) (Chiappa et al., 2017) proposed recurrent environment simulators that could generate long, coherent rollouts of game environments.

A key early insight was that pixel-space prediction with mean squared error produces blurry outputs because MSE averages over multiple possible futures. Mathieu et al. (2016) (Mathieu et al., 2016) addressed this with adversarial training, using a discriminator to encourage sharp predictions, while Lotter et al. (2017) (Lotter et al., 2017) proposed PredNet, inspired by predictive coding theory in neuroscience, which learns hierarchical prediction errors. These early works established the core challenges that continue to define the field: handling stochasticity (multiple plausible futures), maintaining long-horizon coherence, and avoiding the blurriness of MSE-trained models.

VAE-Based and Latent-Variable Models

This family models video prediction through latent-variable autoencoders, using stochastic latents to capture the multimodality of possible futures and learned priors to sample diverse, internally consistent rollouts.

Stochastic Video Generation (SVG)

Denton and Fergus (2018) (Denton & Fergus, 2018) proposed SVG, which uses a VAE framework with a learned prior to generate stochastic video predictions. The key contribution was addressing the fundamental multimodality of video prediction: given a ball rolling toward a wall, it might bounce left or right, and a good model should be able to generate both futures. By sampling different latent codes from the learned prior, SVG can generate diverse plausible futures, each internally consistent. This was an important conceptual contribution, establishing that deterministic video prediction models are fundamentally limited by their inability to represent uncertainty.

FitVid

Babaeizadeh et al. (2021) (Babaeizadeh et al., 2021) introduced FitVid, which reframed the design space of video prediction architectures around a diagnosis of model capacity. Its central finding, captured in the paper's framing of "overfitting in pixel-level video prediction," is that prior video models did not actually have too much capacity; they were underfitting because they used their parameters inefficiently. FitVid was designed to allocate parameters more effectively, and it became the first model to demonstrate substantial overfitting on the standard benchmarks: with enough capacity, it could memorize the training set, exposing the limits of those benchmarks. This overfitting motivated FitVid's main contribution, the use of data augmentation (RandAugment combined with random cropping) to recover generalization, after which it reached state-of-the-art prediction quality with a relatively standard VAE-based architecture. The result reframed the field's bottleneck from raw capacity toward how capacity is used and regularized.

Autoregressive Video Models

VideoGPT (Yan et al., 2021) (Yan et al., 2021) proposed tokenizing video using VQ-VAE and modeling the resulting discrete tokens autoregressively with a GPT-like transformer. This approach leverages the success of autoregressive language models for video generation, treating video as a "language" of visual tokens. VideoGPT demonstrated that the sequence modeling paradigm transfers effectively to the video domain, generating coherent short videos with temporal consistency.

iVideoGPT (Wu et al., 2024) (Wu, 2024) extended this paradigm to interactive world models, where the autoregressive video model is conditioned on actions to generate interactive environment simulations. The key insight is that conditioning an autoregressive video model on actions transforms it from a passive predictor into an interactive simulator. iVideoGPT showed that VideoGPT-style models can be pretrained at scale on millions of human and robotic manipulation trajectories and then adapted into capable world simulators, reaching competitive performance with prior methods on action-conditioned video prediction, visual planning, and model-based control.

STORM (Zhang et al., 2023) (Zhang et al., 2023) proposed a transformer-based world model that uses a "stochastic transformer" to model both the deterministic and stochastic components of environment dynamics. STORM replaces the RSSM's RNN backbone with a transformer, enabling parallel training (which is much faster than sequential RNN training) while maintaining the ability to model stochastic futures. STORM reaches a mean human-normalized score of approximately 126.7% on the Atari 100k benchmark, competitive with DreamerV3, while training in about 4.3 hours on a single NVIDIA RTX 3090 (for 1.85 hours of simulated real-time interaction), a substantial speedup over RNN-based predecessors.

Video Diffusion Models

Diffusion-based video generation has emerged as the dominant paradigm for high-fidelity video synthesis. Ho et al. (2022) (Ho et al., 2022) demonstrated Video Diffusion Models that generate temporally coherent video through iterative denoising, extending the success of image diffusion models to the temporal domain. Imagen Video (Ho et al., 2022) (Ho et al., 2022) scaled this approach to high-definition video generation with a cascade of spatial and temporal super-resolution diffusion models, producing 1280x768 videos at 24 fps. MCVD (Voleti et al., 2022) (Voleti et al., 2022) introduced Masked Conditional Video Diffusion for flexible video prediction, interpolation, and generation. Make-A-Video (Singer et al., 2022) (Singer et al., 2022) demonstrated text-to-video generation by decoupling spatial and temporal learning, training on image-text pairs for spatial understanding and unsupervised video for temporal dynamics. Stable Video Diffusion (Blattmann et al., 2023) (Blattmann et al., 2023) scaled latent video diffusion to large datasets, achieving state-of-the-art video generation quality with a model that could be fine-tuned for specific downstream applications. VideoPoet (Kondratyuk et al., 2024) (Kondratyuk et al., 2024) unified video generation tasks within a single large language model, treating video tokens as another modality in a multimodal autoregressive framework.

Across these systems the comparative axis that matters most is where generation happens and how frames are produced. One family denoises directly in pixel space and stacks cascades of spatial and temporal super-resolution stages to reach high resolution (Video Diffusion Models and Imagen Video), trading heavy compute for fidelity. A second family moves the diffusion process into a compressed latent space (Stable Video Diffusion), which is far cheaper to train and fine-tune at scale and has become the dominant recipe for downstream adaptation. A third family conditions on text and decouples spatial from temporal learning to exploit abundant image-text data (Make-A-Video), while MCVD keeps a single conditional denoiser flexible enough to cover prediction, interpolation, and unconditional generation. VideoPoet sits apart from the denoising approaches entirely, generating discrete video tokens autoregressively inside a language model rather than iteratively denoising, which makes it natural to unify many generation tasks under one model at the cost of token-level sequential decoding. The shared takeaway is that all of these approaches produce substantially higher visual fidelity than VAE-based predictors, with sharper details and more realistic textures, but at significantly higher computational cost (multiple denoising steps, or long token sequences, per frame). The convergence of video diffusion models with world modeling (as in DIAMOND and GameNGen) represents a major trend: the same architectures used for creative video generation can serve as high-fidelity environment simulators when conditioned on actions.

Object-Centric Video Prediction

A distinct line of work focuses on structured, object-centric video prediction that explicitly decomposes scenes into objects and models their dynamics independently. The intellectual roots trace back to Interaction Networks (Battaglia et al., 2016) (Battaglia et al., 2016), which modeled physical interactions between objects using graph neural networks, and Visual Interaction Networks (Watters et al., 2017) (Watters et al., 2017), which extended this to learning physics from video. The broader framework of graph networks for physical reasoning was formalized by Battaglia et al. (2018) (Battaglia et al., 2018).

Object discovery methods provide the perceptual foundation for object-centric world models. MONet (Burgess et al., 2019) (Burgess et al., 2019) and IODINE (Greff et al., 2019) (Greff et al., 2019) demonstrated unsupervised decomposition of scenes into object representations, while slot attention (Locatello et al., 2020) (Locatello et al., 2020) provided a scalable, differentiable mechanism for binding visual features to object "slots." Graph-based physics simulation (Sanchez-Gonzalez et al., 2020) (Sanchez-Gonzalez et al., 2020) showed that learned simulators with graph structure could generalize to novel physical scenarios.

C-SWM (Kipf et al., 2020) (Kipf et al., 2020) proposed Contrastive Structured World Models, which learn object-centric state representations and pairwise interaction dynamics using a contrastive loss. By decomposing the scene into objects and learning their interaction graph, C-SWM can generalize to novel object configurations not seen during training.

SlotFormer (Wu et al., 2023) (Wu et al., 2023) extended this idea using slot attention to discover objects and a transformer to model their temporal dynamics. SlotFormer achieves state-of-the-art unsupervised video prediction on multi-object scenes and supports downstream visual reasoning tasks. The compositional structure of object-centric world models enables systematic generalization: understanding the dynamics of individual objects and their interactions generalizes to novel combinations.

Object-centric world models connect to a deep question about the right level of abstraction for world models: should world models predict pixels (maximally detailed but computationally expensive), abstract features (efficient but potentially losing important structure), or objects and their relations (structured and generalizable but requiring object discovery)?

Foundations: From LSTMs to Adversarial Training​

VAE-Based and Latent-Variable Models​

Stochastic Video Generation (SVG)​

FitVid​

Autoregressive Video Models​

Video Diffusion Models​

Object-Centric Video Prediction​

References