Introduction & Motivation

Biological neural systems learn continuously throughout their lifetime, seamlessly integrating new knowledge without destroying what was previously learned. The human brain can acquire new skills over decades (learning to drive, mastering a new language, adapting to new technology) while retaining core competencies learned in childhood. Artificial neural networks, by contrast, suffer from catastrophic forgetting (also called catastrophic interference), the tendency to abruptly lose performance on previously learned tasks when trained on new data [@mccloskey1989catastrophic, @ratcliff1990connectionist]. This phenomenon, first identified by McCloskey and Cohen (1989) (McCloskey & Cohen, 1989) and later formalized by French (1999) (French, 1999), remains one of the central obstacles to building AI systems that can learn and adapt over time.

The root cause of catastrophic forgetting lies in the stability-plasticity dilemma (Grossberg, 1980): a learning system must be plastic enough to acquire new knowledge, yet stable enough to retain what it has already learned. In standard neural network training with gradient descent, the parameters are globally shared across all tasks, and optimizing for a new task objective can arbitrarily overwrite representations critical for earlier tasks. This is fundamentally different from biological learning, where complementary learning systems in the hippocampus and neocortex work in concert to consolidate memories while maintaining flexibility for new learning [@mcclelland1995there, @kumaran2016what].

Continual learning (CL), also referred to as lifelong learning or incremental learning, seeks to endow neural networks with the ability to learn from a non-stationary stream of data, accumulating knowledge over time while maintaining performance on earlier tasks [@parisi2019continual, @delange2021continual, @wang2024comprehensive]. The importance of this capability cannot be overstated: real-world deployment scenarios (from autonomous vehicles encountering new road conditions, to medical AI systems facing evolving disease patterns, to recommendation systems adapting to shifting user preferences) fundamentally require models that adapt without retraining from scratch.

The field has experienced a renaissance driven by four converging forces:

Prohibitive retraining costs. The deployment of large-scale AI systems has made the cost of full retraining prohibitive. Training a frontier language model costs tens of millions of dollars and takes months on thousands of accelerators. Updating such models with new knowledge must be incremental, not from scratch.
Foundation model adaptation. The success of foundation models (BERT, GPT, LLaMA, etc.) has raised urgent questions about how to update these models with new knowledge (new facts, new languages, new capabilities) without degrading existing performance. This has given rise to a new subfield of continual learning specifically for LLMs [@scialom2025continual, @jang2022towards, @wu2024continual_llm_survey].
Neuroscience inspiration. Research on synaptic consolidation, complementary learning systems, and memory replay in the brain has inspired new algorithmic approaches. The discovery that the brain consolidates memories during sleep through replay of hippocampal activity (Ji & Wilson, 2007) directly motivated experience replay methods in continual learning (Shin et al., 2017). Similarly, work on synaptic metaplasticity (Abraham, 2008) inspired parameter importance-based methods like EWC (Kirkpatrick et al., 2017).
Pre-trained model paradigm shift. The rise of pre-trained vision transformers (ViT) and large language models has created an entirely new paradigm for continual learning, one where the focus shifts from learning representations from scratch to efficiently adapting pre-trained representations. This has given rise to prompt-based continual learning methods [@wang2022learning_l2p, @wang2022dualprompt] that represent a fundamentally different approach from classical continual learning.

The scale of the forgetting problem has been systematically documented. In a comprehensive empirical study, de Lange et al. (2021) (Lange et al., 2021) showed that naive sequential fine-tuning leaves performance on early tasks far below that of joint training across a range of task-incremental image classification benchmarks. The severity depends on task similarity: highly dissimilar tasks cause more abrupt forgetting, while related tasks may partially benefit from shared representations (Lee et al., 2021). Masana et al. (2023) (Masana et al., 2023) reported one of the most comprehensive evaluations of class-incremental methods available as of 2023, comparing 13 methods across multiple settings and showing that, even for the strongest methods, a substantial accuracy gap to joint training (the upper bound) persists on the more challenging multi-task image classification benchmarks, a sobering reality check for the field.

The relationship between continual learning and transfer learning deserves careful attention. Positive forward transfer, where learning earlier tasks improves learning of later tasks, is achievable when tasks share structure, but most methods focus almost exclusively on preventing negative backward transfer (forgetting) rather than promoting positive transfer. The few methods that explicitly optimize for forward transfer (PNN's lateral connections (Rusu et al., 2016), modular networks (Mendez & Eaton, 2022)) achieve it only in restricted regimes, such as reinforcement learning on related Atari game sequences where reusing features from earlier tasks measurably speeds up later learning, rather than as a general property across arbitrary task streams. The grand challenge of knowledge accumulation, where each new task makes the system globally better, remains largely unsolved.

This chapter provides a comprehensive review of continual learning methods, organized by algorithmic family. We cover regularization-based, replay-based, architecture-based, and meta-learning approaches, as well as the emerging paradigm of prompt-based continual learning for pre-trained models, and the intersection of continual learning with large language models. We discuss standard benchmarks, evaluation protocols and their pitfalls, and conclude with open problems and connections to other chapters in this survey. Throughout, we aim to provide not just a catalog of methods but a critical analysis of what works, why, and under what conditions.

References