Connections to Other Chapters

This appendix maps continual learning onto the other chapters of the survey. We select connections by mechanism: each bullet pairs a continual learning idea with a concrete method or primitive from another chapter where the two share a formal structure, a shared technique, or a common tradeoff, rather than aiming for exhaustive coverage of either field.

Efficient Architecture Design (Chapter 3): Architecture-based continual learning methods are deeply intertwined with efficient architecture design, and many of the most promising directions in continual learning emerge from this intersection:

Mixture-of-Experts for continual learning: MoE architectures (Shazeer et al., 2017) naturally support continual learning by routing new tasks to new experts while freezing old ones, providing both computational efficiency and forgetting prevention. The sparse activation patterns in MoE models reduce interference between tasks, and the dynamic expert addition mechanism mirrors the expansion strategies in PackNet (Mallya & Lazebnik, 2018) and PNN (Rusu et al., 2016). LoRAMoE (Huang, 2024) combines LoRA modules in an MoE framework specifically for continual adaptation of LLMs.
Parameter-efficient fine-tuning: LoRA (Hu et al., 2022), adapters (Houlsby et al., 2019), and prefix tuning from Chapter 3 serve dual purposes: enabling efficient inference and facilitating continual learning by limiting the number of parameters modified per task. O-LoRA (Wang, 2023) explicitly constrains successive LoRA modules to be orthogonal, directly preventing interference. AdapterCL (Madotto et al., 2021) trains task-specific adapters while keeping the backbone frozen, bridging PEFT and continual learning.
Knowledge distillation for compression and CL: Knowledge distillation (Hinton et al., 2015), originally developed for model compression, became a core technique in continual learning through LwF (Li & Hoiem, 2017), DER++ (Buzzega et al., 2020), and FOSTER (Wang et al., 2022). The same teacher-student framework serves both efficient deployment and knowledge preservation.
State space models: SSMs like Mamba (Gu & Dao, 2024) with their recurrent structure may offer advantages for online continual learning settings where data arrives sequentially. Their constant-memory inference is analogous to the bounded-memory constraint in online CL.
Quantization and continual learning: Model quantization techniques interact with continual learning in complex ways: quantized models may have different forgetting characteristics than full-precision models, and quantization-aware continual training remains underexplored.

World Models (Chapter 2): Continual learning is essential for world models that must adapt to changing environments, creating a bidirectional connection between these chapters:

Non-stationary environments: A robot's world model must update as it encounters new objects, environments, or physical interactions, making catastrophic forgetting a practical concern for model-based RL. Dreamer [@hafner2020dream, @hafner2023dreamerv3] agents deployed in non-stationary environments face the classic stability-plasticity dilemma: the dynamics model must adapt to new environments while retaining knowledge of previously learned physics.
Complementary learning systems: The CLS framework that inspires CLS-ER (Arani et al., 2022) and other replay-based methods also provides a blueprint for how world models might consolidate environmental knowledge. The hippocampal-cortical interaction modeled by CLS-ER is directly relevant to how an agent should consolidate world knowledge acquired through exploration.
Experience replay in RL: The replay buffers used in model-based RL (Dyna (Sutton, 1990), MuZero (Schrittwieser et al., 2020)) share the same mechanism as the replay buffers in continual learning, though they serve different objectives: RL buffers decorrelate samples and improve sample efficiency under the i.i.d. training assumption, whereas continual learning buffers counter catastrophic forgetting across a task sequence. Reservoir sampling (Vitter, 1985) provides the theoretical foundation for maintaining representative buffers in both settings.
Continual world model learning: As environments evolve (new objects appear, physics change, layouts are modified), world models must perform continual learning of dynamics and observation models, a setting that combines the challenges of both chapters.

Agentic Search (Chapter 4): Agents that interact with evolving knowledge bases must continually update their retrieval and reasoning capabilities, creating several important connections:

Continual retrieval learning: Dense retrieval models like DPR (Karpukhin et al., 2020) and ColBERT (Khattab & Zaharia, 2020) must be continually updated as document collections grow and change. The retriever must index new documents without degrading retrieval quality for existing queries, a class-incremental learning problem applied to retrieval.
Knowledge editing and RAG: Knowledge editing in LLMs (ROME (Meng et al., 2022), MEMIT (Meng et al., 2023)) connects to RAG systems where the model's parametric knowledge must be kept consistent with the retrieved evidence. When external knowledge changes, the model must update its parametric knowledge accordingly, facing the same stability-plasticity tradeoff as classical continual learning.
Self-improving search agents: Agents like Voyager (Wang et al., 2023) and Auto-GPT (Gravitas, 2023) that learn from their search experiences over time face a continual learning challenge: how to accumulate effective search strategies without forgetting previously successful approaches. Reflexion (Shinn et al., 2023) maintains a memory of past reasoning failures that must be managed to avoid interference.
Continual reasoning: As LLMs are continually fine-tuned for new reasoning capabilities (chain-of-thought (Wei et al., 2022), tool use, planning), maintaining existing reasoning abilities while acquiring new ones is a continual learning problem that connects to both continual instruction tuning and agentic capabilities.

Randomized Algorithms (Chapter 5): Sketching, hashing, and sampling techniques from Chapter 5 provide essential computational infrastructure for scalable continual learning:

Compressed replay buffers: Sketching algorithms (Count-Min Sketch, CountSketch) can compress replay buffers, storing approximate representations of previous task distributions in bounded memory rather than raw exemplars. REMIND (Hayes et al., 2020) stores compressed intermediate representations, a form of lossy compression related to sketching.
Efficient parameter importance computation: Random projections enable efficient computation of parameter importance for EWC-style methods (Kirkpatrick et al., 2017) without materializing the full Fisher information matrix. For a model with N parameters, the diagonal Fisher requires O(N) storage, but the full Fisher is O(N^2); random projections reduce this to O(N * k) where k is the projection dimension.
Streaming PCA for representation tracking: GPM (Saha et al., 2021) maintains a basis for each task's representation subspace. Streaming PCA algorithms (Frequent Directions) from Chapter 5 provide the algorithmic machinery for maintaining these bases efficiently as the number of tasks grows.
Reservoir sampling: Beyond its role in replay buffers (discussed under World Models above), Vitter's reservoir sampling (Vitter, 1985) is one instance of the broader streaming-sampling toolkit from Chapter 5, where the same uniform-sample guarantee underpins not only unbiased replay but also sketch construction and online estimators. The shared primitive is what lets buffer management inherit the formal sample-quality bounds developed for streaming algorithms.
The streaming computation perspective: Online continual learning (processing data in a single pass with bounded memory) is closely analogous to the streaming computation model that motivates sketching and sampling algorithms in Chapter 5. This analogy suggests that lower bounds from streaming complexity theory may apply to continual learning, potentially establishing fundamental limits on what can be learned in a single pass.

References