Connections to Other Chapters
Efficient Architecture Design (Chapter 3): Architecture-based continual learning methods are deeply intertwined with efficient architecture design, and many of the most promising directions in continual learning emerge from this intersection:
- Mixture-of-Experts for continual learning: MoE architectures (Shazeer et al., 2017) naturally support continual learning by routing new tasks to new experts while freezing old ones, providing both computational efficiency and forgetting prevention. The sparse activation patterns in MoE models reduce interference between tasks, and the dynamic expert addition mechanism mirrors the expansion strategies in PackNet (Mallya & Lazebnik, 2018) and PNN (Rusu et al., 2016). LoRAMoE (Huang, 2024) combines LoRA modules in an MoE framework specifically for continual adaptation of LLMs.
- Parameter-efficient fine-tuning: LoRA (Hu et al., 2022), adapters (Houlsby et al., 2019), and prefix tuning from Chapter 3 serve dual purposes: enabling efficient inference and facilitating continual learning by limiting the number of parameters modified per task. O-LoRA (Wang, 2023) explicitly constrains successive LoRA modules to be orthogonal, directly preventing interference. AdapterCL (Madotto et al., 2021) trains task-specific adapters while keeping the backbone frozen, bridging PEFT and continual learning.
- Knowledge distillation for compression and CL: Knowledge distillation (Hinton et al., 2015), originally developed for model compression, became a core technique in continual learning through LwF (Li & Hoiem, 2017), DER++ (Buzzega et al., 2020), and FOSTER (Wang et al., 2022). The same teacher-student framework serves both efficient deployment and knowledge preservation.
- State space models: SSMs like Mamba (Gu & Dao, 2024) with their recurrent structure may offer advantages for online continual learning settings where data arrives sequentially. Their constant-memory inference is analogous to the bounded-memory constraint in online CL.
- Quantization and continual learning: Model quantization techniques interact with continual learning in complex ways -- quantized models may have different forgetting characteristics than full-precision models, and quantization-aware continual training remains underexplored.
World Models (Chapter 2): Continual learning is essential for world models that must adapt to changing environments, creating a bidirectional connection between these chapters:
- Non-stationary environments: A robot's world model must update as it encounters new objects, environments, or physical interactions, making catastrophic forgetting a practical concern for model-based RL. Dreamer [@hafner2020dream, @hafner2023dreamerv3] agents deployed in non-stationary environments face the classic stability-plasticity dilemma: the dynamics model must adapt to new environments while retaining knowledge of previously learned physics.
- Complementary learning systems: The CLS framework that inspires CLS-ER (Arani et al., 2022) and other replay-based methods also provides a blueprint for how world models might consolidate environmental knowledge. The hippocampal-cortical interaction modeled by CLS-ER is directly relevant to how an agent should consolidate world knowledge acquired through exploration.
- Experience replay in RL: The replay buffers used in model-based RL (Dyna (Sutton, 1990), MuZero (Schrittwieser et al., 2020)) are formally identical to the replay buffers in continual learning. Reservoir sampling (Vitter, 1985) provides the theoretical foundation for maintaining representative buffers in both settings.
- Continual world model learning: As environments evolve (new objects appear, physics change, layouts are modified), world models must perform continual learning of dynamics and observation models -- a setting that combines the challenges of both chapters.
Agentic Search (Chapter 4): Agents that interact with evolving knowledge bases must continually update their retrieval and reasoning capabilities, creating several important connections:
- Continual retrieval learning: Dense retrieval models like DPR (Karpukhin et al., 2020) and ColBERT (Khattab & Zaharia, 2020) must be continually updated as document collections grow and change. The retriever must index new documents without degrading retrieval quality for existing queries -- a class-incremental learning problem applied to retrieval.
- Knowledge editing and RAG: Knowledge editing in LLMs (ROME (Meng et al., 2022), MEMIT (Meng et al., 2023)) connects to RAG systems where the model's parametric knowledge must be kept consistent with the retrieved evidence. When external knowledge changes, the model must update its parametric knowledge accordingly, facing the same stability-plasticity tradeoff as classical continual learning.
- Self-improving search agents: Agents like Voyager (Wang et al., 2023) and Auto-GPT (Gravitas, 2023) that learn from their search experiences over time face a continual learning challenge: how to accumulate effective search strategies without forgetting previously successful approaches. Reflexion (Shinn et al., 2023) maintains a memory of past reasoning failures that must be managed to avoid interference.
- Continual reasoning: As LLMs are continually fine-tuned for new reasoning capabilities (chain-of-thought (Wei et al., 2022), tool use, planning), maintaining existing reasoning abilities while acquiring new ones is a continual learning problem that connects to both continual instruction tuning and agentic capabilities.
Randomized Algorithms (Chapter 5): Sketching, hashing, and sampling techniques from Chapter 5 provide essential computational infrastructure for scalable continual learning:
- Compressed replay buffers: Sketching algorithms (Count-Min Sketch, CountSketch) can compress replay buffers, storing approximate representations of previous task distributions in bounded memory rather than raw exemplars. REMIND (Hayes et al., 2020) stores compressed intermediate representations, a form of lossy compression related to sketching.
- Efficient parameter importance computation: Random projections enable efficient computation of parameter importance for EWC-style methods (Kirkpatrick et al., 2017) without materializing the full Fisher information matrix. For a model with N parameters, the diagonal Fisher requires O(N) storage, but the full Fisher is O(N^2) -- random projections reduce this to O(N * k) where k is the projection dimension.
- Streaming PCA for representation tracking: GPM (Saha et al., 2021) maintains a basis for each task's representation subspace. Streaming PCA algorithms (Frequent Directions) from Chapter 5 provide the algorithmic machinery for maintaining these bases efficiently as the number of tasks grows.
- Reservoir sampling: Vitter's reservoir sampling (Vitter, 1985) provides the theoretical foundation for replay buffer management, guaranteeing that the buffer maintains a uniform random sample of all data seen so far -- a critical property for unbiased replay.
- The streaming computation perspective: Online continual learning -- processing data in a single pass with bounded memory -- is formally equivalent to the streaming computation model that motivates sketching and sampling algorithms in Chapter 5. This mathematical equivalence suggests that lower bounds from streaming complexity theory may apply to continual learning, potentially establishing fundamental limits on what can be learned in a single pass.
References
- Elahe Arani, Fahad Sarfraz, Bahram Zonooz (2022). Learning Fast, Learning Slow: A General Continual Learning Method based on Complementary Learning System Theory. ICLR.
- Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, Simone Calderara (2020). Dark Experience for General Continual Learning: a Strong, Simple Baseline. NeurIPS.
- Significant Gravitas (2023). Auto-GPT: An Autonomous GPT-4 Experiment. GitHub.
- Albert Gu, Tri Dao (2024). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv.
- Tyler L. Hayes, Kushal Kafle, Robik Shrestha, Manoj Acharya, Christopher Kanan (2020). REMIND Your Neural Network to Prevent Catastrophic Forgetting. ECCV.
- Geoffrey Hinton, Oriol Vinyals, Jeff Dean (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop.
- Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly (2019). Parameter-Efficient Transfer Learning for NLP. ICML.
- Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR.
- Cheng Huang (2024). LoRAMoE: Alleviating World Knowledge Forgetting in Large Language Models via MoE-Style Plugin. ACL.
- Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP.
- Omar Khattab, Matei Zaharia (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR.
- James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz (2017). Overcoming Catastrophic Forgetting in Neural Networks. PNAS.
- Zhizhong Li, Derek Hoiem (2017). Learning without Forgetting. IEEE TPAMI.
- Andrea Madotto, Zhaojiang Lin, Zhenpeng Zhou, Seungwhan Moon, Paul Crook, Bing Liu, Zhiguang Wang, Liang Qiu, Yue Zhang, Pascale Fung (2021). Continual Learning in Task-Oriented Dialogue Systems. EMNLP.
- Arun Mallya, Svetlana Lazebnik (2018). PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. CVPR.
- Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov (2022). Locating and Editing Factual Associations in GPT. NeurIPS.
- Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, David Bau (2023). Mass-Editing Memory in a Transformer. ICLR.
- Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins (2016). Progressive Neural Networks. arXiv.
- Gobinda Saha, Isha Garg, Kaushik Roy (2021). Gradient Projection Memory for Continual Learning. ICLR.
- Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature.
- Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR.
- Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS.
- Richard S. Sutton (1990). Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. ICML.
- Jeffrey S. Vitter (1985). Random Sampling with a Reservoir. ACM Transactions on Mathematical Software.
- Fu-Yun Wang, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan (2022). FOSTER: Feature Boosting and Compression for Class-Incremental Learning. ECCV.
- Zhilin Wang (2023). Orthogonal Subspace Learning for Language Model Continual Learning. EMNLP Findings.
- Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv.
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS.