Regularization Methods
Regularization-based approaches constrain weight updates to preserve knowledge encoded in parameters deemed important for previous tasks. This family can be subdivided into two categories: weight regularization methods that directly penalize changes to important parameters, and functional regularization methods that constrain the model's input-output behavior rather than its parameters.
Weight Regularization
Elastic Weight Consolidation (EWC)
Kirkpatrick et al. (2017) (Kirkpatrick et al., 2017) proposed EWC, drawing inspiration from synaptic consolidation in neuroscience. The key insight is that not all parameters are equally important for a given task -- some encode critical task knowledge while others can be safely modified. EWC adds a quadratic penalty to the loss function that penalizes changes to parameters proportional to their estimated importance for previous tasks:
L_total = L_new(theta) + (lambda / 2) * sum_i F_i * (theta_i - theta_i^*)^2
where F_i is the diagonal of the Fisher information matrix (approximating parameter importance) and theta_i^* are the optimal parameters for the previous task. The Fisher information matrix captures how sensitive the model's output distribution is to each parameter: parameters with high Fisher information are those whose modification would substantially alter the model's predictions, and thus should be protected.
EWC demonstrated that selective parameter protection could substantially mitigate forgetting on sequential Atari game learning -- for instance, retaining performance on a previously learned game while learning a new one. The method is elegant in its simplicity and connects to a rich theory of natural gradient methods in optimization.
Key limitations: (1) The diagonal Fisher is a crude approximation of the full Fisher information matrix, which is prohibitively expensive to store and compute. This approximation ignores parameter interactions, which can be critical. (2) The quadratic penalty assumes a Gaussian posterior over parameters centered at the previous optimum, which becomes increasingly inaccurate as the number of tasks grows. (3) The Fisher information is computed at a single point (the optimum for the previous task), but importance may change as the model moves through parameter space.
Online EWC (Schwarz et al., 2018) addresses the growing memory cost (storing one Fisher matrix per task) by maintaining a running sum of Fisher information matrices rather than storing one per task. This reduces the storage from O(T * |theta|) to O(|theta|) while still approximating the cumulative importance of parameters across all tasks.
Synaptic Intelligence (SI)
Zenke et al. (2017) (Zenke et al., 2017) proposed Synaptic Intelligence (SI), which computes parameter importance online during training rather than post-hoc. While EWC estimates importance at the end of training on each task (a single snapshot), SI tracks the contribution of each parameter to the loss reduction along the entire training trajectory. Specifically, SI accumulates the "path integral" of the gradient for each parameter throughout training:
omega_i = sum_t (delta_i(t) * (theta_i(t) - theta_i(t-1))) / (Delta theta_i^2 + epsilon)
where delta_i(t) is the gradient and the numerator tracks how much each parameter contributed to loss reduction. This yields an importance measure that is both cheaper to compute and arguably more representative than the Fisher-based approximation of EWC, as it captures importance across the entire optimization trajectory rather than at a single point.
Empirical comparisons show SI and EWC perform comparably on standard benchmarks, with SI being more memory-efficient since importance is accumulated online without storing additional matrices per task (Zenke et al., 2017). However, both methods share the fundamental limitation of capacity saturation.
Riemannian Walk (RWalk)
Chaudhry et al. (2018) (Chaudhry et al., 2018) proposed RWalk, which unifies EWC and SI by combining the Fisher information (local curvature) with the path integral (trajectory-based importance). RWalk also introduced the forgetting measure metric and provided a more principled comparison of regularization methods. The key finding was that combining both sources of importance information -- the Fisher's snapshot at the optimum and SI's trajectory-based accumulation -- yields better estimates than either alone.
Memory Aware Synapses (MAS)
Aljundi et al. (2018) (Aljundi et al., 2018) proposed MAS, which estimates parameter importance based on the sensitivity of the learned function (rather than the loss) to parameter changes. The importance is computed as the magnitude of the gradient of the output with respect to each parameter, averaged over the data:
Omega_i = (1/N) * sum_n |grad_{theta_i} ||F(x_n; theta)||_2^2|
This is a crucial distinction from EWC: while EWC's Fisher information measures sensitivity of the loss (and thus requires labels), MAS measures sensitivity of the model's output (and can be computed in an unsupervised manner). This makes MAS task-agnostic and applicable in settings where task boundaries are unclear or labels are unavailable.
Functional Regularization
Learning without Forgetting (LwF)
Li and Hoiem (2017) (Li & Hoiem, 2017) introduced LwF, which uses knowledge distillation (Hinton et al., 2015) to preserve the behavior of the network on old tasks while learning new ones. Rather than constraining parameters directly (which makes assumptions about the geometry of parameter space), LwF constrains the network's input-output behavior: when training on new task data, it minimizes a distillation loss that encourages the network to produce the same outputs as the old model on the new task's inputs:
L_total = L_new(theta) + alpha * L_distill(f_theta(x), f_{theta_old}(x))
where f_{theta_old} is the model snapshot before learning the new task. The distillation loss uses soft targets (softmax outputs with temperature scaling), which carry richer information than hard labels -- specifically, they encode the model's uncertainty and inter-class relationships.
LwF requires no stored exemplars from previous tasks, making it fully privacy-preserving and memory-efficient. However, it assumes that new task data is somewhat representative of previous input distributions -- if the new data distribution is very different from previous tasks, the distillation targets may not accurately reflect the old model's behavior on its original data. Despite this limitation, the functional regularization paradigm introduced by LwF has been enormously influential, forming a key component of many subsequent methods including DER++ (Buzzega et al., 2020), FOSTER (Wang et al., 2022), and LUCIR (Hou et al., 2019).
Less-Forgetting Learning and Encoder-Based Distillation
Less-Forgetting Learning (Jung et al., 2016) preceded LwF and proposed a simpler form of functional regularization by constraining intermediate representations rather than final outputs. This idea was later refined in methods like PODNet (Douillard et al., 2020), which applies pooled distillation across multiple spatial levels of a convolutional network, capturing both local and global feature statistics. PODNet's multi-scale distillation achieves significantly better performance than output-level distillation alone, particularly for long task sequences in class-incremental learning.
LUCIR and Bias Correction
Hou et al. (2019) (Hou et al., 2019) identified a critical problem in class-incremental learning: the model's classifier becomes biased toward recently learned classes because they dominate the training data. LUCIR (Learning a Unified Classifier Incrementally via Rebalancing) addresses this through three mechanisms: (1) cosine normalization of the classifier to remove the magnitude bias, (2) less-forget constraint that preserves the orientation of feature vectors, and (3) inter-class separation to maintain discrimination between old and new classes. LUCIR highlighted that continual learning requires not just preventing forgetting of features but also maintaining a balanced classification boundary.
BiC: Bias Correction
Wu et al. (2019) (Wu et al., 2019) proposed BiC (Bias Correction), which explicitly corrects the classifier bias toward new classes by learning a small linear transformation on the output logits using a held-out validation set. Despite its simplicity (just two scalar parameters per incremental step), BiC substantially improves class-incremental performance. This work demonstrated that a significant portion of the accuracy drop in class-incremental learning comes from classifier bias rather than feature forgetting -- a finding that influenced many subsequent methods.
WA: Weight Alignment
Zhao et al. (2020) (Zhao et al., 2020) proposed WA (Weight Alignment), which corrects the bias in class-incremental learning by aligning the norms of the weight vectors for old and new classes in the final classification layer. After training on a new task, WA normalizes the classifier weights so that old and new classes have comparable norms. This simple post-hoc correction, applied after standard fine-tuning with knowledge distillation, achieves competitive performance with minimal overhead.
Limitations of Regularization Approaches
Regularization methods share a fundamental limitation: as the number of tasks grows, the feasible region of parameter space that satisfies all constraints shrinks, eventually leaving insufficient capacity for new learning (Hsu et al., 2018). This phenomenon, sometimes called capacity saturation, means regularization methods tend to degrade on long task sequences. Geometrically, each new task adds constraints that restrict the feasible region, and eventually there is no point in parameter space that simultaneously satisfies all constraints well.
Additionally, weight regularization methods generally struggle with class-incremental learning, where task identity is not provided at test time (Ven & Tolias, 2019). Without task identity, the model cannot route to task-specific output heads, and the regularization constraints alone are insufficient to maintain a global classification boundary. Functional regularization methods (LwF and descendants) fare better in Class-IL but still face the fundamental challenge of maintaining calibrated decision boundaries across an ever-growing set of classes.
A practical issue is hyperparameter sensitivity: the regularization strength lambda requires careful tuning and may need to change across tasks. Too strong regularization prevents new learning; too weak allows forgetting. This balance is difficult to set a priori and may vary across tasks (Lange et al., 2021).
References
- Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, Tinne Tuytelaars (2018). Memory Aware Synapses: Learning What (not) to Forget. ECCV.
- Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, Simone Calderara (2020). Dark Experience for General Continual Learning: a Strong, Simple Baseline. NeurIPS.
- Arslan Chaudhry, Puneet K. Dokania, Thalaiyasingam Ajanthan, Philip H.S. Torr (2018). Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence. ECCV.
- Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, Eduardo Valle (2020). PODNet: Pooled Outputs Distillation for Small-Tasks Incremental Learning. ECCV.
- Geoffrey Hinton, Oriol Vinyals, Jeff Dean (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop.
- Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, Dahua Lin (2019). Learning a Unified Classifier Incrementally via Rebalancing. CVPR.
- Yen-Chang Hsu, Yen-Cheng Liu, Anita Ramasamy, Zsolt Kira (2018). Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines. NeurIPS CL Workshop.
- Heechul Jung, Jeongwoo Ju, Minju Jung, Junmo Kim (2016). Less-Forgetting Learning in Deep Neural Networks. arXiv.
- James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz (2017). Overcoming Catastrophic Forgetting in Neural Networks. PNAS.
- Matthias De Lange, Rahaf Aljundi, Marc Masana (2021). A Continual Learning Survey: Defying Forgetting in Classification Tasks. IEEE TPAMI.
- Zhizhong Li, Derek Hoiem (2017). Learning without Forgetting. IEEE TPAMI.
- Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina (2018). Progress & Compress: A Scalable Framework for Continual Learning. ICML.
- Gido M. van de Ven, Andreas S. Tolias (2019). Three Scenarios for Continual Learning. NeurIPS Continual Learning Workshop.
- Fu-Yun Wang, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan (2022). FOSTER: Feature Boosting and Compression for Class-Incremental Learning. ECCV.
- Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, Yun Fu (2019). Large Scale Incremental Learning. CVPR.
- Friedemann Zenke, Ben Poole, Surya Ganguli (2017). Continual Learning through Synaptic Intelligence. ICML.
- Bowen Zhao, Xi Xiao, Guojun Gan, Bin Zhang, Shu-Tao Xia (2020). Maintaining Discrimination and Fairness in Class Incremental Learning. CVPR.