Regularization
Regularization is the set of techniques that constrain a model to prefer simpler hypotheses, preventing it from memorizing the training set at the expense of generalization. Understanding regularization requires understanding why overfitting occurs and how different regularizers address the underlying causes.
The Overfitting Problem
A model with sufficient capacity can achieve zero training loss () by memorizing every training example, including noise. However, such a model performs poorly on unseen data () because it has learned patterns specific to the training set rather than the underlying data-generating process. The generalization gap measures the severity of overfitting.
L2 Regularization (Ridge / Weight Decay)
where controls the regularization strength.
The gradient of the regularized loss is , giving the SGD update:
The multiplicative factor shrinks every weight toward zero each step, hence the name weight decay.
Larger corresponds to a tighter prior, a stronger belief that parameters should be small.
The term ensures the matrix is invertible even when is singular (more features than samples). Geometrically, it shrinks the OLS solution toward zero along directions of small variance in .
L1 Regularization (Lasso)
Geometric interpretation: The L1 constraint set is a diamond (cross-polytope) with corners on the coordinate axes. The loss function's level curves are more likely to first touch a corner of the diamond than a smooth point, and corners correspond to sparse solutions (some coordinates exactly zero).
For example, with and , we get : the coordinate is set exactly to zero. With and the same , we get , a shrunken but nonzero value. Any coordinate whose least-squares value falls within is eliminated entirely, which is the mechanism behind L1 sparsity.
| Property | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Penalty | $\lambda \sum | \theta_i |
| Effect on parameters | Sparse (many exactly zero) | Small (all shrunk, none exactly zero) |
| Bayesian prior | Laplace: $p(\theta_i) \propto e^{-\lambda | \theta_i |
| Constraint geometry | Diamond (cross-polytope) | Sphere (ball) |
| Feature selection | Yes (implicit) | No |
| Smoothness | Non-differentiable at | Smooth everywhere |
| Use in deep learning | Rare (non-smooth optimization) | Standard (as weight decay) |
Dropout
To maintain the expected activation magnitude, inverted dropout scales by during training:
This way, no modification is needed at test time.
- Vision Transformers (ViT): Often use no dropout or very low rates, relying instead on data augmentation and weight decay.
- LLMs: Typically do not use dropout during pretraining (GPT-3, LLaMA), as the massive dataset size provides implicit regularization.
- Where it persists: Attention dropout (dropping attention weights) and path dropout / DropPath (dropping entire residual branches) remain useful, especially when data is limited.
Batch Normalization and Layer Normalization
where and are the batch mean and variance, and are learnable scale and shift parameters.
where and are computed over the features for each sample independently.
Early Stopping
In practice, use patience : stop if validation loss has not improved for consecutive evaluations. Save the best checkpoint.
Data Augmentation
| Domain | Augmentations |
|---|---|
| Vision | Random crop, horizontal flip, color jitter, RandAugment, Mixup, CutMix |
| Text | Token masking (BERT), back-translation, synonym replacement, dropout on embeddings |
| Audio | Time stretching, pitch shifting, SpecAugment (masking frequency/time bands) |
Mixup (Zhang et al., 2018) creates virtual training examples by linearly interpolating both inputs and labels: , where .
Bias-Variance Tradeoff
The expectation is over different training sets drawn from the same distribution.
Proof. Let . Then:
Expanding and noting that cross terms vanish (since and by definition, and is independent of ):
| Regime | Bias | Variance | Model Behavior | Fix |
|---|---|---|---|---|
| Underfitting | High | Low | Model too simple; misses patterns | Increase capacity, reduce regularization |
| Overfitting | Low | High | Model memorizes noise | Increase regularization, more data |
| Sweet spot | Balanced | Balanced | Best generalization | None needed |
| Double descent | Low | Low (surprisingly) | Very overparameterized | Let model grow past interpolation threshold |
- Implicit regularization from SGD, which finds minimum-norm solutions in the overparameterized regime
- Inductive biases of neural network architectures (locality, weight sharing)
- Benign overfitting: the model interpolates training data but does so smoothly, with low test error
Additional Regularization Techniques
Summary of Regularization Techniques
| Technique | Mechanism | Computational Cost | Modern Usage |
|---|---|---|---|
| L2 / Weight decay | Shrinks weights toward zero | Negligible | Standard (AdamW) |
| L1 | Drives weights to exactly zero | Negligible | Rare in deep learning |
| Dropout | Random neuron zeroing; ensemble effect | ~10% training overhead | Declining; used in attention/path |
| BatchNorm | Normalizes by batch stats; adds noise | ~5% overhead | CNNs |
| LayerNorm / RMSNorm | Normalizes by feature stats | ~5% overhead | Transformers, LLMs |
| Data augmentation | Increases effective dataset size | Data pipeline overhead | Standard for vision |
| Early stopping | Limits optimization trajectory | None (saves compute) | Standard |
| Label smoothing | Softens one-hot targets | Negligible | Common for classification |
| Gradient clipping | Bounds gradient magnitude | Negligible | Standard for transformers |
Notation Summary
| Symbol | Meaning |
|---|---|
| Regularization strength / weight decay coefficient | |
| L1 and L2 norms of parameters | |
| Dropout probability (drop rate) | |
| Binary dropout mask | |
| Elementwise product | |
| Learnable scale and shift in normalization | |
| Batch mean and variance | |
| Learned predictor (random variable over training sets) | |
| True function | |
| Irreducible noise variance | |
| Validation loss |