Regularization is the set of techniques that constrain a model to prefer simpler hypotheses, preventing it from memorizing the training set at the expense of generalization. Understanding regularization requires understanding why overfitting occurs and how different regularizers address the underlying causes.
A model with sufficient capacity can achieve zero training loss (Ltrain≈0) by memorizing every training example, including noise. However, such a model performs poorly on unseen data (Ltest≫0) because it has learned patterns specific to the training set rather than the underlying data-generating process. The generalization gapLtest−Ltrain measures the severity of overfitting.
A remarkable fact: a neural network with $P$ parameters can memorize a dataset of $N \leq P$ random label assignments -- where the labels bear no relation to the inputs [@zhang2021understanding]. This proves that the model class alone does not determine generalization; the training procedure and data structure also matter.
Add the squared Euclidean norm of parameters to the loss:
Lreg(θ)=L(θ)+2λ∥θ∥22=L(θ)+2λ∑iθi2
where λ>0 controls the regularization strength.
The gradient of the regularized loss is ∇Lreg=∇L+λθ, giving the SGD update:
θt+1=θt−η(∇L(θt)+λθt)=(1−ηλ)θt−η∇L(θt)
The multiplicative factor (1−ηλ) shrinks every weight toward zero each step, hence the name weight decay.
**Bayesian interpretation.** L2 regularization is equivalent to placing a Gaussian prior $p(\theta) \propto \exp(-\lambda\|\theta\|^2/2)$ on the parameters. Minimizing $\mathcal{L}_{\text{reg}}$ is equivalent to computing the **maximum a posteriori (MAP)** estimate:
Larger λ corresponds to a tighter prior -- a stronger belief that parameters should be small.
**L2 regularization $\neq$ weight decay for adaptive optimizers.** For SGD, adding $\frac{\lambda}{2}\|\theta\|^2$ to the loss is equivalent to multiplying $\theta$ by $(1 - \eta\lambda)$ each step. For Adam, the L2 gradient term $\lambda\theta$ gets divided by $\sqrt{\hat{v}_t}$, making the effective regularization strength parameter-dependent and inconsistent. **AdamW** [@loshchilov2019adamw] fixes this by applying weight decay directly: $\theta \leftarrow (1 - \eta\lambda)\theta - \eta \cdot \text{Adam\_step}$, decoupling regularization from the adaptive learning rate. AdamW is the standard for transformer training.
**Closed-form for linear regression.** For $\mathcal{L} = \|y - X\theta\|^2 + \lambda\|\theta\|^2$, the solution is:
θridge=(X⊤X+λI)−1X⊤y
The term λI ensures the matrix is invertible even when X⊤X is singular (more features than samples). Geometrically, it shrinks the OLS solution toward zero along directions of small variance in X.
**Why L1 produces sparsity.** The subgradient of $|\theta_i|$ is $\text{sign}(\theta_i)$, which applies constant pressure of magnitude $\lambda$ regardless of how small $\theta_i$ is. Once $|\theta_i|$ is small enough that this pressure dominates the data gradient, $\theta_i$ is driven exactly to zero. In contrast, L2's gradient $\lambda \theta_i$ diminishes as $\theta_i \to 0$, never reaching zero.
Geometric interpretation: The L1 constraint set ∥θ∥1≤c is a diamond (cross-polytope) with corners on the coordinate axes. The loss function's level curves are more likely to first touch a corner of the diamond than a smooth point -- and corners correspond to sparse solutions (some coordinates exactly zero).
Property
L1 (Lasso)
L2 (Ridge)
Penalty
$\lambda \sum
\theta_i
Effect on parameters
Sparse (many exactly zero)
Small (all shrunk, none exactly zero)
Bayesian prior
Laplace: $p(\theta_i) \propto e^{-\lambda
\theta_i
Constraint geometry
Diamond (cross-polytope)
Sphere (ball)
Feature selection
Yes (implicit)
No
Smoothness
Non-differentiable at θi=0
Smooth everywhere
Use in deep learning
Rare (non-smooth optimization)
Standard (as weight decay)
**Elastic Net** combines both: $\lambda_1 \|\theta\|_1 + \frac{\lambda_2}{2}\|\theta\|_2^2$. This encourages sparsity (L1) while handling correlated features gracefully (L2). In deep learning, a practical approximation is weight decay (L2) with structured pruning (explicit sparsity).
During training, independently set each neuron's activation to zero with probability $p$ (the **drop rate**, typically $p = 0.1$ to $0.5$) [@srivastava2014dropout]:
hdrop=h⊙m,mi∼Bernoulli(1−p)
To maintain the expected activation magnitude, inverted dropout scales by 1/(1−p) during training:
Train: h~=1−p1⋅h⊙mTest: h~=h
This way, no modification is needed at test time.
**Dropout as ensemble averaging.** A network with $n$ units and dropout samples from $2^n$ possible sub-networks (one for each dropout mask). At test time, using all units (with properly scaled weights) approximates the geometric mean of all sub-network predictions. This ensemble interpretation explains dropout's regularization effect: no single sub-network can rely on any other neuron, forcing redundant representations.
**Dropout in modern architectures.** Dropout is less common in modern transformers than in older architectures:
Vision Transformers (ViT): Often use no dropout or very low rates, relying instead on data augmentation and weight decay.
LLMs: Typically do not use dropout during pretraining (GPT-3, LLaMA), as the massive dataset size provides implicit regularization.
Where it persists: Attention dropout (dropping attention weights) and path dropout / DropPath (dropping entire residual branches) remain useful, especially when data is limited.
**Batch normalization** [@ioffe2015batch] normalizes activations across the batch dimension, then applies a learned affine transformation:
h^i=γ⋅σB2+ϵhi−μB+β
where μB=B1∑i=1Bhi and σB2=B1∑i=1B(hi−μB)2 are the batch mean and variance, and γ,β are learnable scale and shift parameters.
**Layer normalization** [@ba2016layer] normalizes across the feature dimension instead of the batch dimension:
h^j=γj⋅σh2+ϵhj−μh+βj
where μh=d1∑j=1dhj and σh2=d1∑j=1d(hj−μh)2 are computed over the d features for each sample independently.
**When to use which:** BatchNorm works well for CNNs with large, fixed batch sizes but breaks down when batch size is small or the batch statistics are unreliable (e.g., distributed training with small per-GPU batches). LayerNorm is independent of batch size and is the standard for transformers and sequence models. **RMSNorm** ($\hat{h} = h / \sqrt{\frac{1}{d}\sum h_j^2 + \epsilon}$, without mean subtraction) is increasingly preferred for LLMs due to simplicity and slight speed advantage.
Monitor validation loss during training and stop when it begins to increase:
θ∗=θt∗,t∗=argmint≤TLval(θt)
In practice, use patiencek: stop if validation loss has not improved for k consecutive evaluations. Save the best checkpoint.
**Connection to L2 regularization.** For linear regression with gradient descent and small learning rate, early stopping after $t$ steps is equivalent to L2 regularization with $\lambda \approx 1/(\eta t)$ [@bishop2006pattern]. Intuitively, the number of training steps limits how far $\theta$ can move from its initialization (which acts as a zero prior). More steps $=$ less regularization.
**Data augmentation** is a form of regularization that increases the effective dataset size by applying label-preserving transformations. Common augmentations:
Domain
Augmentations
Vision
Random crop, horizontal flip, color jitter, RandAugment, Mixup, CutMix
Text
Token masking (BERT), back-translation, synonym replacement, dropout on embeddings
Audio
Time stretching, pitch shifting, SpecAugment (masking frequency/time bands)
Mixup (Zhang et al., 2018) creates virtual training examples by linearly interpolating both inputs and labels: x~=λxi+(1−λ)xj, y~=λyi+(1−λ)yj where λ∼Beta(α,α).
For a model $\hat{f}$ trained on dataset $\mathcal{D}$ drawn from $P(X,Y)$ with $Y = f(X) + \epsilon$, $\epsilon \sim (0, \sigma^2)$, the expected prediction error at a point $x$ decomposes as:
The expectation is over different training sets D drawn from the same distribution.
Proof. Let fˉ(x)=ED[f^(x)]. Then:
E[(y−f^)2]=E[(y−f+f−fˉ+fˉ−f^)2]
Expanding and noting that cross terms vanish (since E[ϵ]=0 and E[f^−fˉ]=0 by definition, and ϵ is independent of f^):
=E[ϵ2]+(f−fˉ)2+E[(f^−fˉ)2]=σ2+Bias2+Variance□
Regime
Bias
Variance
Model Behavior
Fix
Underfitting
High
Low
Model too simple; misses patterns
Increase capacity, reduce regularization
Overfitting
Low
High
Model memorizes noise
Increase regularization, more data
Sweet spot
Balanced
Balanced
Best generalization
--
Double descent
Low
Low (surprisingly)
Very overparameterized
Let model grow past interpolation threshold
**Double descent.** Classical theory predicts that test error follows a U-shaped curve as model capacity increases (underfitting $\to$ optimal $\to$ overfitting). However, modern deep learning exhibits **double descent** [@belkin2019reconciling; @nakkiran2021deep]: test error peaks at the interpolation threshold (where the model can just barely fit the training data) and then *decreases* as capacity increases further into the overparameterized regime. This is partly explained by:
Implicit regularization from SGD, which finds minimum-norm solutions in the overparameterized regime
Inductive biases of neural network architectures (locality, weight sharing)
Benign overfitting -- the model interpolates training data but does so smoothly, with low test error