Skip to main content

Regularization

Regularization is the set of techniques that constrain a model to prefer simpler hypotheses, preventing it from memorizing the training set at the expense of generalization. Understanding regularization requires understanding why overfitting occurs and how different regularizers address the underlying causes.

The Overfitting Problem

A model with sufficient capacity can achieve zero training loss (Ltrain0\mathcal{L}_{\text{train}} \approx 0) by memorizing every training example, including noise. However, such a model performs poorly on unseen data (Ltest0\mathcal{L}_{\text{test}} \gg 0) because it has learned patterns specific to the training set rather than the underlying data-generating process. The generalization gap LtestLtrain\mathcal{L}_{\text{test}} - \mathcal{L}_{\text{train}} measures the severity of overfitting.

A remarkable fact: a neural network with $P$ parameters can memorize a dataset of $N \leq P$ random label assignments -- where the labels bear no relation to the inputs [@zhang2021understanding]. This proves that the model class alone does not determine generalization; the training procedure and data structure also matter.

L2 Regularization (Ridge / Weight Decay)

Add the squared Euclidean norm of parameters to the loss:

Lreg(θ)=L(θ)+λ2θ22=L(θ)+λ2iθi2\mathcal{L}_{\text{reg}}(\theta) = \mathcal{L}(\theta) + \frac{\lambda}{2} \|\theta\|_2^2 = \mathcal{L}(\theta) + \frac{\lambda}{2} \sum_i \theta_i^2

where λ>0\lambda > 0 controls the regularization strength.

The gradient of the regularized loss is Lreg=L+λθ\nabla \mathcal{L}_{\text{reg}} = \nabla \mathcal{L} + \lambda \theta, giving the SGD update:

θt+1=θtη(L(θt)+λθt)=(1ηλ)θtηL(θt)\theta_{t+1} = \theta_t - \eta(\nabla \mathcal{L}(\theta_t) + \lambda \theta_t) = (1 - \eta\lambda)\theta_t - \eta \nabla \mathcal{L}(\theta_t)

The multiplicative factor (1ηλ)(1 - \eta\lambda) shrinks every weight toward zero each step, hence the name weight decay.

**Bayesian interpretation.** L2 regularization is equivalent to placing a Gaussian prior $p(\theta) \propto \exp(-\lambda\|\theta\|^2/2)$ on the parameters. Minimizing $\mathcal{L}_{\text{reg}}$ is equivalent to computing the **maximum a posteriori (MAP)** estimate:

θMAP=argmaxθ  p(Dθ)p(θ)=argminθ  logp(Dθ)L(θ)+λ2θ2logp(θ)+const\theta_{\text{MAP}} = \arg\max_\theta \; p(\mathcal{D}|\theta) \cdot p(\theta) = \arg\min_\theta \; \underbrace{-\log p(\mathcal{D}|\theta)}_{\mathcal{L}(\theta)} + \underbrace{\frac{\lambda}{2}\|\theta\|^2}_{-\log p(\theta) + \text{const}}

Larger λ\lambda corresponds to a tighter prior -- a stronger belief that parameters should be small.

**L2 regularization $\neq$ weight decay for adaptive optimizers.** For SGD, adding $\frac{\lambda}{2}\|\theta\|^2$ to the loss is equivalent to multiplying $\theta$ by $(1 - \eta\lambda)$ each step. For Adam, the L2 gradient term $\lambda\theta$ gets divided by $\sqrt{\hat{v}_t}$, making the effective regularization strength parameter-dependent and inconsistent. **AdamW** [@loshchilov2019adamw] fixes this by applying weight decay directly: $\theta \leftarrow (1 - \eta\lambda)\theta - \eta \cdot \text{Adam\_step}$, decoupling regularization from the adaptive learning rate. AdamW is the standard for transformer training. **Closed-form for linear regression.** For $\mathcal{L} = \|y - X\theta\|^2 + \lambda\|\theta\|^2$, the solution is:

θridge=(XX+λI)1Xy\theta_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y

The term λI\lambda I ensures the matrix is invertible even when XXX^\top X is singular (more features than samples). Geometrically, it shrinks the OLS solution toward zero along directions of small variance in XX.

L1 Regularization (Lasso)

Add the sum of absolute values of parameters:

Lreg(θ)=L(θ)+λθ1=L(θ)+λiθi\mathcal{L}_{\text{reg}}(\theta) = \mathcal{L}(\theta) + \lambda \|\theta\|_1 = \mathcal{L}(\theta) + \lambda \sum_i |\theta_i|

**Why L1 produces sparsity.** The subgradient of $|\theta_i|$ is $\text{sign}(\theta_i)$, which applies constant pressure of magnitude $\lambda$ regardless of how small $\theta_i$ is. Once $|\theta_i|$ is small enough that this pressure dominates the data gradient, $\theta_i$ is driven exactly to zero. In contrast, L2's gradient $\lambda \theta_i$ diminishes as $\theta_i \to 0$, never reaching zero.

Geometric interpretation: The L1 constraint set θ1c\|\theta\|_1 \leq c is a diamond (cross-polytope) with corners on the coordinate axes. The loss function's level curves are more likely to first touch a corner of the diamond than a smooth point -- and corners correspond to sparse solutions (some coordinates exactly zero).

PropertyL1 (Lasso)L2 (Ridge)
Penalty$\lambda \sum\theta_i
Effect on parametersSparse (many exactly zero)Small (all shrunk, none exactly zero)
Bayesian priorLaplace: $p(\theta_i) \propto e^{-\lambda\theta_i
Constraint geometryDiamond (cross-polytope)Sphere (ball)
Feature selectionYes (implicit)No
SmoothnessNon-differentiable at θi=0\theta_i = 0Smooth everywhere
Use in deep learningRare (non-smooth optimization)Standard (as weight decay)
**Elastic Net** combines both: $\lambda_1 \|\theta\|_1 + \frac{\lambda_2}{2}\|\theta\|_2^2$. This encourages sparsity (L1) while handling correlated features gracefully (L2). In deep learning, a practical approximation is weight decay (L2) with structured pruning (explicit sparsity).

Dropout

During training, independently set each neuron's activation to zero with probability $p$ (the **drop rate**, typically $p = 0.1$ to $0.5$) [@srivastava2014dropout]:

hdrop=hm,miBernoulli(1p)h^{\text{drop}} = h \odot m, \quad m_i \sim \text{Bernoulli}(1 - p)

To maintain the expected activation magnitude, inverted dropout scales by 1/(1p)1/(1-p) during training:

Train: h~=11phmTest: h~=h\text{Train: } \quad \tilde{h} = \frac{1}{1 - p} \cdot h \odot m \qquad \text{Test: } \quad \tilde{h} = h

This way, no modification is needed at test time.

**Dropout as ensemble averaging.** A network with $n$ units and dropout samples from $2^n$ possible sub-networks (one for each dropout mask). At test time, using all units (with properly scaled weights) approximates the geometric mean of all sub-network predictions. This ensemble interpretation explains dropout's regularization effect: no single sub-network can rely on any other neuron, forcing redundant representations. **Dropout in modern architectures.** Dropout is less common in modern transformers than in older architectures:
  • Vision Transformers (ViT): Often use no dropout or very low rates, relying instead on data augmentation and weight decay.
  • LLMs: Typically do not use dropout during pretraining (GPT-3, LLaMA), as the massive dataset size provides implicit regularization.
  • Where it persists: Attention dropout (dropping attention weights) and path dropout / DropPath (dropping entire residual branches) remain useful, especially when data is limited.

Batch Normalization and Layer Normalization

**Batch normalization** [@ioffe2015batch] normalizes activations across the batch dimension, then applies a learned affine transformation:

h^i=γhiμBσB2+ϵ+β\hat{h}_i = \gamma \cdot \frac{h_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}} + \beta

where μB=1Bi=1Bhi\mu_{\mathcal{B}} = \frac{1}{B}\sum_{i=1}^B h_i and σB2=1Bi=1B(hiμB)2\sigma_{\mathcal{B}}^2 = \frac{1}{B}\sum_{i=1}^B (h_i - \mu_{\mathcal{B}})^2 are the batch mean and variance, and γ,β\gamma, \beta are learnable scale and shift parameters.

**Layer normalization** [@ba2016layer] normalizes across the feature dimension instead of the batch dimension:

h^j=γjhjμhσh2+ϵ+βj\hat{h}_j = \gamma_j \cdot \frac{h_j - \mu_h}{\sqrt{\sigma_h^2 + \epsilon}} + \beta_j

where μh=1dj=1dhj\mu_h = \frac{1}{d}\sum_{j=1}^d h_j and σh2=1dj=1d(hjμh)2\sigma_h^2 = \frac{1}{d}\sum_{j=1}^d (h_j - \mu_h)^2 are computed over the dd features for each sample independently.

**When to use which:** BatchNorm works well for CNNs with large, fixed batch sizes but breaks down when batch size is small or the batch statistics are unreliable (e.g., distributed training with small per-GPU batches). LayerNorm is independent of batch size and is the standard for transformers and sequence models. **RMSNorm** ($\hat{h} = h / \sqrt{\frac{1}{d}\sum h_j^2 + \epsilon}$, without mean subtraction) is increasingly preferred for LLMs due to simplicity and slight speed advantage.

Early Stopping

Monitor validation loss during training and stop when it begins to increase:

θ=θt,t=argmintTLval(θt)\theta^* = \theta_{t^*}, \quad t^* = \arg\min_{t \leq T} \mathcal{L}_{\text{val}}(\theta_t)

In practice, use patience kk: stop if validation loss has not improved for kk consecutive evaluations. Save the best checkpoint.

**Connection to L2 regularization.** For linear regression with gradient descent and small learning rate, early stopping after $t$ steps is equivalent to L2 regularization with $\lambda \approx 1/(\eta t)$ [@bishop2006pattern]. Intuitively, the number of training steps limits how far $\theta$ can move from its initialization (which acts as a zero prior). More steps $=$ less regularization.

Data Augmentation

**Data augmentation** is a form of regularization that increases the effective dataset size by applying label-preserving transformations. Common augmentations:
DomainAugmentations
VisionRandom crop, horizontal flip, color jitter, RandAugment, Mixup, CutMix
TextToken masking (BERT), back-translation, synonym replacement, dropout on embeddings
AudioTime stretching, pitch shifting, SpecAugment (masking frequency/time bands)

Mixup (Zhang et al., 2018) creates virtual training examples by linearly interpolating both inputs and labels: x~=λxi+(1λ)xj\tilde{x} = \lambda x_i + (1-\lambda) x_j, y~=λyi+(1λ)yj\tilde{y} = \lambda y_i + (1-\lambda) y_j where λBeta(α,α)\lambda \sim \text{Beta}(\alpha, \alpha).

Bias--Variance Tradeoff

For a model $\hat{f}$ trained on dataset $\mathcal{D}$ drawn from $P(X,Y)$ with $Y = f(X) + \epsilon$, $\epsilon \sim (0, \sigma^2)$, the expected prediction error at a point $x$ decomposes as:

ED ⁣[(yf^(x))2]=(ED[f^(x)]f(x))2Bias2+ED ⁣[(f^(x)ED[f^(x)])2]Variance+σ2Irreducible noise\mathbb{E}_{\mathcal{D}}\!\left[(y - \hat{f}(x))^2\right] = \underbrace{\left(\mathbb{E}_{\mathcal{D}}[\hat{f}(x)] - f(x)\right)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_{\mathcal{D}}\!\left[(\hat{f}(x) - \mathbb{E}_{\mathcal{D}}[\hat{f}(x)])^2\right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible noise}}

The expectation is over different training sets D\mathcal{D} drawn from the same distribution.

Proof. Let fˉ(x)=ED[f^(x)]\bar{f}(x) = \mathbb{E}_{\mathcal{D}}[\hat{f}(x)]. Then:

E[(yf^)2]=E[(yf+ffˉ+fˉf^)2]\mathbb{E}[(y - \hat{f})^2] = \mathbb{E}[(y - f + f - \bar{f} + \bar{f} - \hat{f})^2]

Expanding and noting that cross terms vanish (since E[ϵ]=0\mathbb{E}[\epsilon] = 0 and E[f^fˉ]=0\mathbb{E}[\hat{f} - \bar{f}] = 0 by definition, and ϵ\epsilon is independent of f^\hat{f}):

=E[ϵ2]+(ffˉ)2+E[(f^fˉ)2]=σ2+Bias2+Variance= \mathbb{E}[\epsilon^2] + (f - \bar{f})^2 + \mathbb{E}[(\hat{f} - \bar{f})^2] = \sigma^2 + \text{Bias}^2 + \text{Variance} \quad \square

RegimeBiasVarianceModel BehaviorFix
UnderfittingHighLowModel too simple; misses patternsIncrease capacity, reduce regularization
OverfittingLowHighModel memorizes noiseIncrease regularization, more data
Sweet spotBalancedBalancedBest generalization--
Double descentLowLow (surprisingly)Very overparameterizedLet model grow past interpolation threshold
**Double descent.** Classical theory predicts that test error follows a U-shaped curve as model capacity increases (underfitting $\to$ optimal $\to$ overfitting). However, modern deep learning exhibits **double descent** [@belkin2019reconciling; @nakkiran2021deep]: test error peaks at the interpolation threshold (where the model can just barely fit the training data) and then *decreases* as capacity increases further into the overparameterized regime. This is partly explained by:
  1. Implicit regularization from SGD, which finds minimum-norm solutions in the overparameterized regime
  2. Inductive biases of neural network architectures (locality, weight sharing)
  3. Benign overfitting -- the model interpolates training data but does so smoothly, with low test error

Summary of Regularization Techniques

TechniqueMechanismComputational CostModern Usage
L2 / Weight decayShrinks weights toward zeroNegligibleStandard (AdamW)
L1Drives weights to exactly zeroNegligibleRare in deep learning
DropoutRandom neuron zeroing; ensemble effect~10% training overheadDeclining; used in attention/path
BatchNormNormalizes by batch stats; adds noise~5% overheadCNNs
LayerNorm / RMSNormNormalizes by feature stats~5% overheadTransformers, LLMs
Data augmentationIncreases effective dataset sizeData pipeline overheadStandard for vision
Early stoppingLimits optimization trajectoryNone (saves compute)Standard
Label smoothingSoftens one-hot targetsNegligibleCommon for classification
Gradient clippingBounds gradient magnitudeNegligibleStandard for transformers

Notation Summary

SymbolMeaning
λ\lambdaRegularization strength / weight decay coefficient
θ1,θ2\|\theta\|_1, \|\theta\|_2L1 and L2 norms of parameters
ppDropout probability (drop rate)
mmBinary dropout mask
\odotElementwise product
γ,β\gamma, \betaLearnable scale and shift in normalization
μB,σB2\mu_{\mathcal{B}}, \sigma_{\mathcal{B}}^2Batch mean and variance
f^\hat{f}Learned predictor (random variable over training sets)
ffTrue function
σ2\sigma^2Irreducible noise variance
Lval\mathcal{L}_{\text{val}}Validation loss

References