Statistical Learning Theory
Statistical learning theory provides the mathematical foundations for understanding when and why machine learning works. It answers the central question: how can a model that performs well on training data be expected to perform well on unseen data? The theory connects model complexity, sample size, and generalization through elegant inequalities that guide both algorithm design and practical intuitions.
The Learning Problem
We can only compute the empirical risk (training loss):
The gap is the generalization gap. The central goal of learning theory is to bound this gap.
The key question is: when does low empirical risk guarantee low true risk? This requires controlling the uniform convergence of to over the entire hypothesis class .
- Approximation error depends on the expressiveness of (decreases with larger )
- Estimation error depends on the complexity of relative to (increases with larger )
A larger hypothesis class reduces approximation error but increases estimation error. The optimal balances these two sources of error. This is the statistical formalization of the bias-variance tradeoff.
PAC Learning
The minimal is the sample complexity of learning .
Setting the RHS equal to and solving for : with probability ,
Proof sketch. Apply Hoeffding's inequality to each individually: . Take a union bound over all hypotheses.
- Logarithmic in : Even exponentially large hypothesis classes can be learned with polynomial data. A class of -bit programs has , requiring only samples.
- Quadratic in : Halving the error requires more data.
- Independent of : The bound holds for any data distribution (distribution-free learning).
This bound is only useful for finite . For continuous hypothesis classes (neural networks, kernel methods), we need VC dimension or Rademacher complexity.
VC Dimension
If can shatter arbitrarily large sets, then .
- Upper bound: Any points in lie in a -dimensional affine subspace, so there exists a labeling that no hyperplane can realize (by Radon's theorem).
- Lower bound: The points can be shattered. For any labeling, an appropriate hyperplane can separate the classes.
Thus a linear classifier in has VC dimension : it can perfectly classify any points in general position, but there exist points it cannot handle.
| Hypothesis class | VC dimension | Notes |
|---|---|---|
| Constant classifier | Always predicts the same label | |
| Thresholds on | ||
| Intervals on | ||
| Linear classifiers in | Hyperplane + bias | |
| Axis-aligned rectangles in | Classify points inside rectangle | |
| Degree- polynomials in | Real-valued polynomial thresholded | |
| -nearest neighbors | Can shatter any finite set | |
| Neural networks (ReLU, weights, layers) | Bounds from ([?bartlett2019nearly]) | |
| Finite class | $\leq \log_2 | \mathcal{H} |
This is a uniform bound: it holds simultaneously for all , not just the ERM solution.
- has finite VC dimension
- is PAC-learnable
- Uniform convergence holds for
- ERM is a consistent learning algorithm for
This theorem completely characterizes learnability for binary classification: a class is learnable if and only if its VC dimension is finite. The sample complexity is .
Rademacher Complexity
where are i.i.d. Rademacher variables (). The Rademacher complexity is .
For an individual hypothesis selected by any (possibly data-dependent) algorithm:
| Feature | VC Dimension | Rademacher Complexity |
|---|---|---|
| Data dependence | No (worst-case over ) | Yes (adapts to the data distribution) |
| Loss function | Only 0-1 loss | Any bounded loss |
| Tightness | Often loose | Tighter in practice |
| Computability | Often hard to compute exactly | Can be estimated from data |
| Multi-class | Requires extensions (Natarajan dim) | Directly applicable |
For linear models with bounded norm and data with :
This bound depends on the norm of the weights, not the number of parameters -- explaining why overparameterized models with small weights can still generalize.
This bound depends on the product of spectral norms (path norm) and Frobenius-to-spectral norm ratios, not the parameter count. It helps explain why networks with billions of parameters can generalize: what matters is the effective complexity (norm-based), not the raw capacity (parameter count).
PAC-Bayesian Bounds
- The bound penalizes posteriors that deviate from the prior (measured by KL divergence). This formalizes the Bayesian intuition that simpler models (close to the prior) generalize better.
- Flat minima: If and , then . Larger (flatter minimum, since averages over a wider region with low loss) gives a tighter bound.
- Compression: If only of parameters matter, the effective KL is rather than . This connects to pruning and lottery ticket hypotheses.
- PAC-Bayes bounds are currently the tightest non-vacuous generalization bounds for deep networks ([?dziugaite2017computing]).
Bias-Variance Decomposition
where , , .
Expanding and noting that the cross-terms vanish (by and ):
| Complexity | Bias | Variance | Test Error | Regime |
|---|---|---|---|---|
| Too simple (underparameterized) | High | Low | High (underfitting) | Classical |
| Balanced | Moderate | Moderate | Minimum | Classical sweet spot |
| Too complex (slightly overparameterized) | Low | High | High (overfitting) | Interpolation threshold |
| Very overparameterized | Low | Low | Low | Double descent / benign overfitting |
Why Overparameterized Models Generalize
- Classical regime (): U-shaped curve, minimum at the "sweet spot."
- Interpolation threshold (): Test error peaks -- the model has just enough parameters to memorize the training data, but in a brittle way.
- Modern regime (): Test error decreases again as the model grows -- more parameters lead to better generalization.
The interpolation threshold is the point where the model can perfectly fit the training data (). Beyond this point, there are many interpolating solutions, and the optimizer's implicit bias selects one with good generalization properties.
Classical theory predicts that models with more parameters than data points should overfit. Modern neural networks violate this prediction (Nakkiran et al., 2021). The explanations include:
| Explanation | Mechanism | Key Result |
|---|---|---|
| Implicit regularization | SGD biases toward low-complexity solutions | For linear models, SGD converges to minimum-norm solution ([?gunasekar2018characterizing]) |
| Flat minima | Solutions found by SGD have low curvature | PAC-Bayesian bounds are tighter for flat minima ([?dziugaite2017computing]) |
| Norm-based bounds | Generalization depends on weight norms, not parameter count | Rademacher bound |
| NTK regime | Infinite-width networks behave like kernel regression | Convergence + generalization guarantees (Jacot et al., 2018) |
| Benign overfitting | Noise is memorized in "unimportant" directions | Requires effective dimensionality ([?bartlett2020benign]) |
| Feature learning | Networks learn useful representations | Beyond NTK -- finite-width networks learn features that kernels cannot |
- Effective number of parameters: Parameters constrained by regularization or implicit bias contribute less than free parameters. For weight decay , the effective degrees of freedom are .
- Description length: The number of bits needed to describe the model up to a given precision. Compression experiments show neural networks can be compressed dramatically without losing accuracy.
- Flat minima volume: The volume of parameter space around with low loss -- PAC-Bayesian bounds formalize this as a generalization measure.
Generalization bounds based on parameter count are vacuously loose (predicting error for any realistic network). Bounds based on norms, margins, compression, or PAC-Bayes are more informative, though still not fully tight.
Uniform Convergence and Covering Numbers
Key relationships:
- VC dimension bounds covering numbers: (Sauer-Shelah lemma)
- Rademacher complexity bounds covering numbers: Via Dudley's entropy integral
- Fat-shattering dimension: Generalizes VC dimension to real-valued functions, directly bounds covering numbers
Notation Summary
| Symbol | Meaning |
|---|---|
| True risk (expected loss) | |
| Empirical risk (training loss) | |
| Generalization gap | |
| Hypothesis class | |
| Best hypothesis in class: | |
| VC dimension | |
| Rademacher complexity | |
| -covering number | |
| PAC parameters (accuracy, confidence) | |
| Rademacher random variable () | |
| Number of training samples | |
| Number of model parameters | |
| ERM | Empirical risk minimization |
| PAC | Probably approximately correct |
References
- Arthur Jacot, Franck Gabriel, Clement Hongler (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks. NeurIPS.
- Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever (2021). Deep Double Descent: Where Bigger Models and More Data Can Hurt. JMLR.