Multivariable Calculus
Multivariable calculus provides the mathematical machinery for optimizing functions of many variables -- the central computational task in machine learning. Every gradient computation, every optimizer step, and every loss landscape analysis uses the tools developed here.
Partial Derivatives
The gradient collects all partial derivatives into a single vector:
Directional Derivative
where is the angle between and . This is maximized () when is parallel to and minimized () when is anti-parallel.
The step moves in this direction with step size . Note that this is only the best direction for infinitesimally small steps -- for finite , curvature matters.
Taylor Expansion
First order (linear approximation):
Second order (quadratic approximation):
where is the Hessian matrix with .
| Method | Uses | Approximation | Step | Cost per step | Convergence |
|---|---|---|---|---|---|
| Gradient descent | only | Linear | Linear () | ||
| Newton's method | and | Quadratic | Quadratic (near optimum) | ||
| L-BFGS | + approximate | Quasi-quadratic | , | Superlinear |
Newton's method finds the exact minimum of the quadratic approximation in one step. It converges quadratically near a minimum () but is impractical for neural networks () because forming and inverting the Hessian costs memory and computation.
Implicit Function Theorem and Implicit Differentiation
- Differentiating through optimization: If , the optimality condition is . The implicit function theorem gives without unrolling the optimization.
- Implicit MAML (Finn et al., 2017): Differentiates through the inner loop of meta-learning without storing the optimization trajectory.
- Differentiable optimization layers ([?amos2017optnet]): Backpropagation through convex optimization problems embedded as neural network layers.
- Equilibrium models (DEQ) ([?bai2019deep]): Differentiates through the fixed point without storing the forward iterations.
Total Derivative and Chain Rule
Loss Landscapes
The loss function defines a surface in the -dimensional parameter space ( = number of parameters). Understanding this surface guides algorithm design.
| Feature | Definition | Hessian Condition | Frequency in High- |
|---|---|---|---|
| Local minimum | in a neighborhood | (all eigenvalues ) | Rare: of critical points |
| Local maximum | in a neighborhood | (all eigenvalues ) | Rare |
| Saddle point | Neither min nor max | indefinite (mixed eigenvalues) | Overwhelmingly common |
| Plateau | over a region | Near-zero gradient and curvature | Common in deep networks |
| Valley | Narrow region with low loss | High curvature in most directions | Where SGD typically converges |
- Saddle points dominate. For a random critical point, each Hessian eigenvalue is independently positive or negative with roughly equal probability. The probability that all eigenvalues are positive (a minimum) is , which is astronomically small for .
- Local minima are near-global. Empirically, the loss values at different local minima of overparameterized networks are clustered near the global minimum -- there are no "bad" local minima that trap optimization ([?choromanska2015loss]).
- Mode connectivity. Good local minima found by different training runs are connected by low-loss paths through parameter space ([?draxler2018essentially]). This suggests the loss landscape has a connected valley structure rather than isolated basins.
Integration in ML
| Application | Integral | Approximation |
|---|---|---|
| Bayesian inference | $p(\theta | \mathcal{D}) = p(\mathcal{D} |
| Expected loss | Monte Carlo (mini-batch) | |
| Marginal likelihood | $p(\mathcal{D}) = \int p(\mathcal{D} | \theta)p(\theta)d\theta$ |
| Normalizing flows | Change of variables: $p(z) = p(x) | \det \partial f / \partial x |
| Diffusion models | (SDE/ODE) | Numerical integrators (Euler, Heun) |
Notation Summary
| Symbol | Meaning |
|---|---|
| Gradient of (column vector) | |
| Directional derivative in direction | |
| Hessian matrix | |
| Perturbation vector | |
| is positive definite | |
| is negative definite | |
| is positive semi-definite | |
| Total derivative | |
| Partial derivative |