Multivariable Calculus
Multivariable calculus provides the mathematical machinery for optimizing functions of many variables, the central computational task in machine learning. Every gradient computation, every optimizer step, and every loss landscape analysis uses the tools developed here.
Partial Derivatives
The gradient collects all partial derivatives into a single vector:
Directional Derivative
where is the angle between and . This is maximized () when is parallel to and minimized () when is anti-parallel.
The step moves in this direction with step size . Note that this is only the best direction for infinitesimally small steps: for finite , curvature matters.
Taylor Expansion
First order (linear approximation):
Second order (quadratic approximation):
where is the Hessian matrix with .
| Method | Uses | Approximation | Step | Cost per step | Convergence |
|---|---|---|---|---|---|
| Gradient descent | only | Linear | Linear ( per step) | ||
| Newton's method | and | Quadratic | Quadratic (near optimum) | ||
| L-BFGS | + approximate | Quasi-quadratic | , | Superlinear |
For a strictly quadratic objective, Newton's method finds the exact minimum in one step (the quadratic approximation is exact). For a general smooth , it instead converges quadratically near a minimum (). Either way it is impractical for neural networks () because forming and inverting the Hessian costs memory and computation.
The gradient and Hessian are
so . The minimum is at and the condition number is .
Gradient descent. With step size , one step gives
The well-conditioned coordinate snaps to its optimum, but barely moves; subsequent steps shrink by a factor of each iteration, the linear rate.
Newton's method. Because is exactly quadratic, one Newton step reaches the minimum:
Newton rescales each direction by its curvature, which is exactly what gradient descent fails to do on ill-conditioned problems.
Implicit Function Theorem and Implicit Differentiation
- Differentiating through optimization: If , the optimality condition is . The implicit function theorem gives without unrolling the optimization.
- Implicit MAML (Finn et al., 2017): Differentiates through the inner loop of meta-learning without storing the optimization trajectory.
- Differentiable optimization layers (Amos & Kolter, 2017): Backpropagation through convex optimization problems embedded as neural network layers.
- Equilibrium models (DEQ) (Bai et al., 2019): Differentiates through the fixed point without storing the forward iterations.
Total Derivative and Chain Rule
Loss Landscapes
The loss function defines a surface in the -dimensional parameter space ( = number of parameters). Understanding this surface guides algorithm design.
| Feature | Definition | Hessian Condition (sufficient) | Frequency in High- |
|---|---|---|---|
| Local minimum | in a neighborhood | (all eigenvalues ) | Rare: of critical points |
| Local maximum | in a neighborhood | (all eigenvalues ) | Rare |
| Saddle point | Neither min nor max | indefinite (mixed eigenvalues) | Overwhelmingly common |
| Plateau | over a region | Near-zero gradient and curvature | Common in deep networks |
| Valley | Narrow region with low loss | High curvature in most directions | Where SGD typically converges |
- Saddle points dominate. Heuristically, if we model each Hessian eigenvalue at a random critical point as independently positive or negative with roughly equal probability (Dauphin et al., 2014), the probability that all eigenvalues are positive (a minimum) is approximately , which is astronomically small for . (This independence assumption is a simplification; in practice, eigenvalue signs correlate with loss level.)
- Local minima are near-global. Empirically, the loss values at different local minima of overparameterized networks are clustered near the global minimum, so there are no "bad" local minima that trap optimization (Choromanska et al., 2015).
- Mode connectivity. Good local minima found by different training runs are connected by low-loss paths through parameter space (Draxler et al., 2018). This suggests the loss landscape has a connected valley structure rather than isolated basins.
Integration in ML
| Application | Integral | Approximation |
|---|---|---|
| Bayesian inference | $p(\theta | \mathcal{D}) = p(\mathcal{D} |
| Expected loss | Monte Carlo (mini-batch) | |
| Marginal likelihood | $p(\mathcal{D}) = \int p(\mathcal{D} | \theta)p(\theta)d\theta$ |
| Normalizing flows | Change of variables: $p(z) = p(x) | \det \partial f / \partial x |
| Diffusion models | (SDE/ODE) | Numerical integrators (Euler, Heun) |
Notation Summary
| Symbol | Meaning |
|---|---|
| Gradient of (column vector) | |
| Directional derivative in direction | |
| Hessian matrix | |
| Perturbation vector | |
| is positive definite | |
| is negative definite | |
| is positive semi-definite | |
| Total derivative | |
| Partial derivative |
References
- Brandon Amos, J. Zico Kolter (2017). OptNet: Differentiable Optimization as a Layer in Neural Networks. ICML. ↗
- Shaojie Bai, J. Zico Kolter, Vladlen Koltun (2019). Deep Equilibrium Models. NeurIPS. ↗
- Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun (2015). The Loss Surfaces of Multilayer Networks. AISTATS. ↗
- Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, Yoshua Bengio (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems (NeurIPS). ↗
- Felix Draxler, Kambis Veschgini, Manfred Salmhofer, Fred Hamprecht (2018). Essentially No Barriers in Neural Network Energy Landscape. ICML. ↗
- Chelsea Finn, Pieter Abbeel, Sergey Levine (2017). Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML. ↗