Skip to main content

Multivariable Calculus

Multivariable calculus provides the mathematical machinery for optimizing functions of many variables -- the central computational task in machine learning. Every gradient computation, every optimizer step, and every loss landscape analysis uses the tools developed here.

Partial Derivatives

For $f: \mathbb{R}^n \to \mathbb{R}$, the **partial derivative** with respect to $x_i$ measures the rate of change of $f$ while holding all other variables fixed:

fxi=limh0f(x1,,xi+h,,xn)f(x1,,xn)h\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \dots, x_i + h, \dots, x_n) - f(x_1, \dots, x_n)}{h}

The gradient collects all nn partial derivatives into a single vector:

f(x)=[f/x1f/xn]Rn\nabla f(x) = \begin{bmatrix} \partial f / \partial x_1 \\ \vdots \\ \partial f / \partial x_n \end{bmatrix} \in \mathbb{R}^n

The gradient $\nabla f(x)$ points in the direction of steepest ascent of $f$ at $x$. More precisely, for any unit vector $u$, the rate of change $\nabla f^\top u$ is maximized when $u = \nabla f / \|\nabla f\|$. Gradient descent moves in the opposite direction: $x_{t+1} = x_t - \eta \nabla f(x_t)$.

Directional Derivative

The rate of change of $f$ at point $x$ in the direction of a unit vector $u$ is:

Duf(x)=f(x)u=f(x)cosθD_u f(x) = \nabla f(x)^\top u = \|\nabla f(x)\| \cos \theta

where θ\theta is the angle between f\nabla f and uu. This is maximized (=f= \|\nabla f\|) when uu is parallel to f\nabla f and minimized (=f= -\|\nabla f\|) when uu is anti-parallel.

**Why gradient descent is the steepest descent.** We want to find the unit direction $u^*$ that decreases $f$ the most:

u=argminu=1Duf=argminu=1fu=ffu^* = \arg\min_{\|u\| = 1} D_u f = \arg\min_{\|u\| = 1} \nabla f^\top u = -\frac{\nabla f}{\|\nabla f\|}

The step xt+1=xtηfx_{t+1} = x_t - \eta \nabla f moves in this direction with step size ηf\eta \|\nabla f\|. Note that this is only the best direction for infinitesimally small steps -- for finite η\eta, curvature matters.

Taylor Expansion

The Taylor expansion of $f: \mathbb{R}^n \to \mathbb{R}$ around a point $x_0$ is:

First order (linear approximation): f(x0+δ)f(x0)+f(x0)δf(x_0 + \delta) \approx f(x_0) + \nabla f(x_0)^\top \delta

Second order (quadratic approximation):

f(x0+δ)f(x0)+f(x0)δ+12δH(x0)δf(x_0 + \delta) \approx f(x_0) + \nabla f(x_0)^\top \delta + \frac{1}{2} \delta^\top H(x_0) \, \delta

where H(x0)=2f(x0)H(x_0) = \nabla^2 f(x_0) is the Hessian matrix with Hij=2fxixjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}.

**First-order vs. second-order methods:**
MethodUsesApproximationStepCost per stepConvergence
Gradient descentf\nabla f onlyLinearηf-\eta \nabla fO(n)O(n)Linear (O(et/κ)O(e^{-t/\kappa}))
Newton's methodf\nabla f and HHQuadraticH1f-H^{-1} \nabla fO(n3)O(n^3)Quadratic (near optimum)
L-BFGSf\nabla f + approximate H1H^{-1}Quasi-quadraticH~1f-\tilde{H}^{-1} \nabla fO(mn)O(mn), mnm \ll nSuperlinear

Newton's method finds the exact minimum of the quadratic approximation in one step. It converges quadratically near a minimum (xt+1xcxtx2\|x_{t+1} - x^*\| \leq c\|x_t - x^*\|^2) but is impractical for neural networks (n>106n > 10^6) because forming and inverting the Hessian costs O(n2)O(n^2) memory and O(n3)O(n^3) computation.

Implicit Function Theorem and Implicit Differentiation

Let $F: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^m$ be continuously differentiable, and suppose $F(x_0, y_0) = 0$ and $\frac{\partial F}{\partial y}(x_0, y_0)$ is invertible. Then there exists a neighborhood of $x_0$ and a unique differentiable function $y(x)$ such that $F(x, y(x)) = 0$, with:

yx=(Fy)1Fx\frac{\partial y}{\partial x} = -\left(\frac{\partial F}{\partial y}\right)^{-1} \frac{\partial F}{\partial x}

**Implicit differentiation in ML.** This theorem underpins several important techniques:
  1. Differentiating through optimization: If y(x)=argminyg(x,y)y^*(x) = \arg\min_y g(x, y), the optimality condition is yg(x,y)=0\nabla_y g(x, y^*) = 0. The implicit function theorem gives y/x\partial y^* / \partial x without unrolling the optimization.
  2. Implicit MAML (Finn et al., 2017): Differentiates through the inner loop of meta-learning without storing the optimization trajectory.
  3. Differentiable optimization layers ([?amos2017optnet]): Backpropagation through convex optimization problems embedded as neural network layers.
  4. Equilibrium models (DEQ) ([?bai2019deep]): Differentiates through the fixed point z=f(z;θ)z^* = f(z^*; \theta) without storing the forward iterations.

Total Derivative and Chain Rule

If $f$ depends on $t$ both directly and through $x(t)$, the **total derivative** is:

dfdt=ft+i=1nfxidxidt=ft+xfdxdt\frac{df}{dt} = \frac{\partial f}{\partial t} + \sum_{i=1}^n \frac{\partial f}{\partial x_i} \frac{dx_i}{dt} = \frac{\partial f}{\partial t} + \nabla_x f^\top \frac{dx}{dt}

The total derivative appears in: - **Learning rate schedules:** The loss changes due to both parameter updates and the changing learning rate - **Curriculum learning:** The data distribution changes over time, so the expected loss has both explicit and implicit dependence on time - **Neural ODEs** [@chen2018neural]: Model the hidden state as a continuous function $h(t)$ governed by $dh/dt = f(h(t), t; \theta)$, where the total derivative controls dynamics

Loss Landscapes

The loss function L(θ)\mathcal{L}(\theta) defines a surface in the PP-dimensional parameter space (PP = number of parameters). Understanding this surface guides algorithm design.

FeatureDefinitionHessian ConditionFrequency in High-dd
Local minimumL(θ)L(θ)\mathcal{L}(\theta^*) \leq \mathcal{L}(\theta) in a neighborhoodH0H \succ 0 (all eigenvalues >0> 0)Rare: 2P\sim 2^{-P} of critical points
Local maximumL(θ)L(θ)\mathcal{L}(\theta^*) \geq \mathcal{L}(\theta) in a neighborhoodH0H \prec 0 (all eigenvalues <0< 0)Rare
Saddle pointNeither min nor maxHH indefinite (mixed eigenvalues)Overwhelmingly common
PlateauL0\nabla \mathcal{L} \approx 0 over a regionNear-zero gradient and curvatureCommon in deep networks
ValleyNarrow region with low lossHigh curvature in most directionsWhere SGD typically converges
In high dimensions ($P \gg 1$), the character of the loss landscape changes qualitatively:
  1. Saddle points dominate. For a random critical point, each Hessian eigenvalue is independently positive or negative with roughly equal probability. The probability that all PP eigenvalues are positive (a minimum) is 2P\sim 2^{-P}, which is astronomically small for P=109P = 10^9.
  2. Local minima are near-global. Empirically, the loss values at different local minima of overparameterized networks are clustered near the global minimum -- there are no "bad" local minima that trap optimization ([?choromanska2015loss]).
  3. Mode connectivity. Good local minima found by different training runs are connected by low-loss paths through parameter space ([?draxler2018essentially]). This suggests the loss landscape has a connected valley structure rather than isolated basins.

Integration in ML

While optimization (differentiation) dominates ML computationally, integration appears in several important places:
ApplicationIntegralApproximation
Bayesian inference$p(\theta\mathcal{D}) = p(\mathcal{D}
Expected lossEx,y[(fθ(x),y)]\mathbb{E}_{x,y}[\ell(f_\theta(x), y)]Monte Carlo (mini-batch)
Marginal likelihood$p(\mathcal{D}) = \int p(\mathcal{D}\theta)p(\theta)d\theta$
Normalizing flowsChange of variables: $p(z) = p(x)\det \partial f / \partial x
Diffusion models0Tf(xt,t)dt\int_0^T f(x_t, t)dt (SDE/ODE)Numerical integrators (Euler, Heun)

Notation Summary

SymbolMeaning
f\nabla fGradient of ff (column vector)
DufD_u fDirectional derivative in direction uu
H=2fH = \nabla^2 fHessian matrix
δ\deltaPerturbation vector
H0H \succ 0HH is positive definite
H0H \prec 0HH is negative definite
H0H \succeq 0HH is positive semi-definite
df/dtdf/dtTotal derivative
f/xi\partial f/\partial x_iPartial derivative

References