Skip to main content

Multivariable Calculus

Multivariable calculus provides the mathematical machinery for optimizing functions of many variables, the central computational task in machine learning. Every gradient computation, every optimizer step, and every loss landscape analysis uses the tools developed here.

Partial Derivatives

For $f: \mathbb{R}^n \to \mathbb{R}$, the **partial derivative** with respect to $x_i$ measures the rate of change of $f$ while holding all other variables fixed:

fxi=limh0f(x1,,xi+h,,xn)f(x1,,xn)h\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \dots, x_i + h, \dots, x_n) - f(x_1, \dots, x_n)}{h}

The gradient collects all nn partial derivatives into a single vector:

f(x)=[f/x1f/xn]Rn\nabla f(x) = \begin{bmatrix} \partial f / \partial x_1 \\ \vdots \\ \partial f / \partial x_n \end{bmatrix} \in \mathbb{R}^n

The gradient $\nabla f(x)$ points in the direction of steepest ascent of $f$ at $x$. More precisely, for any unit vector $u$, the rate of change $\nabla f^\top u$ is maximized when $u = \nabla f / \|\nabla f\|$. Gradient descent moves in the opposite direction: $x_{t+1} = x_t - \eta \nabla f(x_t)$.

Directional Derivative

The rate of change of $f$ at point $x$ in the direction of a unit vector $u$ is:

Duf(x)=f(x)u=f(x)cosθD_u f(x) = \nabla f(x)^\top u = \|\nabla f(x)\| \cos \theta

where θ\theta is the angle between f\nabla f and uu. This is maximized (=f= \|\nabla f\|) when uu is parallel to f\nabla f and minimized (=f= -\|\nabla f\|) when uu is anti-parallel.

**Why gradient descent is the steepest descent.** We want to find the unit direction $u^*$ that decreases $f$ the most:

u=argminu=1Duf=argminu=1fu=ffu^* = \arg\min_{\|u\| = 1} D_u f = \arg\min_{\|u\| = 1} \nabla f^\top u = -\frac{\nabla f}{\|\nabla f\|}

The step xt+1=xtηfx_{t+1} = x_t - \eta \nabla f moves in this direction with step size ηf\eta \|\nabla f\|. Note that this is only the best direction for infinitesimally small steps: for finite η\eta, curvature matters.

Taylor Expansion

The Taylor expansion of $f: \mathbb{R}^n \to \mathbb{R}$ around a point $x_0$ is:

First order (linear approximation): f(x0+δ)f(x0)+f(x0)δf(x_0 + \delta) \approx f(x_0) + \nabla f(x_0)^\top \delta

Second order (quadratic approximation):

f(x0+δ)f(x0)+f(x0)δ+12δH(x0)δf(x_0 + \delta) \approx f(x_0) + \nabla f(x_0)^\top \delta + \frac{1}{2} \delta^\top H(x_0) \, \delta

where H(x0)=2f(x0)H(x_0) = \nabla^2 f(x_0) is the Hessian matrix with Hij=2fxixjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}.

**First-order vs. second-order methods:**
MethodUsesApproximationStepCost per stepConvergence
Gradient descentf\nabla f onlyLinearηf-\eta \nabla fO(n)O(n)Linear (O((11/κ)t)O((1 - 1/\kappa)^t) per step)
Newton's methodf\nabla f and HHQuadraticH1f-H^{-1} \nabla fO(n3)O(n^3)Quadratic (near optimum)
L-BFGSf\nabla f + approximate H1H^{-1}Quasi-quadraticH~1f-\tilde{H}^{-1} \nabla fO(mn)O(mn), mnm \ll nSuperlinear

For a strictly quadratic objective, Newton's method finds the exact minimum in one step (the quadratic approximation is exact). For a general smooth ff, it instead converges quadratically near a minimum (xt+1xcxtx2\|x_{t+1} - x^*\| \leq c\|x_t - x^*\|^2). Either way it is impractical for neural networks (n>106n > 10^6) because forming and inverting the Hessian costs O(n2)O(n^2) memory and O(n3)O(n^3) computation.

**Gradient descent vs. Newton on a quadratic.** Take the anisotropic quadratic

f(x1,x2)=12(x12+10x22),x0=[11].f(x_1, x_2) = \tfrac{1}{2}\left(x_1^2 + 10\,x_2^2\right), \qquad x_0 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}.

The gradient and Hessian are

f(x)=[x110x2],H=[10010],\nabla f(x) = \begin{bmatrix} x_1 \\ 10\,x_2 \end{bmatrix}, \qquad H = \begin{bmatrix} 1 & 0 \\ 0 & 10 \end{bmatrix},

so f(x0)=(1,10)\nabla f(x_0) = (1, 10)^\top. The minimum is at x=(0,0)x^* = (0, 0) and the condition number is κ=10\kappa = 10.

Gradient descent. With step size η=0.1\eta = 0.1, one step gives

x1=x0ηf(x0)=[11]0.1[110]=[0.90].x_1 = x_0 - \eta\,\nabla f(x_0) = \begin{bmatrix} 1 \\ 1 \end{bmatrix} - 0.1\begin{bmatrix} 1 \\ 10 \end{bmatrix} = \begin{bmatrix} 0.9 \\ 0 \end{bmatrix}.

The well-conditioned coordinate x2x_2 snaps to its optimum, but x1x_1 barely moves; subsequent steps shrink x1x_1 by a factor of 0.90.9 each iteration, the linear O((11/κ)t)O((1 - 1/\kappa)^t) rate.

Newton's method. Because ff is exactly quadratic, one Newton step reaches the minimum:

x1=x0H1f(x0)=[11][1000.1][110]=[00]=x.x_1 = x_0 - H^{-1}\nabla f(x_0) = \begin{bmatrix} 1 \\ 1 \end{bmatrix} - \begin{bmatrix} 1 & 0 \\ 0 & 0.1 \end{bmatrix}\begin{bmatrix} 1 \\ 10 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} = x^*.

Newton rescales each direction by its curvature, which is exactly what gradient descent fails to do on ill-conditioned problems.

Implicit Function Theorem and Implicit Differentiation

Let $F: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^m$ be continuously differentiable, and suppose $F(x_0, y_0) = 0$ and $\frac{\partial F}{\partial y}(x_0, y_0)$ is invertible. Then there exists a neighborhood of $x_0$ and a unique differentiable function $y(x)$ such that $F(x, y(x)) = 0$, with:

yx=(Fy)1Fx\frac{\partial y}{\partial x} = -\left(\frac{\partial F}{\partial y}\right)^{-1} \frac{\partial F}{\partial x}

**Implicit differentiation in ML.** This theorem underpins several important techniques:
  1. Differentiating through optimization: If y(x)=argminyg(x,y)y^*(x) = \arg\min_y g(x, y), the optimality condition is yg(x,y)=0\nabla_y g(x, y^*) = 0. The implicit function theorem gives y/x\partial y^* / \partial x without unrolling the optimization.
  2. Implicit MAML (Finn et al., 2017): Differentiates through the inner loop of meta-learning without storing the optimization trajectory.
  3. Differentiable optimization layers (Amos & Kolter, 2017): Backpropagation through convex optimization problems embedded as neural network layers.
  4. Equilibrium models (DEQ) (Bai et al., 2019): Differentiates through the fixed point z=f(z;θ)z^* = f(z^*; \theta) without storing the forward iterations.

Total Derivative and Chain Rule

If $f$ depends on $t$ both directly and through $x(t)$, the **total derivative** is:

dfdt=ft+i=1nfxidxidt=ft+xfdxdt\frac{df}{dt} = \frac{\partial f}{\partial t} + \sum_{i=1}^n \frac{\partial f}{\partial x_i} \frac{dx_i}{dt} = \frac{\partial f}{\partial t} + \nabla_x f^\top \frac{dx}{dt}

The total derivative appears in: - **Learning rate schedules:** The loss changes due to both parameter updates and the changing learning rate - **Curriculum learning:** The data distribution changes over time, so the expected loss has both explicit and implicit dependence on time - **Neural ODEs** [@chen2018neural]: Model the hidden state as a continuous function $h(t)$ governed by $dh/dt = f(h(t), t; \theta)$, where the total derivative controls dynamics

Loss Landscapes

The loss function L(θ)\mathcal{L}(\theta) defines a surface in the PP-dimensional parameter space (PP = number of parameters). Understanding this surface guides algorithm design.

FeatureDefinitionHessian Condition (sufficient)Frequency in High-dd
Local minimumL(θ)L(θ)\mathcal{L}(\theta^*) \leq \mathcal{L}(\theta) in a neighborhoodH0H \succ 0 (all eigenvalues >0> 0)Rare: 2P\sim 2^{-P} of critical points
Local maximumL(θ)L(θ)\mathcal{L}(\theta^*) \geq \mathcal{L}(\theta) in a neighborhoodH0H \prec 0 (all eigenvalues <0< 0)Rare
Saddle pointNeither min nor maxHH indefinite (mixed eigenvalues)Overwhelmingly common
PlateauL0\nabla \mathcal{L} \approx 0 over a regionNear-zero gradient and curvatureCommon in deep networks
ValleyNarrow region with low lossHigh curvature in most directionsWhere SGD typically converges
In high dimensions ($P \gg 1$), the character of the loss landscape changes qualitatively:
  1. Saddle points dominate. Heuristically, if we model each Hessian eigenvalue at a random critical point as independently positive or negative with roughly equal probability (Dauphin et al., 2014), the probability that all PP eigenvalues are positive (a minimum) is approximately 2P2^{-P}, which is astronomically small for P=109P = 10^9. (This independence assumption is a simplification; in practice, eigenvalue signs correlate with loss level.)
  2. Local minima are near-global. Empirically, the loss values at different local minima of overparameterized networks are clustered near the global minimum, so there are no "bad" local minima that trap optimization (Choromanska et al., 2015).
  3. Mode connectivity. Good local minima found by different training runs are connected by low-loss paths through parameter space (Draxler et al., 2018). This suggests the loss landscape has a connected valley structure rather than isolated basins.

Integration in ML

While optimization (differentiation) dominates ML computationally, integration appears in several important places:
ApplicationIntegralApproximation
Bayesian inference$p(\theta\mathcal{D}) = p(\mathcal{D}
Expected lossEx,y[(fθ(x),y)]\mathbb{E}_{x,y}[\ell(f_\theta(x), y)]Monte Carlo (mini-batch)
Marginal likelihood$p(\mathcal{D}) = \int p(\mathcal{D}\theta)p(\theta)d\theta$
Normalizing flowsChange of variables: $p(z) = p(x)\det \partial f / \partial x
Diffusion models0Tf(xt,t)dt\int_0^T f(x_t, t)dt (SDE/ODE)Numerical integrators (Euler, Heun)

Notation Summary

SymbolMeaning
f\nabla fGradient of ff (column vector)
DufD_u fDirectional derivative in direction uu
H=2fH = \nabla^2 fHessian matrix
δ\deltaPerturbation vector
H0H \succ 0HH is positive definite
H0H \prec 0HH is negative definite
H0H \succeq 0HH is positive semi-definite
df/dtdf/dtTotal derivative
f/xi\partial f/\partial x_iPartial derivative

References