Multivariable Calculus

Multivariable calculus provides the mathematical machinery for optimizing functions of many variables, the central computational task in machine learning. Every gradient computation, every optimizer step, and every loss landscape analysis uses the tools developed here.

Partial Derivatives

For $f: \mathbb{R}^n \to \mathbb{R}$, the **partial derivative** with respect to $x_i$ measures the rate of change of $f$ while holding all other variables fixed:

$\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \dots, x_i + h, \dots, x_n) - f(x_1, \dots, x_n)}{h}$

The gradient collects all $n$ partial derivatives into a single vector:

$\nabla f(x) = \begin{bmatrix} \partial f / \partial x_1 \\ \vdots \\ \partial f / \partial x_n \end{bmatrix} \in \mathbb{R}^n$

The gradient $\nabla f(x)$ points in the direction of steepest ascent of $f$ at $x$. More precisely, for any unit vector $u$, the rate of change $\nabla f^\top u$ is maximized when $u = \nabla f / \|\nabla f\|$. Gradient descent moves in the opposite direction: $x_{t+1} = x_t - \eta \nabla f(x_t)$.

Directional Derivative

The rate of change of $f$ at point $x$ in the direction of a unit vector $u$ is:

$D_u f(x) = \nabla f(x)^\top u = \|\nabla f(x)\| \cos \theta$

where $\theta$ is the angle between $\nabla f$ and $u$ . This is maximized ( $= \|\nabla f\|$ ) when $u$ is parallel to $\nabla f$ and minimized ( $= -\|\nabla f\|$ ) when $u$ is anti-parallel.

**Why gradient descent is the steepest descent.** We want to find the unit direction $u^*$ that decreases $f$ the most:

$u^* = \arg\min_{\|u\| = 1} D_u f = \arg\min_{\|u\| = 1} \nabla f^\top u = -\frac{\nabla f}{\|\nabla f\|}$

The step $x_{t+1} = x_t - \eta \nabla f$ moves in this direction with step size $\eta \|\nabla f\|$ . Note that this is only the best direction for infinitesimally small steps: for finite $\eta$ , curvature matters.

Taylor Expansion

The Taylor expansion of $f: \mathbb{R}^n \to \mathbb{R}$ around a point $x_0$ is:

First order (linear approximation): $f(x_0 + \delta) \approx f(x_0) + \nabla f(x_0)^\top \delta$

Second order (quadratic approximation):

$f(x_0 + \delta) \approx f(x_0) + \nabla f(x_0)^\top \delta + \frac{1}{2} \delta^\top H(x_0) \, \delta$

where $H(x_0) = \nabla^2 f(x_0)$ is the Hessian matrix with $H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$ .

**First-order vs. second-order methods:**

Method	Uses	Approximation	Step	Cost per step	Convergence
Gradient descent	$\nabla f$ only	Linear	$-\eta \nabla f$	$O(n)$	Linear ( $O((1 - 1/\kappa)^t)$ per step)
Newton's method	$\nabla f$ and $H$	Quadratic	$-H^{-1} \nabla f$	$O(n^3)$	Quadratic (near optimum)
L-BFGS	$\nabla f$ + approximate $H^{-1}$	Quasi-quadratic	$-\tilde{H}^{-1} \nabla f$	$O(mn)$ , $m \ll n$	Superlinear

For a strictly quadratic objective, Newton's method finds the exact minimum in one step (the quadratic approximation is exact). For a general smooth $f$ , it instead converges quadratically near a minimum ( $\|x_{t+1} - x^*\| \leq c\|x_t - x^*\|^2$ ). Either way it is impractical for neural networks ( $n > 10^6$ ) because forming and inverting the Hessian costs $O(n^2)$ memory and $O(n^3)$ computation.

**Gradient descent vs. Newton on a quadratic.** Take the anisotropic quadratic

$f(x_1, x_2) = \tfrac{1}{2}\left(x_1^2 + 10\,x_2^2\right), \qquad x_0 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}.$

The gradient and Hessian are

$\nabla f(x) = \begin{bmatrix} x_1 \\ 10\,x_2 \end{bmatrix}, \qquad H = \begin{bmatrix} 1 & 0 \\ 0 & 10 \end{bmatrix},$

so $\nabla f(x_0) = (1, 10)^\top$ . The minimum is at $x^* = (0, 0)$ and the condition number is $\kappa = 10$ .

Gradient descent. With step size $\eta = 0.1$ , one step gives

$x_1 = x_0 - \eta\,\nabla f(x_0) = \begin{bmatrix} 1 \\ 1 \end{bmatrix} - 0.1\begin{bmatrix} 1 \\ 10 \end{bmatrix} = \begin{bmatrix} 0.9 \\ 0 \end{bmatrix}.$

The well-conditioned coordinate $x_2$ snaps to its optimum, but $x_1$ barely moves; subsequent steps shrink $x_1$ by a factor of $0.9$ each iteration, the linear $O((1 - 1/\kappa)^t)$ rate.

Newton's method. Because $f$ is exactly quadratic, one Newton step reaches the minimum:

$x_1 = x_0 - H^{-1}\nabla f(x_0) = \begin{bmatrix} 1 \\ 1 \end{bmatrix} - \begin{bmatrix} 1 & 0 \\ 0 & 0.1 \end{bmatrix}\begin{bmatrix} 1 \\ 10 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} = x^*.$

Newton rescales each direction by its curvature, which is exactly what gradient descent fails to do on ill-conditioned problems.

Implicit Function Theorem and Implicit Differentiation

Let $F: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^m$ be continuously differentiable, and suppose $F(x_0, y_0) = 0$ and $\frac{\partial F}{\partial y}(x_0, y_0)$ is invertible. Then there exists a neighborhood of $x_0$ and a unique differentiable function $y(x)$ such that $F(x, y(x)) = 0$, with:

$\frac{\partial y}{\partial x} = -\left(\frac{\partial F}{\partial y}\right)^{-1} \frac{\partial F}{\partial x}$

**Implicit differentiation in ML.** This theorem underpins several important techniques:

Differentiating through optimization: If $y^*(x) = \arg\min_y g(x, y)$ , the optimality condition is $\nabla_y g(x, y^*) = 0$ . The implicit function theorem gives $\partial y^* / \partial x$ without unrolling the optimization.
Implicit MAML (Finn et al., 2017): Differentiates through the inner loop of meta-learning without storing the optimization trajectory.
Differentiable optimization layers (Amos & Kolter, 2017): Backpropagation through convex optimization problems embedded as neural network layers.
Equilibrium models (DEQ) (Bai et al., 2019): Differentiates through the fixed point $z^* = f(z^*; \theta)$ without storing the forward iterations.

Total Derivative and Chain Rule

If $f$ depends on $t$ both directly and through $x(t)$, the **total derivative** is:

$\frac{df}{dt} = \frac{\partial f}{\partial t} + \sum_{i=1}^n \frac{\partial f}{\partial x_i} \frac{dx_i}{dt} = \frac{\partial f}{\partial t} + \nabla_x f^\top \frac{dx}{dt}$

The total derivative appears in: - **Learning rate schedules:** The loss changes due to both parameter updates and the changing learning rate - **Curriculum learning:** The data distribution changes over time, so the expected loss has both explicit and implicit dependence on time - **Neural ODEs** [@chen2018neural]: Model the hidden state as a continuous function $h(t)$ governed by $dh/dt = f(h(t), t; \theta)$, where the total derivative controls dynamics

Loss Landscapes

The loss function $\mathcal{L}(\theta)$ defines a surface in the $P$ -dimensional parameter space ( $P$ = number of parameters). Understanding this surface guides algorithm design.

Feature	Definition	Hessian Condition (sufficient)	Frequency in High- $d$
Local minimum	$\mathcal{L}(\theta^*) \leq \mathcal{L}(\theta)$ in a neighborhood	$H \succ 0$ (all eigenvalues $> 0$ )	Rare: $\sim 2^{-P}$ of critical points
Local maximum	$\mathcal{L}(\theta^*) \geq \mathcal{L}(\theta)$ in a neighborhood	$H \prec 0$ (all eigenvalues $< 0$ )	Rare
Saddle point	Neither min nor max	$H$ indefinite (mixed eigenvalues)	Overwhelmingly common
Plateau	$\nabla \mathcal{L} \approx 0$ over a region	Near-zero gradient and curvature	Common in deep networks
Valley	Narrow region with low loss	High curvature in most directions	Where SGD typically converges

In high dimensions ($P \gg 1$), the character of the loss landscape changes qualitatively:

Saddle points dominate. Heuristically, if we model each Hessian eigenvalue at a random critical point as independently positive or negative with roughly equal probability (Dauphin et al., 2014), the probability that all $P$ eigenvalues are positive (a minimum) is approximately $2^{-P}$ , which is astronomically small for $P = 10^9$ . (This independence assumption is a simplification; in practice, eigenvalue signs correlate with loss level.)
Local minima are near-global. Empirically, the loss values at different local minima of overparameterized networks are clustered near the global minimum, so there are no "bad" local minima that trap optimization (Choromanska et al., 2015).
Mode connectivity. Good local minima found by different training runs are connected by low-loss paths through parameter space (Draxler et al., 2018). This suggests the loss landscape has a connected valley structure rather than isolated basins.

Integration in ML

While optimization (differentiation) dominates ML computationally, integration appears in several important places:

Application	Integral	Approximation
Bayesian inference	$p(\theta	\mathcal{D}) = p(\mathcal{D}
Expected loss	$\mathbb{E}_{x,y}[\ell(f_\theta(x), y)]$	Monte Carlo (mini-batch)
Marginal likelihood	$p(\mathcal{D}) = \int p(\mathcal{D}	\theta)p(\theta)d\theta$
Normalizing flows	Change of variables: $p(z) = p(x)	\det \partial f / \partial x
Diffusion models	$\int_0^T f(x_t, t)dt$ (SDE/ODE)	Numerical integrators (Euler, Heun)

Notation Summary

Symbol	Meaning
$\nabla f$	Gradient of $f$ (column vector)
$D_u f$	Directional derivative in direction $u$
$H = \nabla^2 f$	Hessian matrix
$\delta$	Perturbation vector
$H \succ 0$	$H$ is positive definite
$H \prec 0$	$H$ is negative definite
$H \succeq 0$	$H$ is positive semi-definite
$df/dt$	Total derivative
$\partial f/\partial x_i$	Partial derivative

Partial Derivatives​

Directional Derivative​

Taylor Expansion​

Implicit Function Theorem and Implicit Differentiation​

Total Derivative and Chain Rule​

Loss Landscapes​

Integration in ML​

Notation Summary​

References