Matrix Calculus
Matrix calculus is the language of neural network gradient derivations. Every backpropagation formula, every optimizer update, and every gradient identity used in ML is an application of the rules developed here.
Gradient of a Scalar Function
The gradient points in the direction of steepest ascent of . Its magnitude equals the maximum directional derivative: the rate of change in the steepest direction.
Common Gradient Identities
These identities are the building blocks of every gradient derivation in ML. Let , (or appropriate dimensions):
| Function | Gradient | Proof Sketch |
|---|---|---|
| Special case of next row with | ||
| (A symmetric) | Since | |
| Chain rule on | ||
| $|x|_1 = \sum_i | x_i | $ |
| Chain rule with | ||
| (softplus) |
This is the normal equation. The solution is the orthogonal projection of onto .
Jacobian
The Jacobian is the best linear approximation to near :
| Operation | Jacobian | Shape |
|---|---|---|
| (linear) | ||
| (elementwise) | (diagonal) | |
| where | ||
| (Hadamard) | (w.r.t. ) | (diagonal) |
| Layer norm | Complex; see dedicated derivation |
Hessian
The Hessian is symmetric (by Schwarz's theorem, assuming continuous second derivatives) and encodes the local curvature of .
The second-order Taylor expansion around a point :
| Application | How the Hessian Is Used |
|---|---|
| Newton's method | Step: . Quadratic convergence but per step |
| Critical point classification | PD minimum, ND maximum, indefinite saddle |
| Curvature analysis | Eigenvalues of reveal loss landscape geometry |
| Natural gradient | Uses Fisher information (expected Hessian of log-likelihood) |
| Hessian-free optimization | Uses products (no explicit ) for conjugate gradient |
| Influence functions | weights the effect of removing a training point |
| Pruning (OBS/OBD) | Uses to estimate the cost of removing a weight |
| Laplace approximation | Approximates posterior $p(\theta |
Chain Rule for Matrices
This is the Jacobian of (evaluated at ) multiplied by the Jacobian of (evaluated at ). For scalar (as in backpropagation), the gradient is:
The gradient w.r.t. is always an outer product of the upstream gradient and the layer input. The gradient w.r.t. the input is always a multiplication by . This pattern holds for every linear layer in every neural network.
Matrix-Valued Derivatives
For a scalar function of a matrix , the gradient has the same shape as the input:
| Function | Gradient | Notes |
|---|---|---|
| Analogue of | ||
| Frobenius inner product | ||
| Appears in Gaussian log-likelihood | ||
| Key for Gaussian log-likelihood | ||
| Jacobi's formula | ||
Using :
This trace-based approach systematically handles most gradient derivations encountered in ML.
Directional Derivative and Differential
This measures the rate of change of along direction . The gradient is the direction that maximizes the directional derivative (subject to ).
This notation makes the chain rule automatic: if , then and (with appropriate products). Identifying the coefficient of in the final expression gives .
Notation Summary
| Symbol | Meaning |
|---|---|
| Gradient of scalar w.r.t. vector (column vector) | |
| Jacobian matrix () | |
| Hessian matrix (, symmetric) | |
| Trace of | |
| Determinant of | |
| Perturbation vector | |
| Directional derivative of in direction | |
| Differential of | |
| Elementwise (Hadamard) product | |
| Upstream gradient () |