Skip to main content

Matrix Calculus

Matrix calculus is the language of neural network gradient derivations. Every backpropagation formula, every optimizer update, and every gradient identity used in ML is an application of the rules developed here.

Gradient of a Scalar Function

For a scalar-valued function $f: \mathbb{R}^n \to \mathbb{R}$, the **gradient** is the column vector of partial derivatives:

xf=[f/x1f/xn]Rn\nabla_x f = \begin{bmatrix} \partial f / \partial x_1 \\ \vdots \\ \partial f / \partial x_n \end{bmatrix} \in \mathbb{R}^n

The gradient points in the direction of steepest ascent of ff. Its magnitude f\|\nabla f\| equals the maximum directional derivative: the rate of change in the steepest direction.

**Layout convention.** There are two conventions for matrix calculus: the **numerator layout** (gradient is a column vector, Jacobian has shape $m \times n$) and the **denominator layout** (gradient is a row vector, Jacobian has shape $n \times m$). We follow the **numerator layout** (also called the "Jacobian formulation"), which is standard in ML and consistent with PyTorch's `.grad` attribute.

Common Gradient Identities

These identities are the building blocks of every gradient derivation in ML. Let a,xRna, x \in \mathbb{R}^n, ARn×nA \in \mathbb{R}^{n \times n} (or appropriate dimensions):

Function f(x)f(x)Gradient xf\nabla_x fProof Sketch
axa^\top xaaxijajxj=ai\frac{\partial}{\partial x_i} \sum_j a_j x_j = a_i
xx=x22x^\top x = \|x\|_2^22x2xSpecial case of next row with A=IA = I
xAxx^\top A x(A+A)x(A + A^\top) xxijkxjAjkxk=kAikxk+jAjixj\frac{\partial}{\partial x_i} \sum_{jk} x_j A_{jk} x_k = \sum_k A_{ik}x_k + \sum_j A_{ji}x_j
xAxx^\top A x (A symmetric)2Ax2AxSince A+A=2AA + A^\top = 2A
Axb22\|Ax - b\|_2^22A(Axb)2A^\top(Ax - b)Chain rule on (Axb)(Axb)(Ax-b)^\top(Ax-b)
$|x|_1 = \sum_ix_i$
σ(wx)\sigma(w^\top x)σ(wx)(1σ(wx))w\sigma(w^\top x)(1 - \sigma(w^\top x)) \cdot wChain rule with σ=σ(1σ)\sigma' = \sigma(1-\sigma)
log(1+ewx)\log(1 + e^{w^\top x}) (softplus)σ(wx)w\sigma(w^\top x) \cdot wsoftplus=σ\text{softplus}' = \sigma
**Deriving the normal equation.** For linear regression, minimize $\mathcal{L}(\theta) = \|X\theta - y\|^2 = (X\theta - y)^\top(X\theta - y)$:

θL=2X(Xθy)=0\nabla_\theta \mathcal{L} = 2X^\top(X\theta - y) = 0

XXθ=XyX^\top X \theta = X^\top y

θ=(XX)1Xy(assuming XX is invertible)\theta^* = (X^\top X)^{-1} X^\top y \quad \text{(assuming } X^\top X \text{ is invertible)}

This is the normal equation. The solution y^=Xθ\hat{y} = X\theta^* is the orthogonal projection of yy onto col(X)\text{col}(X).

Jacobian

For a vector-valued function $f: \mathbb{R}^n \to \mathbb{R}^m$, the **Jacobian** is the $m \times n$ matrix of all first partial derivatives:

J=fxRm×n,Jij=fixjJ = \frac{\partial f}{\partial x} \in \mathbb{R}^{m \times n}, \qquad J_{ij} = \frac{\partial f_i}{\partial x_j}

The Jacobian is the best linear approximation to ff near xx:

f(x+δ)f(x)+Jδ+O(δ2)f(x + \delta) \approx f(x) + J \delta + O(\|\delta\|^2)

Operation f(x)f(x)Jacobian f/x\partial f / \partial xShape
AxAx (linear)AAm×nm \times n
ReLU(x)\text{ReLU}(x) (elementwise)diag(1[x>0])\text{diag}(\mathbf{1}[x > 0])n×nn \times n (diagonal)
softmax(x)\text{softmax}(x)diag(s)ss\text{diag}(s) - ss^\top where s=softmax(x)s = \text{softmax}(x)n×nn \times n
xyx \odot y (Hadamard)diag(y)\text{diag}(y) (w.r.t. xx)n×nn \times n (diagonal)
x2\|x\|_2x/x2x^\top / \|x\|_21×n1 \times n
Layer normComplex; see dedicated derivationn×nn \times n
**The softmax Jacobian** $J_{ij} = s_i(\delta_{ij} - s_j)$ is dense and $n \times n$. For a vocabulary of $V = 100{,}000$ tokens, materializing it would require $V^2 = 10^{10}$ entries. This is why backpropagation uses VJPs (vector-Jacobian products $v^\top J$) rather than full Jacobians -- the VJP through softmax costs $O(V)$, not $O(V^2)$.

Hessian

For $f: \mathbb{R}^n \to \mathbb{R}$, the **Hessian** is the $n \times n$ matrix of second partial derivatives:

H=2f,Hij=2fxixjH = \nabla^2 f, \qquad H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

The Hessian is symmetric (by Schwarz's theorem, assuming continuous second derivatives) and encodes the local curvature of ff.

The second-order Taylor expansion around a point x0x_0:

f(x0+δ)f(x0)+f(x0)δ+12δH(x0)δf(x_0 + \delta) \approx f(x_0) + \nabla f(x_0)^\top \delta + \frac{1}{2} \delta^\top H(x_0) \, \delta

ApplicationHow the Hessian Is Used
Newton's methodStep: δ=H1f\delta = -H^{-1} \nabla f. Quadratic convergence but O(n3)O(n^3) per step
Critical point classificationPD \to minimum, ND \to maximum, indefinite \to saddle
Curvature analysisEigenvalues of HH reveal loss landscape geometry
Natural gradientUses Fisher information (expected Hessian of log-likelihood)
Hessian-free optimizationUses HvHv products (no explicit HH) for conjugate gradient
Influence functionsH1H^{-1} weights the effect of removing a training point
Pruning (OBS/OBD)Uses H1H^{-1} to estimate the cost of removing a weight
Laplace approximationApproximates posterior $p(\theta
**Newton's method** sets the Taylor approximation's gradient to zero: $\nabla f + H\delta = 0 \implies \delta^* = -H^{-1} \nabla f$. This converges quadratically near a minimum ($\|x_{t+1} - x^*\| \leq c\|x_t - x^*\|^2$) but costs $O(n^2)$ memory (for storing $H$) and $O(n^3)$ computation (for solving $H\delta = -\nabla f$). For a neural network with $n = 10^9$ parameters, $H$ would require $10^{18}$ bytes -- hence the use of first-order methods (SGD, Adam) and Hessian-free methods (which compute $Hv$ products in $O(n)$ time without forming $H$).

Chain Rule for Matrices

For composed functions $f \circ g$ where $g: \mathbb{R}^n \to \mathbb{R}^m$ and $f: \mathbb{R}^m \to \mathbb{R}^p$:

(fg)x=fggxRp×n\frac{\partial (f \circ g)}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x} \in \mathbb{R}^{p \times n}

This is the Jacobian of ff (evaluated at g(x)g(x)) multiplied by the Jacobian of gg (evaluated at xx). For scalar ff (as in backpropagation), the gradient is:

x(fg)=(gx)gf\nabla_x (f \circ g) = \left(\frac{\partial g}{\partial x}\right)^\top \nabla_g f

**Gradient through a neural network layer.** For $h = \sigma(Wx + b)$ and scalar loss $\mathcal{L}$, let $z = Wx + b$ and $\bar{h} = \partial \mathcal{L}/\partial h$ (the upstream gradient):

Lz=hˉσ(z)(elementwise, since σ is applied elementwise)\frac{\partial \mathcal{L}}{\partial z} = \bar{h} \odot \sigma'(z) \quad \text{(elementwise, since } \sigma \text{ is applied elementwise)}

LW=Lzx(outer product)\frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial z} \, x^\top \quad \text{(outer product)}

Lx=WLz(propagate gradient to input)\frac{\partial \mathcal{L}}{\partial x} = W^\top \frac{\partial \mathcal{L}}{\partial z} \quad \text{(propagate gradient to input)}

Lb=Lz(bias gradient equals upstream gradient)\frac{\partial \mathcal{L}}{\partial b} = \frac{\partial \mathcal{L}}{\partial z} \quad \text{(bias gradient equals upstream gradient)}

The gradient w.r.t. WW is always an outer product of the upstream gradient and the layer input. The gradient w.r.t. the input is always a multiplication by WW^\top. This pattern holds for every linear layer in every neural network.

Matrix-Valued Derivatives

For a scalar function of a matrix f:Rm×nRf: \mathbb{R}^{m \times n} \to \mathbb{R}, the gradient has the same shape as the input:

(fA)ij=fAijRm×n\left(\frac{\partial f}{\partial A}\right)_{ij} = \frac{\partial f}{\partial A_{ij}} \in \mathbb{R}^{m \times n}

Function f(A)f(A)Gradient fA\frac{\partial f}{\partial A}Notes
tr(A)\text{tr}(A)II
tr(AB)\text{tr}(AB)BB^\topAnalogue of d(ax)/dx=ad(ax)/dx = a
tr(AB)\text{tr}(A^\top B)BBFrobenius inner product
tr(ABA)\text{tr}(ABA^\top)A(B+B)A(B + B^\top)Appears in Gaussian log-likelihood
tr(A1B)\text{tr}(A^{-1}B)ABA-A^{-\top}BA^{-\top}
logdet(A)\log \det(A)AA^{-\top}Key for Gaussian log-likelihood
det(A)\det(A)det(A)A\det(A) \cdot A^{-\top}Jacobi's formula
aA1ba^\top A^{-1}bAabA-A^{-\top}ab^\top A^{-\top}
**The trace trick for deriving gradients.** Many matrix gradient derivations become simple using the identity $\text{tr}(a^\top b) = a^\top b$ (a scalar equals its own trace) and the cyclic property of trace. For example, to find $\partial \|Ax-b\|^2 / \partial A$:

Axb2=tr((Axb)(Axb))=tr(xAAx2bAx+bb)\|Ax-b\|^2 = \text{tr}((Ax-b)^\top(Ax-b)) = \text{tr}(x^\top A^\top Ax - 2b^\top Ax + b^\top b)

Using tr(XAY)/A=XY\partial \text{tr}(X^\top AY)/\partial A = XY^\top:

A=2(Axb)x\frac{\partial}{\partial A} = 2(Ax-b)x^\top

This trace-based approach systematically handles most gradient derivations encountered in ML.

Directional Derivative and Differential

The **directional derivative** of $f$ at $x$ in direction $v$ is:

Dvf(x)=limt0f(x+tv)f(x)t=f(x)vD_v f(x) = \lim_{t \to 0} \frac{f(x + tv) - f(x)}{t} = \nabla f(x)^\top v

This measures the rate of change of ff along direction vv. The gradient is the direction that maximizes the directional derivative (subject to v=1\|v\| = 1).

**Differential notation** simplifies matrix calculus. The differential $df$ is defined as:

df=fdx=tr ⁣(fAdA)df = \nabla f^\top dx = \text{tr}\!\left(\frac{\partial f}{\partial A}^\top dA\right)

This notation makes the chain rule automatic: if f(A)=g(h(A))f(A) = g(h(A)), then df=ghdhdf = \frac{\partial g}{\partial h} dh and dh=hAdAdh = \frac{\partial h}{\partial A} dA (with appropriate products). Identifying the coefficient of dAdA in the final expression gives f/A\partial f/\partial A.

Notation Summary

SymbolMeaning
xf\nabla_x fGradient of scalar ff w.r.t. vector xx (column vector)
J=f/xJ = \partial f / \partial xJacobian matrix (m×nm \times n)
H=2fH = \nabla^2 fHessian matrix (n×nn \times n, symmetric)
tr(A)\text{tr}(A)Trace of AA
det(A)\det(A)Determinant of AA
AA^{-\top}(A1)=(A)1(A^{-1})^\top = (A^\top)^{-1}
δ\deltaPerturbation vector
DvfD_v fDirectional derivative of ff in direction vv
dfdfDifferential of ff
\odotElementwise (Hadamard) product
hˉ\bar{h}Upstream gradient (L/h\partial \mathcal{L}/\partial h)