Skip to main content

Backpropagation

Backpropagation is the algorithm that makes training neural networks tractable. It computes the gradient of the loss with respect to every parameter in O(LCfwd)O(L \cdot C_{\text{fwd}}) time and O(LA)O(L \cdot A) memory, where LL is the number of layers, CfwdC_{\text{fwd}} is the forward-pass cost, and AA is the per-layer activation size. Without it, computing gradients would require O(P)O(P) forward passes (one per parameter), making training of billion-parameter models impossible.

The Chain Rule

For a composition of differentiable functions $f = f_L \circ f_{L-1} \circ \cdots \circ f_1$, the chain rule gives the gradient of a scalar loss $\mathcal{L}$ with respect to the parameters $\theta_l$ of layer $l$:

Lθl=LhLhLhL1hl+1hlhlθl\frac{\partial \mathcal{L}}{\partial \theta_l} = \frac{\partial \mathcal{L}}{\partial h_L} \cdot \frac{\partial h_L}{\partial h_{L-1}} \cdots \frac{\partial h_{l+1}}{\partial h_l} \cdot \frac{\partial h_l}{\partial \theta_l}

where hl=fl(hl1;θl)h_l = f_l(h_{l-1}; \theta_l) is the output of layer ll, with h0=xh_0 = x (input) and L=(hL,y)\mathcal{L} = \ell(h_L, y) (loss).

**Why compute right-to-left?** The key insight is that $\partial \mathcal{L} / \partial h_l$ appears in the gradient for *every* layer $l' \leq l$. By computing the chain from right to left (from loss to input), we reuse the intermediate quantity $\partial \mathcal{L} / \partial h_l$ (called the "upstream gradient" or "delta") for all parameters in layers $l' \leq l$. Computing left-to-right (forward mode) would require a separate pass for each parameter.

Computational Graphs

A neural network defines a **directed acyclic graph** (DAG) where:
  • Nodes represent operations (matmul, add, ReLU, softmax) or variables (inputs, parameters)
  • Edges represent data flow (tensors flowing between operations)
  • Forward pass: evaluates nodes in topological order from inputs to loss, caching all intermediate activations
  • Backward pass: traverses the graph in reverse topological order, computing gradients using the cached activations

Each node stores its local Jacobian -- how its output changes with respect to its inputs. The backward pass chains these local Jacobians via vector-Jacobian products (VJPs).

When a node has multiple consumers (e.g., a residual connection where $h_l$ feeds both $f_{l+1}$ and the skip connection), the gradients from all consumers are **summed**. This is the multivariate chain rule: if $z = f(x)$ and $w = g(x)$ both depend on $x$, then $\partial \mathcal{L}/\partial x = \partial \mathcal{L}/\partial z \cdot \partial z/\partial x + \partial \mathcal{L}/\partial w \cdot \partial w/\partial x$.

A Concrete Example

**Two-layer network with MSE loss.** Let $\hat{y} = \sigma(W_2 \, \sigma(W_1 x + b_1) + b_2)$ with MSE loss $\mathcal{L} = \frac{1}{2}\|y - \hat{y}\|^2$.

Forward pass (compute and cache all intermediates):

z1=W1x+b1,h1=σ(z1),z2=W2h1+b2,y^=σ(z2)z_1 = W_1 x + b_1, \quad h_1 = \sigma(z_1), \quad z_2 = W_2 h_1 + b_2, \quad \hat{y} = \sigma(z_2)

Backward pass (apply chain rule in reverse, reusing cached values):

StepGradientDepends On
Loss outputδy^=Ly^=y^y\delta_{\hat{y}} = \frac{\partial \mathcal{L}}{\partial \hat{y}} = \hat{y} - yy^,y\hat{y}, y
Pre-activation 2δz2=δy^σ(z2)\delta_{z_2} = \delta_{\hat{y}} \odot \sigma'(z_2)δy^,z2\delta_{\hat{y}}, z_2 (cached)
Weight 2LW2=δz2h1\frac{\partial \mathcal{L}}{\partial W_2} = \delta_{z_2} \, h_1^\topδz2,h1\delta_{z_2}, h_1 (cached)
Bias 2Lb2=δz2\frac{\partial \mathcal{L}}{\partial b_2} = \delta_{z_2}δz2\delta_{z_2}
Hidden activationδh1=W2δz2\delta_{h_1} = W_2^\top \delta_{z_2}δz2,W2\delta_{z_2}, W_2
Pre-activation 1δz1=δh1σ(z1)\delta_{z_1} = \delta_{h_1} \odot \sigma'(z_1)δh1,z1\delta_{h_1}, z_1 (cached)
Weight 1LW1=δz1x\frac{\partial \mathcal{L}}{\partial W_1} = \delta_{z_1} \, x^\topδz1,x\delta_{z_1}, x (cached)
Bias 1Lb1=δz1\frac{\partial \mathcal{L}}{\partial b_1} = \delta_{z_1}δz1\delta_{z_1}

Notice the pattern: each layer's backward pass produces (1) gradients for its own parameters and (2) a gradient for its input that becomes the upstream gradient for the next layer backwards.

**Generic backward pass for $z = Wx + b$.** Given upstream gradient $\delta_z = \partial \mathcal{L}/\partial z$:

LW=δzx(outer product),Lb=δz,Lx=Wδz(matmul with transpose)\frac{\partial \mathcal{L}}{\partial W} = \delta_z \, x^\top \quad \text{(outer product)}, \qquad \frac{\partial \mathcal{L}}{\partial b} = \delta_z, \qquad \frac{\partial \mathcal{L}}{\partial x} = W^\top \delta_z \quad \text{(matmul with transpose)}

This is why the backward pass has approximately the same computational cost as the forward pass: both are dominated by matrix multiplications of the same shapes (but transposed).

Jacobians and Automatic Differentiation

For a function $f: \mathbb{R}^n \to \mathbb{R}^m$, the **Jacobian** is the $m \times n$ matrix of all first partial derivatives:

J=fxRm×n,Jij=fixjJ = \frac{\partial f}{\partial x} \in \mathbb{R}^{m \times n}, \qquad J_{ij} = \frac{\partial f_i}{\partial x_j}

Modern automatic differentiation avoids materializing full Jacobians. Instead, it computes products with vectors:
  • Vector-Jacobian product (VJP): Given vRmv \in \mathbb{R}^m (upstream gradient), compute vJRnv^\top J \in \mathbb{R}^n in O(mn)O(mn) time. This is reverse-mode AD (backpropagation).

  • Jacobian-vector product (JVP): Given uRnu \in \mathbb{R}^n (perturbation direction), compute JuRmJu \in \mathbb{R}^m in O(mn)O(mn) time. This is forward-mode AD.

**When to use each mode:**
ScenarioInputs (nn)Outputs (mm)Best ModeCost
Neural net (loss is scalar)Millions of params1Reverse (VJP)1 backward pass
Sensitivity analysisFewManyForward (JVP)nn forward passes
Hessian-vector productnnnnForward-over-reverse1 forward + 1 backward

Neural networks have m=1m = 1 (scalar loss) and n1n \gg 1 (many parameters), making reverse mode overwhelmingly more efficient. This is why backpropagation (reverse-mode AD) is the standard.

**Hessian-vector products without forming the Hessian.** The Hessian $H = \nabla^2 \mathcal{L} \in \mathbb{R}^{P \times P}$ is far too large to materialize ($P^2$ elements). But the product $Hv$ can be computed exactly by composing forward and reverse mode:

Hv=θ((θL)v)Hv = \nabla_\theta ((\nabla_\theta \mathcal{L})^\top v)

This costs one forward pass + one backward pass, regardless of PP. Hessian-vector products are used in second-order methods (conjugate gradient, L-BFGS), eigenvalue estimation of the Hessian (for understanding loss landscape curvature), and natural gradient methods.

Gradient Flow Problems

**Gradient flow** refers to how the gradient magnitude evolves as it propagates backward through layers. For a depth-$L$ network without skip connections, the gradient at layer $l$ involves a product of $L - l$ Jacobians:

Lhl=k=l+1Lhkhk1LhL\frac{\partial \mathcal{L}}{\partial h_l} = \prod_{k=l+1}^{L} \frac{\partial h_k}{\partial h_{k-1}} \cdot \frac{\partial \mathcal{L}}{\partial h_L}

If the spectral norm hk/hk1<1\|\partial h_k / \partial h_{k-1}\| < 1 for most layers, the product shrinks exponentially (vanishing). If >1> 1, it grows exponentially (exploding).

ProblemMechanismSymptomSolution
Vanishing gradientsσ<1\|\sigma'\| < 1 (sigmoid, tanh saturate); deep products shrinkEarly layers do not learnReLU, skip connections, normalization
Exploding gradientsWl>1\|W_l\| > 1; products of weight matrices growLoss spikes, NaNGradient clipping, careful initialization
Dead neuronsReLU: σ(z)=0\sigma'(z) = 0 for z<0z < 0; once dead, gradient is permanently zeroParameters stop updatingLeaky ReLU (α=0.01\alpha = 0.01), lower learning rate
Shattered gradientsGradient correlation between nearby points decays with depthGradients resemble white noise in deep netsBatchNorm, residual connections
For a residual block $h_l = f_l(h_{l-1}) + h_{l-1}$, the Jacobian is:

hlhl1=flhl1+I\frac{\partial h_l}{\partial h_{l-1}} = \frac{\partial f_l}{\partial h_{l-1}} + I

The additive identity II provides a "gradient highway" -- even if fl/hl1\partial f_l / \partial h_{l-1} is small, the gradient passes through the skip connection unattenuated. For a network with LL residual blocks:

Lh0=LhLl=1L(flhl1+I)\frac{\partial \mathcal{L}}{\partial h_0} = \frac{\partial \mathcal{L}}{\partial h_L} \cdot \prod_{l=1}^{L}\left(\frac{\partial f_l}{\partial h_{l-1}} + I\right)

Expanding this product yields 2L2^L terms, one for each subset of layers. The "direct path" (all identity terms) contributes L/hL\partial \mathcal{L}/\partial h_L directly, ensuring gradients reach early layers regardless of depth (He et al., 2016).

Weight Initialization

**Proper initialization preserves gradient scale.** For a linear layer $h_l = W_l h_{l-1}$, the variance of the output is $\text{Var}(h_{l,j}) = n_{\text{in}} \cdot \text{Var}(W_{l,ij}) \cdot \text{Var}(h_{l-1,i})$. To prevent activations from exploding or vanishing across layers, we need $n_{\text{in}} \cdot \text{Var}(W) = 1$.
SchemeWeight VarianceActivationReference
Xavier / GlorotVar(W)=2nin+nout\text{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}}Sigmoid, Tanh(Glorot & Bengio, 2010)
He / KaimingVar(W)=2nin\text{Var}(W) = \frac{2}{n_{\text{in}}}ReLU(He et al., 2015)
LeCunVar(W)=1nin\text{Var}(W) = \frac{1}{n_{\text{in}}}SELU([?lecun2012efficient])

Memory Cost of Backpropagation

Backpropagation must cache all intermediate activations from the forward pass for use in the backward pass. For a transformer with $L$ layers, sequence length $T$, hidden dimension $d$, and batch size $B$:

Activation memoryLBTd(bytes per element)(constant factor)\text{Activation memory} \approx L \cdot B \cdot T \cdot d \cdot (\text{bytes per element}) \cdot (\text{constant factor})

For LLaMA-7B (L=32,d=4096L=32, d=4096) with B=1,T=2048B=1, T=2048 in FP16: 32×1×2048×4096×2×105\approx 32 \times 1 \times 2048 \times 4096 \times 2 \times 10 \approx 5 GB. This often exceeds the model parameter memory, which is why gradient checkpointing (recomputing activations during the backward pass instead of caching them) is standard for training large models.

Notation Summary

SymbolMeaning
hlh_lActivation (output) of layer ll
zlz_lPre-activation (before nonlinearity) of layer ll
δl=L/zl\delta_l = \partial \mathcal{L}/\partial z_lError signal (delta) at layer ll
σ,σ\sigma, \sigma'Activation function and its derivative
\odotElementwise (Hadamard) product
JJJacobian matrix
VJPVector-Jacobian product (reverse-mode AD)
JVPJacobian-vector product (forward-mode AD)
nin,noutn_{\text{in}}, n_{\text{out}}Fan-in and fan-out of a layer
IIIdentity matrix

References