Backpropagation
Backpropagation is the algorithm that makes training neural networks tractable. It computes the gradient of the loss with respect to every parameter in time and memory, where is the number of layers, is the forward-pass cost, and is the per-layer activation size. Without it, computing gradients would require forward passes (one per parameter), making training of billion-parameter models impossible.
The Chain Rule
where is the output of layer , with (input) and (loss).
Computational Graphs
- Nodes represent operations (matmul, add, ReLU, softmax) or variables (inputs, parameters)
- Edges represent data flow (tensors flowing between operations)
- Forward pass: evaluates nodes in topological order from inputs to loss, caching all intermediate activations
- Backward pass: traverses the graph in reverse topological order, computing gradients using the cached activations
Each node stores its local Jacobian -- how its output changes with respect to its inputs. The backward pass chains these local Jacobians via vector-Jacobian products (VJPs).
A Concrete Example
Forward pass (compute and cache all intermediates):
Backward pass (apply chain rule in reverse, reusing cached values):
| Step | Gradient | Depends On |
|---|---|---|
| Loss output | ||
| Pre-activation 2 | (cached) | |
| Weight 2 | (cached) | |
| Bias 2 | ||
| Hidden activation | ||
| Pre-activation 1 | (cached) | |
| Weight 1 | (cached) | |
| Bias 1 |
Notice the pattern: each layer's backward pass produces (1) gradients for its own parameters and (2) a gradient for its input that becomes the upstream gradient for the next layer backwards.
This is why the backward pass has approximately the same computational cost as the forward pass: both are dominated by matrix multiplications of the same shapes (but transposed).
Jacobians and Automatic Differentiation
-
Vector-Jacobian product (VJP): Given (upstream gradient), compute in time. This is reverse-mode AD (backpropagation).
-
Jacobian-vector product (JVP): Given (perturbation direction), compute in time. This is forward-mode AD.
| Scenario | Inputs () | Outputs () | Best Mode | Cost |
|---|---|---|---|---|
| Neural net (loss is scalar) | Millions of params | 1 | Reverse (VJP) | 1 backward pass |
| Sensitivity analysis | Few | Many | Forward (JVP) | forward passes |
| Hessian-vector product | Forward-over-reverse | 1 forward + 1 backward |
Neural networks have (scalar loss) and (many parameters), making reverse mode overwhelmingly more efficient. This is why backpropagation (reverse-mode AD) is the standard.
This costs one forward pass + one backward pass, regardless of . Hessian-vector products are used in second-order methods (conjugate gradient, L-BFGS), eigenvalue estimation of the Hessian (for understanding loss landscape curvature), and natural gradient methods.
Gradient Flow Problems
If the spectral norm for most layers, the product shrinks exponentially (vanishing). If , it grows exponentially (exploding).
| Problem | Mechanism | Symptom | Solution |
|---|---|---|---|
| Vanishing gradients | (sigmoid, tanh saturate); deep products shrink | Early layers do not learn | ReLU, skip connections, normalization |
| Exploding gradients | ; products of weight matrices grow | Loss spikes, NaN | Gradient clipping, careful initialization |
| Dead neurons | ReLU: for ; once dead, gradient is permanently zero | Parameters stop updating | Leaky ReLU (), lower learning rate |
| Shattered gradients | Gradient correlation between nearby points decays with depth | Gradients resemble white noise in deep nets | BatchNorm, residual connections |
The additive identity provides a "gradient highway" -- even if is small, the gradient passes through the skip connection unattenuated. For a network with residual blocks:
Expanding this product yields terms, one for each subset of layers. The "direct path" (all identity terms) contributes directly, ensuring gradients reach early layers regardless of depth (He et al., 2016).
Weight Initialization
| Scheme | Weight Variance | Activation | Reference |
|---|---|---|---|
| Xavier / Glorot | Sigmoid, Tanh | (Glorot & Bengio, 2010) | |
| He / Kaiming | ReLU | (He et al., 2015) | |
| LeCun | SELU | ([?lecun2012efficient]) |
Memory Cost of Backpropagation
For LLaMA-7B () with in FP16: GB. This often exceeds the model parameter memory, which is why gradient checkpointing (recomputing activations during the backward pass instead of caching them) is standard for training large models.
Notation Summary
| Symbol | Meaning |
|---|---|
| Activation (output) of layer | |
| Pre-activation (before nonlinearity) of layer | |
| Error signal (delta) at layer | |
| Activation function and its derivative | |
| Elementwise (Hadamard) product | |
| Jacobian matrix | |
| VJP | Vector-Jacobian product (reverse-mode AD) |
| JVP | Jacobian-vector product (forward-mode AD) |
| Fan-in and fan-out of a layer | |
| Identity matrix |
References
- Xavier Glorot, Yoshua Bengio (2010). Understanding the Difficulty of Training Deep Feedforward Neural Networks. AISTATS.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2016). Deep Residual Learning for Image Recognition. CVPR.