Gradient Descent
The Optimization Problem
Machine learning reduces to empirical risk minimization (ERM): finding parameters that minimize a loss function averaged over a training dataset :
For most models of interest (neural networks), is non-convex and high-dimensional (billions of parameters), ruling out closed-form solutions. Gradient descent solves this iteratively by moving in the direction of steepest descent.
Vanilla Gradient Descent
where is the learning rate (step size).
Drag the blue point to set a starting position on the curve, adjust the learning rate , and click Run GD to watch gradient descent converge. The dashed orange line shows the tangent (gradient direction) at the starting point. Try a large learning rate to see oscillation or divergence.
Convergence Analysis
Equivalently, the loss is bounded by a quadratic around any point:
The constant is the largest eigenvalue of the Hessian .
This is an rate: to achieve -accuracy, we need iterations.
The condition number determines convergence speed. Poorly conditioned problems () converge slowly because the loss surface is elongated: the gradient oscillates across the narrow direction while making slow progress along the wide direction.
Stochastic Gradient Descent (SGD)
Properties of the stochastic gradient :
- Unbiased:
- Bounded variance: , where is the per-sample gradient variance
- Computational cost: per step instead of
This is an rate to a stationary point (where ). Note: for non-convex functions, a stationary point may be a saddle point or local minimum, not necessarily a global minimum.
Run full-batch GD with from . Because the loss is separable, each coordinate updates independently by the scalar rule , where is that coordinate's curvature:
- Steep direction (): the multiplier is , so follows . The sign flips every step: this is the oscillation across the narrow axis, shrinking only by a factor of per step.
- Shallow direction (): the multiplier is , so follows , decreasing monotonically but slowly.
The first two iterates are therefore and . The step size is constrained by the steep direction (any diverges there), yet the shallow direction barely moves with that same . This tension is exactly what the condition number measures, and it is the gap that momentum and adaptive methods are designed to close.
- SGD noise helps escape sharp minima (high curvature) in favor of flat minima (low curvature), which tend to generalize better (Keskar et al., 2017).
- The noise scale is proportional to (learning rate divided by batch size), which is why the linear scaling rule (Goyal et al., 2017) prescribes scaling proportionally to .
- Large-batch training (small noise) can converge to sharp minima that generalize poorly, unless carefully tuned with warmup and learning rate schedules.
Momentum
SGD oscillates in directions with high curvature (large eigenvalues of the Hessian) while making slow progress along directions with low curvature. Momentum fixes this by accumulating a velocity vector:
where is the momentum coefficient (typically ). The velocity is an exponential moving average of past gradients with effective window size .
Nesterov accelerated gradient evaluates the gradient at the "look-ahead" position :
Nesterov momentum achieves the optimal convergence rate for smooth convex functions, compared to for vanilla GD (Nesterov, 1983). In practice, the improvement over classical momentum is often modest for deep learning.
Adam (Adaptive Moment Estimation)
Bias-corrected estimates (since ):
Parameter update:
Default hyperparameters: , , .
- Parameters with large gradient magnitude (large RMS, since is the uncentered second moment, i.e. the mean of squared gradients) have large , so their effective learning rate is smaller. Adam normalizes by RMS magnitude, not by variance, so a large but consistent gradient is also damped.
- Parameters with small gradient magnitude (small RMS) have small , so their effective learning rate is larger, letting Adam make confident updates where the signal is small but steady.
- The bias correction is essential in the first few steps: without it, and are biased toward zero because they are initialized at zero and the exponential average has not converged.
The raw moment updates are
Both are heavily shrunk toward zero by the initialization. The bias correction divides out exactly this shrinkage. At the denominators are and , so
Notice that after correction and exactly, recovering the true first-step statistics rather than the zero-biased ones. The parameter update is
So the first step has magnitude , independent of the gradient scale: with , Adam takes a step of size in the descent direction. Without bias correction, the same step would instead be , more than triple the size, which is the wild early update that warmup exists to suppress.
AdamW is the standard optimizer for transformer training. Typical hyperparameters: , , , weight decay .
Other Optimizers
| Optimizer | Update Rule (Simplified) | Memory | Key Property |
|---|---|---|---|
| SGD | Simple, good generalization | ||
| SGD + Momentum | ; | Dampens oscillations | |
| Adam | Uses as above | Adaptive per-parameter rates | |
| AdamW | Adam + decoupled weight decay | Standard for transformers | |
| Adafactor | Factored second moments | Memory-efficient for large matrices | |
| LAMB | Adam + layerwise LR scaling | Large-batch training | |
| Lion | Sign-based momentum update | Memory-efficient, empirically strong | |
| Muon | Orthogonalized momentum | Emerging; strong empirical results |
Learning Rate Schedules
The learning rate is rarely constant. A schedule adapts it over training:
| Schedule | Formula | Typical Use |
|---|---|---|
| Constant | Baselines, short fine-tuning | |
| Step decay | CNNs (ResNet); , at 30/60/90 epochs | |
| Cosine annealing | Transformer pretraining (Loshchilov & Hutter, 2017) | |
| Linear warmup + cosine | Standard for LLM pretraining | |
| Warmup + inverse sqrt | Original Transformer (Vaswani et al., 2017) | |
| WSD (Warmup-Stable-Decay) | Warmup constant cosine decay | Practical for unknown training length |
Gradient Clipping
where is the maximum allowed gradient norm (typically ). This preserves the gradient direction but bounds its magnitude.
Convergence Rate Summary
| Setting | Algorithm | Rate | Metric |
|---|---|---|---|
| -smooth convex | GD () | ||
| -smooth, -strongly convex | GD () | ||
| -smooth convex | Nesterov | ||
| -smooth convex | SGD () | ||
| -smooth non-convex | SGD () |
Notation Summary
| Symbol | Meaning |
|---|---|
| Model parameters | |
| Learning rate (possibly time-varying) | |
| Loss function (empirical risk) | |
| Per-sample loss | |
| Gradient (or stochastic gradient) at step | |
| Velocity (momentum) or second moment (Adam) | |
| First moment estimate (Adam) | |
| Bias-corrected moment estimates | |
| Exponential decay rates (Adam) | |
| Momentum coefficient | |
| Lipschitz constant of the gradient (-smoothness) | |
| (strong convexity) | Strong convexity parameter |
| Condition number | |
| $B = | \mathcal{B} |
| Per-sample gradient variance | |
| Gradient clipping threshold |
References
- Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He (2017). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677. ↗
- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (2017). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR. ↗
- Ilya Loshchilov, Frank Hutter (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR. ↗
- Yurii Nesterov (1983). A method for solving the convex programming problem with convergence rate O(1/k^2). Doklady Akademii Nauk SSSR (Soviet Mathematics Doklady). ↗
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (2017). Attention Is All You Need. NeurIPS. ↗