Skip to main content

Non-Convex Optimization

Neural network loss functions are non-convex in their parameters. Despite the lack of convexity guarantees, stochastic gradient descent reliably finds good solutions. Understanding why requires studying the geometry of the loss landscape and the implicit biases of optimization algorithms.

The Non-Convex Landscape of Deep Learning

A network with PP parameters defines a loss surface L:RPR\mathcal{L}: \mathbb{R}^P \to \mathbb{R}. This surface has an astronomical number of critical points (where L=0\nabla \mathcal{L} = 0), and most of them are saddle points, not minima.

Saddle Points vs. Local Minima

A critical point $\theta^*$ (where $\nabla \mathcal{L}(\theta^*) = 0$) is a **saddle point** if the Hessian $H = \nabla^2 \mathcal{L}(\theta^*)$ has both positive and negative eigenvalues. The number of negative eigenvalues is called the **index** of the saddle point. For a random critical point in $P$ dimensions (under mild assumptions), the probability of being a local minimum is:

P(local minimum)2PP(\text{local minimum}) \approx 2^{-P}

since each Hessian eigenvalue must be positive independently. For P=108P = 10^8 (a 100M parameter model), this probability is less than 103000000010^{-30000000}.

The practical implication: **the main obstacle to optimization is not bad local minima but saddle points and plateaus.** SGD's stochastic noise naturally helps escape saddle points because the noise has a component along the negative curvature directions, pushing the iterate away from the saddle. Formal results show that perturbed gradient descent escapes strict saddle points in polynomial time [@jin2017escape].

Loss Surface Geometry

Empirical and theoretical studies have revealed rich structure in neural network loss landscapes:

PropertyObservationImplication
Flat vs. sharp minimaSGD converges to wide, flat valleys; large-batch GD finds sharper minima ([?keskar2017large])Flat minima generalize better (robust to weight perturbation)
Mode connectivityGood local minima are connected by paths with low loss ([?draxler2018essentially]; [?garipov2018loss])The "loss landscape" is more like a connected valley than isolated basins
Linear mode connectivityCheckpoints from the same run can be linearly interpolated with monotone loss ([?frankle2020linear])Simplifies model averaging and ensembling
Edge of stabilityGD with large LR evolves the Hessian so λmax(H)2/η\lambda_{\max}(H) \approx 2/\eta ([?cohen2021gradient])Training self-organizes to the stability boundary
Loss landscape simplificationBatchNorm, skip connections, and overparameterization smooth the landscape ([?li2018visualizing])Architecture design = implicit optimization design
Progressive sharpeningTraining loss surface becomes sharper over training ([?jastrzebski2020break])Motivates learning rate decay
**Edge of stability.** Classical optimization theory says GD converges only if $\eta < 2/\lambda_{\max}(H)$. Cohen et al. (2021) showed that in practice, GD first increases $\lambda_{\max}$ until it reaches $\approx 2/\eta$, then oscillates at this boundary while still decreasing the loss monotonically (in a time-averaged sense). This "edge of stability" phenomenon is not predicted by any existing convergence theory and suggests that neural network training dynamics are qualitatively different from classical optimization. **Why flat minima generalize.** A minimum is "flat" if the loss surface curves gently around it: small perturbations to $\theta$ cause small changes in $\mathcal{L}$. Formally, if $\theta^*$ is in a flat minimum, then for a random perturbation $\delta$ with $\|\delta\| = \epsilon$:

L(θ+δ)L(θ)12ϵ2λmax(H)|\mathcal{L}(\theta^* + \delta) - \mathcal{L}(\theta^*)| \leq \frac{1}{2}\epsilon^2 \lambda_{\max}(H)

A smaller λmax\lambda_{\max} (flatter) means the loss is robust to the weight perturbations that occur when moving from training to test data. This is formalized by PAC-Bayesian generalization bounds ([?mcallester1999pac]), which relate generalization to the volume of parameter space around θ\theta^* with low loss.

Why Deep Networks Train Despite Non-Convexity

Several factors explain the empirical success:

FactorMechanismEvidence
OverparameterizationPNP \gg N creates many global minima; gradient descent reaches one ([?du2019gradient])Networks with more parameters are easier to train
SGD noiseStochastic gradients escape saddle points and sharp minimaSmall-batch training generalizes better ([?keskar2017large])
Skip connectionsResidual connections create smooth loss surfacesResNets train where plain networks fail (He et al., 2016)
NormalizationBatchNorm/LayerNorm smooth the loss landscapeEnables much larger learning rates ([?ioffe2015batch]; [?santurkar2018does])
InitializationHe/Xavier initialization places θ0\theta_0 in a well-behaved regionBad initialization causes vanishing/exploding gradients
Implicit regularizationSGD's trajectory has an inductive bias toward simple solutionsMinimum-norm solutions in linear models ([?gunasekar2018characterizing])
Progressive learningWarmup + LR decay matches training phases to landscape geometryStandard in all large-scale training
**Overparameterization regimes.** In the extreme overparameterization limit, the **Neural Tangent Kernel (NTK)** regime [@jacot2018neural], the network behaves like a linear model in a kernel space defined by the initialization. Training is approximately convex, and the dynamics are well-understood. Real networks operate in a "rich" or "feature learning" regime that is harder to analyze but achieves better performance.

Escaping Bad Regions

StrategyHow It WorksUsed In
SGD noiseGradient noise provides random perturbationStandard training
Learning rate warmupGradually increases LR to avoid early sharp minimaTransformer pretraining
Cyclic LR / restartsPeriodic LR increases to escape local basinsSGDR ([?loshchilov2017sgdr])
Gradient clippingBounds gradient magnitude to prevent overshootingRNNs, transformers
Sharpness-aware minimization (SAM)Minimizes worst-case loss in a neighborhoodImproves generalization ([?foret2021sam])
Stochastic weight averaging (SWA)Averages weights along the trajectoryBetter generalization, flatter minima
Exponential moving average (EMA)Maintains running average of weightsStandard in modern training

Learning Rate Schedules

The learning rate is the single most important hyperparameter. It controls the noise scale (η/B\propto \eta / B), the convergence speed, and the type of minimum found.

Warmup starts with a small LR and linearly increases to the target:

ηt=ηmaxmin(tw,1)for tw\eta_t = \eta_{\max} \cdot \min\left(\frac{t}{w}, \, 1\right) \quad \text{for } t \leq w

Cosine decay smoothly decreases the LR:

ηt=ηmin+12(ηmaxηmin)(1+cos(π(tw)Tw))for t>w\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi (t - w)}{T - w}\right)\right) \quad \text{for } t > w

**Three phases of training:**
  1. Warmup phase (t<wt < w): LR ramps from 0 to ηmax\eta_{\max}. Adam's second-moment estimates (v^t\hat{v}_t) are unreliable initially; warmup gives them time to stabilize. The loss drops rapidly during warmup.
  2. Stable phase (w<t<Tdecayw < t < T_{\text{decay}}): LR is near ηmax\eta_{\max}. The model explores the loss landscape, learning features and escaping shallow minima. Most learning happens here.
  3. Decay phase (t>Tdecayt > T_{\text{decay}}): LR decreases toward ηmin\eta_{\min}. The optimizer refines its solution, settling into a flat minimum. The loss decreases slowly but generalization improves.

The WSD (Warmup-Stable-Decay) schedule makes this explicit: warmup for 5% of training, constant LR for ~75%, then cosine decay for the final ~20%. This is increasingly popular because it does not require knowing the total training length in advance.

ScheduleFormulaBest ForKey Parameter
Constantηt=η0\eta_t = \eta_0Fine-tuning, short runsη0\eta_0
Step decayηt=η0γt/s\eta_t = \eta_0 \gamma^{\lfloor t/s \rfloor}CNNs (ResNet)γ=0.1\gamma = 0.1, steps at 30/60/90 epochs
Cosine annealing12η0(1+cos(πt/T))\frac{1}{2}\eta_0(1 + \cos(\pi t/T))PretrainingMin LR, TT
Warmup + cosineLinear warmup then cosineLLM pretrainingWarmup steps ww
WSDWarmup \to constant \to cosineFlexible pretrainingPhase boundaries
1-cycleRamp up then downFast convergenceMax LR, TT
Inverse sqrtη0/t\eta_0 / \sqrt{t}Original Transformerη0\eta_0

Notation Summary

SymbolMeaning
PPNumber of model parameters
λmax(H)\lambda_{\max}(H)Largest eigenvalue of the Hessian
η,ηt\eta, \eta_tLearning rate (possibly time-varying)
wwNumber of warmup steps
TTTotal training steps
ηmin,ηmax\eta_{\min}, \eta_{\max}Minimum and maximum learning rate
γ\gammaDecay factor for step schedule
κ\kappaCondition number
SAMSharpness-Aware Minimization
SWAStochastic Weight Averaging
EMAExponential Moving Average
NTKNeural Tangent Kernel

References