Convex optimization is the "easy case" of optimization: convex problems have a rich theory with clean convergence guarantees, efficient algorithms, and no bad local minima. Many ML problems are convex (linear regression, logistic regression, SVMs), and understanding convexity illuminates why non-convex deep learning is harder and what structure makes it tractable.
First-order condition:f(y)≥f(x)+∇f(x)⊤(y−x) for all x,y (the tangent hyperplane is a global underestimator)
Second-order condition:∇2f(x)⪰0 for all x (the Hessian is PSD everywhere)
For a convex function, **every local minimum is a global minimum**. Moreover, the set of global minima is convex (it may be a single point, a line, or a higher-dimensional flat).
Proof. Suppose x∗ is a local minimum but not global: there exists y with f(y)<f(x∗). By convexity, for any λ∈(0,1): f(λy+(1−λ)x∗)≤λf(y)+(1−λ)f(x∗)<f(x∗). But λy+(1−λ)x∗ can be made arbitrarily close to x∗ by taking λ→0, contradicting local optimality. □
This theorem is why convex optimization has clean convergence guarantees: gradient descent cannot get stuck in bad local minima. For non-convex problems (neural networks), this guarantee fails -- but empirically, the local minima found by SGD on overparameterized networks have loss values very close to the global minimum.
A differentiable function $f$ has **$L$-Lipschitz continuous gradient** (is **$L$-smooth**) if:
∥∇f(x)−∇f(y)∥≤L∥x−y∥∀x,y
Equivalently, f is upper-bounded by a quadratic around any point:
f(y)≤f(x)+∇f(x)⊤(y−x)+2L∥y−x∥2
The constant L is the largest eigenvalue of the Hessian: L=supxλmax(∇2f(x)).
**Connection to learning rate.** $L$-smoothness guarantees that gradient descent with step size $\eta \leq 1/L$ makes monotonic progress: each step decreases the loss by at least $\frac{\eta}{2}\|\nabla f\|^2$. Using $\eta > 1/L$ can cause the loss to increase or diverge. This is the theoretical justification for the common practice of reducing the learning rate when training becomes unstable.
A function is **$\mu$-strongly convex** ($\mu > 0$) if the Hessian is bounded below by $\mu I$:
∇2f(x)⪰μI∀x
Equivalently:
f(y)≥f(x)+∇f(x)⊤(y−x)+2μ∥y−x∥2
Strong convexity guarantees a unique global minimum and a lower bound on curvature.
**L2 regularization creates strong convexity.** Adding $\frac{\lambda}{2}\|\theta\|^2$ to any convex loss makes it $\lambda$-strongly convex (since $\nabla^2(\frac{\lambda}{2}\|\theta\|^2) = \lambda I$). This guarantees:
- A unique global minimum
- Exponential convergence with condition number $\kappa = (L + \lambda)/\lambda$
- Better numerical conditioning
This is one of the key benefits of weight decay beyond regularization: it improves optimization by making the problem better conditioned.
**Nesterov acceleration.** For smooth convex problems, Nesterov's accelerated gradient method achieves the rate $O(L/t^2)$ -- a quadratic improvement over gradient descent's $O(L/t)$ -- by evaluating the gradient at a "lookahead" point. For strongly convex problems, the improvement is from $O((1-1/\kappa)^t)$ to $O((1-1/\sqrt{\kappa})^t)$. These rates are provably optimal: no first-order method can converge faster [@nesterov1983method].
Key properties: f∗∗=f for convex f (biconjugate is the original function), (f+g)∗=f∗⋆g∗ (infimal convolution), and ∇f∗=(∇f)−1 (gradients are inverses).
Convex conjugates appear in ML in several places:
- **Variational representation of KL divergence:** $D_{\text{KL}}(p \| q) = \sup_T \{\mathbb{E}_p[T] - \log \mathbb{E}_q[e^T]\}$ (Donsker--Varadhan)
- **f-GAN losses:** Use the conjugate of $f$-divergences to derive trainable discriminator objectives
- **Legendre transform:** Converts between natural and expectation parameterizations of exponential families