Skip to main content

Convex Optimization

Convex optimization is the "easy case" of optimization: convex problems have a rich theory with clean convergence guarantees, efficient algorithms, and no bad local minima. Many ML problems are convex (linear regression, logistic regression, SVMs), and understanding convexity illuminates why non-convex deep learning is harder and what structure makes it tractable.

Convex Sets

A set $C \subseteq \mathbb{R}^n$ is **convex** if the line segment between any two points in $C$ lies entirely within $C$:

x,yC,  λ[0,1]:λx+(1λ)yC\forall x, y \in C, \; \forall \lambda \in [0,1]: \quad \lambda x + (1-\lambda)y \in C

**Examples of convex sets:** - Hyperplanes: $\{x : a^\top x = b\}$ - Half-spaces: $\{x : a^\top x \leq b\}$ - Balls: $\{x : \|x - c\| \leq r\}$ (in any norm) - Polyhedra: $\{x : Ax \leq b\}$ (intersection of half-spaces) - Positive semidefinite cone: $\{X \in \mathbb{S}^n : X \succeq 0\}$

Intersections of convex sets are convex. This means constraints like AxbAx \leq b and xc\|x\| \leq c together define a convex feasible region.

Convex Functions

A function $f: \mathbb{R}^n \to \mathbb{R}$ is **convex** if for all $x, y$ in its domain and $\lambda \in [0,1]$:

f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y)

Geometrically, the chord between any two points on the graph of ff lies above (or on) the graph.

Equivalent characterizations (for twice-differentiable ff):

  1. First-order condition: f(y)f(x)+f(x)(yx)f(y) \geq f(x) + \nabla f(x)^\top(y - x) for all x,yx, y (the tangent hyperplane is a global underestimator)
  2. Second-order condition: 2f(x)0\nabla^2 f(x) \succeq 0 for all xx (the Hessian is PSD everywhere)
For a convex function, **every local minimum is a global minimum**. Moreover, the set of global minima is convex (it may be a single point, a line, or a higher-dimensional flat).

Proof. Suppose xx^* is a local minimum but not global: there exists yy with f(y)<f(x)f(y) < f(x^*). By convexity, for any λ(0,1)\lambda \in (0,1): f(λy+(1λ)x)λf(y)+(1λ)f(x)<f(x)f(\lambda y + (1-\lambda)x^*) \leq \lambda f(y) + (1-\lambda)f(x^*) < f(x^*). But λy+(1λ)x\lambda y + (1-\lambda)x^* can be made arbitrarily close to xx^* by taking λ0\lambda \to 0, contradicting local optimality. \square

This theorem is why convex optimization has clean convergence guarantees: gradient descent cannot get stuck in bad local minima. For non-convex problems (neural networks), this guarantee fails -- but empirically, the local minima found by SGD on overparameterized networks have loss values very close to the global minimum.
FunctionConvex?Strongly Convex?Notes
xp\|x\|_p for p1p \geq 1YesNoAny norm is convex
Axb22\|Ax - b\|_2^2YesYes (if AA full rank, μ=σmin(A)2\mu = \sigma_{\min}(A)^2)Linear regression
log(1+eyz)\log(1 + e^{-yz})Yes (in model params)NoLogistic regression loss
max(0,1yz)\max(0, 1 - yz)YesNoHinge loss (SVM)
pilogqi-\sum p_i \log q_iYes in qqNoCross entropy
ReLU network lossNoNoNon-convex in weights
exe^xYesNoExponential
xlogxx \log x for x>0x > 0YesNoEntropy (negated)

Lipschitz Smoothness

A differentiable function $f$ has **$L$-Lipschitz continuous gradient** (is **$L$-smooth**) if:

f(x)f(y)Lxyx,y\|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\| \quad \forall x, y

Equivalently, ff is upper-bounded by a quadratic around any point:

f(y)f(x)+f(x)(yx)+L2yx2f(y) \leq f(x) + \nabla f(x)^\top(y-x) + \frac{L}{2}\|y-x\|^2

The constant LL is the largest eigenvalue of the Hessian: L=supxλmax(2f(x))L = \sup_x \lambda_{\max}(\nabla^2 f(x)).

**Connection to learning rate.** $L$-smoothness guarantees that gradient descent with step size $\eta \leq 1/L$ makes monotonic progress: each step decreases the loss by at least $\frac{\eta}{2}\|\nabla f\|^2$. Using $\eta > 1/L$ can cause the loss to increase or diverge. This is the theoretical justification for the common practice of reducing the learning rate when training becomes unstable.

Strong Convexity

A function is **$\mu$-strongly convex** ($\mu > 0$) if the Hessian is bounded below by $\mu I$:

2f(x)μIx\nabla^2 f(x) \succeq \mu I \quad \forall x

Equivalently:

f(y)f(x)+f(x)(yx)+μ2yx2f(y) \geq f(x) + \nabla f(x)^\top (y-x) + \frac{\mu}{2} \|y - x\|^2

Strong convexity guarantees a unique global minimum and a lower bound on curvature.

**L2 regularization creates strong convexity.** Adding $\frac{\lambda}{2}\|\theta\|^2$ to any convex loss makes it $\lambda$-strongly convex (since $\nabla^2(\frac{\lambda}{2}\|\theta\|^2) = \lambda I$). This guarantees: - A unique global minimum - Exponential convergence with condition number $\kappa = (L + \lambda)/\lambda$ - Better numerical conditioning

This is one of the key benefits of weight decay beyond regularization: it improves optimization by making the problem better conditioned.

Convergence Guarantees

For a convex, $L$-smooth function, gradient descent with step size $\eta = 1/L$ satisfies:

f(xt)f(x)Lx0x22tf(x_t) - f(x^*) \leq \frac{L \|x_0 - x^*\|^2}{2t}

For a $\mu$-strongly convex, $L$-smooth function, gradient descent with $\eta = 1/L$ converges exponentially:

f(xt)f(x)(1μL)t(f(x0)f(x))=(11κ)t(f(x0)f(x))f(x_t) - f(x^*) \leq \left(1 - \frac{\mu}{L}\right)^t \left(f(x_0) - f(x^*)\right) = \left(1 - \frac{1}{\kappa}\right)^t \left(f(x_0) - f(x^*)\right)

where κ=L/μ\kappa = L/\mu is the condition number.

SettingRate (GD)Rate (Accelerated)Example
Convex, LL-smoothO(L/t)O(L/t)O(L/t2)O(L/t^2)Logistic regression
μ\mu-strongly convex, LL-smoothO((11/κ)t)O((1-1/\kappa)^t)O((11/κ)t)O((1-1/\sqrt{\kappa})^t)Ridge regression
Convex, non-smoothO(1/t)O(1/\sqrt{t})O(1/t)O(1/\sqrt{t})Lasso, SVM
**Nesterov acceleration.** For smooth convex problems, Nesterov's accelerated gradient method achieves the rate $O(L/t^2)$ -- a quadratic improvement over gradient descent's $O(L/t)$ -- by evaluating the gradient at a "lookahead" point. For strongly convex problems, the improvement is from $O((1-1/\kappa)^t)$ to $O((1-1/\sqrt{\kappa})^t)$. These rates are provably optimal: no first-order method can converge faster [@nesterov1983method].

Proximal Methods

For a convex function $g$ (possibly non-smooth), the **proximal operator** is:

proxηg(x)=argminz{g(z)+12ηzx2}\text{prox}_{\eta g}(x) = \arg\min_z \left\{ g(z) + \frac{1}{2\eta}\|z - x\|^2 \right\}

The proximal operator generalizes the gradient step to non-smooth functions. For smooth gg, it reduces to the gradient step: proxηg(x)=xηg(x)+O(η2)\text{prox}_{\eta g}(x) = x - \eta \nabla g(x) + O(\eta^2).

**Proximal operator for L1.** For $g(\theta) = \lambda\|\theta\|_1$, the proximal operator is the **soft-thresholding** operator:

proxηλ1(θ)i=sign(θi)max(θiηλ,0)\text{prox}_{\eta \lambda \|\cdot\|_1}(\theta)_i = \text{sign}(\theta_i) \max(|\theta_i| - \eta\lambda, 0)

This is the basis of ISTA (Iterative Shrinkage-Thresholding Algorithm) for sparse optimization.

Duality

The **convex conjugate** of $f$ is:

f(y)=supx{yxf(x)}f^*(y) = \sup_x \left\{ y^\top x - f(x) \right\}

Key properties: f=ff^{**} = f for convex ff (biconjugate is the original function), (f+g)=fg(f + g)^* = f^* \star g^* (infimal convolution), and f=(f)1\nabla f^* = (\nabla f)^{-1} (gradients are inverses).

Convex conjugates appear in ML in several places: - **Variational representation of KL divergence:** $D_{\text{KL}}(p \| q) = \sup_T \{\mathbb{E}_p[T] - \log \mathbb{E}_q[e^T]\}$ (Donsker--Varadhan) - **f-GAN losses:** Use the conjugate of $f$-divergences to derive trainable discriminator objectives - **Legendre transform:** Converts between natural and expectation parameterizations of exponential families

ML Applications of Convex Optimization

ProblemConvex?Strongly Convex?Solver
Linear regressionYesYes (if rank(X)=n\text{rank}(X) = n)Closed-form or CG
Ridge regressionYesYes (μ=λ\mu = \lambda)Closed-form
Logistic regressionYesNo (only with L2 reg)L-BFGS, Newton-CG
SVM (hard margin)YesYesQP solver
SVM (soft margin)YesYesSMO, LibSVM
LassoYesNoISTA/FISTA, coordinate descent
Matrix factorizationNoNoAlternating minimization
Neural networksNoNoSGD, Adam
Transformer trainingNoNoAdamW + warmup + cosine

Notation Summary

SymbolMeaning
LLLipschitz constant of the gradient (LL-smoothness)
μ\muStrong convexity parameter
κ=L/μ\kappa = L/\muCondition number
H0H \succeq 0HH is positive semi-definite
xx^*Global minimizer
ff^*Optimal value: f(x)f(x^*)
proxg\text{prox}_gProximal operator of gg
f()f^*(\cdot)Convex conjugate of ff
O(1/t)O(1/t)Sublinear convergence rate
O((1c)t)O((1-c)^t)Linear (exponential) convergence rate