Skip to main content

Convex Optimization

Convex optimization is the "easy case" of optimization: convex problems have a rich theory with clean convergence guarantees, efficient algorithms, and no bad local minima. Many ML problems are convex (linear regression, logistic regression, SVMs), and understanding convexity illuminates why non-convex deep learning is harder and what structure makes it tractable.

Convex Sets

A set $C \subseteq \mathbb{R}^n$ is **convex** if the line segment between any two points in $C$ lies entirely within $C$:

x,yC,  λ[0,1]:λx+(1λ)yC\forall x, y \in C, \; \forall \lambda \in [0,1]: \quad \lambda x + (1-\lambda)y \in C

**Examples of convex sets:** - Hyperplanes: $\{x : a^\top x = b\}$ - Half-spaces: $\{x : a^\top x \leq b\}$ - Balls: $\{x : \|x - c\| \leq r\}$ (in any norm) - Polyhedra: $\{x : Ax \leq b\}$ (intersection of half-spaces) - Positive semidefinite cone: $\{X \in \mathbb{S}^n : X \succeq 0\}$

Intersections of convex sets are convex. This means constraints like AxbAx \leq b and xc\|x\| \leq c together define a convex feasible region.

Convex Functions

A function $f: \mathbb{R}^n \to \mathbb{R}$ is **convex** if for all $x, y$ in its domain and $\lambda \in [0,1]$:

f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y)

Geometrically, the chord between any two points on the graph of ff lies above (or on) the graph.

Equivalent characterizations (for twice-differentiable ff):

  1. First-order condition: f(y)f(x)+f(x)(yx)f(y) \geq f(x) + \nabla f(x)^\top(y - x) for all x,yx, y (the tangent hyperplane is a global underestimator)
  2. Second-order condition: 2f(x)0\nabla^2 f(x) \succeq 0 for all xx (the Hessian is PSD everywhere)
For a convex function, **every local minimum is a global minimum**. Moreover, the set of global minima is convex (it may be a single point, a line, or a higher-dimensional flat).

Proof. Suppose xx^* is a local minimum but not global: there exists yy with f(y)<f(x)f(y) < f(x^*). By convexity, for any λ(0,1)\lambda \in (0,1): f(λy+(1λ)x)λf(y)+(1λ)f(x)<f(x)f(\lambda y + (1-\lambda)x^*) \leq \lambda f(y) + (1-\lambda)f(x^*) < f(x^*). But λy+(1λ)x\lambda y + (1-\lambda)x^* can be made arbitrarily close to xx^* by taking λ0\lambda \to 0, contradicting local optimality. \square

This theorem is why convex optimization has clean convergence guarantees: gradient descent cannot get stuck in bad local minima. For non-convex problems (neural networks), this guarantee fails, but empirically the local minima found by SGD on overparameterized networks have loss values very close to the global minimum.
FunctionConvex?Strongly Convex?Notes
xp\|x\|_p for p1p \geq 1YesNoAny norm is convex
Axb22\|Ax - b\|_2^2YesYes (if AA full rank, μ=σmin(A)2\mu = \sigma_{\min}(A)^2)Linear regression
log(1+eyz)\log(1 + e^{-yz})Yes (in model params)NoLogistic regression loss
max(0,1yz)\max(0, 1 - yz)YesNoHinge loss (SVM)
pilogqi-\sum p_i \log q_iYes in qqNoCross entropy
ReLU network lossNoNoNon-convex in weights
exe^xYesNoExponential
xlogxx \log x for x>0x > 0YesNoEntropy (negated)

Lipschitz Smoothness

A differentiable function $f$ has **$L$-Lipschitz continuous gradient** (is **$L$-smooth**) if:

f(x)f(y)Lxyx,y\|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\| \quad \forall x, y

Equivalently, ff is upper-bounded by a quadratic around any point:

f(y)f(x)+f(x)(yx)+L2yx2f(y) \leq f(x) + \nabla f(x)^\top(y-x) + \frac{L}{2}\|y-x\|^2

The constant LL is the largest eigenvalue of the Hessian: L=supxλmax(2f(x))L = \sup_x \lambda_{\max}(\nabla^2 f(x)).

**Connection to learning rate.** $L$-smoothness guarantees that gradient descent with step size $\eta \leq 1/L$ makes monotonic progress: each step decreases the loss by at least $\frac{\eta}{2}\|\nabla f\|^2$. Using $\eta > 1/L$ can cause the loss to increase or diverge. This is the theoretical justification for the common practice of reducing the learning rate when training becomes unstable.

Strong Convexity

A function is **$\mu$-strongly convex** ($\mu > 0$) if the Hessian is bounded below by $\mu I$:

2f(x)μIx\nabla^2 f(x) \succeq \mu I \quad \forall x

Equivalently:

f(y)f(x)+f(x)(yx)+μ2yx2f(y) \geq f(x) + \nabla f(x)^\top (y-x) + \frac{\mu}{2} \|y - x\|^2

Strong convexity guarantees a unique global minimum and a lower bound on curvature.

**L2 regularization creates strong convexity.** Adding $\frac{\lambda}{2}\|\theta\|^2$ to any convex loss makes it $\lambda$-strongly convex (since $\nabla^2(\frac{\lambda}{2}\|\theta\|^2) = \lambda I$). This guarantees: - A unique global minimum - Exponential convergence with condition number $\kappa = (L + \lambda)/\lambda$ - Better numerical conditioning

This is one of the key benefits of weight decay beyond regularization: it improves optimization by making the problem better conditioned.

Convergence Guarantees

For a convex, $L$-smooth function, gradient descent with step size $\eta = 1/L$ satisfies:

f(xt)f(x)Lx0x22tf(x_t) - f(x^*) \leq \frac{L \|x_0 - x^*\|^2}{2t}

**How many iterations to reach a target accuracy?** Suppose $f$ is convex and $L$-smooth with $L = 10$, the initial distance to the optimum is $\|x_0 - x^*\| = 2$, and we want $f(x_t) - f(x^*) \leq \epsilon$ with $\epsilon = 10^{-3}$.

The bound guarantees f(xt)f(x)Lx0x22tf(x_t) - f(x^*) \leq \dfrac{L\|x_0 - x^*\|^2}{2t}. Setting the right-hand side equal to ϵ\epsilon and solving for tt:

tLx0x22ϵ=10222103=400.002=20,000.t \geq \frac{L\|x_0 - x^*\|^2}{2\epsilon} = \frac{10 \cdot 2^2}{2 \cdot 10^{-3}} = \frac{40}{0.002} = 20{,}000.

So 20,00020{,}000 gradient steps suffice. Notice the dependence on 1/ϵ1/\epsilon: tightening the target from 10310^{-3} to 10610^{-6} multiplies the iteration count by 10001000, to 2×1072 \times 10^7. This slow O(1/ϵ)O(1/\epsilon) scaling is exactly what Nesterov acceleration improves to O(1/ϵ)O(1/\sqrt{\epsilon}), and what strong convexity improves to the much faster O(κlog(1/ϵ))O(\kappa \log(1/\epsilon)).

For a $\mu$-strongly convex, $L$-smooth function, gradient descent with $\eta = 1/L$ converges exponentially:

f(xt)f(x)(1μL)t(f(x0)f(x))=(11κ)t(f(x0)f(x))f(x_t) - f(x^*) \leq \left(1 - \frac{\mu}{L}\right)^t \left(f(x_0) - f(x^*)\right) = \left(1 - \frac{1}{\kappa}\right)^t \left(f(x_0) - f(x^*)\right)

where κ=L/μ\kappa = L/\mu is the condition number.

SettingRate (GD)Rate (Accelerated)Example
Convex, LL-smoothO(L/t)O(L/t)O(L/t2)O(L/t^2)Logistic regression
μ\mu-strongly convex, LL-smoothO((11/κ)t)O((1-1/\kappa)^t)O((11/κ)t)O((1-1/\sqrt{\kappa})^t)Ridge regression
Convex, non-smoothO(1/t)O(1/\sqrt{t})O(1/t)O(1/\sqrt{t})Lasso, SVM
**Nesterov acceleration.** For smooth convex problems, Nesterov's accelerated gradient method achieves the rate $O(L/t^2)$ (a quadratic improvement over gradient descent's $O(L/t)$) by evaluating the gradient at a "lookahead" point. For strongly convex problems, the improvement is from $O((1-1/\kappa)^t)$ to $O((1-1/\sqrt{\kappa})^t)$. These rates are provably optimal: no first-order method can converge faster [@nesterov1983method].

Proximal Methods

For a convex function $g$ (possibly non-smooth), the **proximal operator** is:

proxηg(x)=argminz{g(z)+12ηzx2}\text{prox}_{\eta g}(x) = \arg\min_z \left\{ g(z) + \frac{1}{2\eta}\|z - x\|^2 \right\}

The proximal operator generalizes the gradient step to non-smooth functions. For smooth gg, it reduces to the gradient step: proxηg(x)=xηg(x)+O(η2)\text{prox}_{\eta g}(x) = x - \eta \nabla g(x) + O(\eta^2).

**Proximal operator for L1.** For $g(\theta) = \lambda\|\theta\|_1$, the proximal operator is the **soft-thresholding** operator:

proxηλ1(θ)i=sign(θi)max(θiηλ,0)\text{prox}_{\eta \lambda \|\cdot\|_1}(\theta)_i = \text{sign}(\theta_i) \max(|\theta_i| - \eta\lambda, 0)

This is the basis of ISTA (Iterative Shrinkage-Thresholding Algorithm) for sparse optimization.

Numeric computation. Take θ=(3,0.4,0.5,2)\theta = (3,\, -0.4,\, 0.5,\, -2) with threshold ηλ=0.5\eta\lambda = 0.5. Applying the formula coordinate by coordinate:

  • θ1=3\theta_1 = 3: sign(3)max(30.5,0)=+12.5=2.5\text{sign}(3)\max(3 - 0.5, 0) = +1 \cdot 2.5 = 2.5
  • θ2=0.4\theta_2 = -0.4: sign(0.4)max(0.40.5,0)=10=0\text{sign}(-0.4)\max(0.4 - 0.5, 0) = -1 \cdot 0 = 0
  • θ3=0.5\theta_3 = 0.5: sign(0.5)max(0.50.5,0)=+10=0\text{sign}(0.5)\max(0.5 - 0.5, 0) = +1 \cdot 0 = 0
  • θ4=2\theta_4 = -2: sign(2)max(20.5,0)=11.5=1.5\text{sign}(-2)\max(2 - 0.5, 0) = -1 \cdot 1.5 = -1.5

So prox0.51(θ)=(2.5,0,0,1.5)\text{prox}_{0.5\|\cdot\|_1}(\theta) = (2.5,\, 0,\, 0,\, -1.5). Coordinates with magnitude below the threshold 0.50.5 are set exactly to zero; the survivors are shrunk toward zero by 0.50.5. This is exactly how L1 regularization induces sparsity.

Duality

The **convex conjugate** of $f$ is:

f(y)=supx{yxf(x)}f^*(y) = \sup_x \left\{ y^\top x - f(x) \right\}

Key properties: f=ff^{**} = f for convex ff (biconjugate is the original function), (f+g)=fg(f + g)^* = f^* \star g^* (infimal convolution), and f=(f)1\nabla f^* = (\nabla f)^{-1} (gradients are inverses).

Convex conjugates appear in ML in several places: - **Variational representation of KL divergence:** $D_{\text{KL}}(p \| q) = \sup_T \{\mathbb{E}_p[T] - \log \mathbb{E}_q[e^T]\}$ (Donsker-Varadhan) - **f-GAN losses:** Use the conjugate of $f$-divergences to derive trainable discriminator objectives - **Legendre transform:** Converts between natural and expectation parameterizations of exponential families

ML Applications of Convex Optimization

ProblemConvex?Strongly Convex?Solver
Linear regressionYesYes (if rank(X)=n\text{rank}(X) = n)Closed-form or CG
Ridge regressionYesYes (μ=λ\mu = \lambda)Closed-form
Logistic regressionYesNo (only with L2 reg)L-BFGS, Newton-CG
SVM (hard margin)YesYesQP solver
SVM (soft margin)YesYesSMO, LibSVM
LassoYesNoISTA/FISTA, coordinate descent
Matrix factorizationNoNoAlternating minimization
Neural networksNoNoSGD, Adam
Transformer trainingNoNoAdamW + warmup + cosine

Notation Summary

SymbolMeaning
LLLipschitz constant of the gradient (LL-smoothness)
μ\muStrong convexity parameter
κ=L/μ\kappa = L/\muCondition number
H0H \succeq 0HH is positive semi-definite
xx^*Global minimizer
f(x)f(x^*)Optimal value (the minimum of ff)
proxg\text{prox}_gProximal operator of gg
f()f^*(\cdot)Convex conjugate of ff
O(1/t)O(1/t)Sublinear convergence rate
O((1c)t)O((1-c)^t)Linear (exponential) convergence rate