Convex Optimization
Convex optimization is the "easy case" of optimization: convex problems have a rich theory with clean convergence guarantees, efficient algorithms, and no bad local minima. Many ML problems are convex (linear regression, logistic regression, SVMs), and understanding convexity illuminates why non-convex deep learning is harder and what structure makes it tractable.
Convex Sets
Intersections of convex sets are convex. This means constraints like and together define a convex feasible region.
Convex Functions
Geometrically, the chord between any two points on the graph of lies above (or on) the graph.
Equivalent characterizations (for twice-differentiable ):
- First-order condition: for all (the tangent hyperplane is a global underestimator)
- Second-order condition: for all (the Hessian is PSD everywhere)
Proof. Suppose is a local minimum but not global: there exists with . By convexity, for any : . But can be made arbitrarily close to by taking , contradicting local optimality.
| Function | Convex? | Strongly Convex? | Notes |
|---|---|---|---|
| for | Yes | No | Any norm is convex |
| Yes | Yes (if full rank, ) | Linear regression | |
| Yes (in model params) | No | Logistic regression loss | |
| Yes | No | Hinge loss (SVM) | |
| Yes in | No | Cross entropy | |
| ReLU network loss | No | No | Non-convex in weights |
| Yes | No | Exponential | |
| for | Yes | No | Entropy (negated) |
Lipschitz Smoothness
Equivalently, is upper-bounded by a quadratic around any point:
The constant is the largest eigenvalue of the Hessian: .
Strong Convexity
Equivalently:
Strong convexity guarantees a unique global minimum and a lower bound on curvature.
This is one of the key benefits of weight decay beyond regularization: it improves optimization by making the problem better conditioned.
Convergence Guarantees
The bound guarantees . Setting the right-hand side equal to and solving for :
So gradient steps suffice. Notice the dependence on : tightening the target from to multiplies the iteration count by , to . This slow scaling is exactly what Nesterov acceleration improves to , and what strong convexity improves to the much faster .
where is the condition number.
| Setting | Rate (GD) | Rate (Accelerated) | Example |
|---|---|---|---|
| Convex, -smooth | Logistic regression | ||
| -strongly convex, -smooth | Ridge regression | ||
| Convex, non-smooth | Lasso, SVM |
Proximal Methods
The proximal operator generalizes the gradient step to non-smooth functions. For smooth , it reduces to the gradient step: .
This is the basis of ISTA (Iterative Shrinkage-Thresholding Algorithm) for sparse optimization.
Numeric computation. Take with threshold . Applying the formula coordinate by coordinate:
- :
- :
- :
- :
So . Coordinates with magnitude below the threshold are set exactly to zero; the survivors are shrunk toward zero by . This is exactly how L1 regularization induces sparsity.
Duality
Key properties: for convex (biconjugate is the original function), (infimal convolution), and (gradients are inverses).
ML Applications of Convex Optimization
| Problem | Convex? | Strongly Convex? | Solver |
|---|---|---|---|
| Linear regression | Yes | Yes (if ) | Closed-form or CG |
| Ridge regression | Yes | Yes () | Closed-form |
| Logistic regression | Yes | No (only with L2 reg) | L-BFGS, Newton-CG |
| SVM (hard margin) | Yes | Yes | QP solver |
| SVM (soft margin) | Yes | Yes | SMO, LibSVM |
| Lasso | Yes | No | ISTA/FISTA, coordinate descent |
| Matrix factorization | No | No | Alternating minimization |
| Neural networks | No | No | SGD, Adam |
| Transformer training | No | No | AdamW + warmup + cosine |
Notation Summary
| Symbol | Meaning |
|---|---|
| Lipschitz constant of the gradient (-smoothness) | |
| Strong convexity parameter | |
| Condition number | |
| is positive semi-definite | |
| Global minimizer | |
| Optimal value (the minimum of ) | |
| Proximal operator of | |
| Convex conjugate of | |
| Sublinear convergence rate | |
| Linear (exponential) convergence rate |