Constrained Optimization
Many ML problems involve optimizing subject to constraints -- from SVMs with margin constraints to RLHF with KL penalties to fairness constraints in responsible AI. The Lagrangian framework transforms these constrained problems into unconstrained ones, enabling gradient-based solutions.
The Constrained Problem
where is the objective function, are inequality constraints, and are equality constraints. The set of points satisfying all constraints is the feasible set.
Lagrange Multipliers (Equality Constraints)
The necessary conditions for optimality are:
Lagrangian: . Setting : , giving . The optimal is the eigenvector with the largest eigenvalue. The multiplier equals the maximum variance (the eigenvalue).
KKT Conditions (Inequality Constraints)
- Stationarity:
- Primal feasibility: for all ; for all
- Dual feasibility: for all
- Complementary slackness: for all
In SVMs, the data points with active constraints () are the support vectors -- they determine the decision boundary. All other points have and do not affect the solution.
Lagrangian Duality
The dual function is the minimum of the Lagrangian over :
The dual problem maximizes the dual function: .
Strong duality (, zero duality gap) holds when:
- The problem is convex, and
- Slater's condition is satisfied: there exists a strictly feasible point with for all (strict inequality)
- Easier problem structure: The dual is always a concave maximization (even when the primal is non-convex), and may have fewer variables or simpler constraints.
- Lower bounds: Weak duality gives certificates of (near-)optimality: if a primal feasible and dual feasible satisfy , then is -optimal.
- Kernel trick: The SVM dual depends on data only through inner products, enabling kernel methods.
- Constraint interpretation: Dual variables measure the cost of each constraint, guiding which constraints to relax.
The Lagrangian introduces multipliers for the margin constraints. Taking the infimum over and substituting back yields the dual:
The dual depends on data only through inner products , enabling the kernel trick: replace with for an implicit feature map .
Penalty Methods and Augmented Lagrangian
| Constraint | Penalty Approximation | Used In |
|---|---|---|
| RLHF (the coefficient) | ||
| or projected gradient | Spectral normalization | |
| Weight decay = soft L2 constraint | ||
| Softmax reparameterization | Mixture models, attention | |
| Fair ML |
The penalty coefficient (or ) corresponds to the Lagrange multiplier in the equivalent constrained formulation. Tuning is equivalent to choosing the constraint bound, by strong duality.
Projected Gradient Descent
ML Applications of Constrained Optimization
| Application | Constraint | Solution Approach |
|---|---|---|
| SVM | KKT + kernel trick via dual | |
| RLHF | Lagrangian relaxation ( penalty) | |
| Constrained generation | (e.g., toxicity) | Constrained decoding, penalty |
| Fair ML | Demographic parity or equalized odds | Lagrangian dual, penalty |
| Spectral normalization | Power iteration + projection | |
| Gradient clipping | Projection onto L2 ball | |
| Weight clipping | Projection onto box (WGAN) | |
| Simplex weights | , | Softmax reparameterization |
| Orthogonal weights | Cayley transform, manifold optimization |
Notation Summary
| Symbol | Meaning |
|---|---|
| Objective function | |
| Inequality constraints | |
| Equality constraints | |
| Dual variable (Lagrange multiplier) for inequality constraint | |
| Dual variable for equality constraint | |
| Lagrangian function | |
| Dual function | |
| Optimal dual and primal values | |
| Projection operator onto convex set | |
| KKT | Karush--Kuhn--Tucker conditions |