Skip to main content

Constrained Optimization

Many ML problems involve optimizing subject to constraints -- from SVMs with margin constraints to RLHF with KL penalties to fairness constraints in responsible AI. The Lagrangian framework transforms these constrained problems into unconstrained ones, enabling gradient-based solutions.

The Constrained Problem

The standard form of a constrained optimization problem is:

minxRnf(x)subject togi(x)0,  i=1,,mandhj(x)=0,  j=1,,p\min_{x \in \mathbb{R}^n} f(x) \quad \text{subject to} \quad g_i(x) \leq 0, \; i = 1, \dots, m \quad \text{and} \quad h_j(x) = 0, \; j = 1, \dots, p

where ff is the objective function, gig_i are inequality constraints, and hjh_j are equality constraints. The set of points satisfying all constraints is the feasible set.

**Constrained problems in ML:** - **SVM:** $\min \frac{1}{2}\|w\|^2$ s.t. $y_i(w^\top x_i + b) \geq 1$ - **Trust region:** $\min \nabla f^\top \delta + \frac{1}{2}\delta^\top H \delta$ s.t. $\|\delta\| \leq \Delta$ - **Simplex constraint:** $\min f(\pi)$ s.t. $\pi \geq 0$, $\sum_i \pi_i = 1$ (e.g., mixture weights) - **Spectral norm constraint:** $\min \mathcal{L}(\theta)$ s.t. $\|W_l\|_2 \leq c$ for each layer

Lagrange Multipliers (Equality Constraints)

For $\min_x f(x)$ subject to $h(x) = 0$ where $h: \mathbb{R}^n \to \mathbb{R}^p$, introduce **Lagrange multipliers** $\nu \in \mathbb{R}^p$ and form the **Lagrangian**:

L(x,ν)=f(x)+νh(x)=f(x)+j=1pνjhj(x)\mathcal{L}(x, \nu) = f(x) + \nu^\top h(x) = f(x) + \sum_{j=1}^p \nu_j h_j(x)

The necessary conditions for optimality are:

xL=f(x)+jνjhj(x)=0andh(x)=0\nabla_x \mathcal{L} = \nabla f(x^*) + \sum_j \nu_j^* \nabla h_j(x^*) = 0 \qquad \text{and} \qquad h(x^*) = 0

**Geometric interpretation.** At a constrained optimum $x^*$, the gradient of $f$ must be a linear combination of the constraint gradients. If $\nabla f$ had a component tangent to the constraint surface $h(x) = 0$, we could move along that surface and decrease $f$ -- contradicting optimality. Therefore $\nabla f = -\sum_j \nu_j \nabla h_j$, meaning $\nabla f$ is normal to the constraint surface. **Sensitivity interpretation.** The Lagrange multiplier $\nu_j^*$ measures the sensitivity of the optimal objective value to the constraint. If we relax the constraint from $h_j(x) = 0$ to $h_j(x) = \epsilon_j$, the optimal cost changes by approximately $\nu_j^* \epsilon_j$. A large $|\nu_j^*|$ means the constraint is "expensive" -- relaxing it would significantly improve the objective. **PCA as constrained optimization.** The first principal component maximizes variance subject to unit norm:

maxvvΣvs.t. v2=1\max_v v^\top \Sigma v \quad \text{s.t. } \|v\|^2 = 1

Lagrangian: L(v,λ)=vΣvλ(vv1)\mathcal{L}(v, \lambda) = v^\top \Sigma v - \lambda(v^\top v - 1). Setting vL=0\nabla_v \mathcal{L} = 0: 2Σv=2λv2\Sigma v = 2\lambda v, giving Σv=λv\Sigma v = \lambda v. The optimal vv is the eigenvector with the largest eigenvalue. The multiplier λ\lambda equals the maximum variance (the eigenvalue).

KKT Conditions (Inequality Constraints)

For the general constrained problem, the **KKT conditions** are necessary for optimality (and sufficient when the problem is convex). At an optimal point $x^*$ with multipliers $\lambda^* \in \mathbb{R}^m$ (inequality) and $\nu^* \in \mathbb{R}^p$ (equality):
  1. Stationarity: xf(x)+i=1mλigi(x)+j=1pνjhj(x)=0\nabla_x f(x^*) + \sum_{i=1}^m \lambda_i^* \nabla g_i(x^*) + \sum_{j=1}^p \nu_j^* \nabla h_j(x^*) = 0
  2. Primal feasibility: gi(x)0g_i(x^*) \leq 0 for all ii; hj(x)=0h_j(x^*) = 0 for all jj
  3. Dual feasibility: λi0\lambda_i^* \geq 0 for all ii
  4. Complementary slackness: λigi(x)=0\lambda_i^* g_i(x^*) = 0 for all ii
**Complementary slackness** is the key structural insight: for each inequality constraint, either: - The constraint is **active** ($g_i(x^*) = 0$): the constraint is tight and its multiplier $\lambda_i^*$ can be positive -- the constraint is "pushing" - The constraint is **inactive** ($g_i(x^*) < 0$): the constraint is slack and $\lambda_i^* = 0$ -- the constraint is irrelevant at the optimum

In SVMs, the data points with active constraints (yi(wxi+b)=1y_i(w^\top x_i + b) = 1) are the support vectors -- they determine the decision boundary. All other points have λi=0\lambda_i = 0 and do not affect the solution.

**When are KKT conditions sufficient?** For convex problems (convex $f$ and $g_i$, affine $h_j$), the KKT conditions are both necessary and sufficient for global optimality. For non-convex problems, they are only necessary (a KKT point may be a saddle point or local maximum).

Lagrangian Duality

The **Lagrangian** for the general constrained problem is:

L(x,λ,ν)=f(x)+i=1mλigi(x)+j=1pνjhj(x)\mathcal{L}(x, \lambda, \nu) = f(x) + \sum_{i=1}^m \lambda_i g_i(x) + \sum_{j=1}^p \nu_j h_j(x)

The dual function is the minimum of the Lagrangian over xx:

d(λ,ν)=infxL(x,λ,ν)d(\lambda, \nu) = \inf_x \mathcal{L}(x, \lambda, \nu)

The dual problem maximizes the dual function: maxλ0,νd(λ,ν)\max_{\lambda \geq 0, \nu} d(\lambda, \nu).

**Weak duality** always holds: $d^* \leq f^*$ (the dual optimal value lower-bounds the primal optimal value). The difference $f^* - d^*$ is the **duality gap**.

Strong duality (d=fd^* = f^*, zero duality gap) holds when:

  • The problem is convex, and
  • Slater's condition is satisfied: there exists a strictly feasible point x^\hat{x} with gi(x^)<0g_i(\hat{x}) < 0 for all ii (strict inequality)
**Why duality is useful:**
  1. Easier problem structure: The dual is always a concave maximization (even when the primal is non-convex), and may have fewer variables or simpler constraints.
  2. Lower bounds: Weak duality gives certificates of (near-)optimality: if a primal feasible xx and dual feasible (λ,ν)(\lambda, \nu) satisfy f(x)d(λ,ν)ϵf(x) - d(\lambda, \nu) \leq \epsilon, then xx is ϵ\epsilon-optimal.
  3. Kernel trick: The SVM dual depends on data only through inner products, enabling kernel methods.
  4. Constraint interpretation: Dual variables λi\lambda_i^* measure the cost of each constraint, guiding which constraints to relax.
**SVM dual derivation.** The primal SVM problem (with slack variables for the soft-margin case):

minw,b,ξ12w2+Ciξis.t.yi(wxi+b)1ξi,  ξi0\min_{w, b, \xi} \frac{1}{2}\|w\|^2 + C\sum_i \xi_i \quad \text{s.t.} \quad y_i(w^\top x_i + b) \geq 1 - \xi_i, \; \xi_i \geq 0

The Lagrangian introduces multipliers αi0\alpha_i \geq 0 for the margin constraints. Taking the infimum over w,b,ξw, b, \xi and substituting back yields the dual:

maxαi=1Nαi12i,jαiαjyiyj(xixj)s.t.0αiC,  iαiyi=0\max_\alpha \sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i^\top x_j) \quad \text{s.t.} \quad 0 \leq \alpha_i \leq C, \; \sum_i \alpha_i y_i = 0

The dual depends on data only through inner products xixjx_i^\top x_j, enabling the kernel trick: replace xixjx_i^\top x_j with k(xi,xj)=ϕ(xi)ϕ(xj)k(x_i, x_j) = \phi(x_i)^\top \phi(x_j) for an implicit feature map ϕ\phi.

Penalty Methods and Augmented Lagrangian

In deep learning, hard constraints are often replaced by **penalty terms** added to the loss:
ConstraintPenalty ApproximationUsed In
DKL(ππref)ϵD_{\text{KL}}(\pi \| \pi_{\text{ref}}) \leq \epsilonβDKL(ππref)\beta \, D_{\text{KL}}(\pi \| \pi_{\text{ref}})RLHF (the β\beta coefficient)
W2c\|W\|_2 \leq cλW22\lambda \|W\|_2^2 or projected gradientSpectral normalization
θc\|\theta\| \leq cλ2θ2\frac{\lambda}{2}\|\theta\|^2Weight decay = soft L2 constraint
iπi=1,  πi0\sum_i \pi_i = 1, \; \pi_i \geq 0Softmax reparameterizationMixture models, attention
fairness(f)δ\text{fairness}(f) \leq \deltaλfairness(f)\lambda \cdot \text{fairness}(f)Fair ML

The penalty coefficient λ\lambda (or β\beta) corresponds to the Lagrange multiplier in the equivalent constrained formulation. Tuning λ\lambda is equivalent to choosing the constraint bound, by strong duality.

Projected Gradient Descent

For constrained optimization over a convex set $C$, **projected gradient descent** takes a gradient step and then projects back onto $C$:

xt+1=ΠC(xtηf(xt))whereΠC(z)=argminxCxz2x_{t+1} = \Pi_C(x_t - \eta \nabla f(x_t)) \quad \text{where} \quad \Pi_C(z) = \arg\min_{x \in C} \|x - z\|^2

**Common projections:** - **Box constraints** $[a, b]^n$: $\Pi(z) = \text{clip}(z, a, b)$ (elementwise) - **Simplex** $\Delta^n$: sort $z$ descending, find threshold $\tau$, set $\Pi(z)_i = \max(z_i - \tau, 0)$ [@duchi2008simplex] - **L2 ball** $\|x\| \leq c$: $\Pi(z) = z \cdot \min(1, c/\|z\|)$ (gradient clipping is a projection) - **Spectral norm** $\|W\|_2 \leq c$: compute SVD, clip singular values to $\leq c$

ML Applications of Constrained Optimization

ApplicationConstraintSolution Approach
SVMyi(wxi+b)1y_i(w^\top x_i + b) \geq 1KKT + kernel trick via dual
RLHFDKL(ππref)ϵD_{\text{KL}}(\pi \| \pi_{\text{ref}}) \leq \epsilonLagrangian relaxation (β\beta penalty)
Constrained generationc(y)δc(y) \leq \delta (e.g., toxicity)Constrained decoding, penalty
Fair MLDemographic parity or equalized oddsLagrangian dual, penalty
Spectral normalizationW21\|W\|_2 \leq 1Power iteration + projection
Gradient clippinggc\|g\| \leq cProjection onto L2 ball
Weight clippingθ[a,b]\theta \in [a, b]Projection onto box (WGAN)
Simplex weightsπ0\pi \geq 0, πi=1\sum \pi_i = 1Softmax reparameterization
Orthogonal weightsWW=IW^\top W = ICayley transform, manifold optimization

Notation Summary

SymbolMeaning
f(x)f(x)Objective function
gi(x)0g_i(x) \leq 0Inequality constraints
hj(x)=0h_j(x) = 0Equality constraints
λi0\lambda_i \geq 0Dual variable (Lagrange multiplier) for inequality constraint ii
νj\nu_jDual variable for equality constraint jj
L(x,λ,ν)\mathcal{L}(x, \lambda, \nu)Lagrangian function
d(λ,ν)d(\lambda, \nu)Dual function
d,fd^*, f^*Optimal dual and primal values
ΠC\Pi_CProjection operator onto convex set CC
KKTKarush--Kuhn--Tucker conditions