Skip to main content

RL Algorithms for LLMs

Background

Reinforcement learning from human feedback (RLHF) fine-tunes a pretrained language model πθ\pi_\theta to align with human preferences, using a reward signal derived from human judgments (Ouyang et al., 2022). The standard pipeline has three stages:

  1. Supervised fine-tuning (SFT): Train on high-quality demonstration data to obtain a reference policy πref\pi_{\text{ref}}.
  2. Reward modeling: Train a scalar reward model rϕ(x,y)r_\phi(x, y) on human preference pairs (ywylx)(y_w \succ y_l | x) using the Bradley--Terry model.
  3. RL optimization: Maximize the expected reward while constraining the policy to stay close to πref\pi_{\text{ref}}.

The general objective is:

maxθ  ExD,yπθ(x)[rϕ(x,y)]βDKL ⁣(πθ(x)πref(x))\max_{\theta} \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) \right] - \beta \, D_{\text{KL}}\!\left(\pi_\theta(\cdot|x) \,\|\, \pi_{\text{ref}}(\cdot|x)\right)

The KL penalty serves two purposes: (1) it prevents **reward hacking** -- the policy finding degenerate outputs that score high on the (imperfect) reward model but are actually low-quality; (2) it prevents **catastrophic forgetting** of capabilities learned during pretraining. The coefficient $\beta$ controls this tradeoff: larger $\beta$ keeps the policy closer to $\pi_{\text{ref}}$ at the cost of lower reward. The **Bradley--Terry model** assigns a probability that response $y_w$ is preferred over $y_l$ given prompt $x$:

P(ywylx)=σ ⁣(rϕ(x,yw)rϕ(x,yl))=11+exp ⁣((rϕ(x,yw)rϕ(x,yl)))P(y_w \succ y_l | x) = \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right) = \frac{1}{1 + \exp\!\left(-(r_\phi(x, y_w) - r_\phi(x, y_l))\right)}

The reward model is trained by maximizing the log-likelihood of observed preferences:

LRM=E(x,yw,yl) ⁣[logσ ⁣(rϕ(x,yw)rϕ(x,yl))]\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)}\!\left[\log \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]

REINFORCE: The Foundation

The simplest policy gradient method. By the **log-derivative trick** (also called the REINFORCE trick or score function estimator), the gradient of the expected reward is:

θJ(θ)=Eyπθ(x)[θlogπθ(yx)R(x,y)]\nabla_\theta J(\theta) = \mathbb{E}_{y \sim \pi_\theta(\cdot|x)} \left[ \nabla_\theta \log \pi_\theta(y|x) \cdot R(x, y) \right]

Derivation. We want θEyπθ[R(y)]\nabla_\theta \mathbb{E}_{y \sim \pi_\theta}[R(y)]. Since θπθ(yx)=πθ(yx)θlogπθ(yx)\nabla_\theta \pi_\theta(y|x) = \pi_\theta(y|x) \nabla_\theta \log \pi_\theta(y|x):

θE[R]=θyπθ(yx)R(y)=yπθ(yx)θlogπθ(yx)R(y)=Eπθ ⁣[θlogπθ(yx)R(y)]\nabla_\theta \mathbb{E}[R] = \nabla_\theta \sum_y \pi_\theta(y|x) R(y) = \sum_y \pi_\theta(y|x) \nabla_\theta \log \pi_\theta(y|x) \cdot R(y) = \mathbb{E}_{\pi_\theta}\!\left[\nabla_\theta \log \pi_\theta(y|x) \cdot R(y)\right]

A baseline b(x)b(x) reduces variance without introducing bias (since E[θlogπθb]=0\mathbb{E}[\nabla_\theta \log \pi_\theta \cdot b] = 0):

θJ=Eπθ[θlogπθ(yx)(R(x,y)b(x))]\nabla_\theta J = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(y|x) \cdot (R(x,y) - b(x)) \right]

Limitations: High variance (requires many samples per update), on-policy only (samples must be discarded after each parameter update), no trust region (large updates can destabilize training).

PPO (Proximal Policy Optimization)

PPO constrains how much the policy can change per update using a clipped surrogate objective [@schulman2017ppo]. Define the probability ratio between the current and old policy:

rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}

The clipped objective is:

LCLIP(θ)=Et[min ⁣(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)]\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\!\left( r_t(\theta) \, \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \, \hat{A}_t \right) \right]

where A^t\hat{A}_t is the advantage estimate (typically computed via GAE -- Generalized Advantage Estimation) and ϵ\epsilon (typically 0.10.1--0.20.2) defines the trust region.

**Why the min and clip?** The $\min$ makes the objective pessimistic:
  • If the advantage A^t>0\hat{A}_t > 0 (action was good): increasing rtr_t beyond 1+ϵ1 + \epsilon gets clipped, preventing overly aggressive exploitation.
  • If A^t<0\hat{A}_t < 0 (action was bad): decreasing rtr_t below 1ϵ1 - \epsilon gets clipped, preventing the policy from moving too far in a single step.

This achieves a similar effect to TRPO's hard KL constraint but is much simpler to implement and computationally cheaper.

GAE computes the advantage by exponentially weighting $n$-step temporal difference estimates [@schulman2016gae]:

A^tGAE(γ,λ)=l=0(γλ)lδt+l,where δt=rt+γV(st+1)V(st)\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}, \quad \text{where } \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

λ=0\lambda = 0 gives the one-step TD advantage (low variance, high bias); λ=1\lambda = 1 gives the Monte Carlo advantage (high variance, low bias).

**In RLHF:** PPO requires maintaining four models in memory simultaneously: 1. **Policy** $\pi_\theta$ -- the model being trained 2. **Reference** $\pi_{\text{ref}}$ -- frozen SFT model for KL computation 3. **Reward model** $r_\phi$ -- evaluates response quality 4. **Value function** $V_\psi$ -- estimates expected return for advantage computation

For a 7B parameter model, this means ~28B parameters in memory, requiring at least 4 GPUs with 80GB each just for model weights in FP16. This memory overhead motivates methods like DPO, GRPO, and RLOO that eliminate one or more of these components.

DPO (Direct Preference Optimization)

DPO eliminates the reward model entirely by deriving a closed-form relationship between the optimal policy and the reward [@rafailov2023dpo].

Derivation. The KL-constrained reward maximization objective has a closed-form optimal policy:

π(yx)=1Z(x)πref(yx)exp ⁣(1βr(x,y))\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\!\left(\frac{1}{\beta} r(x, y)\right)

where Z(x)=yπref(yx)exp(r(x,y)/β)Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp(r(x,y)/\beta) is the partition function. Solving for the reward:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)

Substituting this into the Bradley--Terry preference model P(ywyl)=σ(r(x,yw)r(x,yl))P(y_w \succ y_l) = \sigma(r(x,y_w) - r(x,y_l)) and noting that Z(x)Z(x) cancels:

LDPO(θ)=E(x,yw,yl)[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\!\left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right]

DPO reparameterizes the reward as the log-ratio between policy and reference, turning RL into a supervised classification problem on preference pairs. Key advantages: - **No reward model** -- eliminates $r_\phi$ from memory - **No value function** -- eliminates $V_\psi$ from memory - **No sampling during training** -- uses pre-collected preference data (off-policy) - **Simpler implementation** -- just a modified cross-entropy loss

Key limitation: because it is off-policy, DPO can suffer from distributional shift if the preference data was collected under a very different policy than πθ\pi_\theta.

**DPO gradient analysis.** The gradient of $\mathcal{L}_{\text{DPO}}$ has an intuitive form:

θLDPOσ ⁣(r^lr^w)weighting[βθlogπθ(ywx)βθlogπθ(ylx)]\nabla_\theta \mathcal{L}_{\text{DPO}} \propto -\underbrace{\sigma\!\left(\hat{r}_l - \hat{r}_w\right)}_{\text{weighting}} \left[\beta \nabla_\theta \log \pi_\theta(y_w|x) - \beta \nabla_\theta \log \pi_\theta(y_l|x)\right]

where r^i=βlogπθ(yix)πref(yix)\hat{r}_i = \beta \log \frac{\pi_\theta(y_i|x)}{\pi_{\text{ref}}(y_i|x)} is the implicit reward. The weighting term is large when the model incorrectly ranks the pair (assigns higher implicit reward to yly_l than ywy_w), focusing learning on mistakes.

GRPO (Group Relative Policy Optimization)

GRPO [@shao2024deepseekmath] removes the value network from PPO by estimating advantages from a **group of sampled outputs**. For a prompt $x$, sample $G$ responses $\{y_1, \dots, y_G\}$ from $\pi_{\theta_{\text{old}}}$ and compute rewards $\{r_1, \dots, r_G\}$. The advantage for each response is the group-normalized reward:

A^i=rimean({r1,,rG})std({r1,,rG})+ϵ\hat{A}_i = \frac{r_i - \text{mean}(\{r_1, \dots, r_G\})}{\text{std}(\{r_1, \dots, r_G\}) + \epsilon}

The objective uses PPO-style clipping at the token level, averaging over all tokens in each response:

LGRPO(θ)=Ex[1Gi=1G1yit=1yimin ⁣(ri,t(θ)A^i,  clip(ri,t(θ),1ϵ,1+ϵ)A^i)]βDKL(πθπref)\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}_{x} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \min\!\left( r_{i,t}(\theta)\, \hat{A}_i, \; \text{clip}(r_{i,t}(\theta), 1-\epsilon, 1+\epsilon)\, \hat{A}_i \right) \right] - \beta \, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})

where ri,t(θ)=πθ(yi,tx,yi,<t)πθold(yi,tx,yi,<t)r_{i,t}(\theta) = \frac{\pi_\theta(y_{i,t} | x, y_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t} | x, y_{i,<t})} is the per-token probability ratio.

**Why group normalization works.** By comparing outputs within a group from the same prompt, GRPO obtains a relative ranking signal without a learned value function. The group mean acts as a prompt-specific baseline, and the standard deviation normalizes for reward scale. This reduces memory from 4 models (PPO) to 2 (policy + reference), at the cost of requiring $G$ forward passes per prompt. GRPO was used to train DeepSeek-Math and DeepSeek-R1 [@shao2024deepseekmath; @deepseekai2025r1]. GRPO typically computes the KL penalty at the token level using an approximation:

DKL(πθπref)1yt=1y[πθ(ytx,y<t)πref(ytx,y<t)logπθ(ytx,y<t)πref(ytx,y<t)1]D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \approx \frac{1}{|y|} \sum_{t=1}^{|y|} \left[\frac{\pi_\theta(y_t | x, y_{<t})}{\pi_{\text{ref}}(y_t | x, y_{<t})} - \log \frac{\pi_\theta(y_t | x, y_{<t})}{\pi_{\text{ref}}(y_t | x, y_{<t})} - 1\right]

This uses the identity DKL(pq)=Ep[p/qlog(p/q)1]D_{\text{KL}}(p \| q) = \mathbb{E}_p[p/q - \log(p/q) - 1] for a non-negative, unbiased KL estimate.

RLOO (REINFORCE Leave-One-Out)

RLOO is a variance reduction technique for REINFORCE that uses a leave-one-out baseline [@ahmadian2024rloo]. Sample $K$ responses per prompt and use the average reward of the *other* $K-1$ responses as the baseline for each:

bi=1K1jirjb_i = \frac{1}{K-1} \sum_{j \neq i} r_j

θJ=1Ki=1Kθlogπθ(yix)(ribi)\nabla_\theta J = \frac{1}{K} \sum_{i=1}^{K} \nabla_\theta \log \pi_\theta(y_i|x) \cdot (r_i - b_i)

RLOO has several desirable properties:
  1. Unbiased: Each leave-one-out baseline bib_i is independent of yiy_i, so the gradient estimator is unbiased.
  2. Lower variance than REINFORCE: The per-sample baseline captures the difficulty of the prompt, reducing variance from prompt-to-prompt reward differences.
  3. No learned value function: Like GRPO, avoids the memory cost of a separate value network.
  4. Simple implementation: Compared to GRPO, RLOO does not clip probability ratios or normalize by standard deviation -- it is closer to vanilla REINFORCE with a smart baseline.

Comparison

AlgorithmReward ModelValue NetworkOn/Off-PolicyModels in MemoryKey Idea
REINFORCEYesNoOn2Vanilla policy gradient + baseline
PPOYesYesOn4Clipped surrogate with trust region
DPONoNoOff2Reward = log policy ratio; supervised loss
GRPOYesNoOn2Group-normalized advantages with clipping
RLOOYesNoOn2Leave-one-out baseline for variance reduction
**Current trends (as of 2025):** - **Outcome-based RL** (using verifiable rewards like code execution, math proof checking) is replacing learned reward models for domains where automated verification is possible [@deepseekai2025r1]. - **GRPO** has become the dominant algorithm for math and code reasoning, where sampling $G$ solutions and checking correctness provides a natural reward signal. - **DPO** remains popular for general-purpose alignment where preference data is readily available and the simplicity of offline training is valued. - **Online DPO variants** (like OAIF [@guo2024direct]) combine DPO's simplicity with online data collection to mitigate distributional shift.

Notation Summary

SymbolMeaning
πθ\pi_\thetaCurrent policy (the model being trained)
πref\pi_{\text{ref}}Reference policy (frozen SFT model)
π\pi^*Optimal policy under KL-constrained objective
rϕ(x,y)r_\phi(x, y)Reward model score
β\betaKL penalty coefficient
A^t\hat{A}_tAdvantage estimate
ϵ\epsilonPPO/GRPO clipping parameter
γ,λ\gamma, \lambdaGAE discount and decay factors
yw,yly_w, y_lPreferred and rejected responses
σ()\sigma(\cdot)Sigmoid function
G,KG, KNumber of sampled responses per prompt
Z(x)Z(x)Partition function

References