RL Algorithms for LLMs

Background

Reinforcement learning from human feedback (RLHF) fine-tunes a pretrained language model $\pi_\theta$ to align with human preferences, using a reward signal derived from human judgments (Ouyang et al., 2022). The standard pipeline has three stages:

Supervised fine-tuning (SFT): Train on high-quality demonstration data to obtain a reference policy $\pi_{\text{ref}}$ .
Reward modeling: Train a scalar reward model $r_\phi(x, y)$ on human preference pairs $(y_w \succ y_l | x)$ using the Bradley-Terry model.
RL optimization: Maximize the expected reward while constraining the policy to stay close to $\pi_{\text{ref}}$ .

The general objective is:

$\max_{\theta} \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) \right] - \beta \, D_{\text{KL}}\!\left(\pi_\theta(\cdot|x) \,\|\, \pi_{\text{ref}}(\cdot|x)\right)$

The KL penalty serves two purposes: (1) it prevents **reward hacking**, the policy finding degenerate outputs that score high on the (imperfect) reward model but are actually low-quality; (2) it prevents **catastrophic forgetting** of capabilities learned during pretraining. The coefficient $\beta$ controls this tradeoff: larger $\beta$ keeps the policy closer to $\pi_{\text{ref}}$ at the cost of lower reward. The **Bradley-Terry model** assigns a probability that response $y_w$ is preferred over $y_l$ given prompt $x$:

$P(y_w \succ y_l | x) = \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right) = \frac{1}{1 + \exp\!\left(-(r_\phi(x, y_w) - r_\phi(x, y_l))\right)}$

The reward model is trained by maximizing the log-likelihood of observed preferences:

$\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)}\!\left[\log \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]$

REINFORCE: The Foundation

The simplest policy gradient method. By the **log-derivative trick** (also called the REINFORCE trick or score function estimator), the gradient of the expected reward is:

$\nabla_\theta J(\theta) = \mathbb{E}_{y \sim \pi_\theta(\cdot|x)} \left[ \nabla_\theta \log \pi_\theta(y|x) \cdot R(x, y) \right]$

In the RLHF context, $R(x, y)$ is typically a learned reward model $r_\phi(x, y)$ minus the KL penalty. (This is why the comparison table later lists REINFORCE as requiring a reward model: the reward signal is generic in principle, but in the RLHF setting it is supplied by $r_\phi$ .)

Derivation. We want $\nabla_\theta \mathbb{E}_{y \sim \pi_\theta}[R(y)]$ . Since $\nabla_\theta \pi_\theta(y|x) = \pi_\theta(y|x) \nabla_\theta \log \pi_\theta(y|x)$ :

$\nabla_\theta \mathbb{E}[R] = \nabla_\theta \sum_y \pi_\theta(y|x) R(y) = \sum_y \pi_\theta(y|x) \nabla_\theta \log \pi_\theta(y|x) \cdot R(y) = \mathbb{E}_{\pi_\theta}\!\left[\nabla_\theta \log \pi_\theta(y|x) \cdot R(y)\right]$

A baseline $b(x)$ reduces variance without introducing bias (since $\mathbb{E}[\nabla_\theta \log \pi_\theta \cdot b] = 0$ ):

$\nabla_\theta J = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(y|x) \cdot (R(x,y) - b(x)) \right]$

Limitations: High variance (requires many samples per update), on-policy only (samples must be discarded after each parameter update), no trust region (large updates can destabilize training).

PPO (Proximal Policy Optimization)

PPO constrains how much the policy can change per update using a clipped surrogate objective [@schulman2017ppo]. Define the probability ratio between the current and old policy:

$r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}$

The clipped objective is:

$\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\!\left( r_t(\theta) \, \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \, \hat{A}_t \right) \right]$

where $\hat{A}_t$ is the advantage estimate (typically computed via GAE, Generalized Advantage Estimation) and $\epsilon$ (typically $0.1$ to $0.2$ ) defines the trust region.

**Why the min and clip?** The $\min$ makes the objective pessimistic:

If the advantage $\hat{A}_t > 0$ (action was good): increasing $r_t$ beyond $1 + \epsilon$ gets clipped, preventing overly aggressive exploitation.
If $\hat{A}_t < 0$ (action was bad): decreasing $r_t$ below $1 - \epsilon$ gets clipped, preventing the policy from moving too far in a single step.

This achieves a similar effect to TRPO's hard KL constraint but is much simpler to implement and computationally cheaper.

GAE computes the advantage by exponentially weighting $n$-step temporal difference estimates [@schulman2016gae]:

$\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}, \quad \text{where } \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

$\lambda = 0$ gives the one-step TD advantage (low variance, high bias); $\lambda = 1$ gives the Monte Carlo advantage (high variance, low bias).

**In RLHF:** PPO requires maintaining four models in memory simultaneously: 1. **Policy** $\pi_\theta$: the model being trained 2. **Reference** $\pi_{\text{ref}}$: frozen SFT model for KL computation 3. **Reward model** $r_\phi$: evaluates response quality 4. **Value function** $V_\psi$: estimates expected return for advantage computation

For a 7B parameter model, this means ~28B parameters in memory, requiring at least 4 GPUs with 80GB each just for model weights in FP16. This memory overhead motivates methods like DPO, GRPO, and RLOO that eliminate one or more of these components.

DPO (Direct Preference Optimization)

DPO eliminates the reward model entirely by deriving a closed-form relationship between the optimal policy and the reward [@rafailov2023dpo].

Derivation. The KL-constrained reward maximization objective has a closed-form optimal policy:

$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\!\left(\frac{1}{\beta} r(x, y)\right)$

where $Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp(r(x,y)/\beta)$ is the partition function. Solving for the reward:

$r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$

Substituting this into the Bradley-Terry preference model $P(y_w \succ y_l) = \sigma(r(x,y_w) - r(x,y_l))$ and noting that $Z(x)$ cancels:

$\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\!\left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right]$

DPO reparameterizes the reward as the log-ratio between policy and reference, turning RL into a supervised classification problem on preference pairs. Key advantages: - **No reward model**: eliminates $r_\phi$ from memory - **No value function**: eliminates $V_\psi$ from memory - **No sampling during training**: uses pre-collected preference data (off-policy) - **Simpler implementation**: just a modified cross-entropy loss

Key limitation: because it is off-policy, DPO can suffer from distributional shift if the preference data was collected under a very different policy than $\pi_\theta$ .

**DPO gradient analysis.** The gradient of $\mathcal{L}_{\text{DPO}}$ has an intuitive form:

$\nabla_\theta \mathcal{L}_{\text{DPO}} \propto -\underbrace{\sigma\!\left(\hat{r}_l - \hat{r}_w\right)}_{\text{weighting}} \left[\beta \nabla_\theta \log \pi_\theta(y_w|x) - \beta \nabla_\theta \log \pi_\theta(y_l|x)\right]$

where $\hat{r}_i = \beta \log \frac{\pi_\theta(y_i|x)}{\pi_{\text{ref}}(y_i|x)}$ is the implicit reward. The weighting term is large when the model incorrectly ranks the pair (assigns higher implicit reward to $y_l$ than $y_w$ ), focusing learning on mistakes.

GRPO (Group Relative Policy Optimization)

GRPO [@shao2024deepseekmath] removes the value network from PPO by estimating advantages from a **group of sampled outputs**. For a prompt $x$, sample $G$ responses $\{y_1, \dots, y_G\}$ from $\pi_{\theta_{\text{old}}}$ and compute rewards $\{r_1, \dots, r_G\}$. The advantage for each response is the group-normalized reward:

$\hat{A}_i = \frac{r_i - \text{mean}(\{r_1, \dots, r_G\})}{\text{std}(\{r_1, \dots, r_G\}) + \epsilon}$

**Group-normalized advantage with $G = 4$.** Suppose for one prompt we sample four responses and a verifier assigns binary correctness rewards $\{r_1, r_2, r_3, r_4\} = \{1, 0, 1, 0\}$.

The group mean is $\text{mean} = \frac{1 + 0 + 1 + 0}{4} = 0.5$ . The (population) standard deviation is

$\text{std} = \sqrt{\frac{(1-0.5)^2 + (0-0.5)^2 + (1-0.5)^2 + (0-0.5)^2}{4}} = \sqrt{\frac{0.25 \cdot 4}{4}} = 0.5.$

Taking $\epsilon \to 0$ for clarity, the normalized advantages are

$\hat{A}_1 = \hat{A}_3 = \frac{1 - 0.5}{0.5} = +1, \qquad \hat{A}_2 = \hat{A}_4 = \frac{0 - 0.5}{0.5} = -1.$

So correct responses get advantage $+1$ and incorrect ones $-1$ . Notice that with no value network at all, the group itself supplies the baseline: the policy is pushed up on the responses that beat the group average and down on those that fall below it.

The objective uses PPO-style clipping at the token level, averaging over all tokens in each response:

$\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}_{x} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \min\!\left( r_{i,t}(\theta)\, \hat{A}_i, \; \text{clip}(r_{i,t}(\theta), 1-\epsilon, 1+\epsilon)\, \hat{A}_i \right) \right] - \beta \, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$

where $r_{i,t}(\theta) = \frac{\pi_\theta(y_{i,t} | x, y_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t} | x, y_{i,<t})}$ is the per-token probability ratio.

**Why group normalization works.** By comparing outputs within a group from the same prompt, GRPO obtains a relative ranking signal without a learned value function. The group mean acts as a prompt-specific baseline, and the standard deviation normalizes for reward scale. This reduces memory from 4 models (PPO) to 2 (policy + reference), at the cost of requiring $G$ forward passes per prompt. GRPO was used to train DeepSeek-Math and DeepSeek-R1 [@shao2024deepseekmath; @deepseekai2025r1]. GRPO typically computes the KL penalty at the token level using an approximation:

$D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \approx \frac{1}{|y|} \sum_{t=1}^{|y|} \left[\frac{\pi_{\text{ref}}(y_t | x, y_{<t})}{\pi_\theta(y_t | x, y_{<t})} - \log \frac{\pi_{\text{ref}}(y_t | x, y_{<t})}{\pi_\theta(y_t | x, y_{<t})} - 1\right]$

This uses the identity $D_{\text{KL}}(p \| q) = \mathbb{E}_p[q/p - \log(q/p) - 1]$ for a non-negative, unbiased KL estimate. Note that "unbiased" refers to the estimator's expected value matching the true KL; the gradient of this estimator with respect to $\theta$ is biased relative to the true KL gradient, a known GRPO subtlety.

RLOO (REINFORCE Leave-One-Out)

RLOO is a variance reduction technique for REINFORCE that uses a leave-one-out baseline [@ahmadian2024rloo]. Sample $K$ responses per prompt and use the average reward of the *other* $K-1$ responses as the baseline for each:

$b_i = \frac{1}{K-1} \sum_{j \neq i} r_j$

$\nabla_\theta J = \frac{1}{K} \sum_{i=1}^{K} \nabla_\theta \log \pi_\theta(y_i|x) \cdot (r_i - b_i)$

**Leave-one-out baseline with $K = 4$.** Take the same prompt with rewards $\{r_1, r_2, r_3, r_4\} = \{1, 0, 1, 0\}$ and $K = 4$. Each baseline averages the *other three* rewards:

$b_1 = \frac{r_2 + r_3 + r_4}{3} = \frac{0 + 1 + 0}{3} = \tfrac{1}{3}, \qquad b_2 = \frac{r_1 + r_3 + r_4}{3} = \frac{1 + 1 + 0}{3} = \tfrac{2}{3},$

and by symmetry $b_3 = \tfrac{1}{3}$ (like $b_1$ ) and $b_4 = \tfrac{2}{3}$ (like $b_2$ ). The advantages $r_i - b_i$ are

$r_1 - b_1 = 1 - \tfrac{1}{3} = +\tfrac{2}{3}, \qquad r_2 - b_2 = 0 - \tfrac{2}{3} = -\tfrac{2}{3},$

and again $r_3 - b_3 = +\tfrac{2}{3}$ , $r_4 - b_4 = -\tfrac{2}{3}$ . Compared with GRPO's $\pm 1$ , RLOO produces the same sign pattern (reward the correct, penalize the incorrect) but does not divide by the group standard deviation, so the magnitudes differ. The key property is that $b_i$ never uses $r_i$ , which keeps the estimator unbiased.

RLOO has several desirable properties:

Unbiased: Each leave-one-out baseline $b_i$ is independent of $y_i$ , so the gradient estimator is unbiased.
Lower variance than REINFORCE: The per-sample baseline captures the difficulty of the prompt, reducing variance from prompt-to-prompt reward differences.
No learned value function: Like GRPO, avoids the memory cost of a separate value network.
Simple implementation: Compared to GRPO, RLOO does not clip probability ratios or normalize by standard deviation: it is closer to vanilla REINFORCE with a smart baseline.

Comparison

Algorithm	Reward Model	Value Network	On/Off-Policy	Models in Memory	Key Idea
REINFORCE	Yes	No	On	2	Vanilla policy gradient + baseline
PPO	Yes	Yes	On	4	Clipped surrogate with trust region
DPO	No	No	Off	2	Reward = log policy ratio; supervised loss
GRPO	Yes	No	On	2	Group-normalized advantages with clipping
RLOO	Yes	No	On	2	Leave-one-out baseline for variance reduction

**Current trends (as of 2025):** - **Outcome-based RL** (using verifiable rewards like code execution, math proof checking) is replacing learned reward models for domains where automated verification is possible [@deepseekai2025r1]. - **GRPO** has become the dominant algorithm for math and code reasoning, where sampling $G$ solutions and checking correctness provides a natural reward signal. - **DPO** remains popular for general-purpose alignment where preference data is readily available and the simplicity of offline training is valued. - **Online DPO variants** (like OAIF [@guo2024direct]) combine DPO's simplicity with online data collection to mitigate distributional shift.

Notation Summary

Symbol	Meaning
$\pi_\theta$	Current policy (the model being trained)
$\pi_{\text{ref}}$	Reference policy (frozen SFT model)
$\pi^*$	Optimal policy under KL-constrained objective
$r_\phi(x, y)$	Reward model score
$\beta$	KL penalty coefficient
$\hat{A}_t$	Advantage estimate
$\epsilon$	PPO/GRPO clipping parameter
$\gamma, \lambda$	GAE discount and decay factors
$y_w, y_l$	Preferred and rejected responses
$\sigma(\cdot)$	Sigmoid function
$G, K$	Number of sampled responses per prompt
$Z(x)$	Partition function

References

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS. ↗

Background​

REINFORCE: The Foundation​

PPO (Proximal Policy Optimization)​

DPO (Direct Preference Optimization)​

GRPO (Group Relative Policy Optimization)​

RLOO (REINFORCE Leave-One-Out)​

Comparison​

Notation Summary​