RL Algorithms for LLMs
Background
Reinforcement learning from human feedback (RLHF) fine-tunes a pretrained language model to align with human preferences, using a reward signal derived from human judgments (Ouyang et al., 2022). The standard pipeline has three stages:
- Supervised fine-tuning (SFT): Train on high-quality demonstration data to obtain a reference policy .
- Reward modeling: Train a scalar reward model on human preference pairs using the Bradley-Terry model.
- RL optimization: Maximize the expected reward while constraining the policy to stay close to .
The general objective is:
The reward model is trained by maximizing the log-likelihood of observed preferences:
REINFORCE: The Foundation
In the RLHF context, is typically a learned reward model minus the KL penalty. (This is why the comparison table later lists REINFORCE as requiring a reward model: the reward signal is generic in principle, but in the RLHF setting it is supplied by .)
Derivation. We want . Since :
A baseline reduces variance without introducing bias (since ):
Limitations: High variance (requires many samples per update), on-policy only (samples must be discarded after each parameter update), no trust region (large updates can destabilize training).
PPO (Proximal Policy Optimization)
The clipped objective is:
where is the advantage estimate (typically computed via GAE, Generalized Advantage Estimation) and (typically to ) defines the trust region.
- If the advantage (action was good): increasing beyond gets clipped, preventing overly aggressive exploitation.
- If (action was bad): decreasing below gets clipped, preventing the policy from moving too far in a single step.
This achieves a similar effect to TRPO's hard KL constraint but is much simpler to implement and computationally cheaper.
gives the one-step TD advantage (low variance, high bias); gives the Monte Carlo advantage (high variance, low bias).
For a 7B parameter model, this means ~28B parameters in memory, requiring at least 4 GPUs with 80GB each just for model weights in FP16. This memory overhead motivates methods like DPO, GRPO, and RLOO that eliminate one or more of these components.
DPO (Direct Preference Optimization)
Derivation. The KL-constrained reward maximization objective has a closed-form optimal policy:
where is the partition function. Solving for the reward:
Substituting this into the Bradley-Terry preference model and noting that cancels:
Key limitation: because it is off-policy, DPO can suffer from distributional shift if the preference data was collected under a very different policy than .
where is the implicit reward. The weighting term is large when the model incorrectly ranks the pair (assigns higher implicit reward to than ), focusing learning on mistakes.
GRPO (Group Relative Policy Optimization)
The group mean is . The (population) standard deviation is
Taking for clarity, the normalized advantages are
So correct responses get advantage and incorrect ones . Notice that with no value network at all, the group itself supplies the baseline: the policy is pushed up on the responses that beat the group average and down on those that fall below it.
The objective uses PPO-style clipping at the token level, averaging over all tokens in each response:
where is the per-token probability ratio.
This uses the identity for a non-negative, unbiased KL estimate. Note that "unbiased" refers to the estimator's expected value matching the true KL; the gradient of this estimator with respect to is biased relative to the true KL gradient, a known GRPO subtlety.
RLOO (REINFORCE Leave-One-Out)
and by symmetry (like ) and (like ). The advantages are
and again , . Compared with GRPO's , RLOO produces the same sign pattern (reward the correct, penalize the incorrect) but does not divide by the group standard deviation, so the magnitudes differ. The key property is that never uses , which keeps the estimator unbiased.
- Unbiased: Each leave-one-out baseline is independent of , so the gradient estimator is unbiased.
- Lower variance than REINFORCE: The per-sample baseline captures the difficulty of the prompt, reducing variance from prompt-to-prompt reward differences.
- No learned value function: Like GRPO, avoids the memory cost of a separate value network.
- Simple implementation: Compared to GRPO, RLOO does not clip probability ratios or normalize by standard deviation: it is closer to vanilla REINFORCE with a smart baseline.
Comparison
| Algorithm | Reward Model | Value Network | On/Off-Policy | Models in Memory | Key Idea |
|---|---|---|---|---|---|
| REINFORCE | Yes | No | On | 2 | Vanilla policy gradient + baseline |
| PPO | Yes | Yes | On | 4 | Clipped surrogate with trust region |
| DPO | No | No | Off | 2 | Reward = log policy ratio; supervised loss |
| GRPO | Yes | No | On | 2 | Group-normalized advantages with clipping |
| RLOO | Yes | No | On | 2 | Leave-one-out baseline for variance reduction |
Notation Summary
| Symbol | Meaning |
|---|---|
| Current policy (the model being trained) | |
| Reference policy (frozen SFT model) | |
| Optimal policy under KL-constrained objective | |
| Reward model score | |
| KL penalty coefficient | |
| Advantage estimate | |
| PPO/GRPO clipping parameter | |
| GAE discount and decay factors | |
| Preferred and rejected responses | |
| Sigmoid function | |
| Number of sampled responses per prompt | |
| Partition function |