RL Algorithms for LLMs
Background
Reinforcement learning from human feedback (RLHF) fine-tunes a pretrained language model to align with human preferences, using a reward signal derived from human judgments (Ouyang et al., 2022). The standard pipeline has three stages:
- Supervised fine-tuning (SFT): Train on high-quality demonstration data to obtain a reference policy .
- Reward modeling: Train a scalar reward model on human preference pairs using the Bradley--Terry model.
- RL optimization: Maximize the expected reward while constraining the policy to stay close to .
The general objective is:
The reward model is trained by maximizing the log-likelihood of observed preferences:
REINFORCE: The Foundation
Derivation. We want . Since :
A baseline reduces variance without introducing bias (since ):
Limitations: High variance (requires many samples per update), on-policy only (samples must be discarded after each parameter update), no trust region (large updates can destabilize training).
PPO (Proximal Policy Optimization)
The clipped objective is:
where is the advantage estimate (typically computed via GAE -- Generalized Advantage Estimation) and (typically --) defines the trust region.
- If the advantage (action was good): increasing beyond gets clipped, preventing overly aggressive exploitation.
- If (action was bad): decreasing below gets clipped, preventing the policy from moving too far in a single step.
This achieves a similar effect to TRPO's hard KL constraint but is much simpler to implement and computationally cheaper.
gives the one-step TD advantage (low variance, high bias); gives the Monte Carlo advantage (high variance, low bias).
For a 7B parameter model, this means ~28B parameters in memory, requiring at least 4 GPUs with 80GB each just for model weights in FP16. This memory overhead motivates methods like DPO, GRPO, and RLOO that eliminate one or more of these components.
DPO (Direct Preference Optimization)
Derivation. The KL-constrained reward maximization objective has a closed-form optimal policy:
where is the partition function. Solving for the reward:
Substituting this into the Bradley--Terry preference model and noting that cancels:
Key limitation: because it is off-policy, DPO can suffer from distributional shift if the preference data was collected under a very different policy than .
where is the implicit reward. The weighting term is large when the model incorrectly ranks the pair (assigns higher implicit reward to than ), focusing learning on mistakes.
GRPO (Group Relative Policy Optimization)
The objective uses PPO-style clipping at the token level, averaging over all tokens in each response:
where is the per-token probability ratio.
This uses the identity for a non-negative, unbiased KL estimate.
RLOO (REINFORCE Leave-One-Out)
- Unbiased: Each leave-one-out baseline is independent of , so the gradient estimator is unbiased.
- Lower variance than REINFORCE: The per-sample baseline captures the difficulty of the prompt, reducing variance from prompt-to-prompt reward differences.
- No learned value function: Like GRPO, avoids the memory cost of a separate value network.
- Simple implementation: Compared to GRPO, RLOO does not clip probability ratios or normalize by standard deviation -- it is closer to vanilla REINFORCE with a smart baseline.
Comparison
| Algorithm | Reward Model | Value Network | On/Off-Policy | Models in Memory | Key Idea |
|---|---|---|---|---|---|
| REINFORCE | Yes | No | On | 2 | Vanilla policy gradient + baseline |
| PPO | Yes | Yes | On | 4 | Clipped surrogate with trust region |
| DPO | No | No | Off | 2 | Reward = log policy ratio; supervised loss |
| GRPO | Yes | No | On | 2 | Group-normalized advantages with clipping |
| RLOO | Yes | No | On | 2 | Leave-one-out baseline for variance reduction |
Notation Summary
| Symbol | Meaning |
|---|---|
| Current policy (the model being trained) | |
| Reference policy (frozen SFT model) | |
| Optimal policy under KL-constrained objective | |
| Reward model score | |
| KL penalty coefficient | |
| Advantage estimate | |
| PPO/GRPO clipping parameter | |
| GAE discount and decay factors | |
| Preferred and rejected responses | |
| Sigmoid function | |
| Number of sampled responses per prompt | |
| Partition function |