RLHF Explained: How We Teach AI to Be Nice
ChatGPT isn't just smart; it's polite. How? Reinforcement Learning from Human Feedback (RLHF). We...
Abstract AlgorithmsTLDR: A raw LLM is a super-smart parrot that read the entire internet โ including its worst parts. RLHF (Reinforcement Learning from Human Feedback) is the training pipeline that transforms it from a pattern-matching engine into an assistant that is helpful, harmless, and honest.
๐ The Parrot Who Read Everything
Imagine a parrot that has read every book, forum post, Reddit thread, and dark corner of the web. Ask it anything and it can produce text. But it might:
- Answer in the style of a conspiracy forum.
- Generate offensive content because that's statistically common in its training data.
- Give a confident-sounding wrong answer because wrong answers also appear in training data.
RLHF is the rehabilitation process that teaches this parrot which outputs humans actually prefer.
๐ข The Three Stages of RLHF
flowchart LR
SFT["Stage 1: SFT\nSupervised Fine-Tuning\nHuman writes ideal answers\nโ imitation learning"]
RM["Stage 2: Reward Model\nHumans rank A vs B\nโ train a preference predictor"]
PPO["Stage 3: PPO\nRL with KL penalty\nโ optimize policy to maximize reward"]
SFT --> RM --> PPO
Stage 1 โ Supervised Fine-Tuning (SFT)
Human labelers write high-quality answers to a sample of prompts. The base LLM is fine-tuned to imitate this behavior. This creates the SFT Policy (ฯ_SFT) โ the "before RLHF" model.
Stage 2 โ Reward Model Training
Human labelers are shown pairs of model outputs (A vs B) for the same prompt and asked "Which is better?" These preference labels train a Reward Model (RM) that predicts a numeric score for any (prompt, response) pair โ without human involvement at inference time.
Stage 3 โ RL Fine-Tuning with PPO
The SFT model is used as the starting policy. PPO (Proximal Policy Optimization) generates responses, scores them via the Reward Model, and updates the policy weights to maximize reward.
โ๏ธ The KL Divergence Constraint: Why the Model Doesn't Collapse
Left unconstrained, PPO finds the response the Reward Model scores highest โ and it may be nonsensical text that superficially satisfies the reward function (Goodhart's Law).
The KL divergence penalty prevents this:
$$\text{Maximize: } \mathbb{E}\left[ R(x, y) - \beta \cdot \log \frac{\pi_{RL}(y|x)}{\pi_{SFT}(y|x)} \right]$$
Plain English:
- $R(x, y)$: reward score โ higher is better.
- $\beta \cdot \log(\pi{RL}/\pi{SFT})$: how much the RL policy has drifted from the SFT baseline.
- Together: maximize reward, but penalize large deviations from the original model.
If $\beta$ is too small: the model hacks the reward function with gibberish. If $\beta$ is too large: the model barely changes from SFT.
Typical $\beta$ values: 0.01โ0.1.
๐ง What Human Preference Data Looks Like
Labelers evaluate output pairs using a rubric:
| Dimension | Question |
| Helpfulness | Does the response directly address the prompt? |
| Honesty | Does it avoid false claims and express appropriate uncertainty? |
| Harmlessness | Does it avoid toxic, dangerous, or offensive content? |
| Conciseness | Is it free of unnecessary filler and repetition? |
The preference signal is a ranking (A > B), not a rating (A = 8/10, B = 6/10). Rankings are more reliable and faster to collect than absolute scores.
โ๏ธ RLHF Limitations and Alternatives
| Limitation | Why It Matters |
| Expensive human labeling | Thousands of high-quality (prompt, comparison) pairs needed โ skilled, well-briefed labelers required |
| Reward model is imperfect | It can be gamed (mode-collapse in PPO) |
| KL constraint is a crude fix | It prevents collapse but may limit performance ceiling |
| Labeler disagreement | Different people rank the same output differently โ especially for subjective content |
Alternatives and successors:
- DPO (Direct Preference Optimization): Skips the RM and PPO entirely โ optimizes preference directly. Simpler, often competitive with RLHF. Used in Llama 3.
- RLAIF (RL from AI Feedback): Replace human labelers with a stronger LLM-as-judge. Used in Claude (Constitutional AI).
- PPO-Lite: A simplified PPO variant used when compute is constrained.
๐ Summary
- RLHF = SFT โ Reward Model โ PPO โ three stages to transform a base LLM into an aligned assistant.
- Reward Model is a trained preference predictor; it replaces the human labeler at scale.
- KL penalty prevents PPO from collapsing the output distribution to reward-hacking gibberish.
- DPO skips the RM and PPO entirely โ increasingly preferred for its simplicity.
- RLAIF replaces human labelers with an AI judge for scalable feedback.
๐ Practice Quiz
Why is a KL divergence penalty added to the RLHF objective?
- A) To reduce training compute cost.
- B) To prevent the RL-optimized model from drifting so far from the SFT baseline that it generates reward-hacking gibberish.
- C) To speed up convergence by reducing exploration.
Answer: B
What type of human annotation does RLHF collect for training the Reward Model?
- A) Absolute quality scores (1-10) for each response.
- B) Pairwise preferences: "Response A is better than Response B" for the same prompt.
- C) Token-level corrections on model outputs.
Answer: B
What is the main advantage of DPO over RLHF?
- A) DPO uses more human feedback and is therefore more accurate.
- B) DPO removes the RL training loop entirely โ it directly optimizes for preferences without a separate Reward Model or PPO.
- C) DPO is faster at inference time.
Answer: B

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. ๏ฟฝ...
