RLHF Explained: How We Teach AI to Be Nice
ChatGPT isn't just smart; it's polite. How? Reinforcement Learning from Human Feedback (RLHF). We...
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: A raw LLM is a super-smart parrot that read the entire internet โ including its worst parts. RLHF (Reinforcement Learning from Human Feedback) is the training pipeline that transforms it from a pattern-matching engine into an assistant that is helpful, harmless, and honest.
๐ The Parrot Who Read Everything
Imagine a parrot that has read every book, forum post, Reddit thread, and dark corner of the web. Ask it anything and it can produce text. But it might:
- Answer in the style of a conspiracy forum.
- Generate offensive content because that's statistically common in its training data.
- Give a confident-sounding wrong answer because wrong answers also appear in training data.
RLHF is the rehabilitation process that teaches this parrot which outputs humans actually prefer.
๐ RLHF Key Vocabulary
Before diving into the pipeline, here is a quick reference for the terms that appear throughout this post. If you have seen these words before but never formally defined them, this table is your anchor.
| Term | What It Means |
| Base LLM | The raw pre-trained language model, trained only on next-token prediction with no alignment signal |
| SFT (Supervised Fine-Tuning) | Fine-tuning the base LLM on human-written ideal responses so the model learns to imitate good behavior |
| Reward Model (RM) | A neural network trained on human preference pairs; it outputs a scalar quality score for any (prompt, response) pair |
| PPO (Proximal Policy Optimization) | The reinforcement learning algorithm used to update the LLM policy weights to maximize the reward model's score |
| KL Divergence | A penalty term that measures how much the RL-optimized policy has drifted from the SFT baseline โ prevents reward hacking |
| Preference Data | Ranked pairs of responses (A > B) collected from human labelers for the same prompt; the core training signal for the RM |
| RLAIF | Reinforcement Learning from AI Feedback โ replaces human labelers with a stronger LLM-as-judge (used in Claude's Constitutional AI) |
| DPO | Direct Preference Optimization โ trains directly on preferences without a separate reward model or PPO loop |
| Alignment | The property of a model whose outputs are reliably helpful, harmless, and honest across diverse inputs |
These nine terms are the full vocabulary of modern RLHF. By the end of this post, you will have seen all of them in context.
๐ RLHF Full Pipeline Sequence
sequenceDiagram
participant H as Human Labelers
participant SFT as SFT Model
participant RM as Reward Model
participant PPO as PPO Trainer
participant P as RLHF Policy
SFT->>H: Generate candidate responses
H->>H: Rank responses (A > B)
H->>RM: Train on preference pairs
RM->>RM: Learn scalar reward score
SFT->>PPO: Initialize policy from SFT
PPO->>RM: Score generated responses
RM-->>PPO: Reward signal
PPO->>PPO: Update policy (KL-constrained)
PPO->>P: RLHF-aligned model
This sequence diagram traces the complete RLHF pipeline from SFT model to aligned policy. Human labelers rank candidate responses, training those preferences into the Reward Model; then PPO generates responses, scores them against the RM, and applies a KL-constrained update to keep the policy close to the SFT baseline. The takeaway is that the pipeline has three distinct actors โ labelers, the reward model, and PPO โ and a failure in any one of them degrades the final aligned policy.
๐ RLHF Phases State Diagram
stateDiagram-v2
[*] --> PreTraining
PreTraining --> SFT : Fine-tune on human demos
SFT --> RewardModel : Train on preference pairs
RewardModel --> PPO : RL optimization with KL penalty
PPO --> RLHF_Tuned : Aligned policy
RLHF_Tuned --> [*]
This state diagram shows that RLHF is a strictly sequential process: you cannot train the reward model before SFT completes, and you cannot run PPO before the reward model is ready. Each state transition represents a distinct dataset, training objective, and evaluation checkpoint. The key insight is that the pipeline is only as strong as its weakest stage โ a weak SFT checkpoint produces noisy preference comparisons, which trains a weak reward model, which produces a weak PPO policy.
๐ The Three Stages of RLHF
flowchart LR
SFT[Stage 1: SFT Supervised Fine-Tuning Human writes ideal answers imitation learning]
RM[Stage 2: Reward Model Humans rank A vs B train a preference predictor]
PPO[Stage 3: PPO RL with KL penalty optimize policy to maximize reward]
SFT --> RM --> PPO
This flowchart compresses the entire RLHF pipeline into a three-node summary. Stage 1 (SFT) produces a behaviorally reasonable model by imitating human-written responses. Stage 2 (Reward Model) learns a scalar preference signal from pairwise human rankings. Stage 3 (PPO) uses that signal to shift the policy toward higher-scoring outputs while the KL penalty prevents collapse. Reading left to right, each stage hands off its primary artifact โ a fine-tuned model, a trained preference predictor, and finally an aligned policy โ to the next.
Stage 1 โ Supervised Fine-Tuning (SFT)
Human labelers write high-quality answers to a sample of prompts. The base LLM is fine-tuned to imitate this behavior. This creates the SFT Policy (ฯ_SFT) โ the "before RLHF" model.
Stage 2 โ Reward Model Training
Human labelers are shown pairs of model outputs (A vs B) for the same prompt and asked "Which is better?" These preference labels train a Reward Model (RM) that predicts a numeric score for any (prompt, response) pair โ without human involvement at inference time.
Stage 3 โ RL Fine-Tuning with PPO
The SFT model is used as the starting policy. PPO (Proximal Policy Optimization) generates responses, scores them via the Reward Model, and updates the policy weights to maximize reward.
โ๏ธ The KL Divergence Constraint: Why the Model Doesn't Collapse
Left unconstrained, PPO finds the response the Reward Model scores highest โ and it may be nonsensical text that superficially satisfies the reward function (Goodhart's Law).
The KL divergence penalty prevents this:
$$\text{Maximize: } \mathbb{E}\left[ R(x, y) - \beta \cdot \log \frac{\pi_{RL}(y|x)}{\pi_{SFT}(y|x)} \right]$$
Plain English:
- $R(x, y)$: reward score โ higher is better.
- $\beta \cdot \log(\pi{RL}/\pi{SFT})$: how much the RL policy has drifted from the SFT baseline.
- Together: maximize reward, but penalize large deviations from the original model.
If $\beta$ is too small: the model hacks the reward function with gibberish. If $\beta$ is too large: the model barely changes from SFT.
Typical $\beta$ values: 0.01โ0.1.
๐ง Deep Dive: Human Preference Data and Annotation
Labelers evaluate output pairs using a rubric:
| Dimension | Question |
| Helpfulness | Does the response directly address the prompt? |
| Honesty | Does it avoid false claims and express appropriate uncertainty? |
| Harmlessness | Does it avoid toxic, dangerous, or offensive content? |
| Conciseness | Is it free of unnecessary filler and repetition? |
The preference signal is a ranking (A > B), not a rating (A = 8/10, B = 6/10). Rankings are more reliable and faster to collect than absolute scores.
โ๏ธ Trade-offs & Failure Modes: RLHF Trade-offs and Failure Modes
| Challenge | Why It Matters |
| Expensive human labeling | Thousands of high-quality (prompt, comparison) pairs needed โ skilled, well-briefed labelers required |
| Reward model is imperfect | It can be gamed (mode-collapse in PPO) โ the model finds responses that score high but are not actually good |
| KL constraint is a crude fix | It prevents collapse but may limit performance ceiling; ฮฒ tuning requires significant experimentation |
| Labeler disagreement | Different people rank the same output differently โ especially for subjective content |
| Reward hacking (Goodhart's Law) | PPO optimizes whatever the RM scores โ if the RM is imperfect, so is the policy |
| Scaling cost | Each alignment iteration requires new preference data; human labeling does not scale cheaply |
Alternatives and successors:
- DPO (Direct Preference Optimization): Skips the RM and PPO entirely โ optimizes preference directly. Simpler, often competitive with RLHF. Used in Llama 3.
- RLAIF (RL from AI Feedback): Replace human labelers with a stronger LLM-as-judge. Used in Claude (Constitutional AI).
- PPO-Lite: A simplified PPO variant used when compute is constrained.
๐งญ Decision Guide: RLHF vs DPO vs RLAIF
| Situation | Recommendation |
| Building a flagship assistant with large labeling budget | Full RLHF (SFT โ RM โ PPO) is the proven approach |
| Small team, limited annotation budget | DPO โ eliminates the reward model training step entirely |
| Need to scale alignment beyond human labeling capacity | RLAIF โ use a strong LLM judge to generate preference data |
| Tight compute budget for the alignment phase | DPO or PPO-Lite โ lighter training loop |
| Need interpretable reward signal | RLHF with explicit RM โ the reward score is inspectable |
| Iterating quickly on alignment behaviors | DPO converges faster and requires fewer moving parts |
๐ Real-World Applications of RLHF
RLHF is not an academic curiosity โ it is the alignment backbone of the most widely deployed AI products today.
| Product / Project | How RLHF or Its Successor Is Used |
| ChatGPT / GPT-4 (OpenAI) | Classic three-stage RLHF: SFT on contractor-written responses, RM trained on pairwise preferences, PPO fine-tuning |
| Claude (Anthropic) | Constitutional AI + RLAIF: a set of written principles guides an AI judge that provides feedback, dramatically reducing reliance on human labelers |
| Llama 3 (Meta) | DPO instead of PPO; the same preference data drives direct optimization without a separate reward model training loop |
| Customer-Facing Chatbots | Enterprise teams apply lightweight RLHF or DPO on domain-specific preference data to reduce off-topic or brand-unsafe responses |
| Code Assistants (Copilot, Cursor) | Preference data collected from accept/reject signals on code completions; reward model scores helpfulness and correctness |
Why does the product landscape matter for a beginner? Because it shows that RLHF is not one fixed recipe. OpenAI pioneered it, Anthropic scaled it with AI feedback, and Meta simplified it with DPO. The underlying insight โ that human preference signals are a powerful training target โ is shared across all three.
๐งช Practical: What RLHF Looks Like in Code
Most researchers use the trl library (Transformer Reinforcement Learning) from Hugging Face to implement RLHF. Below is a minimal sketch of the three stages. These are not production-ready snippets, but they illustrate the shape of the pipeline.
Stage 1 โ SFT is standard fine-tuning:
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
trainer = SFTTrainer(
model=model,
train_dataset=sft_dataset, # rows: {"prompt": ..., "response": ...}
dataset_text_field="response",
)
trainer.train()
sft_model = trainer.model
Stage 2 โ Reward Model training on preference pairs:
from trl import RewardTrainer, RewardConfig
# Preference dataset rows: {"chosen": ..., "rejected": ...}
# Both are full prompt+response strings.
reward_trainer = RewardTrainer(
model=AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B"),
args=RewardConfig(output_dir="./reward_model"),
train_dataset=preference_dataset,
)
reward_trainer.train()
reward_model = reward_trainer.model
Stage 3 โ PPO optimization loop:
from trl import PPOTrainer, PPOConfig
import torch
ppo_trainer = PPOTrainer(
config=PPOConfig(kl_penalty="kl", init_kl_coef=0.05), # beta = 0.05
model=sft_model,
ref_model=sft_model, # frozen reference = SFT baseline
reward_model=reward_model,
tokenizer=tokenizer,
)
for batch in ppo_dataloader:
query_tensors = batch["input_ids"]
# Generate a response from the current policy
response_tensors = ppo_trainer.generate(query_tensors)
# Score each response with the reward model
rewards = [reward_model(q, r) for q, r in zip(query_tensors, response_tensors)]
# PPO update: maximize reward subject to KL penalty
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
The kl_penalty and init_kl_coef are the tunable ฮฒ from the KL divergence formula above. Most practitioners start around 0.05 and adjust based on how much the model diverges from the SFT checkpoint during training.
๐ ๏ธ Hugging Face TRL: RewardTrainer and PPOTrainer for RLHF Pipelines
Hugging Face TRL (Transformer Reinforcement Learning) is the standard open-source library for implementing RLHF in Python โ it provides SFTTrainer, RewardTrainer, and PPOTrainer that map one-to-one to the three RLHF stages described above, all building on the familiar transformers Trainer API and integrating natively with PEFT and bitsandbytes for memory-efficient training.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import RewardTrainer, RewardConfig, PPOTrainer, PPOConfig
from datasets import load_dataset
model_name = "meta-llama/Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# โโ Stage 2: Train the Reward Model on human preference pairs โโโโโโโโโโโโโโ
# Dataset format: {"chosen": "<prompt> + <good response>", "rejected": "<prompt> + <bad response>"}
preference_dataset = load_dataset("json", data_files="preferences.jsonl", split="train")
reward_model = AutoModelForCausalLM.from_pretrained(model_name)
reward_trainer = RewardTrainer(
model=reward_model,
args=RewardConfig(
output_dir="./reward-model",
per_device_train_batch_size=4,
num_train_epochs=1,
bf16=True,
),
tokenizer=tokenizer,
train_dataset=preference_dataset,
)
reward_trainer.train()
# โโ Stage 3: PPO fine-tuning with KL penalty โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
sft_model = AutoModelForCausalLM.from_pretrained("./sft-checkpoint") # SFT baseline
ppo_trainer = PPOTrainer(
config=PPOConfig(
learning_rate=1e-6,
batch_size=32,
kl_penalty="kl",
init_kl_coef=0.05, # ฮฒ in the KL divergence formula
target_kl=0.1,
),
model=sft_model,
ref_model=sft_model, # frozen copy of SFT model = reference policy
tokenizer=tokenizer,
)
for batch in ppo_dataloader:
query_tensors = batch["input_ids"]
response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=128)
rewards = [reward_model(q, r).logits[:, -1] for q, r in zip(query_tensors, response_tensors)]
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
RewardTrainer handles the Bradley-Terry pairwise loss automatically โ you supply chosen and rejected columns and it computes the log-sigmoid objective. PPOTrainer manages the KL-penalized policy update loop, the reference model's frozen copy, and gradient clipping, so you only need to supply prompts, generate responses, score them, and call .step().
For a full deep-dive on Hugging Face TRL, a dedicated follow-up post is planned.
๐ Production Lessons from RLHF
Running RLHF in practice is harder than the clean three-stage diagram suggests. Here are the lessons that practitioners encounter repeatedly.
Reward hacking is real. PPO will find degenerate outputs that score high on the reward model but look nothing like good responses โ long repetitive lists, excessive hedging, or flattery. The KL penalty suppresses this, but does not eliminate it. Monitor output diversity throughout training.
KL coefficient tuning takes real experimentation. ฮฒ = 0.01 and ฮฒ = 0.1 can produce completely different models. Too small and the model drifts to reward-hacking behavior; too large and the PPO pass adds nothing beyond SFT.
Labeler quality matters more than labeler quantity. A small set of well-briefed, consistent labelers produces better reward models than a large crowd with inconsistent rubrics. Labeler disagreement directly injects noise into the RM's training signal.
DPO is simpler and often competitive. For teams without a dedicated RL infrastructure, DPO is the right default. It skips the reward model training loop entirely and converges faster. Llama 3's alignment used DPO.
RLAIF scales better than human labeling. Once you have a strong enough base model to serve as a judge, RLAIF (AI feedback) can generate vastly more preference pairs per day than human contractors. This is the core insight behind Constitutional AI.
The SFT step matters. A weak SFT checkpoint produces a weak RM and a weak PPO policy. Garbage in, garbage out โ investing in high-quality SFT demonstrations pays dividends in every subsequent stage.
๐ TLDR: Summary & Key Takeaways
TLDR: RLHF transforms a base LLM into a helpful, harmless assistant through three stages โ SFT, Reward Model training, and PPO โ with a KL penalty preventing the model from gaming the reward signal.
- RLHF = SFT โ Reward Model โ PPO โ three stages to transform a base LLM into an aligned assistant.
- Reward Model is a trained preference predictor; it replaces the human labeler at scale.
- KL penalty prevents PPO from collapsing the output distribution to reward-hacking gibberish.
- DPO skips the RM and PPO entirely โ increasingly preferred for its simplicity.
- RLAIF replaces human labelers with an AI judge for scalable feedback.
๐ Related Posts
- RLHF Training Pipeline: From Preferences to Policy
- SFT Explained: Supervised Fine-Tuning LLMs in Practice
- LoRA Explained: Fine-Tuning LLMs on a Budget
- How GPT / LLMs Work Under the Hood

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer โ 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2ร A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
