Home/Blog/Ai/RLHF Explained: How We Teach AI to Be Nice

AiAdvanced•14 min read•Mar 9, 2026

RLHF Explained: How We Teach AI to Be Nice

ChatGPT isn't just smart; it's polite. How? Reinforcement Learning from Human Feedback (RLHF). We...

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: A raw LLM is a super-smart parrot that read the entire internet — including its worst parts. RLHF (Reinforcement Learning from Human Feedback) is the training pipeline that transforms it from a pattern-matching engine into an assistant that is helpful, harmless, and honest.

📖 The Parrot Who Read Everything

Imagine a parrot that has read every book, forum post, Reddit thread, and dark corner of the web. Ask it anything and it can produce text. But it might:

Answer in the style of a conspiracy forum.
Generate offensive content because that's statistically common in its training data.
Give a confident-sounding wrong answer because wrong answers also appear in training data.

RLHF is the rehabilitation process that teaches this parrot which outputs humans actually prefer.

🔍 RLHF Key Vocabulary

Before diving into the pipeline, here is a quick reference for the terms that appear throughout this post. If you have seen these words before but never formally defined them, this table is your anchor.

Term	What It Means
Base LLM	The raw pre-trained language model, trained only on next-token prediction with no alignment signal
SFT (Supervised Fine-Tuning)	Fine-tuning the base LLM on human-written ideal responses so the model learns to imitate good behavior
Reward Model (RM)	A neural network trained on human preference pairs; it outputs a scalar quality score for any (prompt, response) pair
PPO (Proximal Policy Optimization)	The reinforcement learning algorithm used to update the LLM policy weights to maximize the reward model's score
KL Divergence	A penalty term that measures how much the RL-optimized policy has drifted from the SFT baseline — prevents reward hacking
Preference Data	Ranked pairs of responses (A > B) collected from human labelers for the same prompt; the core training signal for the RM
RLAIF	Reinforcement Learning from AI Feedback — replaces human labelers with a stronger LLM-as-judge (used in Claude's Constitutional AI)
DPO	Direct Preference Optimization — trains directly on preferences without a separate reward model or PPO loop
Alignment	The property of a model whose outputs are reliably helpful, harmless, and honest across diverse inputs

These nine terms are the full vocabulary of modern RLHF. By the end of this post, you will have seen all of them in context.

📊 RLHF Full Pipeline Sequence

sequenceDiagram
    participant H as Human Labelers
    participant SFT as SFT Model
    participant RM as Reward Model
    participant PPO as PPO Trainer
    participant P as RLHF Policy

    SFT->>H: Generate candidate responses
    H->>H: Rank responses (A > B)
    H->>RM: Train on preference pairs
    RM->>RM: Learn scalar reward score
    SFT->>PPO: Initialize policy from SFT
    PPO->>RM: Score generated responses
    RM-->>PPO: Reward signal
    PPO->>PPO: Update policy (KL-constrained)
    PPO->>P: RLHF-aligned model

This sequence diagram traces the complete RLHF pipeline from SFT model to aligned policy. Human labelers rank candidate responses, training those preferences into the Reward Model; then PPO generates responses, scores them against the RM, and applies a KL-constrained update to keep the policy close to the SFT baseline. The takeaway is that the pipeline has three distinct actors — labelers, the reward model, and PPO — and a failure in any one of them degrades the final aligned policy.

📊 RLHF Phases State Diagram

stateDiagram-v2
    [*] --> PreTraining
    PreTraining --> SFT : Fine-tune on human demos
    SFT --> RewardModel : Train on preference pairs
    RewardModel --> PPO : RL optimization with KL penalty
    PPO --> RLHF_Tuned : Aligned policy
    RLHF_Tuned --> [*]

This state diagram shows that RLHF is a strictly sequential process: you cannot train the reward model before SFT completes, and you cannot run PPO before the reward model is ready. Each state transition represents a distinct dataset, training objective, and evaluation checkpoint. The key insight is that the pipeline is only as strong as its weakest stage — a weak SFT checkpoint produces noisy preference comparisons, which trains a weak reward model, which produces a weak PPO policy.

📊 The Three Stages of RLHF

flowchart LR
    SFT[Stage 1: SFT Supervised Fine-Tuning Human writes ideal answers  imitation learning]
    RM[Stage 2: Reward Model Humans rank A vs B  train a preference predictor]
    PPO[Stage 3: PPO RL with KL penalty  optimize policy to maximize reward]

    SFT --> RM --> PPO

This flowchart compresses the entire RLHF pipeline into a three-node summary. Stage 1 (SFT) produces a behaviorally reasonable model by imitating human-written responses. Stage 2 (Reward Model) learns a scalar preference signal from pairwise human rankings. Stage 3 (PPO) uses that signal to shift the policy toward higher-scoring outputs while the KL penalty prevents collapse. Reading left to right, each stage hands off its primary artifact — a fine-tuned model, a trained preference predictor, and finally an aligned policy — to the next.

Stage 1 — Supervised Fine-Tuning (SFT)

Human labelers write high-quality answers to a sample of prompts. The base LLM is fine-tuned to imitate this behavior. This creates the SFT Policy (π_SFT) — the "before RLHF" model.

Stage 2 — Reward Model Training

Human labelers are shown pairs of model outputs (A vs B) for the same prompt and asked "Which is better?" These preference labels train a Reward Model (RM) that predicts a numeric score for any (prompt, response) pair — without human involvement at inference time.

Stage 3 — RL Fine-Tuning with PPO

The SFT model is used as the starting policy. PPO (Proximal Policy Optimization) generates responses, scores them via the Reward Model, and updates the policy weights to maximize reward.

⚙️ The KL Divergence Constraint: Why the Model Doesn't Collapse

Left unconstrained, PPO finds the response the Reward Model scores highest — and it may be nonsensical text that superficially satisfies the reward function (Goodhart's Law).

The KL divergence penalty prevents this:

$$\text{Maximize: } \mathbb{E}\left[ R(x, y) - \beta \cdot \log \frac{\pi_{RL}(y|x)}{\pi_{SFT}(y|x)} \right]$$

Plain English:

$R(x, y)$: reward score — higher is better.
$\beta \cdot \log(\pi{RL}/\pi{SFT})$: how much the RL policy has drifted from the SFT baseline.
Together: maximize reward, but penalize large deviations from the original model.

If $\beta$ is too small: the model hacks the reward function with gibberish. If $\beta$ is too large: the model barely changes from SFT.

Typical $\beta$ values: 0.01–0.1.

🧠 Deep Dive: Human Preference Data and Annotation

Labelers evaluate output pairs using a rubric:

Dimension	Question
Helpfulness	Does the response directly address the prompt?
Honesty	Does it avoid false claims and express appropriate uncertainty?
Harmlessness	Does it avoid toxic, dangerous, or offensive content?
Conciseness	Is it free of unnecessary filler and repetition?

The preference signal is a ranking (A > B), not a rating (A = 8/10, B = 6/10). Rankings are more reliable and faster to collect than absolute scores.

⚖️ Trade-offs & Failure Modes: RLHF Trade-offs and Failure Modes

Challenge	Why It Matters
Expensive human labeling	Thousands of high-quality (prompt, comparison) pairs needed — skilled, well-briefed labelers required
Reward model is imperfect	It can be gamed (mode-collapse in PPO) — the model finds responses that score high but are not actually good
KL constraint is a crude fix	It prevents collapse but may limit performance ceiling; β tuning requires significant experimentation
Labeler disagreement	Different people rank the same output differently — especially for subjective content
Reward hacking (Goodhart's Law)	PPO optimizes whatever the RM scores — if the RM is imperfect, so is the policy
Scaling cost	Each alignment iteration requires new preference data; human labeling does not scale cheaply

Alternatives and successors:

DPO (Direct Preference Optimization): Skips the RM and PPO entirely — optimizes preference directly. Simpler, often competitive with RLHF. Used in Llama 3.
RLAIF (RL from AI Feedback): Replace human labelers with a stronger LLM-as-judge. Used in Claude (Constitutional AI).
PPO-Lite: A simplified PPO variant used when compute is constrained.

🧭 Decision Guide: RLHF vs DPO vs RLAIF

Situation	Recommendation
Building a flagship assistant with large labeling budget	Full RLHF (SFT → RM → PPO) is the proven approach
Small team, limited annotation budget	DPO — eliminates the reward model training step entirely
Need to scale alignment beyond human labeling capacity	RLAIF — use a strong LLM judge to generate preference data
Tight compute budget for the alignment phase	DPO or PPO-Lite — lighter training loop
Need interpretable reward signal	RLHF with explicit RM — the reward score is inspectable
Iterating quickly on alignment behaviors	DPO converges faster and requires fewer moving parts

🌍 Real-World Applications of RLHF

RLHF is not an academic curiosity — it is the alignment backbone of the most widely deployed AI products today.

Product / Project	How RLHF or Its Successor Is Used
ChatGPT / GPT-4 (OpenAI)	Classic three-stage RLHF: SFT on contractor-written responses, RM trained on pairwise preferences, PPO fine-tuning
Claude (Anthropic)	Constitutional AI + RLAIF: a set of written principles guides an AI judge that provides feedback, dramatically reducing reliance on human labelers
Llama 3 (Meta)	DPO instead of PPO; the same preference data drives direct optimization without a separate reward model training loop
Customer-Facing Chatbots	Enterprise teams apply lightweight RLHF or DPO on domain-specific preference data to reduce off-topic or brand-unsafe responses
Code Assistants (Copilot, Cursor)	Preference data collected from accept/reject signals on code completions; reward model scores helpfulness and correctness

Why does the product landscape matter for a beginner? Because it shows that RLHF is not one fixed recipe. OpenAI pioneered it, Anthropic scaled it with AI feedback, and Meta simplified it with DPO. The underlying insight — that human preference signals are a powerful training target — is shared across all three.

🧪 Practical: What RLHF Looks Like in Code

Most researchers use the trl library (Transformer Reinforcement Learning) from Hugging Face to implement RLHF. Below is a minimal sketch of the three stages. These are not production-ready snippets, but they illustrate the shape of the pipeline.

Stage 1 — SFT is standard fine-tuning:

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")

trainer = SFTTrainer(
    model=model,
    train_dataset=sft_dataset,   # rows: {"prompt": ..., "response": ...}
    dataset_text_field="response",
)
trainer.train()
sft_model = trainer.model

Stage 2 — Reward Model training on preference pairs:

from trl import RewardTrainer, RewardConfig

# Preference dataset rows: {"chosen": ..., "rejected": ...}
# Both are full prompt+response strings.
reward_trainer = RewardTrainer(
    model=AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B"),
    args=RewardConfig(output_dir="./reward_model"),
    train_dataset=preference_dataset,
)
reward_trainer.train()
reward_model = reward_trainer.model

Stage 3 — PPO optimization loop:

from trl import PPOTrainer, PPOConfig
import torch

ppo_trainer = PPOTrainer(
    config=PPOConfig(kl_penalty="kl", init_kl_coef=0.05),  # beta = 0.05
    model=sft_model,
    ref_model=sft_model,          # frozen reference = SFT baseline
    reward_model=reward_model,
    tokenizer=tokenizer,
)

for batch in ppo_dataloader:
    query_tensors = batch["input_ids"]
    # Generate a response from the current policy
    response_tensors = ppo_trainer.generate(query_tensors)
    # Score each response with the reward model
    rewards = [reward_model(q, r) for q, r in zip(query_tensors, response_tensors)]
    # PPO update: maximize reward subject to KL penalty
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

The kl_penalty and init_kl_coef are the tunable β from the KL divergence formula above. Most practitioners start around 0.05 and adjust based on how much the model diverges from the SFT checkpoint during training.

🛠️ Hugging Face TRL: RewardTrainer and PPOTrainer for RLHF Pipelines

Hugging Face TRL (Transformer Reinforcement Learning) is the standard open-source library for implementing RLHF in Python — it provides SFTTrainer, RewardTrainer, and PPOTrainer that map one-to-one to the three RLHF stages described above, all building on the familiar transformers Trainer API and integrating natively with PEFT and bitsandbytes for memory-efficient training.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import RewardTrainer, RewardConfig, PPOTrainer, PPOConfig
from datasets import load_dataset

model_name = "meta-llama/Llama-3-8B"
tokenizer  = AutoTokenizer.from_pretrained(model_name)

# ── Stage 2: Train the Reward Model on human preference pairs ──────────────
# Dataset format: {"chosen": "<prompt> + <good response>", "rejected": "<prompt> + <bad response>"}
preference_dataset = load_dataset("json", data_files="preferences.jsonl", split="train")

reward_model = AutoModelForCausalLM.from_pretrained(model_name)
reward_trainer = RewardTrainer(
    model=reward_model,
    args=RewardConfig(
        output_dir="./reward-model",
        per_device_train_batch_size=4,
        num_train_epochs=1,
        bf16=True,
    ),
    tokenizer=tokenizer,
    train_dataset=preference_dataset,
)
reward_trainer.train()

# ── Stage 3: PPO fine-tuning with KL penalty ───────────────────────────────
sft_model = AutoModelForCausalLM.from_pretrained("./sft-checkpoint")   # SFT baseline

ppo_trainer = PPOTrainer(
    config=PPOConfig(
        learning_rate=1e-6,
        batch_size=32,
        kl_penalty="kl",
        init_kl_coef=0.05,    # β in the KL divergence formula
        target_kl=0.1,
    ),
    model=sft_model,
    ref_model=sft_model,      # frozen copy of SFT model = reference policy
    tokenizer=tokenizer,
)

for batch in ppo_dataloader:
    query_tensors   = batch["input_ids"]
    response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=128)
    rewards = [reward_model(q, r).logits[:, -1] for q, r in zip(query_tensors, response_tensors)]
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

RewardTrainer handles the Bradley-Terry pairwise loss automatically — you supply chosen and rejected columns and it computes the log-sigmoid objective. PPOTrainer manages the KL-penalized policy update loop, the reference model's frozen copy, and gradient clipping, so you only need to supply prompts, generate responses, score them, and call .step().

For a full deep-dive on Hugging Face TRL, a dedicated follow-up post is planned.

📚 Production Lessons from RLHF

Running RLHF in practice is harder than the clean three-stage diagram suggests. Here are the lessons that practitioners encounter repeatedly.

Reward hacking is real. PPO will find degenerate outputs that score high on the reward model but look nothing like good responses — long repetitive lists, excessive hedging, or flattery. The KL penalty suppresses this, but does not eliminate it. Monitor output diversity throughout training.
KL coefficient tuning takes real experimentation. β = 0.01 and β = 0.1 can produce completely different models. Too small and the model drifts to reward-hacking behavior; too large and the PPO pass adds nothing beyond SFT.
Labeler quality matters more than labeler quantity. A small set of well-briefed, consistent labelers produces better reward models than a large crowd with inconsistent rubrics. Labeler disagreement directly injects noise into the RM's training signal.
DPO is simpler and often competitive. For teams without a dedicated RL infrastructure, DPO is the right default. It skips the reward model training loop entirely and converges faster. Llama 3's alignment used DPO.
RLAIF scales better than human labeling. Once you have a strong enough base model to serve as a judge, RLAIF (AI feedback) can generate vastly more preference pairs per day than human contractors. This is the core insight behind Constitutional AI.
The SFT step matters. A weak SFT checkpoint produces a weak RM and a weak PPO policy. Garbage in, garbage out — investing in high-quality SFT demonstrations pays dividends in every subsequent stage.

📌 TLDR: Summary & Key Takeaways

TLDR: RLHF transforms a base LLM into a helpful, harmless assistant through three stages — SFT, Reward Model training, and PPO — with a KL penalty preventing the model from gaming the reward signal.

RLHF = SFT → Reward Model → PPO — three stages to transform a base LLM into an aligned assistant.
Reward Model is a trained preference predictor; it replaces the human labeler at scale.
KL penalty prevents PPO from collapsing the output distribution to reward-hacking gibberish.
DPO skips the RM and PPO entirely — increasingly preferred for its simplicity.
RLAIF replaces human labelers with an AI judge for scalable feedback.

Article tools

Explain simpler Compare approaches What next?

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Article metadata