RLHF in Practice: From Human Preferences to Better LLM Policies
RLHF turns human preference signals into policy updates for more useful LLM behavior.
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL constraints to stay close to a reference model). RLHF can significantly improve usefulness and harmlessness, but it introduces risks like reward hacking and annotation bias.
๐ Why RLHF Exists After Pretraining and SFT
GPT-2 could write fluent text but produced harmful content on request. The fix wasn't more data โ it was RLHF: human raters scored outputs, a reward model learned their preferences, and PPO optimised the policy to score higher. This post explains that pipeline step by step.
Pretraining teaches language fluency. SFT teaches example-following behavior. Yet product teams often still see issues after both stages:
- responses that are technically correct but unhelpful,
- tone that ignores user intent,
- overconfident mistakes,
- weak refusal behavior in unsafe scenarios.
RLHF exists because not all desired behavior is easy to encode as direct supervised labels. Humans can often compare two responses faster than they can craft one perfect reference response.
| Stage | Strength | Limitation |
| Pretraining | Broad language priors | No product preference alignment |
| SFT | Teaches explicit response patterns | Limited by demonstration coverage |
| RLHF | Optimizes for preference signals | Sensitive to reward model quality |
RLHF does not replace SFT. It usually builds on SFT as a stronger initialization.
๐ The RLHF Pipeline: Data, Reward, Policy
Most production RLHF pipelines include three parts:
- Preference data collection
- Reward model training
- Policy optimization with regularization
1) Preference data collection
Annotators compare candidate responses for the same prompt:
- Which answer is more helpful?
- Which answer is safer?
- Which answer follows instructions better?
2) Reward model training
A separate model learns to score responses so preferred outputs receive higher scores.
3) Policy optimization
The assistant policy is updated to maximize reward while staying close to a reference policy using KL penalties.
| Component | Input | Output | Common failure mode |
| Preference dataset | Prompt + response pairs + preference labels | Ranked examples | Annotator inconsistency |
| Reward model | Ranked examples | Scalar reward signal | Overfitting to annotation artifacts |
| Policy optimizer | Prompt + reward signal | Updated policy | Reward hacking / style collapse |
๐ RLHF End-to-End Training Sequence
sequenceDiagram
participant SFT as SFT Model
participant A as Annotators
participant RM as Reward Model
participant PPO as PPO Trainer
participant KL as KL Constraint
SFT->>A: Generate candidate responses
A->>A: Rank response pairs (A > B)
A->>RM: Preference labels
RM->>RM: Train: maximize P(preferred > rejected)
SFT->>PPO: Initialize policy weights
PPO->>PPO: Generate response
PPO->>RM: Score response
RM-->>PPO: Reward scalar
PPO->>KL: Measure drift from SFT
KL-->>PPO: KL penalty
PPO->>PPO: Update policy weights
This diagram traces the full RLHF training loop from raw SFT model to an aligned policy. Annotators rank candidate responses, the reward model (RM) learns those preferences, and then PPO iteratively generates responses, scores them against the RM, and applies KL-penalised weight updates. The key takeaway is that the SFT model serves double duty: it initialises the PPO policy and acts as the frozen reference for KL penalty computation.
๐ Reward Model: Prompt to Score Flow
flowchart LR
P[Prompt]
RA[Response A]
RB[Response B]
RM[Reward Model (Bradley-Terry trained)]
SA[Score A: r(x, y_A)]
SB[Score B: r(x, y_B)]
Rank[Rank: A > B if r_A > r_B]
Policy[Policy Update (maximize reward)]
P --> RM
RA --> RM --> SA
RB --> RM --> SB
SA --> Rank
SB --> Rank
Rank --> Policy
This flowchart shows how a single prompt feeds both Response A and Response B into the Bradley-Terry-trained reward model, producing two scalar scores. The scores are then compared to rank the responses, and that ranking drives the policy update โ the model is pushed to generate outputs that score more like Response A. The central insight is that the reward model replaces the human rater at policy-optimisation time, so its quality directly determines the ceiling of what PPO can achieve.
โ๏ธ Preference Data Design: The Most Underrated Lever
Teams often obsess over PPO hyperparameters and underinvest in preference data design.
Good preference data characteristics
- clear rubric (helpfulness, harmlessness, honesty),
- diverse prompt categories,
- difficult borderline cases,
- explicit annotation disagreement tracking.
| Data design choice | Why it matters |
| Pairwise comparison instead of absolute scoring | Lower cognitive load for annotators |
| Rubric with examples | Improves label consistency |
| Multi-domain prompt mix | Prevents narrow alignment behavior |
| Audited annotator calibration rounds | Reduces drift over time |
If your labels are inconsistent, the reward model learns noise, and policy optimization amplifies that noise.
๐ง Deep Dive: Reward Modeling and KL-Constrained Policy Updates
Internals: reward model objective
Given prompt x and two responses y_w (winner) and y_l (loser), reward model r_theta is often trained with a Bradley-Terry style objective:
[ \mathcal{L}{RM} = -\log \sigma(r\theta(x, yw) - r\theta(x, y_l)) ]
This encourages higher reward for preferred responses.
Policy optimization objective (conceptual)
A common RLHF objective is:
[ \max\pi \; \mathbb{E}{x \sim D, y \sim \pi(\cdot|x)} [r\theta(x,y)] - \beta \; KL(\pi(\cdot|x) || \pi{ref}(\cdot|x)) ]
Where:
piis the trainable policy,pi_refis a frozen reference policy,betacontrols how far policy can drift.
Performance analysis and stability controls
| Signal | Healthy trend | Warning sign |
| Reward score | Gradual increase | Sharp spikes with human eval decline |
| KL divergence | Controlled range | Explosive drift from reference |
| Human eval | Improves on heldout prompts | Reward up but human preference down |
| Refusal behavior | More consistent policy adherence | Over-refusal or unsafe permissiveness |
Reward can become a proxy target that misses true user value. Human eval checkpoints are non-negotiable.
๐ฌ Internals
RLHF uses a three-stage pipeline: supervised fine-tuning (SFT) on demonstrations, reward model (RM) training on human preference pairs, and policy optimization via PPO. The reward model rฯ(x, y) scores response quality; PPO maximizes E[rฯ(x, y)] - ฮฒยทKL(ฯ_ฮธ || ฯ_SFT) where the KL penalty prevents the policy from drifting too far from the SFT base. The reference policy ฯ_SFT is frozen throughout PPO to stabilize training.
โก Performance Analysis
RLHF requires 4ร the compute of SFT alone due to PPO's actor-critic rollout loop. Training InstructGPT (1.3B) required ~320 GPU-hours of PPO after SFT โ modest compared to pre-training but operationally complex. DPO (Direct Preference Optimization), a drop-in RLHF alternative, achieves comparable alignment with a simple cross-entropy loss and 2โ3ร less training time.
๐ RLHF Training Flow in One Diagram
flowchart TD
A[Prompt Set] --> B[Generate candidate responses]
B --> C[Human preference comparisons]
C --> D[Train reward model]
D --> E[Initialize policy from SFT model]
E --> F[RL optimization with KL penalty]
F --> G[Offline and online evaluations]
G --> H{Pass acceptance criteria?}
H -- No --> I[Refine data rubric and reward model]
I --> D
H -- Yes --> J[Release aligned policy]
This loop is expensive, so prioritizing data quality and evaluation design upfront saves many failed RL cycles.
๐ Real-World Applications: Where RLHF Delivers Value
Helpful assistant quality
RLHF can reduce evasive or generic responses and improve usefulness under ambiguous prompts.
Safety policy consistency
Preference labels can encode policy-aligned refusal behavior better than plain SFT in many settings.
Tone and interaction quality
User satisfaction often improves when RLHF encourages clearer, context-sensitive responses.
| Use case | RLHF benefit |
| Consumer chat assistants | Better helpfulness and tone |
| Enterprise copilots | Policy-consistent behavior under edge prompts |
| Agentic workflows | Improved decision quality under preference criteria |
โ๏ธ Trade-offs & Failure Modes: Risks, Trade-offs, and Failure Patterns
| Risk | How it appears | Mitigation |
| Reward hacking | Policy exploits reward model quirks | Strong KL control + periodic human audits |
| Annotator bias | Responses optimized for narrow labeler style | Diverse annotator pool + rubric governance |
| Over-regularization | Model barely improves from SFT | Tune KL coefficient and rollout strategy |
| Under-regularization | Policy drift and unstable behavior | Tight KL bounds + early stop checks |
| Expensive iteration loop | Slow experimentation cadence | Smaller pilot loops before large training runs |
RLHF can improve alignment, but it can also institutionalize the wrong preferences if the process is poorly governed.
๐งญ Decision Guide: RLHF vs Simpler Preference Methods
| Situation | Preferred approach |
| Early-stage product, limited budget | High-quality SFT first |
| Need robust preference optimization at scale | RLHF pipeline |
| Need lower-complexity preference tuning | DPO or direct preference optimization variants |
| Safety-critical behavior shifts | RLHF with strong evaluation governance |
A common practical path is: Pretraining -> SFT -> preference optimization (DPO/RLHF) -> continuous eval.
๐งช Practical Sketch with TRL-Style Components
This example demonstrates a minimal PPO-style RLHF training loop using Hugging Face TRL โ the same pattern used by InstructGPT and open-source alignment pipelines. It is chosen because TRL's PPOTrainer encapsulates the three essential RLHF moves (generate, score, update) in a tight loop that mirrors the theoretical pipeline. As you read, focus on how the reward signal from the reward model flows into trainer.step() and how the KL-penalty config (kl_penalty, target_kl) controls drift from the reference policy.
# Conceptual pseudo-pipeline (not production-ready):
# 1) Train reward model on preference pairs
# 2) Run PPO-style policy optimization with KL control
from trl import PPOTrainer, PPOConfig
ppo_config = PPOConfig(
learning_rate=1e-6,
batch_size=64,
mini_batch_size=8,
kl_penalty="kl",
target_kl=0.1,
)
# trainer = PPOTrainer(config=ppo_config, model=policy_model, ref_model=ref_model, tokenizer=tok)
# for batch in rollout_loader:
# responses = trainer.generate(batch["prompt_ids"])
# rewards = reward_model.score(batch["prompts"], responses)
# trainer.step(batch["prompt_ids"], responses, rewards)
Production systems add:
- reward model sanity checks,
- adversarial prompt suites,
- rollback thresholds based on human evaluation.
๐ ๏ธ Hugging Face TRL and DeepSpeed: Scaling RLHF to Multi-GPU Clusters
Hugging Face TRL provides the DPOTrainer as a simpler, often superior alternative to the full PPO pipeline โ it directly optimizes the policy on preference pairs without a separate reward model training step. DeepSpeed (Microsoft) is the distributed training engine that makes RLHF computationally feasible at scale: its ZeRO optimizer stages shard model states, gradients, and optimizer states across GPUs, enabling PPO and DPO training of 7Bโ70B models on multi-GPU clusters that would otherwise run out of memory.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
# DPO eliminates the reward model training step entirely:
# it directly optimizes the policy on (prompt, chosen, rejected) triples.
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
ref_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16) # frozen
# Dataset format: {"prompt": str, "chosen": str, "rejected": str}
preference_data = load_dataset("json", data_files="dpo_preferences.jsonl", split="train")
dpo_trainer = DPOTrainer(
model=model,
ref_model=ref_model, # frozen SFT baseline for KL regularization
args=DPOConfig(
output_dir="./dpo-output",
beta=0.1, # KL regularization strength (same role as ฮฒ in PPO)
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
num_train_epochs=1,
bf16=True,
# DeepSpeed ZeRO Stage 2 config for multi-GPU training
deepspeed="ds_config_zero2.json",
),
tokenizer=tokenizer,
train_dataset=preference_data,
)
dpo_trainer.train()
dpo_trainer.save_model("./dpo-aligned-model")
A minimal ds_config_zero2.json enables multi-GPU training without changing a line of Python:
{
"zero_optimization": { "stage": 2, "overlap_comm": true },
"bf16": { "enabled": true },
"gradient_clipping": 1.0
}
| Tool | Role | When to reach for it |
| TRL DPOTrainer | Preference optimization without RM training | Simpler alternative to PPO for most teams |
| TRL PPOTrainer | Full RLHF with reward model + KL-constrained policy updates | When interpretable reward signal is required |
| DeepSpeed ZeRO | Shard model/optimizer state across GPUs | Training 7B+ models on multi-GPU clusters |
For a full deep-dive on Hugging Face TRL and DeepSpeed, dedicated follow-up posts are planned.
๐ Practical Lessons from Alignment Teams
- A better rubric often beats a bigger reward model.
- Keep a fixed human-eval holdout set across runs.
- Monitor KL and human preference together, never in isolation.
- Add category-level breakdowns (safety, factuality, tone, refusal).
- Treat RLHF as governance + modeling, not only training code.
๐ TLDR: Summary & Key Takeaways
TLDR: RLHF adds a preference-optimization loop on top of SFT โ collect pairwise rankings, train a reward model, then run KL-constrained PPO to shift the policy toward human-preferred outputs.
- RLHF optimizes model policy using preference signals, not only supervised labels.
- Reward model quality determines how useful RLHF updates will be.
- KL-constrained optimization helps prevent destructive policy drift.
- Human evaluations remain the ground truth against reward overfitting.
- The strongest RLHF systems combine technical rigor with annotation governance.
One-liner: RLHF can make assistants far more aligned, but only if your preference pipeline is trustworthy.
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer โ 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2ร A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
