Home/Blog/Ai/RLHF in Practice: From Human Preferences to Better LLM Policies

AiIntermediate•12 min read•Mar 9, 2026

RLHF in Practice: From Human Preferences to Better LLM Policies

RLHF turns human preference signals into policy updates for more useful LLM behavior.

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL constraints to stay close to a reference model). RLHF can significantly improve usefulness and harmlessness, but it introduces risks like reward hacking and annotation bias.

📖 Why RLHF Exists After Pretraining and SFT

GPT-2 could write fluent text but produced harmful content on request. The fix wasn't more data — it was RLHF: human raters scored outputs, a reward model learned their preferences, and PPO optimised the policy to score higher. This post explains that pipeline step by step.

Pretraining teaches language fluency. SFT teaches example-following behavior. Yet product teams often still see issues after both stages:

responses that are technically correct but unhelpful,
tone that ignores user intent,
overconfident mistakes,
weak refusal behavior in unsafe scenarios.

RLHF exists because not all desired behavior is easy to encode as direct supervised labels. Humans can often compare two responses faster than they can craft one perfect reference response.

Stage	Strength	Limitation
Pretraining	Broad language priors	No product preference alignment
SFT	Teaches explicit response patterns	Limited by demonstration coverage
RLHF	Optimizes for preference signals	Sensitive to reward model quality

RLHF does not replace SFT. It usually builds on SFT as a stronger initialization.

🔍 The RLHF Pipeline: Data, Reward, Policy

Most production RLHF pipelines include three parts:

Preference data collection
Reward model training
Policy optimization with regularization

1) Preference data collection

Annotators compare candidate responses for the same prompt:

Which answer is more helpful?
Which answer is safer?
Which answer follows instructions better?

2) Reward model training

A separate model learns to score responses so preferred outputs receive higher scores.

3) Policy optimization

The assistant policy is updated to maximize reward while staying close to a reference policy using KL penalties.

Component	Input	Output	Common failure mode
Preference dataset	Prompt + response pairs + preference labels	Ranked examples	Annotator inconsistency
Reward model	Ranked examples	Scalar reward signal	Overfitting to annotation artifacts
Policy optimizer	Prompt + reward signal	Updated policy	Reward hacking / style collapse

📊 RLHF End-to-End Training Sequence

sequenceDiagram
    participant SFT as SFT Model
    participant A as Annotators
    participant RM as Reward Model
    participant PPO as PPO Trainer
    participant KL as KL Constraint

    SFT->>A: Generate candidate responses
    A->>A: Rank response pairs (A > B)
    A->>RM: Preference labels
    RM->>RM: Train: maximize P(preferred > rejected)
    SFT->>PPO: Initialize policy weights
    PPO->>PPO: Generate response
    PPO->>RM: Score response
    RM-->>PPO: Reward scalar
    PPO->>KL: Measure drift from SFT
    KL-->>PPO: KL penalty
    PPO->>PPO: Update policy weights

This diagram traces the full RLHF training loop from raw SFT model to an aligned policy. Annotators rank candidate responses, the reward model (RM) learns those preferences, and then PPO iteratively generates responses, scores them against the RM, and applies KL-penalised weight updates. The key takeaway is that the SFT model serves double duty: it initialises the PPO policy and acts as the frozen reference for KL penalty computation.

📊 Reward Model: Prompt to Score Flow

flowchart LR
    P[Prompt]
    RA[Response A]
    RB[Response B]
    RM["Reward Model (Bradley-Terry trained)"]
    SA["Score A: r(x, y_A)"]
    SB["Score B: r(x, y_B)"]
    Rank[Rank: A > B if r_A > r_B]
    Policy["Policy Update (maximize reward)"]

    P --> RM
    RA --> RM --> SA
    RB --> RM --> SB
    SA --> Rank
    SB --> Rank
    Rank --> Policy

This flowchart shows how a single prompt feeds both Response A and Response B into the Bradley-Terry-trained reward model, producing two scalar scores. The scores are then compared to rank the responses, and that ranking drives the policy update — the model is pushed to generate outputs that score more like Response A. The central insight is that the reward model replaces the human rater at policy-optimisation time, so its quality directly determines the ceiling of what PPO can achieve.

⚙️ Preference Data Design: The Most Underrated Lever

Teams often obsess over PPO hyperparameters and underinvest in preference data design.

Good preference data characteristics

clear rubric (helpfulness, harmlessness, honesty),
diverse prompt categories,
difficult borderline cases,
explicit annotation disagreement tracking.

Data design choice	Why it matters
Pairwise comparison instead of absolute scoring	Lower cognitive load for annotators
Rubric with examples	Improves label consistency
Multi-domain prompt mix	Prevents narrow alignment behavior
Audited annotator calibration rounds	Reduces drift over time

If your labels are inconsistent, the reward model learns noise, and policy optimization amplifies that noise.

🧠 Deep Dive: Reward Modeling and KL-Constrained Policy Updates

Internals: reward model objective

Given prompt x and two responses y_w (winner) and y_l (loser), reward model r_theta is often trained with a Bradley-Terry style objective:

[ \mathcal{L}{RM} = -\log \sigma(r\theta(x, yw) - r\theta(x, y_l)) ]

This encourages higher reward for preferred responses.

Policy optimization objective (conceptual)

A common RLHF objective is:

[ \max\pi \; \mathbb{E}{x \sim D, y \sim \pi(\cdot|x)} [r\theta(x,y)] - \beta \; KL(\pi(\cdot|x) || \pi{ref}(\cdot|x)) ]

Where:

pi is the trainable policy,
pi_ref is a frozen reference policy,
beta controls how far policy can drift.

Performance analysis and stability controls

Signal	Healthy trend	Warning sign
Reward score	Gradual increase	Sharp spikes with human eval decline
KL divergence	Controlled range	Explosive drift from reference
Human eval	Improves on heldout prompts	Reward up but human preference down
Refusal behavior	More consistent policy adherence	Over-refusal or unsafe permissiveness

Reward can become a proxy target that misses true user value. Human eval checkpoints are non-negotiable.

🔬 Internals

RLHF uses a three-stage pipeline: supervised fine-tuning (SFT) on demonstrations, reward model (RM) training on human preference pairs, and policy optimization via PPO. The reward model rφ(x, y) scores response quality; PPO maximizes E[rφ(x, y)] - β·KL(π_θ || π_SFT) where the KL penalty prevents the policy from drifting too far from the SFT base. The reference policy π_SFT is frozen throughout PPO to stabilize training.

⚡ Performance Analysis

RLHF requires 4× the compute of SFT alone due to PPO's actor-critic rollout loop. Training InstructGPT (1.3B) required ~320 GPU-hours of PPO after SFT — modest compared to pre-training but operationally complex. DPO (Direct Preference Optimization), a drop-in RLHF alternative, achieves comparable alignment with a simple cross-entropy loss and 2–3× less training time.

📊 RLHF Training Flow in One Diagram

flowchart TD
    A[Prompt Set] --> B[Generate candidate responses]
    B --> C[Human preference comparisons]
    C --> D[Train reward model]
    D --> E[Initialize policy from SFT model]
    E --> F[RL optimization with KL penalty]
    F --> G[Offline and online evaluations]
    G --> H{Pass acceptance criteria?}
    H -- No --> I[Refine data rubric and reward model]
    I --> D
    H -- Yes --> J[Release aligned policy]

This loop is expensive, so prioritizing data quality and evaluation design upfront saves many failed RL cycles.

🌍 Real-World Applications: Where RLHF Delivers Value

Helpful assistant quality

RLHF can reduce evasive or generic responses and improve usefulness under ambiguous prompts.

Safety policy consistency

Preference labels can encode policy-aligned refusal behavior better than plain SFT in many settings.

Tone and interaction quality

User satisfaction often improves when RLHF encourages clearer, context-sensitive responses.

Use case	RLHF benefit
Consumer chat assistants	Better helpfulness and tone
Enterprise copilots	Policy-consistent behavior under edge prompts
Agentic workflows	Improved decision quality under preference criteria

⚖️ Trade-offs & Failure Modes: Risks, Trade-offs, and Failure Patterns

Risk	How it appears	Mitigation
Reward hacking	Policy exploits reward model quirks	Strong KL control + periodic human audits
Annotator bias	Responses optimized for narrow labeler style	Diverse annotator pool + rubric governance
Over-regularization	Model barely improves from SFT	Tune KL coefficient and rollout strategy
Under-regularization	Policy drift and unstable behavior	Tight KL bounds + early stop checks
Expensive iteration loop	Slow experimentation cadence	Smaller pilot loops before large training runs

RLHF can improve alignment, but it can also institutionalize the wrong preferences if the process is poorly governed.

🧭 Decision Guide: RLHF vs Simpler Preference Methods

Situation	Preferred approach
Early-stage product, limited budget	High-quality SFT first
Need robust preference optimization at scale	RLHF pipeline
Need lower-complexity preference tuning	DPO or direct preference optimization variants
Safety-critical behavior shifts	RLHF with strong evaluation governance

A common practical path is: Pretraining -> SFT -> preference optimization (DPO/RLHF) -> continuous eval.

🧪 Practical Sketch with TRL-Style Components

This example demonstrates a minimal PPO-style RLHF training loop using Hugging Face TRL — the same pattern used by InstructGPT and open-source alignment pipelines. It is chosen because TRL's PPOTrainer encapsulates the three essential RLHF moves (generate, score, update) in a tight loop that mirrors the theoretical pipeline. As you read, focus on how the reward signal from the reward model flows into trainer.step() and how the KL-penalty config (kl_penalty, target_kl) controls drift from the reference policy.

# Conceptual pseudo-pipeline (not production-ready):
# 1) Train reward model on preference pairs
# 2) Run PPO-style policy optimization with KL control

from trl import PPOTrainer, PPOConfig

ppo_config = PPOConfig(
    learning_rate=1e-6,
    batch_size=64,
    mini_batch_size=8,
    kl_penalty="kl",
    target_kl=0.1,
)

# trainer = PPOTrainer(config=ppo_config, model=policy_model, ref_model=ref_model, tokenizer=tok)
# for batch in rollout_loader:
#     responses = trainer.generate(batch["prompt_ids"])
#     rewards = reward_model.score(batch["prompts"], responses)
#     trainer.step(batch["prompt_ids"], responses, rewards)

Production systems add:

reward model sanity checks,
adversarial prompt suites,
rollback thresholds based on human evaluation.

🛠️ Hugging Face TRL and DeepSpeed: Scaling RLHF to Multi-GPU Clusters

Hugging Face TRL provides the DPOTrainer as a simpler, often superior alternative to the full PPO pipeline — it directly optimizes the policy on preference pairs without a separate reward model training step. DeepSpeed (Microsoft) is the distributed training engine that makes RLHF computationally feasible at scale: its ZeRO optimizer stages shard model states, gradients, and optimizer states across GPUs, enabling PPO and DPO training of 7B–70B models on multi-GPU clusters that would otherwise run out of memory.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset

# DPO eliminates the reward model training step entirely:
# it directly optimizes the policy on (prompt, chosen, rejected) triples.
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
ref_model  = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)  # frozen

# Dataset format: {"prompt": str, "chosen": str, "rejected": str}
preference_data = load_dataset("json", data_files="dpo_preferences.jsonl", split="train")

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,          # frozen SFT baseline for KL regularization
    args=DPOConfig(
        output_dir="./dpo-output",
        beta=0.1,                  # KL regularization strength (same role as β in PPO)
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        num_train_epochs=1,
        bf16=True,
        # DeepSpeed ZeRO Stage 2 config for multi-GPU training
        deepspeed="ds_config_zero2.json",
    ),
    tokenizer=tokenizer,
    train_dataset=preference_data,
)

dpo_trainer.train()
dpo_trainer.save_model("./dpo-aligned-model")

A minimal ds_config_zero2.json enables multi-GPU training without changing a line of Python:

{
  "zero_optimization": { "stage": 2, "overlap_comm": true },
  "bf16": { "enabled": true },
  "gradient_clipping": 1.0
}

Tool	Role	When to reach for it
TRL DPOTrainer	Preference optimization without RM training	Simpler alternative to PPO for most teams
TRL PPOTrainer	Full RLHF with reward model + KL-constrained policy updates	When interpretable reward signal is required
DeepSpeed ZeRO	Shard model/optimizer state across GPUs	Training 7B+ models on multi-GPU clusters

For a full deep-dive on Hugging Face TRL and DeepSpeed, dedicated follow-up posts are planned.

📚 Practical Lessons from Alignment Teams

A better rubric often beats a bigger reward model.
Keep a fixed human-eval holdout set across runs.
Monitor KL and human preference together, never in isolation.
Add category-level breakdowns (safety, factuality, tone, refusal).
Treat RLHF as governance + modeling, not only training code.

📌 TLDR: Summary & Key Takeaways

TLDR: RLHF adds a preference-optimization loop on top of SFT — collect pairwise rankings, train a reward model, then run KL-constrained PPO to shift the policy toward human-preferred outputs.

RLHF optimizes model policy using preference signals, not only supervised labels.
Reward model quality determines how useful RLHF updates will be.
KL-constrained optimization helps prevent destructive policy drift.
Human evaluations remain the ground truth against reward overfitting.
The strongest RLHF systems combine technical rigor with annotation governance.

One-liner: RLHF can make assistants far more aligned, but only if your preference pipeline is trustworthy.

Article tools

Explain simpler Compare approaches What next?

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Article metadata