All Posts

RLHF in Practice: From Human Preferences to Better LLM Policies

RLHF turns human preference signals into policy updates for more useful LLM behavior.

Abstract AlgorithmsAbstract Algorithms
ยทยท11 min read

AI-assisted content.

TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL constraints to stay close to a reference model). RLHF can significantly improve usefulness and harmlessness, but it introduces risks like reward hacking and annotation bias.


๐Ÿ“– Why RLHF Exists After Pretraining and SFT

GPT-2 could write fluent text but produced harmful content on request. The fix wasn't more data โ€” it was RLHF: human raters scored outputs, a reward model learned their preferences, and PPO optimised the policy to score higher. This post explains that pipeline step by step.

Pretraining teaches language fluency. SFT teaches example-following behavior. Yet product teams often still see issues after both stages:

  • responses that are technically correct but unhelpful,
  • tone that ignores user intent,
  • overconfident mistakes,
  • weak refusal behavior in unsafe scenarios.

RLHF exists because not all desired behavior is easy to encode as direct supervised labels. Humans can often compare two responses faster than they can craft one perfect reference response.

StageStrengthLimitation
PretrainingBroad language priorsNo product preference alignment
SFTTeaches explicit response patternsLimited by demonstration coverage
RLHFOptimizes for preference signalsSensitive to reward model quality

RLHF does not replace SFT. It usually builds on SFT as a stronger initialization.


๐Ÿ” The RLHF Pipeline: Data, Reward, Policy

Most production RLHF pipelines include three parts:

  1. Preference data collection
  2. Reward model training
  3. Policy optimization with regularization

1) Preference data collection

Annotators compare candidate responses for the same prompt:

  • Which answer is more helpful?
  • Which answer is safer?
  • Which answer follows instructions better?

2) Reward model training

A separate model learns to score responses so preferred outputs receive higher scores.

3) Policy optimization

The assistant policy is updated to maximize reward while staying close to a reference policy using KL penalties.

ComponentInputOutputCommon failure mode
Preference datasetPrompt + response pairs + preference labelsRanked examplesAnnotator inconsistency
Reward modelRanked examplesScalar reward signalOverfitting to annotation artifacts
Policy optimizerPrompt + reward signalUpdated policyReward hacking / style collapse

๐Ÿ“Š RLHF End-to-End Training Sequence

sequenceDiagram
    participant SFT as SFT Model
    participant A as Annotators
    participant RM as Reward Model
    participant PPO as PPO Trainer
    participant KL as KL Constraint

    SFT->>A: Generate candidate responses
    A->>A: Rank response pairs (A > B)
    A->>RM: Preference labels
    RM->>RM: Train: maximize P(preferred > rejected)
    SFT->>PPO: Initialize policy weights
    PPO->>PPO: Generate response
    PPO->>RM: Score response
    RM-->>PPO: Reward scalar
    PPO->>KL: Measure drift from SFT
    KL-->>PPO: KL penalty
    PPO->>PPO: Update policy weights

This diagram traces the full RLHF training loop from raw SFT model to an aligned policy. Annotators rank candidate responses, the reward model (RM) learns those preferences, and then PPO iteratively generates responses, scores them against the RM, and applies KL-penalised weight updates. The key takeaway is that the SFT model serves double duty: it initialises the PPO policy and acts as the frozen reference for KL penalty computation.

๐Ÿ“Š Reward Model: Prompt to Score Flow

flowchart LR
    P[Prompt]
    RA[Response A]
    RB[Response B]
    RM[Reward Model (Bradley-Terry trained)]
    SA[Score A: r(x, y_A)]
    SB[Score B: r(x, y_B)]
    Rank[Rank: A > B if r_A > r_B]
    Policy[Policy Update (maximize reward)]

    P --> RM
    RA --> RM --> SA
    RB --> RM --> SB
    SA --> Rank
    SB --> Rank
    Rank --> Policy

This flowchart shows how a single prompt feeds both Response A and Response B into the Bradley-Terry-trained reward model, producing two scalar scores. The scores are then compared to rank the responses, and that ranking drives the policy update โ€” the model is pushed to generate outputs that score more like Response A. The central insight is that the reward model replaces the human rater at policy-optimisation time, so its quality directly determines the ceiling of what PPO can achieve.


โš™๏ธ Preference Data Design: The Most Underrated Lever

Teams often obsess over PPO hyperparameters and underinvest in preference data design.

Good preference data characteristics

  • clear rubric (helpfulness, harmlessness, honesty),
  • diverse prompt categories,
  • difficult borderline cases,
  • explicit annotation disagreement tracking.
Data design choiceWhy it matters
Pairwise comparison instead of absolute scoringLower cognitive load for annotators
Rubric with examplesImproves label consistency
Multi-domain prompt mixPrevents narrow alignment behavior
Audited annotator calibration roundsReduces drift over time

If your labels are inconsistent, the reward model learns noise, and policy optimization amplifies that noise.


๐Ÿง  Deep Dive: Reward Modeling and KL-Constrained Policy Updates

Internals: reward model objective

Given prompt x and two responses y_w (winner) and y_l (loser), reward model r_theta is often trained with a Bradley-Terry style objective:

[ \mathcal{L}{RM} = -\log \sigma(r\theta(x, yw) - r\theta(x, y_l)) ]

This encourages higher reward for preferred responses.

Policy optimization objective (conceptual)

A common RLHF objective is:

[ \max\pi \; \mathbb{E}{x \sim D, y \sim \pi(\cdot|x)} [r\theta(x,y)] - \beta \; KL(\pi(\cdot|x) || \pi{ref}(\cdot|x)) ]

Where:

  • pi is the trainable policy,
  • pi_ref is a frozen reference policy,
  • beta controls how far policy can drift.

Performance analysis and stability controls

SignalHealthy trendWarning sign
Reward scoreGradual increaseSharp spikes with human eval decline
KL divergenceControlled rangeExplosive drift from reference
Human evalImproves on heldout promptsReward up but human preference down
Refusal behaviorMore consistent policy adherenceOver-refusal or unsafe permissiveness

Reward can become a proxy target that misses true user value. Human eval checkpoints are non-negotiable.


๐Ÿ”ฌ Internals

RLHF uses a three-stage pipeline: supervised fine-tuning (SFT) on demonstrations, reward model (RM) training on human preference pairs, and policy optimization via PPO. The reward model rฯ†(x, y) scores response quality; PPO maximizes E[rฯ†(x, y)] - ฮฒยทKL(ฯ€_ฮธ || ฯ€_SFT) where the KL penalty prevents the policy from drifting too far from the SFT base. The reference policy ฯ€_SFT is frozen throughout PPO to stabilize training.

โšก Performance Analysis

RLHF requires 4ร— the compute of SFT alone due to PPO's actor-critic rollout loop. Training InstructGPT (1.3B) required ~320 GPU-hours of PPO after SFT โ€” modest compared to pre-training but operationally complex. DPO (Direct Preference Optimization), a drop-in RLHF alternative, achieves comparable alignment with a simple cross-entropy loss and 2โ€“3ร— less training time.

๐Ÿ“Š RLHF Training Flow in One Diagram

flowchart TD
    A[Prompt Set] --> B[Generate candidate responses]
    B --> C[Human preference comparisons]
    C --> D[Train reward model]
    D --> E[Initialize policy from SFT model]
    E --> F[RL optimization with KL penalty]
    F --> G[Offline and online evaluations]
    G --> H{Pass acceptance criteria?}
    H -- No --> I[Refine data rubric and reward model]
    I --> D
    H -- Yes --> J[Release aligned policy]

This loop is expensive, so prioritizing data quality and evaluation design upfront saves many failed RL cycles.


๐ŸŒ Real-World Applications: Where RLHF Delivers Value

Helpful assistant quality

RLHF can reduce evasive or generic responses and improve usefulness under ambiguous prompts.

Safety policy consistency

Preference labels can encode policy-aligned refusal behavior better than plain SFT in many settings.

Tone and interaction quality

User satisfaction often improves when RLHF encourages clearer, context-sensitive responses.

Use caseRLHF benefit
Consumer chat assistantsBetter helpfulness and tone
Enterprise copilotsPolicy-consistent behavior under edge prompts
Agentic workflowsImproved decision quality under preference criteria

โš–๏ธ Trade-offs & Failure Modes: Risks, Trade-offs, and Failure Patterns

RiskHow it appearsMitigation
Reward hackingPolicy exploits reward model quirksStrong KL control + periodic human audits
Annotator biasResponses optimized for narrow labeler styleDiverse annotator pool + rubric governance
Over-regularizationModel barely improves from SFTTune KL coefficient and rollout strategy
Under-regularizationPolicy drift and unstable behaviorTight KL bounds + early stop checks
Expensive iteration loopSlow experimentation cadenceSmaller pilot loops before large training runs

RLHF can improve alignment, but it can also institutionalize the wrong preferences if the process is poorly governed.


๐Ÿงญ Decision Guide: RLHF vs Simpler Preference Methods

SituationPreferred approach
Early-stage product, limited budgetHigh-quality SFT first
Need robust preference optimization at scaleRLHF pipeline
Need lower-complexity preference tuningDPO or direct preference optimization variants
Safety-critical behavior shiftsRLHF with strong evaluation governance

A common practical path is: Pretraining -> SFT -> preference optimization (DPO/RLHF) -> continuous eval.


๐Ÿงช Practical Sketch with TRL-Style Components

This example demonstrates a minimal PPO-style RLHF training loop using Hugging Face TRL โ€” the same pattern used by InstructGPT and open-source alignment pipelines. It is chosen because TRL's PPOTrainer encapsulates the three essential RLHF moves (generate, score, update) in a tight loop that mirrors the theoretical pipeline. As you read, focus on how the reward signal from the reward model flows into trainer.step() and how the KL-penalty config (kl_penalty, target_kl) controls drift from the reference policy.

# Conceptual pseudo-pipeline (not production-ready):
# 1) Train reward model on preference pairs
# 2) Run PPO-style policy optimization with KL control

from trl import PPOTrainer, PPOConfig

ppo_config = PPOConfig(
    learning_rate=1e-6,
    batch_size=64,
    mini_batch_size=8,
    kl_penalty="kl",
    target_kl=0.1,
)

# trainer = PPOTrainer(config=ppo_config, model=policy_model, ref_model=ref_model, tokenizer=tok)
# for batch in rollout_loader:
#     responses = trainer.generate(batch["prompt_ids"])
#     rewards = reward_model.score(batch["prompts"], responses)
#     trainer.step(batch["prompt_ids"], responses, rewards)

Production systems add:

  • reward model sanity checks,
  • adversarial prompt suites,
  • rollback thresholds based on human evaluation.

๐Ÿ› ๏ธ Hugging Face TRL and DeepSpeed: Scaling RLHF to Multi-GPU Clusters

Hugging Face TRL provides the DPOTrainer as a simpler, often superior alternative to the full PPO pipeline โ€” it directly optimizes the policy on preference pairs without a separate reward model training step. DeepSpeed (Microsoft) is the distributed training engine that makes RLHF computationally feasible at scale: its ZeRO optimizer stages shard model states, gradients, and optimizer states across GPUs, enabling PPO and DPO training of 7Bโ€“70B models on multi-GPU clusters that would otherwise run out of memory.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset

# DPO eliminates the reward model training step entirely:
# it directly optimizes the policy on (prompt, chosen, rejected) triples.
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
ref_model  = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)  # frozen

# Dataset format: {"prompt": str, "chosen": str, "rejected": str}
preference_data = load_dataset("json", data_files="dpo_preferences.jsonl", split="train")

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,          # frozen SFT baseline for KL regularization
    args=DPOConfig(
        output_dir="./dpo-output",
        beta=0.1,                  # KL regularization strength (same role as ฮฒ in PPO)
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        num_train_epochs=1,
        bf16=True,
        # DeepSpeed ZeRO Stage 2 config for multi-GPU training
        deepspeed="ds_config_zero2.json",
    ),
    tokenizer=tokenizer,
    train_dataset=preference_data,
)

dpo_trainer.train()
dpo_trainer.save_model("./dpo-aligned-model")

A minimal ds_config_zero2.json enables multi-GPU training without changing a line of Python:

{
  "zero_optimization": { "stage": 2, "overlap_comm": true },
  "bf16": { "enabled": true },
  "gradient_clipping": 1.0
}
ToolRoleWhen to reach for it
TRL DPOTrainerPreference optimization without RM trainingSimpler alternative to PPO for most teams
TRL PPOTrainerFull RLHF with reward model + KL-constrained policy updatesWhen interpretable reward signal is required
DeepSpeed ZeROShard model/optimizer state across GPUsTraining 7B+ models on multi-GPU clusters

For a full deep-dive on Hugging Face TRL and DeepSpeed, dedicated follow-up posts are planned.


๐Ÿ“š Practical Lessons from Alignment Teams

  • A better rubric often beats a bigger reward model.
  • Keep a fixed human-eval holdout set across runs.
  • Monitor KL and human preference together, never in isolation.
  • Add category-level breakdowns (safety, factuality, tone, refusal).
  • Treat RLHF as governance + modeling, not only training code.

๐Ÿ“Œ TLDR: Summary & Key Takeaways

TLDR: RLHF adds a preference-optimization loop on top of SFT โ€” collect pairwise rankings, train a reward model, then run KL-constrained PPO to shift the policy toward human-preferred outputs.

  • RLHF optimizes model policy using preference signals, not only supervised labels.
  • Reward model quality determines how useful RLHF updates will be.
  • KL-constrained optimization helps prevent destructive policy drift.
  • Human evaluations remain the ground truth against reward overfitting.
  • The strongest RLHF systems combine technical rigor with annotation governance.

One-liner: RLHF can make assistants far more aligned, but only if your preference pipeline is trustworthy.


Share

Test Your Knowledge

๐Ÿง 

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms