All Posts

Fine-Tuning LLMs: The Complete Engineer's Guide to SFT, LoRA, and RLHF

Supervised fine-tuning, parameter-efficient LoRA, and reinforcement learning from human feedback — when to use each and how to implement them

Abstract AlgorithmsAbstract Algorithms
··34 min read
Cover Image for Fine-Tuning LLMs: The Complete Engineer's Guide to SFT, LoRA, and RLHF
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: A pretrained LLM is a generalist. Fine-tuning makes it a specialist. Supervised Fine-Tuning (SFT) teaches it your domain's language through labeled examples. LoRA does the same with 99% fewer trainable parameters. RLHF shapes its behavior using human preference signals. Picking the right approach depends on your data budget, GPU budget, and whether you need format adaptation or behavioral alignment.


Your team is three weeks into building an AI assistant for a mid-size law firm. The brief is clear: help paralegals review contracts, flag non-standard clauses, and suggest revisions in the firm's house style.

You wire up GPT-4 with a system prompt describing the firm's preferred citation format, their standard indemnification language, and the three clause patterns their partners watch for. First demo goes fine — the toy examples look great. Then a partner drops in a real 80-page commercial lease agreement.

The model answers in a generic, confident tone that could have come from any legal website. It flags a boilerplate limitation-of-liability clause as "potentially unusual." It formats its output in standard markdown prose instead of the firm's structured clause-number annotations. Worst of all, it confidently cites a case that doesn't exist.

No amount of system prompt engineering fixes this. The base model simply does not know this firm's terminology, its citation house style, or which clause patterns are actually anomalous in commercial real estate versus tech licensing. The knowledge has to be baked in — not just described at inference time.

That's what fine-tuning is for. And that's where most teams hit their first fork in the road: Which fine-tuning approach do you reach for?


📖 What Fine-Tuning Actually Is — and Three Things It Isn't

Fine-tuning takes a pretrained language model — which has already learned the structure of language, world knowledge, and general reasoning from hundreds of billions of tokens — and updates its weights on a smaller, domain-specific dataset. After fine-tuning, the model has changed. Its parameters now encode both the broad pretraining knowledge and the specialized patterns from your domain.

Here's what fine-tuning is not:

It's not prompting. A system prompt describes what you want. Fine-tuning shows the model hundreds or thousands of examples of exactly how to do it. Prompting changes inference-time context; fine-tuning changes the model's weights.

It's not training from scratch. Pretraining a 7B-parameter model from scratch on raw text requires weeks of compute and billions of dollars. Fine-tuning starts from a model that already understands language and adapts it for a few hours or days on consumer-grade hardware.

It's not RAG (Retrieval-Augmented Generation). RAG injects relevant documents into the model's context at inference time so it can answer grounded questions. Fine-tuning changes how the model reasons and writes, not just what facts it can access. For the legal assistant above, you might use both: fine-tune on house style and clause patterns, then use RAG to retrieve the relevant contract sections at query time.

What fine-tuning actually does: it updates model weights using gradient descent on your labeled examples, minimizing a loss function (typically cross-entropy over next-token prediction) until the model generates outputs resembling your training labels. The resulting model is your specialized model — no runtime retrieval, no giant prompt, no external document store required.


🔍 Three Ways to Fine-Tune: SFT, LoRA, and RLHF at a Glance

Before diving into each technique, here's the high-level landscape. The three dominant fine-tuning approaches differ fundamentally in what they optimize, how much they cost, and what problem they solve best.

ApproachWhat It OptimizesTraining CostData RequiredWhen to UseTypical Result
SFT (Supervised Fine-Tuning)Next-token prediction on labeled outputsMedium — full or near-full fine-tune1K–100K instruction-response pairsDomain adaptation, format/style complianceStrong domain fit; may forget general capabilities
LoRA (Low-Rank Adaptation)Same as SFT, but on tiny adapter matricesLow — trains ~0.1–1% of parameters500–10K examplesBudget-constrained domain adaptationNear-SFT quality at a fraction of compute
RLHF (Reinforcement Learning from Human Feedback)Human preference scores via a reward modelHigh — requires reward model + PPO loop10K+ comparison pairs (human preferences)Behavioral alignment, safety, helpfulnessBest alignment; expensive and operationally complex

The decision tree below shows how to navigate between them before you write a single line of training code. It is a practical guide, not an exhaustive taxonomy — most production systems combine multiple approaches (base pretraining → SFT → RLHF is exactly what OpenAI used for ChatGPT).

flowchart TD
    A[Do you have labeled examples in your domain?] --> |No| B[Use few-shot prompting or RAG instead]
    A --> |Yes| C{How many labeled examples?}
    C --> |Fewer than 500| D[LoRA or QLoRA on a single consumer GPU]
    C --> |500 to 50000| E{What is your primary goal?}
    C --> |More than 50000| F[Full SFT or high-rank LoRA both viable]
    E --> |Domain vocabulary and output format| G[SFT or LoRA adapter]
    E --> |Safety and behavioral alignment| H[RLHF or Direct Preference Optimization]
    E --> |Instruction following| I[SFT on instruction-response pairs]
    D --> J[QLoRA 4-bit quantization - single 24GB GPU is enough]
    F --> G
    G --> K[Start with LoRA rank 8 to 16 then evaluate vs full SFT]
    H --> L[DPO for under 50K pairs - PPO for larger scale RLHF]
    I --> K

Start at the top of this diagram by asking whether you have labeled data at all — if not, fine-tuning is premature. Once you confirm you have training examples, the number of examples and your primary objective branch you directly to the right technique. Notice that LoRA appears in multiple branches: it is the practical default for most non-alignment use cases because it collapses the cost difference between "we have a single GPU" and "we have a small cluster."


⚙️ Supervised Fine-Tuning: Teaching an LLM Your Domain's Language Step by Step

Supervised Fine-Tuning is the conceptually simplest approach. You provide the model with a dataset of (instruction, response) pairs — the outputs you want the model to produce — and train it using the same cross-entropy next-token-prediction loss used during pretraining. The difference is that pretraining runs on raw, unsupervised internet text; SFT runs on your curated, labeled examples.

Dataset Format: Instruction-Response Pairs

The standard format for SFT is an instruction template. The most common are Alpaca-style and ChatML-style templates:

### Instruction: Summarize the following contract clause in the firm's standard notation.
### Input: "The Licensor shall not be liable for any indirect, incidental, special, exemplary, or consequential damages..."
### Response: LIMITATION: Standard liability cap — excludes indirect/consequential damages. Flag if: uncapped direct damages appear in same clause. Revision status: Accepted.

Each example teaches the model the target output structure. After seeing thousands of these, the model stops producing generic summaries and starts producing structured clause annotations.

Loss Function: Cross-Entropy on the Response Tokens Only

During SFT, the loss is computed only on the response tokens — not the instruction or input tokens. This ensures the model is rewarded for generating the right output, not for memorizing the prompt format. In HuggingFace, this is handled automatically when you use the DataCollatorForLanguageModeling with appropriate labels masking, or by using SFTTrainer from the trl library.

Running SFT with HuggingFace Trainer

from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
import torch

# 1. Load base model and tokenizer
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # required: LLaMA has no default pad token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

# 2. Prepare instruction-response training data
raw_data = [
    {
        "text": (
            "### Instruction: Summarize this contract clause.\n"
            "### Input: The party of the first part agrees to deliver services by Q3 2024...\n"
            "### Response: DELIVERY: Contractor commits to Q3 2024 delivery with weekly status reports."
        )
    },
    {
        "text": (
            "### Instruction: Flag any non-standard indemnification language.\n"
            "### Input: Each party shall indemnify the other against all losses without cap...\n"
            "### Response: FLAG: Mutual uncapped indemnification is non-standard. Recommend adding a liability cap equal to fees paid in prior 12 months."
        )
    },
    # Add your full dataset here — minimum 1K examples for production use
]

dataset = Dataset.from_list(raw_data)

def tokenize(example):
    return tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=512,
    )

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])
tokenized.set_format("torch")

# 3. Training configuration
training_args = TrainingArguments(
    output_dir="./sft-legal-llm",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,   # effective batch size = 8
    learning_rate=2e-5,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    save_strategy="epoch",
    logging_steps=50,
    fp16=True,
    report_to="none",                # set to "wandb" for experiment tracking
)

# 4. Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)

trainer.train()
trainer.save_model("./sft-legal-llm-final")

The key numbers to tune for SFT: learning rate (2e-5 is a safe default for full fine-tuning; go lower for larger models), batch size (larger batches stabilize training), and epochs (2–4 is typical — more risks catastrophic forgetting of general capabilities).


🧠 LoRA: Fine-Tuning 7B Parameters by Updating Only 0.1% of Them

Supervised fine-tuning of a 7B-parameter model requires storing the full model in FP16 (~14GB), plus optimizer states in FP32 for Adam (~56GB for momentum and variance). Total: roughly 70GB of GPU memory for training — far beyond what a single A100 80GB can comfortably handle once gradients and activations are included.

LoRA (Low-Rank Adaptation, Hu et al. 2021) sidesteps this entirely through a clever algebraic insight: the weight updates from fine-tuning tend to be low-rank. Rather than updating the full weight matrix W (of shape d × k), LoRA freezes W and introduces two small trainable matrices B (shape d × r) and A (shape r × k), where r << min(d, k). The adapted weight during inference is simply:

W_adapted = W_frozen + B × A

The rank r is a hyperparameter you choose. Common values are 4, 8, 16, and 64. The alpha hyperparameter scales the update: the effective scaling factor is alpha / r, which acts like a learning rate multiplier for the adapter.

The Internals: How Low-Rank Decomposition Shrinks the Weight Update

To make this concrete, consider a typical attention query projection matrix in a 7B LLM. These matrices are often 4096 × 4096 = 16,777,216 parameters. With LoRA at rank r=16, you instead train:

  • Matrix A: 16 × 4096 = 65,536 parameters
  • Matrix B: 4096 × 16 = 65,536 parameters
  • Total per layer: 131,072 parameters — a 99.2% reduction in that matrix

A 7B model with 32 transformer layers and LoRA applied to 4 attention projections per layer (Q, K, V, and output) has approximately 32 × 4 × 131,072 ≈ 16.7 million total trainable parameters, compared to 7 billion in the base model. That is 0.24% of the total parameters.

The intuition for why this works: pretrained LLMs are already very good. Fine-tuning for a domain-specific task typically requires only a small perturbation to the weight space — the adaptation is inherently low-dimensional. The base model's "knowledge" stays untouched; only the task-specific delta is learned.

Which layers to apply LoRA to? The answer depends on the task. For text generation and style adaptation, the query and value projections (q_proj and v_proj) in the attention layers give the most gain per parameter. For tasks requiring deep domain knowledge, also adding LoRA to the MLP (feed-forward) layers often helps, at the cost of more trainable parameters.

Performance Analysis: Memory and Speed Savings from LoRA

The memory savings from LoRA are substantial in practice:

ConfigurationTrainable ParamsGPU Memory (Training)Training Time (1K steps)
Full SFT — 7B model7.0B (100%)~70GBBaseline
LoRA r=16 — 7B model~16.7M (0.24%)~18GB~40% of full SFT
QLoRA r=16 — 7B model in 4-bit~16.7M (0.24%)~10GB~55% of full SFT
LoRA r=64 — 7B model~67M (0.96%)~22GB~50% of full SFT

QLoLA combines LoRA with 4-bit quantization of the frozen base model weights, bringing a 7B model's training footprint to under 12GB — well within a single 24GB consumer GPU (NVIDIA RTX 3090/4090). The trainable LoRA adapters themselves remain in BFloat16 for precision.

Inference adds zero latency: B × A is merged back into W at deployment time using model.merge_and_unload(), so the final model runs identically to a full fine-tuned model.

Running LoRA Fine-Tuning with PEFT

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
import torch

# 1. Load base model
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

# 2. Define LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                               # rank — dimensionality of the adapter matrices
    lora_alpha=32,                      # scaling: effective LR multiplier = alpha/r = 2.0
    lora_dropout=0.05,                  # regularization — set to 0 for inference
    target_modules=["q_proj", "v_proj"],# apply LoRA to attention query + value projections
    bias="none",                        # do not train bias terms
)

# 3. Wrap the model — only B and A matrices are trainable after this
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 2,359,296 || all params: 1,236,014,080 || trainable%: 0.19%

# 4. Train — the Trainer is unaware of LoRA; it trains whatever is marked requires_grad=True
training_args = TrainingArguments(
    output_dir="./lora-legal-llm",
    num_train_epochs=3,
    per_device_train_batch_size=8,      # LoRA allows larger batch sizes due to lower memory use
    learning_rate=3e-4,                 # LoRA tolerates a higher LR than full fine-tuning
    warmup_ratio=0.05,
    save_strategy="epoch",
    fp16=True,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,    # same format as SFT example above
)

trainer.train()

# 5. Save only the LoRA adapter weights (~10MB vs ~14GB for the full model)
model.save_pretrained("./lora-adapter-only")

# 6. For deployment: merge adapter into base model (zero inference overhead)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./lora-merged-final")

The line model.print_trainable_parameters() is your sanity check — always confirm the trainable percentage is in the expected range before burning GPU hours. If it shows 100%, you forgot to call get_peft_model.


📊 RLHF: The Three-Stage Training Pipeline That Shaped ChatGPT

Supervised fine-tuning and LoRA are both fundamentally about imitation learning — the model learns to replicate examples you labeled. This works well for format and domain adaptation. It fails for one important class of problem: teaching the model to be helpful, harmless, and honest in ways that are hard to express as labeled examples.

When InstructGPT (the direct predecessor to ChatGPT) was SFT-trained on human-written responses, the model learned to produce text that looked like what a helpful assistant would say. But it also learned to produce confident-sounding nonsense, to be sycophantic, and to give long verbose answers because length correlated with human approval in training labels. The SFT model optimized for the form of helpfulness, not its substance.

RLHF solves this by introducing a reward signal derived from human comparisons rather than human labels. Instead of asking humans to write the ideal response (hard and expensive), you ask them which of two model-generated responses is better (easy and scalable). A reward model is trained on these preference pairs and used to guide the LLM's training via reinforcement learning.

The RLHF pipeline has three distinct stages. The diagram below shows how they connect:

flowchart TD
    A[Stage 1 - SFT Baseline - Train on instruction-response pairs] --> B[Stage 2 - Reward Model Training]
    B --> C[Collect human preference pairs - same prompt two responses humans rank one over the other]
    C --> D[Train a reward model to predict human preference scores]
    D --> E[Stage 3 - PPO Policy Optimization]
    E --> F[Initialize policy from the SFT checkpoint]
    F --> G[Generate responses with current policy]
    G --> H[Score responses with the frozen reward model]
    H --> I[Compute PPO loss - maximize reward while penalizing KL divergence from SFT policy]
    I --> J{KL divergence within target range?}
    J --> |No - policy has drifted too far| K[Clip gradient update and increase KL coefficient]
    J --> |Yes| L[Apply gradient update to policy network]
    K --> L
    L --> M{Converged?}
    M --> |Not yet| G
    M --> |Yes| N[Aligned model ready for deployment]

The three-stage flow is critical to understand. Stage 1 (SFT) gives the model baseline instruction-following capability. Stage 2 (reward model training) creates a proxy for human preference that can be evaluated cheaply at scale. Stage 3 (PPO training) uses that proxy to push the model's output distribution toward outputs humans prefer — while the KL penalty keeps it from drifting so far from the SFT policy that it starts producing incoherent or degenerate outputs.

RLHF with the TRL PPOTrainer

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer, pipeline
import torch

# 1. Load the SFT-trained model with a PPO value head
#    The value head estimates how much reward a given state (prompt context) is worth
model = AutoModelForCausalLMWithValueHead.from_pretrained("./sft-legal-llm-final")
tokenizer = AutoTokenizer.from_pretrained("./sft-legal-llm-final")
tokenizer.pad_token = tokenizer.eos_token

# 2. Load a separately trained reward model
#    This model scores (prompt + response) pairs and returns a scalar reward
reward_model = pipeline(
    "text-classification",
    model="./reward-model",
    device=0,
    return_all_scores=False,
)

# 3. PPO configuration
ppo_config = PPOConfig(
    model_name="sft-legal-llm",
    learning_rate=1.41e-5,
    ppo_epochs=4,              # number of optimization passes per batch of rollouts
    mini_batch_size=16,
    batch_size=256,
    gradient_accumulation_steps=1,
    kl_penalty="kl",           # penalize KL divergence from the SFT reference policy
    target_kl=6.0,             # stop updating when KL divergence exceeds this threshold
    cliprange=0.2,             # PPO clip parameter — limits how large each policy update can be
    vf_coef=0.1,               # value function loss coefficient
)

ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    tokenizer=tokenizer,
)

# 4. Training loop: generate, score, update
for epoch in range(3):
    for batch in ppo_trainer.dataloader:
        query_tensors = batch["input_ids"]

        # Generate responses from the current policy
        response_tensors = ppo_trainer.generate(
            query_tensors,
            max_new_tokens=128,
            do_sample=True,
            temperature=0.7,
        )

        # Decode and score each (prompt, response) pair with the reward model
        responses = [
            tokenizer.decode(r, skip_special_tokens=True)
            for r in response_tensors
        ]
        rewards = [
            torch.tensor(reward_model(resp)[0]["score"])
            for resp in responses
        ]

        # PPO update: maximize expected reward while keeping KL penalty within bounds
        stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
        ppo_trainer.log_stats(stats, batch, rewards)

# 5. Save the RLHF-aligned model
ppo_trainer.save_pretrained("./rlhf-aligned-legal-llm")

The PPO training loop is the most computationally intensive step in the entire fine-tuning pipeline. Each iteration requires a forward pass through the policy (to generate), a forward pass through the reward model (to score), and then a backward pass through the policy (to update). This is why RLHF at scale requires dedicated infrastructure — teams typically run this on 8–64 GPUs for large models.


🌍 Real-World Fine-Tuning: How GitHub Copilot, ChatGPT, and MedPaLM Were Built

Fine-tuning is not an academic exercise. The most commercially successful AI products of the past three years are almost universally the result of disciplined fine-tuning pipelines.

ChatGPT (OpenAI, 2022): The canonical RLHF success story. GPT-3.5 (a strong SFT-trained base) was aligned using the InstructGPT RLHF recipe: human-written SFT data → reward model trained on human comparisons → PPO optimization. The result was a model that felt dramatically more helpful and less likely to produce harmful content than the raw SFT model — even though the underlying weights differed by relatively small parameter deltas.

GitHub Copilot (GitHub + OpenAI, 2021): Primarily SFT. The Codex model was fine-tuned on a massive corpus of public GitHub code with inline comments as "instructions" and subsequent code as "responses." The key insight was scale: training on billions of code examples, not just thousands, gave it enough coverage to understand project-level context, API conventions, and idiomatic patterns across dozens of languages.

Code Llama (Meta, 2023): A two-stage SFT pipeline. Llama 2 was continued-pretraining on 500 billion code tokens, then instruction fine-tuned on code-specific instruction-response pairs for the "instruct" variant. The fill-in-the-middle (FIM) capability required modifying the SFT data format to include prefix/suffix/middle token markers — a dataset engineering challenge as much as a training one.

MedPaLM 2 (Google DeepMind, 2023): Domain SFT combined with chain-of-thought fine-tuning. PaLM 2 was fine-tuned on a curated dataset of medical questions, expert-written answers, and reasoning chains from physicians. It achieved 86.5% on USMLE-style questions — above the passing threshold for human medical licensing. The fine-tuning data quality (expert-curated medical QA, not web-scraped content) was the decisive factor in outperforming much larger generalist models.

Enterprise customer service bots: The hidden majority of fine-tuning deployments. Companies like Intercom, Zendesk, and Salesforce fine-tune 3B–13B open-source models (Mistral, Llama, Qwen) using LoRA on 2K–20K examples of their own resolved support tickets. The result is a model that speaks in the company's brand voice, understands their product taxonomy, and handles their most common query patterns — deployed on a single A10G GPU at a fraction of the API cost.


⚖️ Trade-offs and Failure Modes: Where Each Approach Breaks Down

Every fine-tuning approach has a characteristic failure mode. Knowing these in advance saves weeks of debugging.

SFT overfits on small datasets. Below ~500 high-quality examples, SFT models typically memorize the training set rather than generalizing. Symptoms: perfect loss on training data, near-random outputs on novel phrasings of the same task. Mitigation: data augmentation (rephrase instructions programmatically), early stopping on validation loss, or switch to LoRA which trains fewer parameters and regularizes implicitly.

SFT causes catastrophic forgetting. Aggressive SFT on a narrow domain degrades performance on everything outside that domain. A model fine-tuned only on legal documents may fail standard common-sense reasoning questions it answered correctly before. Mitigation: include a small percentage (~5–10%) of general-purpose instruction data in every SFT training run, or use LoRA which preserves the frozen base weights entirely.

LoRA's rank is a finicky hyperparameter. Set r too high (r=128 on a small dataset) and you're effectively doing full fine-tuning with extra steps — memory savings vanish, and the adapter can still overfit. Set it too low (r=1) and the adapter lacks the capacity to learn domain-specific patterns — training loss plateaus early and validation performance disappoints. A safe starting point is r=8 or r=16, then ablate up and down.

LoRA may underperform full SFT on complex structural tasks. For tasks that require the model to deeply internalize a new output schema — multi-step JSON generation with nested schemas, code generation in a proprietary DSL, or complex multi-document reasoning — full SFT often outperforms LoRA by 5–15 points on task-specific metrics. If you have the compute, benchmark both before committing to LoRA-only.

RLHF is vulnerable to reward hacking. The reward model is a proxy for human preferences, not a perfect oracle. Policies trained with PPO quickly learn to exploit the reward model's blind spots. Classic examples: generating very long responses (length correlates with human approval scores), being sycophantically agreeable regardless of accuracy, or using hedge phrases that sound safe but provide no useful information. Mitigation: diverse reward model training data, periodic human evaluation of PPO outputs (not just reward scores), and a tight KL penalty to prevent the policy from drifting too far from the SFT baseline.

RLHF is operationally expensive and unstable. PPO is notoriously sensitive to hyperparameters. A learning rate that's 2× too high can cause the policy to collapse to a degenerate mode within 100 steps. The three-model setup (SFT policy, reference policy, reward model) requires significant infrastructure coordination. For teams without dedicated ML research engineering capacity, DPO (Direct Preference Optimization) is a compelling RLHF alternative: it achieves similar alignment results without the separate reward model or the PPO training loop, using only offline preference data.


🧭 Choosing Your Fine-Tuning Strategy: A Decision Guide for Engineers

The decision tree below distills the trade-off analysis into a practical flowchart for engineering teams. After the diagram, the reference table maps common production scenarios to concrete recommendations.

flowchart TD
    A[Fine-Tuning Strategy Decision] --> B{What is your GPU budget?}
    B --> |Single consumer GPU 16 to 24GB| C[QLoRA with 4-bit quantized base model]
    B --> |1 to 4 A100s| D{What is your primary goal?}
    B --> |8 plus A100s or a cloud cluster| E[Full SFT - no memory constraints apply]
    D --> |Domain vocabulary and output format| F[LoRA with rank 8 to 16]
    D --> |Safety and behavioral alignment| G[RLHF or DPO with preference data]
    D --> |General instruction following| H[SFT on instruction-response pairs]
    C --> I[LoRA rank 8 - 500 to 10K examples - peft plus trl SFTTrainer]
    E --> J[SFT with cosine LR schedule and mixed general-domain data to prevent forgetting]
    F --> K[peft LoraConfig plus SFTTrainer from trl]
    G --> L[DPO if fewer than 50K pairs - PPO via trl PPOTrainer for larger budgets]
    H --> K

Use the strategy decision tree entry point (GPU budget) to reach a leaf, then cross-reference with the table below for specific tooling and data requirements.

ScenarioRecommended ApproachData MinimumKey ToolWatch Out For
Domain style and format adaptationLoRA (r=8–16)1K–5K examplespeft + SFTTrainerRank too high = no memory savings
Instruction following, new capabilitiesSFT (full or LoRA)5K–50K examplesSFTTrainerCatastrophic forgetting
Safety and alignment tuningRLHF / DPO10K+ preference pairstrl PPOTrainer / DPOReward hacking
Budget-constrained, single GPUQLoRA (r=8, 4-bit)500+ examplespeft + BitsAndBytesConfigLower quality ceiling than full SFT
Production deployment, minimal overheadLoRA + merge1K+ examplesmerge_and_unload()Need to re-merge when updating base
Academic / research fine-tuneFull SFT10K+ examplesTrainerHigh GPU cost

🧪 Data Preparation and Evaluation: Making Fine-Tuning Work in Practice

The most common reason fine-tuning fails in practice is not a bad training loop — it's bad data. Understanding what good fine-tuning data looks like, and how to measure whether training actually worked, is where experienced practitioners spend most of their time.

What Good Fine-Tuning Data Looks Like

Diversity over volume, always. 10,000 near-identical examples teach the model one narrow pattern very well and nothing else. 2,000 diverse examples covering the full range of your task — edge cases, unusual phrasings, failure modes you want the model to handle gracefully — produce a far more useful model. When building a legal clause assistant, don't just include standard commercial leases. Include manufacturing contracts, employment agreements, IP licensing clauses, and international contract language.

Response quality is the ceiling. The model can never surpass the quality of its training labels. Poor responses — vague summaries, inconsistent formatting, errors — become the model's new baseline. If you can afford to label only 2,000 examples, pay for expert labelers rather than crowdsourcing 20,000 low-quality labels.

Dataset size guidance by technique:

TechniqueMinimum viableSweet spotDiminishing returns beyond
SFT (full fine-tune)1,000 examples10K–100K500K+ (better to pretrain on domain data)
LoRA / PEFT500 examples2K–10K50K+ (consider full SFT at this scale)
RLHF preference pairs5,000 pairs20K–100KArchitecture-dependent
DPO preference pairs3,000 pairs10K–50KArchitecture-dependent

Evaluation: Four Ways to Know if Fine-Tuning Actually Worked

Perplexity on held-out domain data: The most direct signal. Fine-tune on 90% of your domain dataset and measure perplexity on the held-out 10%. Lower perplexity means the model predicts your domain's output distribution more accurately. Baseline against the pretrained model's perplexity on the same held-out set — the gap tells you how much the fine-tuning changed the model.

Task-specific benchmarks: Define 3–5 representative tasks and score them automatically where possible. For legal clause annotation: precision and recall on flag/no-flag decisions against expert annotations. For code generation: execution-based pass-rate (does the generated code run and produce correct output?). For chatbots: ROUGE or BLEURT against reference responses.

Human evaluation with blind comparison: Show human evaluators pairs of outputs — one from the base model, one from your fine-tuned model — without telling them which is which. Ask them to rate on helpfulness, accuracy, and format compliance. This is the ground truth. It is expensive, but any model going to production should have at least a 50-response blind evaluation to confirm the fine-tuning direction is correct.

Catastrophic forgetting test: Run your fine-tuned model on standard general benchmarks — MMLU (multi-task reasoning), HellaSwag (commonsense NLI), TruthfulQA (factual accuracy). Compare scores against the pretrained base model. A drop of more than 2–3 points on these benchmarks is a warning sign that fine-tuning was too aggressive. Mitigation: add 5–10% general instruction data to your SFT dataset and reduce the number of training epochs.


🛠️ HuggingFace TRL and PEFT: The Production Fine-Tuning Stack

TRL (Transformer Reinforcement Learning) and PEFT (Parameter-Efficient Fine-Tuning) are the two HuggingFace libraries that together form the standard production fine-tuning stack. TRL provides SFTTrainer, RewardTrainer, PPOTrainer, and DPOTrainer — a complete pipeline from SFT through RLHF. PEFT provides LoraConfig, QLoRA quantization integration, and adapter management utilities.

The SFTTrainer from TRL is the modern recommended way to run SFT — it handles chat template formatting, efficient packing of short examples into fixed-length sequences, and seamless PEFT integration in a single API:

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# 1. Model and tokenizer
model_name = "meta-llama/Llama-3.2-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,  # bfloat16 is preferred on Ampere+ GPUs
    device_map="auto",
)

# 2. Optional LoRA configuration — remove peft_config for full SFT
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # all attention projections
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
)

# 3. SFTTrainer — handles tokenization, packing, and training in one unified API
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,           # HuggingFace Dataset with a "text" or "messages" column
    peft_config=lora_config,         # omit this line for full fine-tuning
    args=SFTConfig(
        output_dir="./production-llm",
        max_seq_length=2048,
        packing=True,                # pack short examples together for GPU efficiency
        num_train_epochs=2,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        warmup_ratio=0.05,
        lr_scheduler_type="cosine",
        logging_steps=25,
        save_strategy="steps",
        save_steps=200,
        bf16=True,
        report_to="none",
    ),
)

trainer.train()
trainer.save_model("./production-llm-final")
tokenizer.save_pretrained("./production-llm-final")

The packing=True flag is a significant efficiency gain — instead of padding short examples to the maximum sequence length, it concatenates multiple examples (separated by EOS tokens) into a single training sequence. This can increase GPU utilization from 40–60% (with padding) to 85–95%.

For a full deep-dive on the TRL library's reward training and DPO workflows, see the companion post on RLHF Training Pipeline: From Preferences to Policy, or the HuggingFace TRL documentation at huggingface.co/docs/trl.


📚 Lessons Learned: Common Mistakes That Waste GPU Hours

These are the errors that appear most frequently in fine-tuning projects — not in the tutorial stage, but after teams move to production data.

1. Too-small SFT datasets creating confident-but-wrong specialists. A model trained on 150 curated legal clause examples will appear to work beautifully on your 150 held-out examples. It will fail on the 151st — a valid clause phrased slightly differently. The model has memorized the training patterns, not learned to generalize. The fix is not to blindly collect more data; it is to collect diverse data that covers the long tail of phrasings, contexts, and edge cases for your task.

2. Choosing LoRA rank by intuition instead of by measurement. Engineers often pick r=64 because "more must be better." At r=64 on a dataset of 1,000 examples, you're training 4× more parameters than at r=16 with no improvement — and often worse performance due to overfitting. Run a simple ablation: train with r=4, r=8, r=16, r=32 and compare validation loss curves. The rank where validation loss stops improving is your optimal r.

3. Forgetting to set tokenizer.pad_token = tokenizer.eos_token. LLaMA-family and many other modern tokenizers do not define a pad token by default. If you pass batches of variable-length sequences without a pad token, you'll see a Python ValueError or, worse, silent training errors if the collator falls back to a None padding strategy. Always set this line immediately after loading the tokenizer.

4. Measuring RLHF success only via reward score. A rising average reward during PPO training is a necessary but not sufficient signal of improvement. The reward model can be gamed. Always instrument your RLHF training with at least two independent signals: the reward model score and a separate held-out task benchmark. If task benchmark performance plateaus or drops while reward score climbs, you're watching reward hacking happen in real time.

5. Not running a catastrophic forgetting baseline before deployment. It takes ten minutes to run your fine-tuned model through a 100-question MMLU sample and compare it against the base model. Teams that skip this step frequently discover — after deployment — that their specialized model no longer passes basic reasoning checks that the base model handled effortlessly. Build this check into your CI/CD pipeline for every model artifact.

6. Training for too many epochs with no early stopping. On small datasets, fine-tuning loss reaches a minimum and then climbs on validation data while training loss continues to drop. Two to four epochs is a standard starting point; always monitor validation loss and use load_best_model_at_end=True in TrainingArguments to automatically recover the best checkpoint.


📌 TLDR & Key Takeaways

Fine-tuning is the engineering step between "a model that can do almost anything" and "a model that does your thing well." Here's the decision framework in six lines:

  • Use SFT when you have 1K+ high-quality labeled examples and need the model to learn a new format, vocabulary, or domain deeply. Use it as the mandatory first stage before RLHF.
  • Use LoRA when compute is constrained, when you need to maintain multiple domain adapters, or as your default approach when you don't have evidence that full fine-tuning would materially outperform it.
  • Use RLHF (or DPO) when your problem is alignment — safety, helpfulness, tone — rather than format adaptation. You need preference data (comparisons), not just labels.
  • Data quality beats data quantity. 2,000 diverse, expert-quality examples outperform 20,000 noisy crowd-sourced ones in virtually every fine-tuning scenario.
  • Always test for catastrophic forgetting. Any fine-tuning run should be validated against a general benchmark before deployment, not just against domain-specific metrics.
  • LoRA → merge → deploy. For production, always merge your LoRA adapter into the base model before deployment using merge_and_unload(). Keeping adapters separate adds inference complexity with no benefit.

📝 Practice Quiz

  1. A company wants to fine-tune a 7B LLM to respond in their brand's tone and use their internal product terminology. They have 3,000 labeled support ticket examples and one NVIDIA A10G GPU (24GB). Which approach should they use?

    A) Full supervised fine-tuning — it gives the best quality
    B) RLHF — tone is a preference problem, not an imitation problem
    C) QLoRA with rank 8–16 — fits in 24GB and handles 3K examples well
    D) RAG — inject brand guidelines at inference time instead

    Correct Answer: C — Full SFT of a 7B model requires ~70GB of training memory, exceeding the A10G's 24GB. RLHF requires preference pairs and a reward model, which the company doesn't have. QLoRA (4-bit quantized base + LoRA adapters in BF16) trains a 7B model on a single 24GB GPU with 3K examples comfortably.

  2. In a LoRA configuration with r=16 and lora_alpha=32, what is the effective scaling factor applied to the adapter output?

    A) 16
    B) 32
    C) 2.0
    D) 0.5

    Correct Answer: C — The effective scaling factor is lora_alpha / r = 32 / 16 = 2.0. This multiplier acts as a per-adapter learning rate adjustment. Increasing lora_alpha relative to r results in larger weight updates from the adapter.

  3. After several hundred PPO training steps, your RLHF reward score is increasing but your model's accuracy on a held-out domain QA benchmark is declining. What is the most likely cause?

    A) The KL penalty is too high and preventing the model from learning
    B) The base SFT model was not trained for long enough
    C) Reward hacking — the model is exploiting weaknesses in the reward model proxy
    D) The learning rate is too low for PPO to make meaningful updates

    Correct Answer: C — Rising reward score alongside declining task performance is the classic signature of reward hacking. The policy has learned to maximize the reward model's score through behaviors the reward model rates highly (length, hedge phrases, sycophancy) that don't actually improve task quality. The fix involves stricter KL penalties, more diverse reward model training data, and human evaluation of PPO outputs at regular intervals.

  4. Which of the following is a direct consequence of setting a LoRA rank that is too low (e.g., r=1 on a complex task)?

    A) Training will fail with an out-of-memory error
    B) The adapter lacks the capacity to learn the target task and validation loss plateaus early
    C) The merged model will be larger than the original base model
    D) Inference will be significantly slower than the base model

    Correct Answer: B — A very low rank limits the expressivity of the adapter matrices. The model trains quickly (few parameters) but cannot encode enough task-specific information to generalize. Training loss may decrease, but validation loss stops improving early, and task performance remains close to the baseline pretrained model.

  5. Open-ended — challenge question: You've built an RLHF-aligned customer service assistant that scores very well on your reward model but your support managers report that customers are increasingly frustrated. The model gives long, polite, but ultimately non-committal answers. Propose a complete diagnosis and mitigation plan covering data, training, and evaluation changes you would make.

    No single correct answer — strong responses should cover: identifying long-response bias in the reward model training data, examining whether "politeness" was over-indexed in human preference labels, adding task completion metrics to the reward signal alongside human preference, implementing a response-length penalty, collecting new preference data explicitly contrasting helpful-but-brief vs. polite-but-vague responses, and adding a separate automated classifier to flag non-committal phrasing as an evaluation signal.


Fine-tuning is a broad topic — the posts below offer targeted deep dives on each of the techniques introduced here:

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms