30 min readLlm Fine Tuning Lora

Fine-Tuning LLMs: The Complete Engineer's Guide to SFT, LoRA, and RLHF

Supervised fine-tuning, parameter-efficient LoRA, and reinforcement learning from human feedback — when to use each and how to implement them

Abstract Algorithms/Apr 18, 2026/LLM Engineering

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Executive TLDR

TLDR: A pretrained LLM is a generalist.
Supervised Fine Tuning (SFT) teaches it your domain's language through labeled examples.
LoRA does the same with 99% fewer trainable parameters.
RLHF shapes its behavior using human preference signals.

Core mental model

Read this as a system of state, constraints, and failure boundaries.

Supervised fine-tuning, parameter-efficient LoRA, and reinforcement learning from human feedback — when to use each and how to implement them

Explain simpler Compare tradeoffs

Key systems visualization

The article’s conceptual path

Llm

Fine Tuning

Lora

Rlhf

Machine Learning

TLDR: A pretrained LLM is a generalist. Fine-tuning makes it a specialist. Supervised Fine-Tuning (SFT) teaches it your domain's language through labeled examples. LoRA does the same with 99% fewer trainable parameters. RLHF shapes its behavior using human preference signals. Picking the right approach depends on your data budget, GPU budget, and whether you need format adaptation or behavioral alignment.

The Legal Document Assistant That Answered Like a Freshman

Your team is three weeks into building an AI assistant for a mid-size law firm. The brief is clear: help paralegals review contracts, flag non-standard clauses, and suggest revisions in the firm's house style.

You wire up GPT-4 with a system prompt describing the firm's preferred citation format, their standard indemnification language, and the three clause patterns their partners watch for. First demo goes fine — the toy examples look great. Then a partner drops in a real 80-page commercial lease agreement.

The model answers in a generic, confident tone that could have come from any legal website. It flags a boilerplate limitation-of-liability clause as "potentially unusual." It formats its output in standard markdown prose instead of the firm's structured clause-number annotations. Worst of all, it confidently cites a case that doesn't exist.

No amount of system prompt engineering fixes this. The base model simply does not know this firm's terminology, its citation house style, or which clause patterns are actually anomalous in commercial real estate versus tech licensing. The knowledge has to be baked in — not just described at inference time.

That's what fine-tuning is for. And that's where most teams hit their first fork in the road: Which fine-tuning approach do you reach for?

📖 What Fine-Tuning Actually Is — and Three Things It Isn't

Fine-tuning takes a pretrained language model — which has already learned the structure of language, world knowledge, and general reasoning from hundreds of billions of tokens — and updates its weights on a smaller, domain-specific dataset. After fine-tuning, the model has changed. Its parameters now encode both the broad pretraining knowledge and the specialized patterns from your domain.

Here's what fine-tuning is not:

It's not prompting. A system prompt describes what you want. Fine-tuning shows the model hundreds or thousands of examples of exactly how to do it. Prompting changes inference-time context; fine-tuning changes the model's weights.

It's not training from scratch. Pretraining a 7B-parameter model from scratch on raw text requires weeks of compute and billions of dollars. Fine-tuning starts from a model that already understands language and adapts it for a few hours or days on consumer-grade hardware.

It's not RAG (Retrieval-Augmented Generation). RAG injects relevant documents into the model's context at inference time so it can answer grounded questions. Fine-tuning changes how the model reasons and writes, not just what facts it can access. For the legal assistant above, you might use both: fine-tune on house style and clause patterns, then use RAG to retrieve the relevant contract sections at query time.

What fine-tuning actually does: it updates model weights using gradient descent on your labeled examples, minimizing a loss function (typically cross-entropy over next-token prediction) until the model generates outputs resembling your training labels. The resulting model is your specialized model — no runtime retrieval, no giant prompt, no external document store required.

🔍 Three Ways to Fine-Tune: SFT, LoRA, and RLHF at a Glance

Before diving into each technique, here's the high-level landscape. The three dominant fine-tuning approaches differ fundamentally in what they optimize, how much they cost, and what problem they solve best.

Approach	What It Optimizes	Training Cost	Data Required	When to Use	Typical Result
SFT (Supervised Fine-Tuning)	Next-token prediction on labeled outputs	Medium — full or near-full fine-tune	1K–100K instruction-response pairs	Domain adaptation, format/style compliance	Strong domain fit; may forget general capabilities
LoRA (Low-Rank Adaptation)	Same as SFT, but on tiny adapter matrices	Low — trains ~0.1–1% of parameters	500–10K examples	Budget-constrained domain adaptation	Near-SFT quality at a fraction of compute
RLHF (Reinforcement Learning from Human Feedback)	Human preference scores via a reward model	High — requires reward model + PPO loop	10K+ comparison pairs (human preferences)	Behavioral alignment, safety, helpfulness	Best alignment; expensive and operationally complex

The decision tree below shows how to navigate between them before you write a single line of training code. It is a practical guide, not an exhaustive taxonomy — most production systems combine multiple approaches (base pretraining → SFT → RLHF is exactly what OpenAI used for ChatGPT).

flowchart TD
    A[Do you have labeled examples in your domain?] --> |No| B[Use few-shot prompting or RAG instead]
    A --> |Yes| C{How many labeled examples?}
    C --> |Fewer than 500| D[LoRA or QLoRA on a single consumer GPU]
    C --> |500 to 50000| E{What is your primary goal?}
    C --> |More than 50000| F[Full SFT or high-rank LoRA both viable]
    E --> |Domain vocabulary and output format| G[SFT or LoRA adapter]
    E --> |Safety and behavioral alignment| H[RLHF or Direct Preference Optimization]
    E --> |Instruction following| I[SFT on instruction-response pairs]
    D --> J[QLoRA 4-bit quantization - single 24GB GPU is enough]
    F --> G
    G --> K[Start with LoRA rank 8 to 16 then evaluate vs full SFT]
    H --> L[DPO for under 50K pairs - PPO for larger scale RLHF]
    I --> K

Start at the top of this diagram by asking whether you have labeled data at all — if not, fine-tuning is premature. Once you confirm you have training examples, the number of examples and your primary objective branch you directly to the right technique. Notice that LoRA appears in multiple branches: it is the practical default for most non-alignment use cases because it collapses the cost difference between "we have a single GPU" and "we have a small cluster."

⚙️ Supervised Fine-Tuning: Teaching an LLM Your Domain's Language Step by Step

Supervised Fine-Tuning is the conceptually simplest approach. You provide the model with a dataset of (instruction, response) pairs — the outputs you want the model to produce — and train it using the same cross-entropy next-token-prediction loss used during pretraining. The difference is that pretraining runs on raw, unsupervised internet text; SFT runs on your curated, labeled examples.

Dataset Format: Instruction-Response Pairs

The standard format for SFT is an instruction template. The most common are Alpaca-style and ChatML-style templates:

### Instruction: Summarize the following contract clause in the firm's standard notation.
### Input: "The Licensor shall not be liable for any indirect, incidental, special, exemplary, or consequential damages..."
### Response: LIMITATION: Standard liability cap — excludes indirect/consequential damages. Flag if: uncapped direct damages appear in same clause. Revision status: Accepted.

Each example teaches the model the target output structure. After seeing thousands of these, the model stops producing generic summaries and starts producing structured clause annotations.

Loss Function: Cross-Entropy on the Response Tokens Only

During SFT, the loss is computed only on the response tokens — not the instruction or input tokens. This ensures the model is rewarded for generating the right output, not for memorizing the prompt format. In HuggingFace, this is handled automatically when you use the DataCollatorForLanguageModeling with appropriate labels masking, or by using SFTTrainer from the trl library.

Running SFT with HuggingFace Trainer

from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
import torch

# 1. Load base model and tokenizer
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # required: LLaMA has no default pad token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

# 2. Prepare instruction-response training data
raw_data = [
    {
        "text": (
            "### Instruction: Summarize this contract clause.\n"
            "### Input: The party of the first part agrees to deliver services by Q3 2024...\n"
            "### Response: DELIVERY: Contractor commits to Q3 2024 delivery with weekly status reports."
        )
    },
    {
        "text": (
            "### Instruction: Flag any non-standard indemnification language.\n"
            "### Input: Each party shall indemnify the other against all losses without cap...\n"
            "### Response: FLAG: Mutual uncapped indemnification is non-standard. Recommend adding a liability cap equal to fees paid in prior 12 months."
        )
    },
    # Add your full dataset here — minimum 1K examples for production use
]

dataset = Dataset.from_list(raw_data)

def tokenize(example):
    return tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=512,
    )

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])
tokenized.set_format("torch")

# 3. Training configuration
training_args = TrainingArguments(
    output_dir="./sft-legal-llm",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,   # effective batch size = 8
    learning_rate=2e-5,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    save_strategy="epoch",
    logging_steps=50,
    fp16=True,
    report_to="none",                # set to "wandb" for experiment tracking
)

# 4. Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)

trainer.train()
trainer.save_model("./sft-legal-llm-final")

The key numbers to tune for SFT: learning rate (2e-5 is a safe default for full fine-tuning; go lower for larger models), batch size (larger batches stabilize training), and epochs (2–4 is typical — more risks catastrophic forgetting of general capabilities).

🧠 LoRA: Fine-Tuning 7B Parameters by Updating Only 0.1% of Them

Supervised fine-tuning of a 7B-parameter model requires storing the full model in FP16 (~14GB), plus optimizer states in FP32 for Adam (~56GB for momentum and variance). Total: roughly 70GB of GPU memory for training — far beyond what a single A100 80GB can comfortably handle once gradients and activations are included.

LoRA (Low-Rank Adaptation, Hu et al. 2021) sidesteps this entirely through a clever algebraic insight: the weight updates from fine-tuning tend to be low-rank. Rather than updating the full weight matrix W (of shape d × k), LoRA freezes W and introduces two small trainable matrices B (shape d × r) and A (shape r × k), where r << min(d, k). The adapted weight during inference is simply:

W_adapted = W_frozen + B × A

The rank r is a hyperparameter you choose. Common values are 4, 8, 16, and 64. The alpha hyperparameter scales the update: the effective scaling factor is alpha / r, which acts like a learning rate multiplier for the adapter.

The Internals: How Low-Rank Decomposition Shrinks the Weight Update

To make this concrete, consider a typical attention query projection matrix in a 7B LLM. These matrices are often 4096 × 4096 = 16,777,216 parameters. With LoRA at rank r=16, you instead train:

Matrix A: 16 × 4096 = 65,536 parameters
Matrix B: 4096 × 16 = 65,536 parameters
Total per layer: 131,072 parameters — a 99.2% reduction in that matrix

A 7B model with 32 transformer layers and LoRA applied to 4 attention projections per layer (Q, K, V, and output) has approximately 32 × 4 × 131,072 ≈ 16.7 million total trainable parameters, compared to 7 billion in the base model. That is 0.24% of the total parameters.

The intuition for why this works: pretrained LLMs are already very good. Fine-tuning for a domain-specific task typically requires only a small perturbation to the weight space — the adaptation is inherently low-dimensional. The base model's "knowledge" stays untouched; only the task-specific delta is learned.

Which layers to apply LoRA to? The answer depends on the task. For text generation and style adaptation, the query and value projections (q_proj and v_proj) in the attention layers give the most gain per parameter. For tasks requiring deep domain knowledge, also adding LoRA to the MLP (feed-forward) layers often helps, at the cost of more trainable parameters.

Performance Analysis: Memory and Speed Savings from LoRA

The memory savings from LoRA are substantial in practice:

Configuration	Trainable Params	GPU Memory (Training)	Training Time (1K steps)
Full SFT — 7B model	7.0B (100%)	~70GB	Baseline
LoRA r=16 — 7B model	~16.7M (0.24%)	~18GB	~40% of full SFT
QLoRA r=16 — 7B model in 4-bit	~16.7M (0.24%)	~10GB	~55% of full SFT
LoRA r=64 — 7B model	~67M (0.96%)	~22GB	~50% of full SFT

QLoLA combines LoRA with 4-bit quantization of the frozen base model weights, bringing a 7B model's training footprint to under 12GB — well within a single 24GB consumer GPU (NVIDIA RTX 3090/4090). The trainable LoRA adapters themselves remain in BFloat16 for precision.

Inference adds zero latency: B × A is merged back into W at deployment time using model.merge_and_unload(), so the final model runs identically to a full fine-tuned model.

Running LoRA Fine-Tuning with PEFT

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
import torch

# 1. Load base model
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

# 2. Define LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                               # rank — dimensionality of the adapter matrices
    lora_alpha=32,                      # scaling: effective LR multiplier = alpha/r = 2.0
    lora_dropout=0.05,                  # regularization — set to 0 for inference
    target_modules=["q_proj", "v_proj"],# apply LoRA to attention query + value projections
    bias="none",                        # do not train bias terms
)

# 3. Wrap the model — only B and A matrices are trainable after this
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 2,359,296 || all params: 1,236,014,080 || trainable%: 0.19%

# 4. Train — the Trainer is unaware of LoRA; it trains whatever is marked requires_grad=True
training_args = TrainingArguments(
    output_dir="./lora-legal-llm",
    num_train_epochs=3,
    per_device_train_batch_size=8,      # LoRA allows larger batch sizes due to lower memory use
    learning_rate=3e-4,                 # LoRA tolerates a higher LR than full fine-tuning
    warmup_ratio=0.05,
    save_strategy="epoch",
    fp16=True,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,    # same format as SFT example above
)

trainer.train()

# 5. Save only the LoRA adapter weights (~10MB vs ~14GB for the full model)
model.save_pretrained("./lora-adapter-only")

# 6. For deployment: merge adapter into base model (zero inference overhead)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./lora-merged-final")

The line model.print_trainable_parameters() is your sanity check — always confirm the trainable percentage is in the expected range before burning GPU hours. If it shows 100%, you forgot to call get_peft_model.

📊 RLHF: The Three-Stage Training Pipeline That Shaped ChatGPT

Supervised fine-tuning and LoRA are both fundamentally about imitation learning — the model learns to replicate examples you labeled. This works well for format and domain adaptation. It fails for one important class of problem: teaching the model to be helpful, harmless, and honest in ways that are hard to express as labeled examples.

When InstructGPT (the direct predecessor to ChatGPT) was SFT-trained on human-written responses, the model learned to produce text that looked like what a helpful assistant would say. But it also learned to produce confident-sounding nonsense, to be sycophantic, and to give long verbose answers because length correlated with human approval in training labels. The SFT model optimized for the form of helpfulness, not its substance.

RLHF solves this by introducing a reward signal derived from human comparisons rather than human labels. Instead of asking humans to write the ideal response (hard and expensive), you ask them which of two model-generated responses is better (easy and scalable). A reward model is trained on these preference pairs and used to guide the LLM's training via reinforcement learning.

The RLHF pipeline has three distinct stages. The diagram below shows how they connect:

flowchart TD
    A[Stage 1 - SFT Baseline - Train on instruction-response pairs] --> B[Stage 2 - Reward Model Training]
    B --> C[Collect human preference pairs - same prompt two responses humans rank one over the other]
    C --> D[Train a reward model to predict human preference scores]
    D --> E[Stage 3 - PPO Policy Optimization]
    E --> F[Initialize policy from the SFT checkpoint]
    F --> G[Generate responses with current policy]
    G --> H[Score responses with the frozen reward model]
    H --> I[Compute PPO loss - maximize reward while penalizing KL divergence from SFT policy]
    I --> J{KL divergence within target range?}
    J --> |No - policy has drifted too far| K[Clip gradient update and increase KL coefficient]
    J --> |Yes| L[Apply gradient update to policy network]
    K --> L
    L --> M{Converged?}
    M --> |Not yet| G
    M --> |Yes| N[Aligned model ready for deployment]

The three-stage flow is critical to understand. Stage 1 (SFT) gives the model baseline instruction-following capability. Stage 2 (reward model training) creates a proxy for human preference that can be evaluated cheaply at scale. Stage 3 (PPO training) uses that proxy to push the model's output distribution toward outputs humans prefer — while the KL penalty keeps it from drifting so far from the SFT policy that it starts producing incoherent or degenerate outputs.

RLHF with the TRL PPOTrainer

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer, pipeline
import torch

# 1. Load the SFT-trained model with a PPO value head
#    The value head estimates how much reward a given state (prompt context) is worth
model = AutoModelForCausalLMWithValueHead.from_pretrained("./sft-legal-llm-final")
tokenizer = AutoTokenizer.from_pretrained("./sft-legal-llm-final")
tokenizer.pad_token = tokenizer.eos_token

# 2. Load a separately trained reward model
#    This model scores (prompt + response) pairs and returns a scalar reward
reward_model = pipeline(
    "text-classification",
    model="./reward-model",
    device=0,
    return_all_scores=False,
)

# 3. PPO configuration
ppo_config = PPOConfig(
    model_name="sft-legal-llm",
    learning_rate=1.41e-5,
    ppo_epochs=4,              # number of optimization passes per batch of rollouts
    mini_batch_size=16,
    batch_size=256,
    gradient_accumulation_steps=1,
    kl_penalty="kl",           # penalize KL divergence from the SFT reference policy
    target_kl=6.0,             # stop updating when KL divergence exceeds this threshold
    cliprange=0.2,             # PPO clip parameter — limits how large each policy update can be
    vf_coef=0.1,               # value function loss coefficient
)

ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    tokenizer=tokenizer,
)

# 4. Training loop: generate, score, update
for epoch in range(3):
    for batch in ppo_trainer.dataloader:
        query_tensors = batch["input_ids"]

        # Generate responses from the current policy
        response_tensors = ppo_trainer.generate(
            query_tensors,
            max_new_tokens=128,
            do_sample=True,
            temperature=0.7,
        )

        # Decode and score each (prompt, response) pair with the reward model
        responses = [
            tokenizer.decode(r, skip_special_tokens=True)
            for r in response_tensors
        ]
        rewards = [
            torch.tensor(reward_model(resp)[0]["score"])
            for resp in responses
        ]

        # PPO update: maximize expected reward while keeping KL penalty within bounds
        stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
        ppo_trainer.log_stats(stats, batch, rewards)

# 5. Save the RLHF-aligned model
ppo_trainer.save_pretrained("./rlhf-aligned-legal-llm")

The PPO training loop is the most computationally intensive step in the entire fine-tuning pipeline. Each iteration requires a forward pass through the policy (to generate), a forward pass through the reward model (to score), and then a backward pass through the policy (to update). This is why RLHF at scale requires dedicated infrastructure — teams typically run this on 8–64 GPUs for large models.

🌍 Real-World Fine-Tuning: How GitHub Copilot, ChatGPT, and MedPaLM Were Built

Fine-tuning is not an academic exercise. The most commercially successful AI products of the past three years are almost universally the result of disciplined fine-tuning pipelines.

ChatGPT (OpenAI, 2022): The canonical RLHF success story. GPT-3.5 (a strong SFT-trained base) was aligned using the InstructGPT RLHF recipe: human-written SFT data → reward model trained on human comparisons → PPO optimization. The result was a model that felt dramatically more helpful and less likely to produce harmful content than the raw SFT model — even though the underlying weights differed by relatively small parameter deltas.

GitHub Copilot (GitHub + OpenAI, 2021): Primarily SFT. The Codex model was fine-tuned on a massive corpus of public GitHub code with inline comments as "instructions" and subsequent code as "responses." The key insight was scale: training on billions of code examples, not just thousands, gave it enough coverage to understand project-level context, API conventions, and idiomatic patterns across dozens of languages.

Code Llama (Meta, 2023): A two-stage SFT pipeline. Llama 2 was continued-pretraining on 500 billion code tokens, then instruction fine-tuned on code-specific instruction-response pairs for the "instruct" variant. The fill-in-the-middle (FIM) capability required modifying the SFT data format to include prefix/suffix/middle token markers — a dataset engineering challenge as much as a training one.

MedPaLM 2 (Google DeepMind, 2023): Domain SFT combined with chain-of-thought fine-tuning. PaLM 2 was fine-tuned on a curated dataset of medical questions, expert-written answers, and reasoning chains from physicians. It achieved 86.5% on USMLE-style questions — above the passing threshold for human medical licensing. The fine-tuning data quality (expert-curated medical QA, not web-scraped content) was the decisive factor in outperforming much larger generalist models.

Enterprise customer service bots: The hidden majority of fine-tuning deployments. Companies like Intercom, Zendesk, and Salesforce fine-tune 3B–13B open-source models (Mistral, Llama, Qwen) using LoRA on 2K–20K examples of their own resolved support tickets. The result is a model that speaks in the company's brand voice, understands their product taxonomy, and handles their most common query patterns — deployed on a single A10G GPU at a fraction of the API cost.

⚖️ Trade-offs and Failure Modes: Where Each Approach Breaks Down

Every fine-tuning approach has a characteristic failure mode. Knowing these in advance saves weeks of debugging.

SFT overfits on small datasets. Below ~500 high-quality examples, SFT models typically memorize the training set rather than generalizing. Symptoms: perfect loss on training data, near-random outputs on novel phrasings of the same task. Mitigation: data augmentation (rephrase instructions programmatically), early stopping on validation loss, or switch to LoRA which trains fewer parameters and regularizes implicitly.

SFT causes catastrophic forgetting. Aggressive SFT on a narrow domain degrades performance on everything outside that domain. A model fine-tuned only on legal documents may fail standard common-sense reasoning questions it answered correctly before. Mitigation: include a small percentage (~5–10%) of general-purpose instruction data in every SFT training run, or use LoRA which preserves the frozen base weights entirely.

LoRA's rank is a finicky hyperparameter. Set r too high (r=128 on a small dataset) and you're effectively doing full fine-tuning with extra steps — memory savings vanish, and the adapter can still overfit. Set it too low (r=1) and the adapter lacks the capacity to learn domain-specific patterns — training loss plateaus early and validation performance disappoints. A safe starting point is r=8 or r=16, then ablate up and down.

LoRA may underperform full SFT on complex structural tasks. For tasks that require the model to deeply internalize a new output schema — multi-step JSON generation with nested schemas, code generation in a proprietary DSL, or complex multi-document reasoning — full SFT often outperforms LoRA by 5–15 points on task-specific metrics. If you have the compute, benchmark both before committing to LoRA-only.

RLHF is vulnerable to reward hacking. The reward model is a proxy for human preferences, not a perfect oracle. Policies trained with PPO quickly learn to exploit the reward model's blind spots. Classic examples: generating very long responses (length correlates with human approval scores), being sycophantically agreeable regardless of accuracy, or using hedge phrases that sound safe but provide no useful information. Mitigation: diverse reward model training data, periodic human evaluation of PPO outputs (not just reward scores), and a tight KL penalty to prevent the policy from drifting too far from the SFT baseline.

RLHF is operationally expensive and unstable. PPO is notoriously sensitive to hyperparameters. A learning rate that's 2× too high can cause the policy to collapse to a degenerate mode within 100 steps. The three-model setup (SFT policy, reference policy, reward model) requires significant infrastructure coordination. For teams without dedicated ML research engineering capacity, DPO (Direct Preference Optimization) is a compelling RLHF alternative: it achieves similar alignment results without the separate reward model or the PPO training loop, using only offline preference data.

🧭 Choosing Your Fine-Tuning Strategy: A Decision Guide for Engineers

The decision tree below distills the trade-off analysis into a practical flowchart for engineering teams. After the diagram, the reference table maps common production scenarios to concrete recommendations.

flowchart TD
    A[Fine-Tuning Strategy Decision] --> B{What is your GPU budget?}
    B --> |Single consumer GPU 16 to 24GB| C[QLoRA with 4-bit quantized base model]
    B --> |1 to 4 A100s| D{What is your primary goal?}
    B --> |8 plus A100s or a cloud cluster| E[Full SFT - no memory constraints apply]
    D --> |Domain vocabulary and output format| F[LoRA with rank 8 to 16]
    D --> |Safety and behavioral alignment| G[RLHF or DPO with preference data]
    D --> |General instruction following| H[SFT on instruction-response pairs]
    C --> I[LoRA rank 8 - 500 to 10K examples - peft plus trl SFTTrainer]
    E --> J[SFT with cosine LR schedule and mixed general-domain data to prevent forgetting]
    F --> K[peft LoraConfig plus SFTTrainer from trl]
    G --> L[DPO if fewer than 50K pairs - PPO via trl PPOTrainer for larger budgets]
    H --> K

Use the strategy decision tree entry point (GPU budget) to reach a leaf, then cross-reference with the table below for specific tooling and data requirements.

Scenario	Recommended Approach	Data Minimum	Key Tool	Watch Out For
Domain style and format adaptation	LoRA (r=8–16)	1K–5K examples	`peft` + `SFTTrainer`	Rank too high = no memory savings
Instruction following, new capabilities	SFT (full or LoRA)	5K–50K examples	`SFTTrainer`	Catastrophic forgetting
Safety and alignment tuning	RLHF / DPO	10K+ preference pairs	`trl` PPOTrainer / DPO	Reward hacking
Budget-constrained, single GPU	QLoRA (r=8, 4-bit)	500+ examples	`peft` + `BitsAndBytesConfig`	Lower quality ceiling than full SFT
Production deployment, minimal overhead	LoRA + merge	1K+ examples	`merge_and_unload()`	Need to re-merge when updating base
Academic / research fine-tune	Full SFT	10K+ examples	`Trainer`	High GPU cost

🧪 Data Preparation and Evaluation: Making Fine-Tuning Work in Practice

The most common reason fine-tuning fails in practice is not a bad training loop — it's bad data. Understanding what good fine-tuning data looks like, and how to measure whether training actually worked, is where experienced practitioners spend most of their time.

What Good Fine-Tuning Data Looks Like

Diversity over volume, always. 10,000 near-identical examples teach the model one narrow pattern very well and nothing else. 2,000 diverse examples covering the full range of your task — edge cases, unusual phrasings, failure modes you want the model to handle gracefully — produce a far more useful model. When building a legal clause assistant, don't just include standard commercial leases. Include manufacturing contracts, employment agreements, IP licensing clauses, and international contract language.

Response quality is the ceiling. The model can never surpass the quality of its training labels. Poor responses — vague summaries, inconsistent formatting, errors — become the model's new baseline. If you can afford to label only 2,000 examples, pay for expert labelers rather than crowdsourcing 20,000 low-quality labels.

Dataset size guidance by technique:

Technique	Minimum viable	Sweet spot	Diminishing returns beyond
SFT (full fine-tune)	1,000 examples	10K–100K	500K+ (better to pretrain on domain data)
LoRA / PEFT	500 examples	2K–10K	50K+ (consider full SFT at this scale)
RLHF preference pairs	5,000 pairs	20K–100K	Architecture-dependent
DPO preference pairs	3,000 pairs	10K–50K	Architecture-dependent

Evaluation: Four Ways to Know if Fine-Tuning Actually Worked

Perplexity on held-out domain data: The most direct signal. Fine-tune on 90% of your domain dataset and measure perplexity on the held-out 10%. Lower perplexity means the model predicts your domain's output distribution more accurately. Baseline against the pretrained model's perplexity on the same held-out set — the gap tells you how much the fine-tuning changed the model.

Task-specific benchmarks: Define 3–5 representative tasks and score them automatically where possible. For legal clause annotation: precision and recall on flag/no-flag decisions against expert annotations. For code generation: execution-based pass-rate (does the generated code run and produce correct output?). For chatbots: ROUGE or BLEURT against reference responses.

Human evaluation with blind comparison: Show human evaluators pairs of outputs — one from the base model, one from your fine-tuned model — without telling them which is which. Ask them to rate on helpfulness, accuracy, and format compliance. This is the ground truth. It is expensive, but any model going to production should have at least a 50-response blind evaluation to confirm the fine-tuning direction is correct.

Catastrophic forgetting test: Run your fine-tuned model on standard general benchmarks — MMLU (multi-task reasoning), HellaSwag (commonsense NLI), TruthfulQA (factual accuracy). Compare scores against the pretrained base model. A drop of more than 2–3 points on these benchmarks is a warning sign that fine-tuning was too aggressive. Mitigation: add 5–10% general instruction data to your SFT dataset and reduce the number of training epochs.

🛠️ HuggingFace TRL and PEFT: The Production Fine-Tuning Stack

TRL (Transformer Reinforcement Learning) and PEFT (Parameter-Efficient Fine-Tuning) are the two HuggingFace libraries that together form the standard production fine-tuning stack. TRL provides SFTTrainer, RewardTrainer, PPOTrainer, and DPOTrainer — a complete pipeline from SFT through RLHF. PEFT provides LoraConfig, QLoRA quantization integration, and adapter management utilities.

The SFTTrainer from TRL is the modern recommended way to run SFT — it handles chat template formatting, efficient packing of short examples into fixed-length sequences, and seamless PEFT integration in a single API:

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# 1. Model and tokenizer
model_name = "meta-llama/Llama-3.2-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,  # bfloat16 is preferred on Ampere+ GPUs
    device_map="auto",
)

# 2. Optional LoRA configuration — remove peft_config for full SFT
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # all attention projections
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
)

# 3. SFTTrainer — handles tokenization, packing, and training in one unified API
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,           # HuggingFace Dataset with a "text" or "messages" column
    peft_config=lora_config,         # omit this line for full fine-tuning
    args=SFTConfig(
        output_dir="./production-llm",
        max_seq_length=2048,
        packing=True,                # pack short examples together for GPU efficiency
        num_train_epochs=2,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        warmup_ratio=0.05,
        lr_scheduler_type="cosine",
        logging_steps=25,
        save_strategy="steps",
        save_steps=200,
        bf16=True,
        report_to="none",
    ),
)

trainer.train()
trainer.save_model("./production-llm-final")
tokenizer.save_pretrained("./production-llm-final")

The packing=True flag is a significant efficiency gain — instead of padding short examples to the maximum sequence length, it concatenates multiple examples (separated by EOS tokens) into a single training sequence. This can increase GPU utilization from 40–60% (with padding) to 85–95%.

For a full deep-dive on the TRL library's reward training and DPO workflows, see the companion post on RLHF Training Pipeline: From Preferences to Policy, or the HuggingFace TRL documentation at huggingface.co/docs/trl.

📚 Lessons Learned: Common Mistakes That Waste GPU Hours

These are the errors that appear most frequently in fine-tuning projects — not in the tutorial stage, but after teams move to production data.

1. Too-small SFT datasets creating confident-but-wrong specialists. A model trained on 150 curated legal clause examples will appear to work beautifully on your 150 held-out examples. It will fail on the 151st — a valid clause phrased slightly differently. The model has memorized the training patterns, not learned to generalize. The fix is not to blindly collect more data; it is to collect diverse data that covers the long tail of phrasings, contexts, and edge cases for your task.

2. Choosing LoRA rank by intuition instead of by measurement. Engineers often pick r=64 because "more must be better." At r=64 on a dataset of 1,000 examples, you're training 4× more parameters than at r=16 with no improvement — and often worse performance due to overfitting. Run a simple ablation: train with r=4, r=8, r=16, r=32 and compare validation loss curves. The rank where validation loss stops improving is your optimal r.

3. Forgetting to set tokenizer.pad_token = tokenizer.eos_token. LLaMA-family and many other modern tokenizers do not define a pad token by default. If you pass batches of variable-length sequences without a pad token, you'll see a Python ValueError or, worse, silent training errors if the collator falls back to a None padding strategy. Always set this line immediately after loading the tokenizer.

4. Measuring RLHF success only via reward score. A rising average reward during PPO training is a necessary but not sufficient signal of improvement. The reward model can be gamed. Always instrument your RLHF training with at least two independent signals: the reward model score and a separate held-out task benchmark. If task benchmark performance plateaus or drops while reward score climbs, you're watching reward hacking happen in real time.

5. Not running a catastrophic forgetting baseline before deployment. It takes ten minutes to run your fine-tuned model through a 100-question MMLU sample and compare it against the base model. Teams that skip this step frequently discover — after deployment — that their specialized model no longer passes basic reasoning checks that the base model handled effortlessly. Build this check into your CI/CD pipeline for every model artifact.

6. Training for too many epochs with no early stopping. On small datasets, fine-tuning loss reaches a minimum and then climbs on validation data while training loss continues to drop. Two to four epochs is a standard starting point; always monitor validation loss and use load_best_model_at_end=True in TrainingArguments to automatically recover the best checkpoint.

📌 TLDR & Key Takeaways

Fine-tuning is the engineering step between "a model that can do almost anything" and "a model that does your thing well." Here's the decision framework in six lines:

Use SFT when you have 1K+ high-quality labeled examples and need the model to learn a new format, vocabulary, or domain deeply. Use it as the mandatory first stage before RLHF.
Use LoRA when compute is constrained, when you need to maintain multiple domain adapters, or as your default approach when you don't have evidence that full fine-tuning would materially outperform it.
Use RLHF (or DPO) when your problem is alignment — safety, helpfulness, tone — rather than format adaptation. You need preference data (comparisons), not just labels.
Data quality beats data quantity. 2,000 diverse, expert-quality examples outperform 20,000 noisy crowd-sourced ones in virtually every fine-tuning scenario.
Always test for catastrophic forgetting. Any fine-tuning run should be validated against a general benchmark before deployment, not just against domain-specific metrics.
LoRA → merge → deploy. For production, always merge your LoRA adapter into the base model before deployment using merge_and_unload(). Keeping adapters separate adds inference complexity with no benefit.

Fine-tuning is a broad topic — the posts below offer targeted deep dives on each of the techniques introduced here:

LoRA Explained: How to Fine-Tune LLMs on a Budget — complete treatment of the low-rank decomposition math, rank selection, and QLoRA quantization
RLHF Explained: How We Teach AI to Be Helpful and Harmless — reward model architecture, PPO stability, and DPO as a simpler RLHF alternative
Transfer Learning Explained: Standing on the Shoulders of Pretrained Models — the foundational concept that makes fine-tuning possible, with PyTorch examples for vision and text
PEFT, LoRA, and QLoRA: A Practical Guide — hands-on walkthrough of the full PEFT library, adapter merging, and production deployment patterns

Quiet AI help

Explain simpler Compare approaches What next?

Article metadata

Written by

Abstract Algorithms

@abstractalgorithms

Related deep dives

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

31 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs