All Posts

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

From the math of low-rank decomposition to running QLoRA on a single A100 — everything you need to fine-tune a 70B model without a supercomputer.

Abstract AlgorithmsAbstract Algorithms
··31 min read

AI-assisted content.

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8×. Use HuggingFace PEFT + TRL + bitsandbytes; always call merge_and_unload() before serving; monitor both task loss and a general-capability eval (e.g. MMLU) to catch catastrophic forgetting before it reaches production.


📖 The Memory Wall That Blocked Every LLM Fine-Tune Before 2022

Before 2022, fine-tuning a large language model was an HPC problem. A 7-billion-parameter model stored in 32-bit floating point occupies 28 GB of GPU memory for the weights alone. Add the Adam optimizer and the situation explodes: Adam maintains a momentum estimate and a variance estimate for each parameter — that is four copies of the model in memory simultaneously (weights + gradients + two optimizer states), totalling roughly 112 GB just to start training. A single A100 80 GB card cannot hold this. You needed at minimum four high-end GPUs connected over NVLink, and a full fine-tune of a 70B model required eight to sixteen A100 80 GB cards — hardware costing well above $200 000 per node.

The consequence was severe: only well-funded labs could run behavioural fine-tuning. Everyone else was stuck prompting the base model and hoping the in-context examples were enough. Domain adaptation — training a model to speak fluent medical, legal, or customer-support language — was effectively closed to teams without supercomputers.

LoRA (Low-Rank Adaptation of Large Language Models, Hu et al. 2021) dismantled that wall. Instead of updating all 7 billion weights, LoRA freezes the original model entirely and inserts two tiny trainable matrices alongside each attention projection. The total number of trainable parameters drops to roughly 0.5 % of the original — sometimes as low as 0.1 % for small ranks. GPU memory consumption falls by 60–70 % because optimizer states exist only for the tiny adapter matrices, not the full model.

QLoRA (Dettmers et al. 2023) pushed the boundary further. It quantises the frozen base model weights to 4-bit Normal Float (NF4), roughly halving the memory footprint of the frozen base while keeping adapter training in full bfloat16 precision. The result is transformative: a Llama 3 70B fine-tune that required 8× A100 80 GB under full fine-tuning drops to 2× A100 80 GB under QLoRA — and down to a single A100 80 GB for 13B-class models.

The practical impact is immediate. A team at a mid-sized fintech company fine-tuned Mistral-7B-Instruct on 1 200 customer-support transcripts using QLoRA on a single 48 GB A40 GPU over a weekend. The resulting model outperformed GPT-4o on their proprietary query types while eliminating per-call API costs. That is the world LoRA opens.


🔍 LoRA in Plain English: Margin Notes on a Frozen Textbook

Think of the pre-trained base model as a textbook that already contains enormous knowledge about language, reasoning, and the world. Full fine-tuning is like reprinting the entire textbook with a few changed chapters — enormously expensive and wasteful since most pages need no change at all.

LoRA does something more elegant. It keeps the original textbook exactly as it is and writes margin notes — small, targeted annotations that modify how specific passages are interpreted. The notes are lightweight, swap in and out easily, and capture exactly the behavioural change you want without touching a single original page.

In practical terms, every attention weight matrix W (shaped d × k) in the transformer encodes a learned "skill" — how to compute query vectors, key vectors, or value projections. When you fine-tune a model to behave differently on a downstream task, you are not rewriting every skill from scratch. Research shows the effective change matrix ΔW — the difference between the fully fine-tuned weights and the original weights — has a surprisingly low intrinsic rank. That is, even though ΔW is a d × k matrix, almost all of its information can be captured by a much smaller object.

LoRA exploits this sparsity directly. Rather than storing ΔW as a full d × k matrix (which is as expensive as the original), it approximates ΔW as the product of two much smaller matrices:

  • Matrix A — shape d × r, where r is the rank, typically 4–64
  • Matrix B — shape r × k

The product B × A produces a rank-r approximation of ΔW. Only A and B are trained; W is completely frozen. The parameter saving is dramatic: instead of training d × k parameters, you train d×r + r×k parameters. For a standard attention projection where d = k = 4096 and r = 16, this reduces from 16.7 million parameters to just 131 072 — a 128× reduction for that single layer.


⚙️ How LoRA Adds a Parallel Path to Every Attention Layer

LoRA does not modify the frozen weight matrix. Instead, it runs a lightweight parallel branch alongside the frozen forward pass. For every attention projection the forward computation becomes:

y = W_frozen · x  +  (B · A) · x

where W_frozen is completely frozen and B · A is the trainable low-rank adapter. At initialisation, B is set to all zeros and A is drawn from a small Gaussian. This means B · A = 0 at step zero, so the model output is identical to the base model at the start of training. Training can then proceed stably from a known-good baseline rather than from a random initialisation.

The diagram below shows how a single transformer attention layer is modified by LoRA. The frozen W path and the trainable B → A path run in parallel and their outputs are added together before being passed to the next component.

`mermaid graph TD Input[Input activation x] --> FrozenW[Frozen weight W - shape d x k] Input --> AdapterA[LoRA down-projection A - shape d x r] AdapterA --> AdapterB[LoRA up-projection B - shape r x k] FrozenW --> Adder[Add frozen and adapter outputs] AdapterB --> Adder Adder --> Output[Layer output y]


The adapter branch compresses the input from `d` dimensions down to `r` dimensions via `A` (the down-projection) and then expands back to `k` dimensions via `B` (the up-projection). This bottleneck structure is what enforces the low-rank constraint: no matter what values `A` and `B` learn, their product `B × A` can represent at most `r` linearly independent directions in the output space. The rank `r` is therefore the key capacity hyperparameter — it controls how much behavioural change the adapter can encode.

LoRA adapters are typically attached to the query and value projections (`q_proj`, `v_proj`) inside every attention block, though the best practice for stronger adaptation is to also target the key projection (`k_proj`), output projection (`o_proj`), and the three MLP projections (`gate_proj`, `up_proj`, `down_proj`). Targeting all seven projection types roughly doubles the parameter count but produces noticeably better adaptation on complex tasks.

---

## 🧠 Deep Dive: The Math of Low-Rank Decomposition, NF4 Quantization, and QLoRA's Architecture

### Mathematical Model: The Low-Rank Decomposition Formula and Why Fine-Tuning Deltas Are Naturally Sparse

The formal statement of LoRA is concise. Given a frozen pre-trained weight matrix `W_0 ∈ ℝ^(d×k)`, the modified forward pass is:

h = W_0 · x + (B · A) · x = (W_0 + B · A) · x


where `B ∈ ℝ^(d×r)`, `A ∈ ℝ^(r×k)`, rank `r << min(d, k)`.

The key insight from the original LoRA paper is empirical: when full fine-tuning changes a pre-trained model's weights (i.e., `ΔW = W_fine_tuned - W_0`), the resulting change matrix has a very low effective rank — the singular values of `ΔW` fall off quickly after the first few dominant components. This means approximating `ΔW ≈ B · A` at rank 16 or 32 captures the overwhelming majority of the useful fine-tuning signal while discarding noise.

The adapter output is scaled by a factor of `lora_alpha / r` before being added to the frozen output. If `lora_alpha = 32` and `r = 16`, the effective scale is `2.0`. This means the adapter contribution is amplified by 2× relative to the frozen base. The alpha parameter functions as a learning rate multiplier for the adapter branch: increasing `alpha` relative to `r` makes the adapter more aggressively override the base model behaviour. The default ratio `alpha = 2 × r` is a well-tested starting point — it keeps adapter influence significant without overwhelming the frozen base early in training.

### LoRA and QLoRA Internals: NF4 Quantization, Double Quant, and the 4-Bit Architecture

QLoRA introduces three innovations on top of LoRA to squeeze the frozen base into 4 bits without significant quality loss.

**Normal Float 4 (NF4):** Standard 4-bit integer quantisation distributes its 16 representable values uniformly across a numerical range. This is inefficient for neural network weights, which are empirically normally distributed around zero. NF4 instead places its 16 quantisation levels at the quantiles of the standard normal distribution — more levels near zero where most weights cluster, fewer at the extremes. For normally distributed values this is information-theoretically optimal and empirically outperforms standard INT4 or FP4.

**Double quantisation:** The quantisation constants themselves (one per block of weights) consume memory. QLoRA quantises these constants a second time using FP8, saving approximately 0.37 bits per parameter on average — roughly 0.5 GB on a 65B parameter model. Small, but meaningful at scale.

**Paged optimisers:** When GPU memory is exhausted during a backward pass, CUDA would normally crash with an out-of-memory error. QLoRA uses NVIDIA's unified memory mechanism to automatically page optimizer states to CPU RAM and back as needed, eliminating these crashes at the cost of some training-step throughput.

The net result: a Llama 3 70B model that requires 140 GB in fp16 fits in roughly 35 GB in NF4 — an exact 4× compression ratio. Adapter training proceeds in bfloat16, so all gradient computations are numerically stable despite the compressed frozen base.

### Performance Analysis: Rank Selection, Parameter Counts, and Training Cost Trade-offs

To understand what rank means in practice, consider a model with hidden dimension `d = 4096` and a single attention projection of shape `4096 × 4096`. A LoRA adapter at rank `r` adds `2 × r × 4096` trainable parameters — one matrix `A` of shape `4096 × r` and one matrix `B` of shape `r × 4096`. For `r = 16` this is 131 072 parameters, for `r = 64` it is 524 288.

A Llama 3 8B model has 32 transformer layers, each with 4 attention projections and 3 MLP projections. Targeting all 7 projections at `r = 16`:

32 layers × 7 projections × 2 × 16 × 4096 = 29,360,128 trainable parameters


Out of 8 billion total parameters, this is 0.37 %. The bitsandbytes library reports this as `trainable%: 0.37` when you call `model.print_trainable_parameters()` before training.

The practical guidance for rank selection:
- `r = 4` — use for style or tone changes on large models (< 7B); very fast training, minimal expressiveness
- `r = 16` — the universal default; covers instruction-following, domain adaptation, format changes
- `r = 32` or `r = 64` — use when training on complex multi-step reasoning tasks or when `r = 16` shows a loss plateau well above baseline
- Increasing `r` beyond 64 rarely helps and approaches the cost of full fine-tuning

---

## 📊 The QLoRA Training Pipeline From Data to Served Model

The complete QLoRA workflow involves six distinct stages, each with specific tooling and failure points. Understanding the pipeline as a whole — rather than just the training step — is the difference between a working deployment and a model that behaves perfectly in the notebook but bizarrely in production.

The diagram below shows every stage, from the raw base model to a production-ready merged model served behind vLLM. Pay particular attention to the merge step: LoRA adapters loaded at inference time without merging incur a 2× compute overhead in the adapter projection, and the adapter checkpoint is architecturally fragile (it requires the exact same PEFT version and base model version to load).

`mermaid
graph TD
    BaseModel[Base model in fp16] --> Quantise[NF4 4-bit quantisation via bitsandbytes]
    Quantise --> FrozenBase[Frozen 4-bit base in memory]
    FrozenBase --> PrepKbit[prepare model for kbit training - gradient checkpointing]
    PrepKbit --> AttachAdapters[Attach LoRA adapters in bfloat16 via PEFT]
    TrainingData[Formatted training data] --> SFTLoop[SFTTrainer supervised fine-tuning loop]
    AttachAdapters --> SFTLoop
    SFTLoop --> SaveAdapter[Save LoRA adapter checkpoint]
    SaveAdapter --> MergeStep[Load base in fp16 and call merge and unload]
    MergeStep --> MergedModel[Merged fp16 model - no adapter dependency]
    MergedModel --> ServingLayer[vLLM or TGI serving endpoint]

Each arrow in the diagram hides a potential failure mode. The quantisation step must use bnb_4bit_quant_type="nf4" and bnb_4bit_compute_dtype=torch.bfloat16 — using float16 as the compute dtype can produce NaN gradients on A100 and H100 GPUs. The merge step loads the base model in fp16 (not 4-bit) because merging requires full-precision arithmetic to fuse W + B×A accurately. Skipping the merge and serving with adapter weights attached doubles inference latency for no accuracy benefit.

The performance comparison table below shows how different configurations trade GPU requirements against training time and quality relative to a full fine-tune baseline:

ConfigurationModelGPU RequiredTrain Time per 1 K examplesQuality vs Full FT
Full fine-tuneLlama 3 8B4× A100 40 GB~2 hoursBaseline
LoRA (r=16)Llama 3 8B1× A100 40 GB~45 min-2 to -4 %
QLoRA (r=16)Llama 3 8B1× RTX 4090 24 GB~90 min-3 to -6 %
LoRA (r=16)Llama 3 70B4× A100 80 GB~8 hours-2 to -4 %
QLoRA (r=16)Llama 3 70B2× A100 80 GB~14 hours-3 to -6 %

Quality deltas are measured as relative difference on task-specific benchmarks and vary significantly with dataset size and quality — 2 000 well-curated examples consistently outperform 20 000 auto-scraped examples.


🧪 Complete Working Example: Fine-Tuning Mistral-7B with QLoRA on a Custom Support Dataset

This section walks through a complete, runnable QLoRA fine-tune from data preparation to inference with the merged model. The scenario is a Mistral-7B-Instruct model adapted to summarise customer complaint tickets in a specific corporate format — a realistic domain-adaptation task that highlights every config decision you will encounter in practice.

The code is split into two scripts that mirror the two operational stages: training (including adapter saving) and merging + inference. Both scripts are annotated inline with the reasoning behind each hyperparameter choice.

Training Script: qlora_train.py

# qlora_train.py — fine-tune Mistral-7B-Instruct with QLoRA on custom data
import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from trl import SFTTrainer

MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"

# --- 1. Training data: instruction-response pairs ---
# In production, load from JSONL or Hugging Face dataset.
# Minimum viable size: ~200 examples. Recommended: 500–2000.
TRAINING_EXAMPLES = [
    {
        "instruction": "Summarize the following customer complaint in one sentence.",
        "input": "I've been waiting 3 weeks for my order and no one is responding to my emails.",
        "response": "Customer has been waiting 3 weeks for an undelivered order with no email response from support.",
    },
    # ... add 500-2000 more examples for real use cases
]

def format_for_mistral(ex: dict) -> str:
    """Apply Mistral's instruction template.

    Critically: use the model's own chat template, not a custom one.
    Mismatched templates are the #1 cause of fine-tuned models ignoring instructions.
    For other base models, use: tokenizer.apply_chat_template(messages, tokenize=False)
    """
    user_msg = ex["instruction"]
    if ex.get("input"):
        user_msg += f"\n\n{ex['input']}"
    return f"<s>[INST] {user_msg} [/INST] {ex['response']} </s>"

# --- 2. QLoRA quantization config ---
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",            # Normal Float 4 — optimal for normally distributed LLM weights
    bnb_4bit_compute_dtype=torch.bfloat16, # bfloat16 avoids NaN on A100/H100; do NOT use float16
    bnb_4bit_use_double_quant=True,        # quantise the quantisation constants for ~0.5 GB additional saving
)

# --- 3. Load model + tokenizer ---
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token  # Mistral has no dedicated pad token
tokenizer.padding_side = "right"           # right-padding avoids position embedding shifts

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",          # fills GPU first, spills to CPU if needed
    trust_remote_code=True,
)
# Enables gradient checkpointing for the 4-bit base — required before get_peft_model
model = prepare_model_for_kbit_training(model)

# --- 4. LoRA adapter config ---
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,               # rank — 16 balances adapter capacity vs parameter count
    lora_alpha=32,      # effective scale = alpha / r = 2.0; amplifies adapter contribution
    lora_dropout=0.05,  # light regularisation; 0.0 is acceptable for larger datasets
    bias="none",        # don't train biases — adds minimal expressiveness for significant overhead
    target_modules=[    # target all attention + MLP projections for maximal adaptation
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Expected output: trainable params: 41,943,040 || all params: 7,283,965,952 || trainable%: 0.5759

# --- 5. Dataset ---
dataset = Dataset.from_list([
    {"text": format_for_mistral(ex)} for ex in TRAINING_EXAMPLES
])

# --- 6. Training ---
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,       # set to your median sequence length; longer = more GPU memory
    args=TrainingArguments(
        output_dir="./mistral-7b-qlora",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,   # effective batch size = 2 × 8 = 16
        warmup_ratio=0.05,               # warm up over first 5% of steps; prevents early LR spikes
        learning_rate=2e-4,              # standard QLoRA LR; lower to 1e-4 if loss is unstable
        fp16=False,
        bf16=True,                       # must match bnb_4bit_compute_dtype
        logging_steps=10,
        save_strategy="epoch",
        evaluation_strategy="no",        # add eval_dataset + strategy="steps" for early stopping
        report_to="none",                # swap to "wandb" for production monitoring
    ),
)
trainer.train()

# --- 7. Save adapter only (not the full merged model) ---
model.save_pretrained("./mistral-7b-qlora-adapter")
tokenizer.save_pretrained("./mistral-7b-qlora-adapter")
print("Adapter saved. Run qlora_merge_and_infer.py to merge before serving.")

Merge and Inference Script: qlora_merge_and_infer.py

# qlora_merge_and_infer.py — merge adapter into base model weights, then run inference.
# This script runs AFTER training is complete.
# Always merge before deploying to production — adapter-only serving is 2x slower.
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

BASE_MODEL   = "mistralai/Mistral-7B-Instruct-v0.3"
ADAPTER_PATH = "./mistral-7b-qlora-adapter"
MERGED_PATH  = "./mistral-7b-merged"

def merge_adapter():
    """Fuse the LoRA adapter matrices (B × A) into the frozen base weights (W).

    Why: After merge_and_unload(), W_new = W_frozen + B × A for each layer.
    The adapter is removed, removing the extra projection overhead at inference.
    Load the base in fp16 for merging — NOT 4-bit.
    Merging requires full-precision matrix addition; NF4 cannot represent the result accurately.
    """
    print("Loading base model in fp16 for merging...")
    tokenizer = AutoTokenizer.from_pretrained(ADAPTER_PATH)
    base = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL,
        torch_dtype=torch.float16,  # full precision needed for merge arithmetic
        device_map="auto"
    )
    print("Attaching LoRA adapter...")
    model = PeftModel.from_pretrained(base, ADAPTER_PATH)

    print("Merging and unloading...")
    merged = model.merge_and_unload()  # fuses B×A into W, removes PEFT wrapper

    merged.save_pretrained(MERGED_PATH)
    tokenizer.save_pretrained(MERGED_PATH)
    print(f"Merged model saved to {MERGED_PATH}")

def infer(prompt: str) -> str:
    """Run inference on the merged model.

    The merged model is a standard HuggingFace CausalLM — no PEFT dependency.
    In production, replace this with vLLM for batched serving with 3–5x higher throughput.
    """
    tokenizer = AutoTokenizer.from_pretrained(MERGED_PATH)
    model = AutoModelForCausalLM.from_pretrained(
        MERGED_PATH,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    inputs = tokenizer(
        f"<s>[INST] {prompt} [/INST]",
        return_tensors="pt"
    ).to(model.device)

    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.1,   # low temperature for deterministic summarisation
            do_sample=True,
        )
    # Slice off the input tokens; decode only the generated response
    response_tokens = output[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(response_tokens, skip_special_tokens=True)

if __name__ == "__main__":
    merge_adapter()
    test_prompt = (
        "Summarize the following customer complaint in one sentence.\n\n"
        "I ordered a laptop stand three weeks ago. It never arrived, tracking hasn't updated "
        "in two weeks, and your customer support keeps closing my tickets without responding."
    )
    print("\nModel response:")
    print(infer(test_prompt))

The training script's most important configuration choices are the bnb_4bit_compute_dtype (always bfloat16, not float16), the target_modules list (all seven projection types for full adaptation), and the gradient_accumulation_steps (multiply with batch size to get an effective batch of 16–32). The merge script's most important detail is loading the base model in fp16 — not 4-bit — because the merge arithmetic requires numerical precision that NF4 cannot provide.


📈 Evaluating Your Fine-Tune: Loss Curves, Perplexity, and the MMLU Canary

Training loss alone is a dangerously incomplete signal. A model can reach near-zero training loss on 800 examples while simultaneously catastrophically forgetting how to perform multi-step reasoning on topics outside the training set. Production-safe evaluation requires three distinct signals monitored together.

Training loss tells you whether the model is learning the target task format. A healthy loss curve for a QLoRA run drops sharply in the first 10–20 % of steps and then decelerates into a gentle plateau around 0.3–0.8 for instruction-following tasks. If the curve never leaves the starting value (typically 2.0–4.0 for causal LMs), check your chat template — the most common cause of a flat loss curve is misformatted training data that the model cannot overfit even with high learning rates. If the curve drops all the way to near 0.0 on a small dataset, you are overfitting — add early_stopping_patience or reduce num_train_epochs.

Perplexity on a held-out split of your training distribution is the in-domain quality signal. Compute it with trainer.evaluate() after adding an eval_dataset to your SFTTrainer config. Perplexity decreasing on the eval split while training loss is still high indicates the learning is generalising rather than memorising.

A general-capability canary — such as MMLU (5-shot), HellaSwag, or a fixed internal benchmark — detects catastrophic forgetting before it reaches users. Run this eval every epoch or every 500 steps. Seeing your domain task score rise while MMLU drops from 62 % to 48 % is a clear signal that your learning rate is too high or your dataset lacks diversity. The fix is typically a lower learning rate (5e-5 instead of 2e-4), adding system-prompt diversity to the training examples, or using DPO-style preference optimisation instead of pure SFT for the second stage.

For quick MMLU evaluation during a QLoRA run, the lm-evaluation-harness library from EleutherAI runs against any HuggingFace-compatible checkpoint with a single CLI command and produces per-task accuracy scores that can be logged to W&B or MLflow alongside your training metrics.


🏗️ Advanced Deployment Patterns: Adapter Composition, Continued Pre-Training, and Multi-Tenant Serving

Once you have mastered the basic QLoRA fine-tune loop, three advanced patterns become relevant as you move from single-model prototypes to production serving systems.

Stacking Multiple LoRA Adapters with LoraHub

LoraHub (Huang et al. 2023) demonstrates that LoRA adapters trained on different tasks can be combined by weighted interpolation. If you have an adapter fine-tuned on code generation and a separate adapter fine-tuned on structured data extraction, LoraHub searches for a coefficient vector [w1, w2] such that ΔW = w1 × ΔW_code + w2 × ΔW_extraction maximises performance on a new task — without any gradient-based training, using only a handful of in-context examples for calibration. This is valuable when you want to compose specialised adapters rather than training a single adapter on a mixture dataset.

PEFT supports this through add_weighted_adapter(), which merges a list of adapter checkpoints using either linear combination or SVD-based composition. Linear combination is faster; SVD-based produces lower approximation error when the adapters' weight spaces differ significantly.

Continued Pre-Training Before Instruction Fine-Tuning

For domains with highly specialised vocabulary — genomics, patent law, semiconductor design — a two-stage approach outperforms direct instruction fine-tuning. Stage one uses LoRA (not QLoRA) with a large rank (r=128) to run continued pre-training on domain documents with a standard next-token prediction objective. This step updates the model's vocabulary distribution without changing instruction-following behaviour. Stage two then runs a standard QLoRA instruction fine-tune on a smaller supervised dataset. The stage-one checkpoint absorbs domain vocabulary cheaply; the stage-two checkpoint aligns on the task format. The Microsoft Phi-3 and Yi series models both follow this two-stage curriculum at the base model level.

Multi-Tenant Adapter Hot-Swapping with vLLM

vLLM's LoRARequest API loads and unloads LoRA adapters per-request against a shared base model instance. This means a single GPU cluster hosting Llama 3 8B can simultaneously serve hundreds of customer-specific fine-tuned models by swapping adapters in the KV-cache-aware serving pipeline. Three requirements must hold for this to work correctly: all adapters must use the same rank (r), all adapters must target the same set of modules, and all adapters must be trained against the same base model checkpoint (not different quantisation configurations). Any deviation causes shape mismatches that crash the serving process.

The operational pattern is: train all customer adapters centrally with a shared LoraConfig template, store them in object storage (S3 or GCS), and reference them by ID in the LoRARequest at inference time. The base model stays loaded in GPU memory across all requests; only the much-smaller adapter tensors are swapped.


🌍 Who Is Running LoRA in Production and What They Have Learned

LoRA and QLoRA have moved from research papers into the core infrastructure of companies building LLM products. The patterns below reflect public case studies, blog posts, and open-source releases from organisations operating at scale.

Mistral AI and the vertical AI product wave. The Mistral 7B base model was explicitly designed to be fine-tunable on consumer hardware, and the company's release strategy — base + instruct + API — implicitly assumes that customers will run LoRA adapters on top. Legal AI startups (Harvey, Casetext), medical AI companies (Nabla, Suki), and customer-service platforms (Freshdesk, Zendesk AI) all use LoRA-adapted Mistral variants to deliver domain accuracy that a generic instruction model cannot match.

HuggingFace Zephyr: DPO on top of SFT on top of LoRA. The Zephyr-7B-beta model is a public demonstration of the full fine-tuning pipeline. The team ran SFT with LoRA adapters on Mistral-7B using a synthetic instruction dataset, then ran DPO (Direct Preference Optimisation) on the resulting checkpoint using human-preference pairs. The final model outperformed Llama 2 70B on the MT-Bench chat benchmark using a model one-tenth the size and a fraction of the compute. The DPO stage used DPOTrainer from the TRL library — the same library used in the training example above.

Anyscale: per-tenant adapter hot-swapping. Anyscale's managed fine-tuning product allows enterprise customers to maintain their own LoRA adapters and have them loaded at inference time using vLLM's dynamic adapter loading feature. Each tenant's adapter is stored in object storage and loaded into a shared base model instance on demand — a multiplexed serving architecture that makes per-customer model personalisation economically viable. The critical requirement for this pattern is that all adapters target the same rank and the same set of modules; otherwise the base model cannot serve them interchangeably.

Nous Research Hermes: instruction-tuning on synthetic data. The Hermes series (Hermes 2, Hermes 3) demonstrates that dataset quality dominates dataset size. Nous Research used synthetically generated instruction data from Claude and GPT-4 to create highly diverse training examples, then ran LoRA fine-tuning on Llama base models. Hermes 2 Pro Llama 3 8B consistently scores above identically-parameterised models trained on larger but lower-quality datasets — a real-world validation of the "500 curated examples beats 5 000 scraped ones" principle.


⚖️ Seven Ways QLoRA Fine-Tunes Go Wrong — and How to Fix Each One

Failure ModeRoot CauseSymptomFix
Catastrophic forgettingLearning rate too high; training data lacks diversityMMLU or general benchmarks drop sharply after epoch 1Lower LR to 5e-5; add diverse system prompts; cap at 1–2 epochs
Rank too low for task complexityr=4 used for multi-step reasoning or code generationLoss plateaus well above 0.5; outputs are grammatically correct but logically wrongIncrease r to 32 or 64; re-run with all 7 target modules
Overfitting on small dataset< 200 examples; 5+ epochsTrain loss → 0; eval perplexity rises; model repeats training phrases verbatimAdd more examples, reduce epochs to 2, or switch to LoRA + DPO preference training
Wrong target modulesOnly q_proj targetedModel adapts tone but not reasoning; complex format instructions ignoredAdd v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj
NF4 compute instabilitybnb_4bit_compute_dtype=torch.float16 on Ampere/Hopper GPUNaN loss after first 10–20 stepsChange to torch.bfloat16; re-run
Adapter not merged before servingmerge_and_unload() skipped; PEFT wrapper left attached2× inference latency; serving crashes on model-version mismatch after upgradesAlways merge after training; save the merged model independently of the adapter
Chat template mismatchtokenizer.apply_chat_template() not used; custom template appliedModel ignores instruction format; outputs raw completions without following the [INST] templateUse tokenizer.apply_chat_template(messages, tokenize=False) in your format function

These seven failure modes cover the overwhelming majority of issues reported in fine-tuning forums, GitHub issues on PEFT and TRL, and internal post-mortems from teams running QLoRA in production. The two most damaging in production — catastrophic forgetting and adapter-not-merged latency — both have simple preventions: a canary eval and a merge step in the deployment pipeline.


🧭 Choosing Your Fine-Tuning Strategy: LoRA vs QLoRA vs Full Fine-Tune vs DPO

ScenarioRecommendationReason
Maximum quality, compute available (≥ 4× A100)Full fine-tuneNo rank approximation error; optimal for high-stakes applications
1 GPU, behavioural change, ≤ 7B modelLoRA (r=16, all projections)Sufficient VRAM in fp16; adapter overhead is minimal
1 GPU, 7B–13B model, ≤ 24 GB VRAMQLoRA (r=16)NF4 quantisation makes it fit; 90-min training on RTX 4090
2× A100, 70B model fine-tuneQLoRA (r=16–32)Only viable single-node option; full fine-tune requires 8×
Have chosen/rejected completion pairsDPO with LoRA adaptersDPOTrainer + LoRA targets alignment efficiently without RLHF infrastructure
Style or tone change onlyLoRA r=4 on q_proj and v_projMinimal adapter; fast training; negligible quality drop for simple changes
Multi-tenant, per-customer adaptersLoRA (consistent rank + modules)vLLM adapter hot-swap requires uniform adapter shapes across all tenants
Limited data (< 200 examples)Prompt engineering firstFine-tuning below 200 examples almost always overfits; few-shot prompting often matches or exceeds quality

The decision between LoRA and QLoRA is almost entirely a VRAM question. If the base model in fp16 fits in your available GPU memory with room for adapter gradients and optimizer states, use standard LoRA — it trains slightly faster than QLoRA and has no risk of NF4 compute-dtype mismatch. If VRAM is the bottleneck, QLoRA is the correct choice with essentially no workflow changes beyond the BitsAndBytesConfig block.


🛠️ HuggingFace PEFT, TRL, and bitsandbytes: The Three Libraries Powering the Stack

The QLoRA fine-tuning workflow depends on three HuggingFace ecosystem libraries that each solve a distinct piece of the problem. Understanding what each library does — and what it does not do — prevents the most common misconfiguration errors.

PEFT (Parameter-Efficient Fine-Tuning) is the adapter injection layer. Its get_peft_model() function wraps any HuggingFace PreTrainedModel with the adapter architecture specified in a LoraConfig. PEFT handles the target_modules injection (finding and wrapping the right nn.Linear layers), the zero-initialisation of B matrices, the trainable_parameters() tracking, and the merge_and_unload() fusion. It supports LoRA, QLoRA, prefix tuning, IA3, and adaptation prompts through the same unified API. PEFT does not handle training loops, dataset formatting, or quantisation — those are responsibilities of TRL and bitsandbytes respectively.

TRL (Transformer Reinforcement Learning) provides SFTTrainer for supervised fine-tuning and DPOTrainer for direct preference optimisation. Both extend HuggingFace Trainer with LLM-specific utilities: automatic dataset packing (fitting multiple short sequences into one context window to maximise GPU utilisation), chat-template application, and ConstantLength dataset handling. For teams that have human feedback data in the form of chosen/rejected response pairs, DPOTrainer can run DPO on top of an SFT checkpoint — or directly on a LoRA-adapted base — to align the model with human preferences without the complexity of PPO:

from trl import DPOTrainer, DPOConfig

# Minimal DPO setup on top of a LoRA-adapted model.
# dpo_dataset must contain "prompt", "chosen", and "rejected" fields.
dpo_trainer = DPOTrainer(
    model=model,                   # PEFT model from get_peft_model()
    ref_model=None,                # None = use implicit reference from LoRA frozen base
    args=DPOConfig(
        output_dir="./dpo-output",
        num_train_epochs=1,        # DPO converges faster than SFT; 1 epoch is often enough
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=5e-5,        # lower LR than SFT; DPO is sensitive to overshooting
        bf16=True,
        beta=0.1,                  # DPO temperature — controls how strongly to penalise rejected
    ),
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)
dpo_trainer.train()

bitsandbytes is the GPU quantisation backend. It provides the BitsAndBytesConfig class consumed by AutoModelForCausalLM.from_pretrained() and implements the actual NF4 and INT8 quantisation kernels via CUDA. Without bitsandbytes, load_in_4bit=True has no effect. The library also provides the paged Adam optimiser via optim="paged_adamw_32bit" in TrainingArguments, which automatically manages CPU offload of optimizer states to prevent out-of-memory crashes during the backward pass. Adding this argument to the training script above is recommended for any QLoRA run on a model above 13B parameters.

For a full guide on when to choose fine-tuning over retrieval-augmented generation, see RAG vs Fine-Tuning: When to Use Each.


📚 Lessons Learned from Running QLoRA Fine-Tunes in Production

  1. merge_and_unload() is a deployment gate, not an optimisation. Adapters left attached at inference time add a full extra matrix multiplication to every forward pass. A 7B model with 7 adapter-attached projection layers runs at roughly 1.6× the latency of the merged equivalent. Always merge, always save the merged model separately from the adapter checkpoint.

  2. tokenizer.apply_chat_template() prevents the #1 silent failure. Every base model has a specific conversation template — Mistral uses [INST]/[/INST], Llama 3 uses a different header format, Phi-3 uses yet another. Training with a mismatched template produces a model that has learned to follow instructions in the wrong format. At inference time, this manifests as a model that ignores the instruction entirely or produces garbled outputs. Always use tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) in your data formatting function.

  3. r=16, alpha=32 is your starting point — only move up if loss plateaus. Increasing rank increases training time roughly linearly and does not always improve final quality. Start at r=16 for every new fine-tuning task. If the loss curve shows a clear plateau above a target of 0.5–0.8, try r=32. If you are adapting a model to generate structured code or complex multi-step documents, r=64 may be necessary.

  4. Monitor a general capability benchmark every epoch. Task-specific training loss going to zero while MMLU drops from 60 % to 45 % is a loss of general capability that will create user-visible regressions in every use case outside your training distribution. Integrate a 5-shot MMLU or similar benchmark into your evaluation loop and set a threshold: if general capability drops more than 5 points absolute, the model fails the deployment gate.

  5. QLoRA with paged Adam adds 20–30 % training time versus standard LoRA on the same VRAM. This overhead comes from CPU-GPU optimizer state paging. Budget for it in your training time estimates. For models that fit in VRAM without paging (e.g. 7B models on A100 80 GB), use standard LoRA + adamw_torch for faster training.

  6. Dataset quality beats dataset size — always, at every scale. Across public benchmarks and internal experiments, 500 carefully curated examples with correct format, diverse instructions, and accurate responses consistently outperform 5 000 auto-scraped, lightly-filtered examples. Before scaling your dataset, audit a random 50-example sample manually. If more than 5 % of samples have format errors, truncation artefacts, or factually incorrect responses, clean the dataset before adding more data.


📌 TLDR

  • LoRA freezes the base model and trains two tiny matrices per layer (B ∈ ℝ^(d×r) and A ∈ ℝ^(r×k)) so that the effective weight update ΔW ≈ B × A has rank r. Only 0.1–0.6 % of parameters are trained; GPU memory drops by 60–70 %.
  • QLoRA adds NF4 4-bit quantisation of the frozen base (from 140 GB to 35 GB for a 70B model), double quantisation, and paged optimisers. A 70B fine-tune that requires 8× A100 80 GB under full fine-tuning runs on 2× A100 80 GB under QLoRA.
  • The three-library stack: PEFT injects adapters, TRL drives the training loop, bitsandbytes quantises the base.
  • Always merge_and_unload() before serving. Unmerged adapters run at 1.6× inference latency with no quality benefit.
  • Always use tokenizer.apply_chat_template(). Mismatched templates are the #1 silent failure mode.
  • Monitor general capability (MMLU or equivalent) alongside task loss to detect catastrophic forgetting before deployment.
  • Default configuration: r=16, alpha=32, all 7 attention + MLP projection modules targeted, bfloat16 compute, paged adamw for ≥ 30B models.

📝 Practice Quiz

  1. What is the primary reason that QLoRA can fine-tune a 70B model on 2× A100 80 GB GPUs when full fine-tuning requires 8×?

    • A) QLoRA uses a smaller rank than LoRA, which reduces the number of trainable parameters further
    • B) QLoRA quantises the frozen base model weights to 4-bit NF4, reducing the base model's memory footprint by approximately 4×
    • C) QLoRA skips the optimizer state entirely and trains only with first-order gradients
    • D) QLoRA replaces Adam with SGD, which requires no optimizer state Correct Answer: B
  2. You train a Mistral-7B QLoRA adapter at r=16 for 5 epochs on 600 customer support examples. After training, task-specific accuracy on your eval set is 94 %, but you observe that the model now refuses to answer general knowledge questions it answered correctly before fine-tuning. Which failure mode does this describe, and which single training argument change is most likely to fix it?

    • A) Rank too low — increase r to 64
    • B) Catastrophic forgetting caused by excessive training — reduce num_train_epochs to 1–2 and lower learning_rate to 5e-5
    • C) Chat template mismatch — apply tokenizer.apply_chat_template() during data formatting
    • D) NF4 instability — switch bnb_4bit_compute_dtype to torch.float16 Correct Answer: B
  3. Why must B be initialised to zeros in LoRA, and what property does this zero-initialisation guarantee at the start of training?

    • A) Zero initialisation of B ensures the adapter never exceeds rank r during training, preserving the low-rank constraint
    • B) Zero initialisation of B ensures the adapter output B×A equals zero at step 0, so the model starts training from the exact same output distribution as the unmodified base model
    • C) Zero initialisation of B prevents gradient vanishing by ensuring the first backward pass propagates through a numerically stable path
    • D) Zero initialisation of B minimises the KL divergence between the adapter distribution and the base model distribution throughout training Correct Answer: B
  4. Open-ended challenge: You have fine-tuned a Llama 3 8B model with QLoRA (r=16, all 7 projection modules targeted) on 800 carefully curated domain-specific examples over 3 epochs. Task-specific eval accuracy reaches 91 %. However, MMLU accuracy drops from 62 % to 48 %, and users report that the model occasionally generates responses in a completely different language than the input. Diagnose the two most likely root causes for these symptoms and describe a revised training strategy — including specific changes to hyperparameters, data composition, and evaluation frequency — that would achieve at least 88 % task accuracy while keeping MMLU within 5 percentage points of the base model baseline.


Share
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms