Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
From the math of low-rank decomposition to running QLoRA on a single A100 — everything you need to fine-tune a 70B model without a supercomputer.
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8×. Use HuggingFace PEFT + TRL + bitsandbytes; always call
merge_and_unload()before serving; monitor both task loss and a general-capability eval (e.g. MMLU) to catch catastrophic forgetting before it reaches production.
📖 The Memory Wall That Blocked Every LLM Fine-Tune Before 2022
Before 2022, fine-tuning a large language model was an HPC problem. A 7-billion-parameter model stored in 32-bit floating point occupies 28 GB of GPU memory for the weights alone. Add the Adam optimizer and the situation explodes: Adam maintains a momentum estimate and a variance estimate for each parameter — that is four copies of the model in memory simultaneously (weights + gradients + two optimizer states), totalling roughly 112 GB just to start training. A single A100 80 GB card cannot hold this. You needed at minimum four high-end GPUs connected over NVLink, and a full fine-tune of a 70B model required eight to sixteen A100 80 GB cards — hardware costing well above $200 000 per node.
The consequence was severe: only well-funded labs could run behavioural fine-tuning. Everyone else was stuck prompting the base model and hoping the in-context examples were enough. Domain adaptation — training a model to speak fluent medical, legal, or customer-support language — was effectively closed to teams without supercomputers.
LoRA (Low-Rank Adaptation of Large Language Models, Hu et al. 2021) dismantled that wall. Instead of updating all 7 billion weights, LoRA freezes the original model entirely and inserts two tiny trainable matrices alongside each attention projection. The total number of trainable parameters drops to roughly 0.5 % of the original — sometimes as low as 0.1 % for small ranks. GPU memory consumption falls by 60–70 % because optimizer states exist only for the tiny adapter matrices, not the full model.
QLoRA (Dettmers et al. 2023) pushed the boundary further. It quantises the frozen base model weights to 4-bit Normal Float (NF4), roughly halving the memory footprint of the frozen base while keeping adapter training in full bfloat16 precision. The result is transformative: a Llama 3 70B fine-tune that required 8× A100 80 GB under full fine-tuning drops to 2× A100 80 GB under QLoRA — and down to a single A100 80 GB for 13B-class models.
The practical impact is immediate. A team at a mid-sized fintech company fine-tuned Mistral-7B-Instruct on 1 200 customer-support transcripts using QLoRA on a single 48 GB A40 GPU over a weekend. The resulting model outperformed GPT-4o on their proprietary query types while eliminating per-call API costs. That is the world LoRA opens.
🔍 LoRA in Plain English: Margin Notes on a Frozen Textbook
Think of the pre-trained base model as a textbook that already contains enormous knowledge about language, reasoning, and the world. Full fine-tuning is like reprinting the entire textbook with a few changed chapters — enormously expensive and wasteful since most pages need no change at all.
LoRA does something more elegant. It keeps the original textbook exactly as it is and writes margin notes — small, targeted annotations that modify how specific passages are interpreted. The notes are lightweight, swap in and out easily, and capture exactly the behavioural change you want without touching a single original page.
In practical terms, every attention weight matrix W (shaped d × k) in the transformer encodes a learned "skill" — how to compute query vectors, key vectors, or value projections. When you fine-tune a model to behave differently on a downstream task, you are not rewriting every skill from scratch. Research shows the effective change matrix ΔW — the difference between the fully fine-tuned weights and the original weights — has a surprisingly low intrinsic rank. That is, even though ΔW is a d × k matrix, almost all of its information can be captured by a much smaller object.
LoRA exploits this sparsity directly. Rather than storing ΔW as a full d × k matrix (which is as expensive as the original), it approximates ΔW as the product of two much smaller matrices:
- Matrix A — shape
d × r, whereris the rank, typically 4–64 - Matrix B — shape
r × k
The product B × A produces a rank-r approximation of ΔW. Only A and B are trained; W is completely frozen. The parameter saving is dramatic: instead of training d × k parameters, you train d×r + r×k parameters. For a standard attention projection where d = k = 4096 and r = 16, this reduces from 16.7 million parameters to just 131 072 — a 128× reduction for that single layer.
⚙️ How LoRA Adds a Parallel Path to Every Attention Layer
LoRA does not modify the frozen weight matrix. Instead, it runs a lightweight parallel branch alongside the frozen forward pass. For every attention projection the forward computation becomes:
y = W_frozen · x + (B · A) · x
where W_frozen is completely frozen and B · A is the trainable low-rank adapter. At initialisation, B is set to all zeros and A is drawn from a small Gaussian. This means B · A = 0 at step zero, so the model output is identical to the base model at the start of training. Training can then proceed stably from a known-good baseline rather than from a random initialisation.
The diagram below shows how a single transformer attention layer is modified by LoRA. The frozen W path and the trainable B → A path run in parallel and their outputs are added together before being passed to the next component.
`mermaid graph TD Input[Input activation x] --> FrozenW[Frozen weight W - shape d x k] Input --> AdapterA[LoRA down-projection A - shape d x r] AdapterA --> AdapterB[LoRA up-projection B - shape r x k] FrozenW --> Adder[Add frozen and adapter outputs] AdapterB --> Adder Adder --> Output[Layer output y]
The adapter branch compresses the input from `d` dimensions down to `r` dimensions via `A` (the down-projection) and then expands back to `k` dimensions via `B` (the up-projection). This bottleneck structure is what enforces the low-rank constraint: no matter what values `A` and `B` learn, their product `B × A` can represent at most `r` linearly independent directions in the output space. The rank `r` is therefore the key capacity hyperparameter — it controls how much behavioural change the adapter can encode.
LoRA adapters are typically attached to the query and value projections (`q_proj`, `v_proj`) inside every attention block, though the best practice for stronger adaptation is to also target the key projection (`k_proj`), output projection (`o_proj`), and the three MLP projections (`gate_proj`, `up_proj`, `down_proj`). Targeting all seven projection types roughly doubles the parameter count but produces noticeably better adaptation on complex tasks.
---
## 🧠 Deep Dive: The Math of Low-Rank Decomposition, NF4 Quantization, and QLoRA's Architecture
### Mathematical Model: The Low-Rank Decomposition Formula and Why Fine-Tuning Deltas Are Naturally Sparse
The formal statement of LoRA is concise. Given a frozen pre-trained weight matrix `W_0 ∈ ℝ^(d×k)`, the modified forward pass is:
h = W_0 · x + (B · A) · x = (W_0 + B · A) · x
where `B ∈ ℝ^(d×r)`, `A ∈ ℝ^(r×k)`, rank `r << min(d, k)`.
The key insight from the original LoRA paper is empirical: when full fine-tuning changes a pre-trained model's weights (i.e., `ΔW = W_fine_tuned - W_0`), the resulting change matrix has a very low effective rank — the singular values of `ΔW` fall off quickly after the first few dominant components. This means approximating `ΔW ≈ B · A` at rank 16 or 32 captures the overwhelming majority of the useful fine-tuning signal while discarding noise.
The adapter output is scaled by a factor of `lora_alpha / r` before being added to the frozen output. If `lora_alpha = 32` and `r = 16`, the effective scale is `2.0`. This means the adapter contribution is amplified by 2× relative to the frozen base. The alpha parameter functions as a learning rate multiplier for the adapter branch: increasing `alpha` relative to `r` makes the adapter more aggressively override the base model behaviour. The default ratio `alpha = 2 × r` is a well-tested starting point — it keeps adapter influence significant without overwhelming the frozen base early in training.
### LoRA and QLoRA Internals: NF4 Quantization, Double Quant, and the 4-Bit Architecture
QLoRA introduces three innovations on top of LoRA to squeeze the frozen base into 4 bits without significant quality loss.
**Normal Float 4 (NF4):** Standard 4-bit integer quantisation distributes its 16 representable values uniformly across a numerical range. This is inefficient for neural network weights, which are empirically normally distributed around zero. NF4 instead places its 16 quantisation levels at the quantiles of the standard normal distribution — more levels near zero where most weights cluster, fewer at the extremes. For normally distributed values this is information-theoretically optimal and empirically outperforms standard INT4 or FP4.
**Double quantisation:** The quantisation constants themselves (one per block of weights) consume memory. QLoRA quantises these constants a second time using FP8, saving approximately 0.37 bits per parameter on average — roughly 0.5 GB on a 65B parameter model. Small, but meaningful at scale.
**Paged optimisers:** When GPU memory is exhausted during a backward pass, CUDA would normally crash with an out-of-memory error. QLoRA uses NVIDIA's unified memory mechanism to automatically page optimizer states to CPU RAM and back as needed, eliminating these crashes at the cost of some training-step throughput.
The net result: a Llama 3 70B model that requires 140 GB in fp16 fits in roughly 35 GB in NF4 — an exact 4× compression ratio. Adapter training proceeds in bfloat16, so all gradient computations are numerically stable despite the compressed frozen base.
### Performance Analysis: Rank Selection, Parameter Counts, and Training Cost Trade-offs
To understand what rank means in practice, consider a model with hidden dimension `d = 4096` and a single attention projection of shape `4096 × 4096`. A LoRA adapter at rank `r` adds `2 × r × 4096` trainable parameters — one matrix `A` of shape `4096 × r` and one matrix `B` of shape `r × 4096`. For `r = 16` this is 131 072 parameters, for `r = 64` it is 524 288.
A Llama 3 8B model has 32 transformer layers, each with 4 attention projections and 3 MLP projections. Targeting all 7 projections at `r = 16`:
32 layers × 7 projections × 2 × 16 × 4096 = 29,360,128 trainable parameters
Out of 8 billion total parameters, this is 0.37 %. The bitsandbytes library reports this as `trainable%: 0.37` when you call `model.print_trainable_parameters()` before training.
The practical guidance for rank selection:
- `r = 4` — use for style or tone changes on large models (< 7B); very fast training, minimal expressiveness
- `r = 16` — the universal default; covers instruction-following, domain adaptation, format changes
- `r = 32` or `r = 64` — use when training on complex multi-step reasoning tasks or when `r = 16` shows a loss plateau well above baseline
- Increasing `r` beyond 64 rarely helps and approaches the cost of full fine-tuning
---
## 📊 The QLoRA Training Pipeline From Data to Served Model
The complete QLoRA workflow involves six distinct stages, each with specific tooling and failure points. Understanding the pipeline as a whole — rather than just the training step — is the difference between a working deployment and a model that behaves perfectly in the notebook but bizarrely in production.
The diagram below shows every stage, from the raw base model to a production-ready merged model served behind vLLM. Pay particular attention to the merge step: LoRA adapters loaded at inference time without merging incur a 2× compute overhead in the adapter projection, and the adapter checkpoint is architecturally fragile (it requires the exact same PEFT version and base model version to load).
`mermaid
graph TD
BaseModel[Base model in fp16] --> Quantise[NF4 4-bit quantisation via bitsandbytes]
Quantise --> FrozenBase[Frozen 4-bit base in memory]
FrozenBase --> PrepKbit[prepare model for kbit training - gradient checkpointing]
PrepKbit --> AttachAdapters[Attach LoRA adapters in bfloat16 via PEFT]
TrainingData[Formatted training data] --> SFTLoop[SFTTrainer supervised fine-tuning loop]
AttachAdapters --> SFTLoop
SFTLoop --> SaveAdapter[Save LoRA adapter checkpoint]
SaveAdapter --> MergeStep[Load base in fp16 and call merge and unload]
MergeStep --> MergedModel[Merged fp16 model - no adapter dependency]
MergedModel --> ServingLayer[vLLM or TGI serving endpoint]
Each arrow in the diagram hides a potential failure mode. The quantisation step must use bnb_4bit_quant_type="nf4" and bnb_4bit_compute_dtype=torch.bfloat16 — using float16 as the compute dtype can produce NaN gradients on A100 and H100 GPUs. The merge step loads the base model in fp16 (not 4-bit) because merging requires full-precision arithmetic to fuse W + B×A accurately. Skipping the merge and serving with adapter weights attached doubles inference latency for no accuracy benefit.
The performance comparison table below shows how different configurations trade GPU requirements against training time and quality relative to a full fine-tune baseline:
| Configuration | Model | GPU Required | Train Time per 1 K examples | Quality vs Full FT |
| Full fine-tune | Llama 3 8B | 4× A100 40 GB | ~2 hours | Baseline |
| LoRA (r=16) | Llama 3 8B | 1× A100 40 GB | ~45 min | -2 to -4 % |
| QLoRA (r=16) | Llama 3 8B | 1× RTX 4090 24 GB | ~90 min | -3 to -6 % |
| LoRA (r=16) | Llama 3 70B | 4× A100 80 GB | ~8 hours | -2 to -4 % |
| QLoRA (r=16) | Llama 3 70B | 2× A100 80 GB | ~14 hours | -3 to -6 % |
Quality deltas are measured as relative difference on task-specific benchmarks and vary significantly with dataset size and quality — 2 000 well-curated examples consistently outperform 20 000 auto-scraped examples.
🧪 Complete Working Example: Fine-Tuning Mistral-7B with QLoRA on a Custom Support Dataset
This section walks through a complete, runnable QLoRA fine-tune from data preparation to inference with the merged model. The scenario is a Mistral-7B-Instruct model adapted to summarise customer complaint tickets in a specific corporate format — a realistic domain-adaptation task that highlights every config decision you will encounter in practice.
The code is split into two scripts that mirror the two operational stages: training (including adapter saving) and merging + inference. Both scripts are annotated inline with the reasoning behind each hyperparameter choice.
Training Script: qlora_train.py
# qlora_train.py — fine-tune Mistral-7B-Instruct with QLoRA on custom data
import torch
from datasets import Dataset
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from trl import SFTTrainer
MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"
# --- 1. Training data: instruction-response pairs ---
# In production, load from JSONL or Hugging Face dataset.
# Minimum viable size: ~200 examples. Recommended: 500–2000.
TRAINING_EXAMPLES = [
{
"instruction": "Summarize the following customer complaint in one sentence.",
"input": "I've been waiting 3 weeks for my order and no one is responding to my emails.",
"response": "Customer has been waiting 3 weeks for an undelivered order with no email response from support.",
},
# ... add 500-2000 more examples for real use cases
]
def format_for_mistral(ex: dict) -> str:
"""Apply Mistral's instruction template.
Critically: use the model's own chat template, not a custom one.
Mismatched templates are the #1 cause of fine-tuned models ignoring instructions.
For other base models, use: tokenizer.apply_chat_template(messages, tokenize=False)
"""
user_msg = ex["instruction"]
if ex.get("input"):
user_msg += f"\n\n{ex['input']}"
return f"<s>[INST] {user_msg} [/INST] {ex['response']} </s>"
# --- 2. QLoRA quantization config ---
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Normal Float 4 — optimal for normally distributed LLM weights
bnb_4bit_compute_dtype=torch.bfloat16, # bfloat16 avoids NaN on A100/H100; do NOT use float16
bnb_4bit_use_double_quant=True, # quantise the quantisation constants for ~0.5 GB additional saving
)
# --- 3. Load model + tokenizer ---
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Mistral has no dedicated pad token
tokenizer.padding_side = "right" # right-padding avoids position embedding shifts
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
device_map="auto", # fills GPU first, spills to CPU if needed
trust_remote_code=True,
)
# Enables gradient checkpointing for the 4-bit base — required before get_peft_model
model = prepare_model_for_kbit_training(model)
# --- 4. LoRA adapter config ---
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank — 16 balances adapter capacity vs parameter count
lora_alpha=32, # effective scale = alpha / r = 2.0; amplifies adapter contribution
lora_dropout=0.05, # light regularisation; 0.0 is acceptable for larger datasets
bias="none", # don't train biases — adds minimal expressiveness for significant overhead
target_modules=[ # target all attention + MLP projections for maximal adaptation
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Expected output: trainable params: 41,943,040 || all params: 7,283,965,952 || trainable%: 0.5759
# --- 5. Dataset ---
dataset = Dataset.from_list([
{"text": format_for_mistral(ex)} for ex in TRAINING_EXAMPLES
])
# --- 6. Training ---
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048, # set to your median sequence length; longer = more GPU memory
args=TrainingArguments(
output_dir="./mistral-7b-qlora",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # effective batch size = 2 × 8 = 16
warmup_ratio=0.05, # warm up over first 5% of steps; prevents early LR spikes
learning_rate=2e-4, # standard QLoRA LR; lower to 1e-4 if loss is unstable
fp16=False,
bf16=True, # must match bnb_4bit_compute_dtype
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="no", # add eval_dataset + strategy="steps" for early stopping
report_to="none", # swap to "wandb" for production monitoring
),
)
trainer.train()
# --- 7. Save adapter only (not the full merged model) ---
model.save_pretrained("./mistral-7b-qlora-adapter")
tokenizer.save_pretrained("./mistral-7b-qlora-adapter")
print("Adapter saved. Run qlora_merge_and_infer.py to merge before serving.")
Merge and Inference Script: qlora_merge_and_infer.py
# qlora_merge_and_infer.py — merge adapter into base model weights, then run inference.
# This script runs AFTER training is complete.
# Always merge before deploying to production — adapter-only serving is 2x slower.
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.3"
ADAPTER_PATH = "./mistral-7b-qlora-adapter"
MERGED_PATH = "./mistral-7b-merged"
def merge_adapter():
"""Fuse the LoRA adapter matrices (B × A) into the frozen base weights (W).
Why: After merge_and_unload(), W_new = W_frozen + B × A for each layer.
The adapter is removed, removing the extra projection overhead at inference.
Load the base in fp16 for merging — NOT 4-bit.
Merging requires full-precision matrix addition; NF4 cannot represent the result accurately.
"""
print("Loading base model in fp16 for merging...")
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_PATH)
base = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
torch_dtype=torch.float16, # full precision needed for merge arithmetic
device_map="auto"
)
print("Attaching LoRA adapter...")
model = PeftModel.from_pretrained(base, ADAPTER_PATH)
print("Merging and unloading...")
merged = model.merge_and_unload() # fuses B×A into W, removes PEFT wrapper
merged.save_pretrained(MERGED_PATH)
tokenizer.save_pretrained(MERGED_PATH)
print(f"Merged model saved to {MERGED_PATH}")
def infer(prompt: str) -> str:
"""Run inference on the merged model.
The merged model is a standard HuggingFace CausalLM — no PEFT dependency.
In production, replace this with vLLM for batched serving with 3–5x higher throughput.
"""
tokenizer = AutoTokenizer.from_pretrained(MERGED_PATH)
model = AutoModelForCausalLM.from_pretrained(
MERGED_PATH,
torch_dtype=torch.float16,
device_map="auto"
)
inputs = tokenizer(
f"<s>[INST] {prompt} [/INST]",
return_tensors="pt"
).to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.1, # low temperature for deterministic summarisation
do_sample=True,
)
# Slice off the input tokens; decode only the generated response
response_tokens = output[0][inputs["input_ids"].shape[1]:]
return tokenizer.decode(response_tokens, skip_special_tokens=True)
if __name__ == "__main__":
merge_adapter()
test_prompt = (
"Summarize the following customer complaint in one sentence.\n\n"
"I ordered a laptop stand three weeks ago. It never arrived, tracking hasn't updated "
"in two weeks, and your customer support keeps closing my tickets without responding."
)
print("\nModel response:")
print(infer(test_prompt))
The training script's most important configuration choices are the bnb_4bit_compute_dtype (always bfloat16, not float16), the target_modules list (all seven projection types for full adaptation), and the gradient_accumulation_steps (multiply with batch size to get an effective batch of 16–32). The merge script's most important detail is loading the base model in fp16 — not 4-bit — because the merge arithmetic requires numerical precision that NF4 cannot provide.
📈 Evaluating Your Fine-Tune: Loss Curves, Perplexity, and the MMLU Canary
Training loss alone is a dangerously incomplete signal. A model can reach near-zero training loss on 800 examples while simultaneously catastrophically forgetting how to perform multi-step reasoning on topics outside the training set. Production-safe evaluation requires three distinct signals monitored together.
Training loss tells you whether the model is learning the target task format. A healthy loss curve for a QLoRA run drops sharply in the first 10–20 % of steps and then decelerates into a gentle plateau around 0.3–0.8 for instruction-following tasks. If the curve never leaves the starting value (typically 2.0–4.0 for causal LMs), check your chat template — the most common cause of a flat loss curve is misformatted training data that the model cannot overfit even with high learning rates. If the curve drops all the way to near 0.0 on a small dataset, you are overfitting — add early_stopping_patience or reduce num_train_epochs.
Perplexity on a held-out split of your training distribution is the in-domain quality signal. Compute it with trainer.evaluate() after adding an eval_dataset to your SFTTrainer config. Perplexity decreasing on the eval split while training loss is still high indicates the learning is generalising rather than memorising.
A general-capability canary — such as MMLU (5-shot), HellaSwag, or a fixed internal benchmark — detects catastrophic forgetting before it reaches users. Run this eval every epoch or every 500 steps. Seeing your domain task score rise while MMLU drops from 62 % to 48 % is a clear signal that your learning rate is too high or your dataset lacks diversity. The fix is typically a lower learning rate (5e-5 instead of 2e-4), adding system-prompt diversity to the training examples, or using DPO-style preference optimisation instead of pure SFT for the second stage.
For quick MMLU evaluation during a QLoRA run, the lm-evaluation-harness library from EleutherAI runs against any HuggingFace-compatible checkpoint with a single CLI command and produces per-task accuracy scores that can be logged to W&B or MLflow alongside your training metrics.
🏗️ Advanced Deployment Patterns: Adapter Composition, Continued Pre-Training, and Multi-Tenant Serving
Once you have mastered the basic QLoRA fine-tune loop, three advanced patterns become relevant as you move from single-model prototypes to production serving systems.
Stacking Multiple LoRA Adapters with LoraHub
LoraHub (Huang et al. 2023) demonstrates that LoRA adapters trained on different tasks can be combined by weighted interpolation. If you have an adapter fine-tuned on code generation and a separate adapter fine-tuned on structured data extraction, LoraHub searches for a coefficient vector [w1, w2] such that ΔW = w1 × ΔW_code + w2 × ΔW_extraction maximises performance on a new task — without any gradient-based training, using only a handful of in-context examples for calibration. This is valuable when you want to compose specialised adapters rather than training a single adapter on a mixture dataset.
PEFT supports this through add_weighted_adapter(), which merges a list of adapter checkpoints using either linear combination or SVD-based composition. Linear combination is faster; SVD-based produces lower approximation error when the adapters' weight spaces differ significantly.
Continued Pre-Training Before Instruction Fine-Tuning
For domains with highly specialised vocabulary — genomics, patent law, semiconductor design — a two-stage approach outperforms direct instruction fine-tuning. Stage one uses LoRA (not QLoRA) with a large rank (r=128) to run continued pre-training on domain documents with a standard next-token prediction objective. This step updates the model's vocabulary distribution without changing instruction-following behaviour. Stage two then runs a standard QLoRA instruction fine-tune on a smaller supervised dataset. The stage-one checkpoint absorbs domain vocabulary cheaply; the stage-two checkpoint aligns on the task format. The Microsoft Phi-3 and Yi series models both follow this two-stage curriculum at the base model level.
Multi-Tenant Adapter Hot-Swapping with vLLM
vLLM's LoRARequest API loads and unloads LoRA adapters per-request against a shared base model instance. This means a single GPU cluster hosting Llama 3 8B can simultaneously serve hundreds of customer-specific fine-tuned models by swapping adapters in the KV-cache-aware serving pipeline. Three requirements must hold for this to work correctly: all adapters must use the same rank (r), all adapters must target the same set of modules, and all adapters must be trained against the same base model checkpoint (not different quantisation configurations). Any deviation causes shape mismatches that crash the serving process.
The operational pattern is: train all customer adapters centrally with a shared LoraConfig template, store them in object storage (S3 or GCS), and reference them by ID in the LoRARequest at inference time. The base model stays loaded in GPU memory across all requests; only the much-smaller adapter tensors are swapped.
🌍 Who Is Running LoRA in Production and What They Have Learned
LoRA and QLoRA have moved from research papers into the core infrastructure of companies building LLM products. The patterns below reflect public case studies, blog posts, and open-source releases from organisations operating at scale.
Mistral AI and the vertical AI product wave. The Mistral 7B base model was explicitly designed to be fine-tunable on consumer hardware, and the company's release strategy — base + instruct + API — implicitly assumes that customers will run LoRA adapters on top. Legal AI startups (Harvey, Casetext), medical AI companies (Nabla, Suki), and customer-service platforms (Freshdesk, Zendesk AI) all use LoRA-adapted Mistral variants to deliver domain accuracy that a generic instruction model cannot match.
HuggingFace Zephyr: DPO on top of SFT on top of LoRA. The Zephyr-7B-beta model is a public demonstration of the full fine-tuning pipeline. The team ran SFT with LoRA adapters on Mistral-7B using a synthetic instruction dataset, then ran DPO (Direct Preference Optimisation) on the resulting checkpoint using human-preference pairs. The final model outperformed Llama 2 70B on the MT-Bench chat benchmark using a model one-tenth the size and a fraction of the compute. The DPO stage used DPOTrainer from the TRL library — the same library used in the training example above.
Anyscale: per-tenant adapter hot-swapping. Anyscale's managed fine-tuning product allows enterprise customers to maintain their own LoRA adapters and have them loaded at inference time using vLLM's dynamic adapter loading feature. Each tenant's adapter is stored in object storage and loaded into a shared base model instance on demand — a multiplexed serving architecture that makes per-customer model personalisation economically viable. The critical requirement for this pattern is that all adapters target the same rank and the same set of modules; otherwise the base model cannot serve them interchangeably.
Nous Research Hermes: instruction-tuning on synthetic data. The Hermes series (Hermes 2, Hermes 3) demonstrates that dataset quality dominates dataset size. Nous Research used synthetically generated instruction data from Claude and GPT-4 to create highly diverse training examples, then ran LoRA fine-tuning on Llama base models. Hermes 2 Pro Llama 3 8B consistently scores above identically-parameterised models trained on larger but lower-quality datasets — a real-world validation of the "500 curated examples beats 5 000 scraped ones" principle.
⚖️ Seven Ways QLoRA Fine-Tunes Go Wrong — and How to Fix Each One
| Failure Mode | Root Cause | Symptom | Fix |
| Catastrophic forgetting | Learning rate too high; training data lacks diversity | MMLU or general benchmarks drop sharply after epoch 1 | Lower LR to 5e-5; add diverse system prompts; cap at 1–2 epochs |
| Rank too low for task complexity | r=4 used for multi-step reasoning or code generation | Loss plateaus well above 0.5; outputs are grammatically correct but logically wrong | Increase r to 32 or 64; re-run with all 7 target modules |
| Overfitting on small dataset | < 200 examples; 5+ epochs | Train loss → 0; eval perplexity rises; model repeats training phrases verbatim | Add more examples, reduce epochs to 2, or switch to LoRA + DPO preference training |
| Wrong target modules | Only q_proj targeted | Model adapts tone but not reasoning; complex format instructions ignored | Add v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj |
| NF4 compute instability | bnb_4bit_compute_dtype=torch.float16 on Ampere/Hopper GPU | NaN loss after first 10–20 steps | Change to torch.bfloat16; re-run |
| Adapter not merged before serving | merge_and_unload() skipped; PEFT wrapper left attached | 2× inference latency; serving crashes on model-version mismatch after upgrades | Always merge after training; save the merged model independently of the adapter |
| Chat template mismatch | tokenizer.apply_chat_template() not used; custom template applied | Model ignores instruction format; outputs raw completions without following the [INST] template | Use tokenizer.apply_chat_template(messages, tokenize=False) in your format function |
These seven failure modes cover the overwhelming majority of issues reported in fine-tuning forums, GitHub issues on PEFT and TRL, and internal post-mortems from teams running QLoRA in production. The two most damaging in production — catastrophic forgetting and adapter-not-merged latency — both have simple preventions: a canary eval and a merge step in the deployment pipeline.
🧭 Choosing Your Fine-Tuning Strategy: LoRA vs QLoRA vs Full Fine-Tune vs DPO
| Scenario | Recommendation | Reason |
| Maximum quality, compute available (≥ 4× A100) | Full fine-tune | No rank approximation error; optimal for high-stakes applications |
| 1 GPU, behavioural change, ≤ 7B model | LoRA (r=16, all projections) | Sufficient VRAM in fp16; adapter overhead is minimal |
| 1 GPU, 7B–13B model, ≤ 24 GB VRAM | QLoRA (r=16) | NF4 quantisation makes it fit; 90-min training on RTX 4090 |
| 2× A100, 70B model fine-tune | QLoRA (r=16–32) | Only viable single-node option; full fine-tune requires 8× |
| Have chosen/rejected completion pairs | DPO with LoRA adapters | DPOTrainer + LoRA targets alignment efficiently without RLHF infrastructure |
| Style or tone change only | LoRA r=4 on q_proj and v_proj | Minimal adapter; fast training; negligible quality drop for simple changes |
| Multi-tenant, per-customer adapters | LoRA (consistent rank + modules) | vLLM adapter hot-swap requires uniform adapter shapes across all tenants |
| Limited data (< 200 examples) | Prompt engineering first | Fine-tuning below 200 examples almost always overfits; few-shot prompting often matches or exceeds quality |
The decision between LoRA and QLoRA is almost entirely a VRAM question. If the base model in fp16 fits in your available GPU memory with room for adapter gradients and optimizer states, use standard LoRA — it trains slightly faster than QLoRA and has no risk of NF4 compute-dtype mismatch. If VRAM is the bottleneck, QLoRA is the correct choice with essentially no workflow changes beyond the BitsAndBytesConfig block.
🛠️ HuggingFace PEFT, TRL, and bitsandbytes: The Three Libraries Powering the Stack
The QLoRA fine-tuning workflow depends on three HuggingFace ecosystem libraries that each solve a distinct piece of the problem. Understanding what each library does — and what it does not do — prevents the most common misconfiguration errors.
PEFT (Parameter-Efficient Fine-Tuning) is the adapter injection layer. Its get_peft_model() function wraps any HuggingFace PreTrainedModel with the adapter architecture specified in a LoraConfig. PEFT handles the target_modules injection (finding and wrapping the right nn.Linear layers), the zero-initialisation of B matrices, the trainable_parameters() tracking, and the merge_and_unload() fusion. It supports LoRA, QLoRA, prefix tuning, IA3, and adaptation prompts through the same unified API. PEFT does not handle training loops, dataset formatting, or quantisation — those are responsibilities of TRL and bitsandbytes respectively.
TRL (Transformer Reinforcement Learning) provides SFTTrainer for supervised fine-tuning and DPOTrainer for direct preference optimisation. Both extend HuggingFace Trainer with LLM-specific utilities: automatic dataset packing (fitting multiple short sequences into one context window to maximise GPU utilisation), chat-template application, and ConstantLength dataset handling. For teams that have human feedback data in the form of chosen/rejected response pairs, DPOTrainer can run DPO on top of an SFT checkpoint — or directly on a LoRA-adapted base — to align the model with human preferences without the complexity of PPO:
from trl import DPOTrainer, DPOConfig
# Minimal DPO setup on top of a LoRA-adapted model.
# dpo_dataset must contain "prompt", "chosen", and "rejected" fields.
dpo_trainer = DPOTrainer(
model=model, # PEFT model from get_peft_model()
ref_model=None, # None = use implicit reference from LoRA frozen base
args=DPOConfig(
output_dir="./dpo-output",
num_train_epochs=1, # DPO converges faster than SFT; 1 epoch is often enough
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=5e-5, # lower LR than SFT; DPO is sensitive to overshooting
bf16=True,
beta=0.1, # DPO temperature — controls how strongly to penalise rejected
),
train_dataset=dpo_dataset,
tokenizer=tokenizer,
)
dpo_trainer.train()
bitsandbytes is the GPU quantisation backend. It provides the BitsAndBytesConfig class consumed by AutoModelForCausalLM.from_pretrained() and implements the actual NF4 and INT8 quantisation kernels via CUDA. Without bitsandbytes, load_in_4bit=True has no effect. The library also provides the paged Adam optimiser via optim="paged_adamw_32bit" in TrainingArguments, which automatically manages CPU offload of optimizer states to prevent out-of-memory crashes during the backward pass. Adding this argument to the training script above is recommended for any QLoRA run on a model above 13B parameters.
For a full guide on when to choose fine-tuning over retrieval-augmented generation, see RAG vs Fine-Tuning: When to Use Each.
📚 Lessons Learned from Running QLoRA Fine-Tunes in Production
merge_and_unload()is a deployment gate, not an optimisation. Adapters left attached at inference time add a full extra matrix multiplication to every forward pass. A 7B model with 7 adapter-attached projection layers runs at roughly 1.6× the latency of the merged equivalent. Always merge, always save the merged model separately from the adapter checkpoint.tokenizer.apply_chat_template()prevents the #1 silent failure. Every base model has a specific conversation template — Mistral uses[INST]/[/INST], Llama 3 uses a different header format, Phi-3 uses yet another. Training with a mismatched template produces a model that has learned to follow instructions in the wrong format. At inference time, this manifests as a model that ignores the instruction entirely or produces garbled outputs. Always usetokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)in your data formatting function.r=16, alpha=32 is your starting point — only move up if loss plateaus. Increasing rank increases training time roughly linearly and does not always improve final quality. Start at r=16 for every new fine-tuning task. If the loss curve shows a clear plateau above a target of 0.5–0.8, try r=32. If you are adapting a model to generate structured code or complex multi-step documents, r=64 may be necessary.
Monitor a general capability benchmark every epoch. Task-specific training loss going to zero while MMLU drops from 60 % to 45 % is a loss of general capability that will create user-visible regressions in every use case outside your training distribution. Integrate a 5-shot MMLU or similar benchmark into your evaluation loop and set a threshold: if general capability drops more than 5 points absolute, the model fails the deployment gate.
QLoRA with paged Adam adds 20–30 % training time versus standard LoRA on the same VRAM. This overhead comes from CPU-GPU optimizer state paging. Budget for it in your training time estimates. For models that fit in VRAM without paging (e.g. 7B models on A100 80 GB), use standard LoRA +
adamw_torchfor faster training.Dataset quality beats dataset size — always, at every scale. Across public benchmarks and internal experiments, 500 carefully curated examples with correct format, diverse instructions, and accurate responses consistently outperform 5 000 auto-scraped, lightly-filtered examples. Before scaling your dataset, audit a random 50-example sample manually. If more than 5 % of samples have format errors, truncation artefacts, or factually incorrect responses, clean the dataset before adding more data.
📌 TLDR
- LoRA freezes the base model and trains two tiny matrices per layer (
B ∈ ℝ^(d×r)andA ∈ ℝ^(r×k)) so that the effective weight updateΔW ≈ B × Ahas rankr. Only 0.1–0.6 % of parameters are trained; GPU memory drops by 60–70 %. - QLoRA adds NF4 4-bit quantisation of the frozen base (from 140 GB to 35 GB for a 70B model), double quantisation, and paged optimisers. A 70B fine-tune that requires 8× A100 80 GB under full fine-tuning runs on 2× A100 80 GB under QLoRA.
- The three-library stack: PEFT injects adapters, TRL drives the training loop, bitsandbytes quantises the base.
- Always
merge_and_unload()before serving. Unmerged adapters run at 1.6× inference latency with no quality benefit. - Always use
tokenizer.apply_chat_template(). Mismatched templates are the #1 silent failure mode. - Monitor general capability (MMLU or equivalent) alongside task loss to detect catastrophic forgetting before deployment.
- Default configuration: r=16, alpha=32, all 7 attention + MLP projection modules targeted, bfloat16 compute, paged adamw for ≥ 30B models.
📝 Practice Quiz
What is the primary reason that QLoRA can fine-tune a 70B model on 2× A100 80 GB GPUs when full fine-tuning requires 8×?
- A) QLoRA uses a smaller rank than LoRA, which reduces the number of trainable parameters further
- B) QLoRA quantises the frozen base model weights to 4-bit NF4, reducing the base model's memory footprint by approximately 4×
- C) QLoRA skips the optimizer state entirely and trains only with first-order gradients
- D) QLoRA replaces Adam with SGD, which requires no optimizer state Correct Answer: B
You train a Mistral-7B QLoRA adapter at r=16 for 5 epochs on 600 customer support examples. After training, task-specific accuracy on your eval set is 94 %, but you observe that the model now refuses to answer general knowledge questions it answered correctly before fine-tuning. Which failure mode does this describe, and which single training argument change is most likely to fix it?
- A) Rank too low — increase r to 64
- B) Catastrophic forgetting caused by excessive training — reduce num_train_epochs to 1–2 and lower learning_rate to 5e-5
- C) Chat template mismatch — apply tokenizer.apply_chat_template() during data formatting
- D) NF4 instability — switch bnb_4bit_compute_dtype to torch.float16 Correct Answer: B
Why must
Bbe initialised to zeros in LoRA, and what property does this zero-initialisation guarantee at the start of training?- A) Zero initialisation of B ensures the adapter never exceeds rank r during training, preserving the low-rank constraint
- B) Zero initialisation of B ensures the adapter output B×A equals zero at step 0, so the model starts training from the exact same output distribution as the unmodified base model
- C) Zero initialisation of B prevents gradient vanishing by ensuring the first backward pass propagates through a numerically stable path
- D) Zero initialisation of B minimises the KL divergence between the adapter distribution and the base model distribution throughout training Correct Answer: B
Open-ended challenge: You have fine-tuned a Llama 3 8B model with QLoRA (r=16, all 7 projection modules targeted) on 800 carefully curated domain-specific examples over 3 epochs. Task-specific eval accuracy reaches 91 %. However, MMLU accuracy drops from 62 % to 48 %, and users report that the model occasionally generates responses in a completely different language than the input. Diagnose the two most likely root causes for these symptoms and describe a revised training strategy — including specific changes to hyperparameters, data composition, and evaluation frequency — that would achieve at least 88 % task accuracy while keeping MMLU within 5 percentage points of the base model baseline.
🔗 Related Posts
- RAG vs Fine-Tuning: When to Use Each — companion post: when retrieval-augmented generation is the better choice than adapter fine-tuning
- Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs — cost model and decision framework for self-hosting a fine-tuned model versus calling an external API
- AI Agents Explained: When LLMs Start Using Tools — how fine-tuned models are used as backbone planners inside tool-calling agent loops
- LLM Skill Registry: Routing and Evaluation for Production Agents — routing requests to the right fine-tuned adapter at inference time in a multi-skill agent system
- Headless Agents: Deploying Skills as an MCP Server — packaging a fine-tuned model as a callable skill behind the Model Context Protocol

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
