31 min readLlm Fine Tuning Machine Learning

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

From the math of low-rank decomposition to running QLoRA on a single A100 — everything you need to fine-tune a 70B model without a supercomputer.

Abstract Algorithms/Apr 19, 2026/LLM Engineering

On this page

📖 The Memory Wall That Blocked Every LLM Fine-Tune Before 2022 🔍 LoRA in Plain English: Margin Notes on a Frozen Textbook ⚙️ How LoRA Adds a Parallel Path to Every Attention Layer 🧠 Deep Dive: The Math of Low-Rank Decomposition, NF4 Quantization, and QLoRA's Architecture Mathematical Model: The Low-Rank Decomposition Formula and Why Fine-Tuning Deltas Are Naturally Sparse LoRA and QLoRA Internals: NF4 Quantization, Double Quant, and the 4-Bit Architecture Performance Analysis: Rank Selection, Parameter Counts, and Training Cost Trade-offs 📊 The QLoRA Training Pipeline From Data to Served Model 🧪 Complete Working Example: Fine-Tuning Mistral-7B with QLoRA on a Custom Support Dataset

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Executive TLDR

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near identical quality.
QLoRA adds 4 bit NF4 quantization of the frozen base, enabling 70B fine tuning on 2× A100 80 GB instead of 8×.
Use HuggingFace PEFT + TRL + bitsandbytes; always call before serving; monitor both task loss and a general capability eval (e.g.
MMLU) to catch catastrophic forgetting before it reaches production.

Core mental model

Read this as a system of state, constraints, and failure boundaries.

From the math of low-rank decomposition to running QLoRA on a single A100 — everything you need to fine-tune a 70B model without a supercomputer.

Explain simpler Compare tradeoffs

Key systems visualization

The article’s conceptual path

📖 The Memory Wall That Blocked Every LLM Fine-Tune Before 2022

🔍 LoRA in Plain English: Margin Notes on a Frozen Textbook

⚙️ How LoRA Adds a Parallel Path to Every Attention Layer

🧠 Deep Dive: The Math of Low-Rank Decomposition, NF4 Quantization, and QLoRA's Architecture

📊 The QLoRA Training Pipeline From Data to Served Model

> TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8×. Use HuggingFace PEFT + TRL + bitsandbytes; always call `merge_and_unload()` before serving; monitor both task loss and a general-capability eval (e.g. MMLU) to catch catastrophic forgetting before it reaches production.

📖 The Memory Wall That Blocked Every LLM Fine-Tune Before 2022

Before 2022, fine-tuning a large language model was an HPC problem. A 7-billion-parameter model stored in 32-bit floating point occupies 28 GB of GPU memory for the weights alone. Add the Adam optimizer and the situation explodes: Adam maintains a momentum estimate and a variance estimate for each parameter — that is four copies of the model in memory simultaneously (weights + gradients + two optimizer states), totalling roughly 112 GB just to start training. A single A100 80 GB card cannot hold this. You needed at minimum four high-end GPUs connected over NVLink, and a full fine-tune of a 70B model required eight to sixteen A100 80 GB cards — hardware costing well above $200 000 per node.

The consequence was severe: only well-funded labs could run behavioural fine-tuning. Everyone else was stuck prompting the base model and hoping the in-context examples were enough. Domain adaptation — training a model to speak fluent medical, legal, or customer-support language — was effectively closed to teams without supercomputers.

LoRA (Low-Rank Adaptation of Large Language Models, Hu et al. 2021) dismantled that wall. Instead of updating all 7 billion weights, LoRA freezes the original model entirely and inserts two tiny trainable matrices alongside each attention projection. The total number of trainable parameters drops to roughly 0.5 % of the original — sometimes as low as 0.1 % for small ranks. GPU memory consumption falls by 60–70 % because optimizer states exist only for the tiny adapter matrices, not the full model.

QLoRA (Dettmers et al. 2023) pushed the boundary further. It quantises the frozen base model weights to 4-bit Normal Float (NF4), roughly halving the memory footprint of the frozen base while keeping adapter training in full bfloat16 precision. The result is transformative: a Llama 3 70B fine-tune that required 8× A100 80 GB under full fine-tuning drops to 2× A100 80 GB under QLoRA — and down to a single A100 80 GB for 13B-class models.

The practical impact is immediate. A team at a mid-sized fintech company fine-tuned Mistral-7B-Instruct on 1 200 customer-support transcripts using QLoRA on a single 48 GB A40 GPU over a weekend. The resulting model outperformed GPT-4o on their proprietary query types while eliminating per-call API costs. That is the world LoRA opens.

🔍 LoRA in Plain English: Margin Notes on a Frozen Textbook

Think of the pre-trained base model as a textbook that already contains enormous knowledge about language, reasoning, and the world. Full fine-tuning is like reprinting the entire textbook with a few changed chapters — enormously expensive and wasteful since most pages need no change at all.

LoRA does something more elegant. It keeps the original textbook exactly as it is and writes margin notes — small, targeted annotations that modify how specific passages are interpreted. The notes are lightweight, swap in and out easily, and capture exactly the behavioural change you want without touching a single original page.

In practical terms, every attention weight matrix W (shaped d × k) in the transformer encodes a learned "skill" — how to compute query vectors, key vectors, or value projections. When you fine-tune a model to behave differently on a downstream task, you are not rewriting every skill from scratch. Research shows the effective change matrix ΔW — the difference between the fully fine-tuned weights and the original weights — has a surprisingly low intrinsic rank. That is, even though ΔW is a d × k matrix, almost all of its information can be captured by a much smaller object.

LoRA exploits this sparsity directly. Rather than storing ΔW as a full d × k matrix (which is as expensive as the original), it approximates ΔW as the product of two much smaller matrices:

Matrix A — shape d × r, where r is the rank, typically 4–64
Matrix B — shape r × k

The product B × A produces a rank-r approximation of ΔW. Only A and B are trained; W is completely frozen. The parameter saving is dramatic: instead of training d × k parameters, you train d×r + r×k parameters. For a standard attention projection where d = k = 4096 and r = 16, this reduces from 16.7 million parameters to just 131 072 — a 128× reduction for that single layer.

⚙️ How LoRA Adds a Parallel Path to Every Attention Layer

LoRA does not modify the frozen weight matrix. Instead, it runs a lightweight parallel branch alongside the frozen forward pass. For every attention projection the forward computation becomes:

y = W_frozen · x  +  (B · A) · x

where W_frozen is completely frozen and B · A is the trainable low-rank adapter. At initialisation, B is set to all zeros and A is drawn from a small Gaussian. This means B · A = 0 at step zero, so the model output is identical to the base model at the start of training. Training can then proceed stably from a known-good baseline rather than from a random initialisation.

The diagram below shows how a single transformer attention layer is modified by LoRA. The frozen W path and the trainable B → A path run in parallel and their outputs are added together before being passed to the next component.

graph TD
    Input[Input activation x] --> FrozenW[Frozen weight W - shape d x k]
    Input --> AdapterA[LoRA down-projection A - shape d x r]
    AdapterA --> AdapterB[LoRA up-projection B - shape r x k]
    FrozenW --> Adder[Add frozen and adapter outputs]
    AdapterB --> Adder
    Adder --> Output[Layer output y]

The adapter branch compresses the input from d dimensions down to r dimensions via A (the down-projection) and then expands back to k dimensions via B (the up-projection). This bottleneck structure is what enforces the low-rank constraint: no matter what values A and B learn, their product B × A can represent at most r linearly independent directions in the output space. The rank r is therefore the key capacity hyperparameter — it controls how much behavioural change the adapter can encode.

LoRA adapters are typically attached to the query and value projections (q_proj, v_proj) inside every attention block, though the best practice for stronger adaptation is to also target the key projection (k_proj), output projection (o_proj), and the three MLP projections (gate_proj, up_proj, down_proj). Targeting all seven projection types roughly doubles the parameter count but produces noticeably better adaptation on complex tasks.

🧠 Deep Dive: The Math of Low-Rank Decomposition, NF4 Quantization, and QLoRA's Architecture

Mathematical Model: The Low-Rank Decomposition Formula and Why Fine-Tuning Deltas Are Naturally Sparse

The formal statement of LoRA is concise. Given a frozen pre-trained weight matrix W_0 ∈ ℝ^(d×k), the modified forward pass is:

h = W_0 · x + (B · A) · x = (W_0 + B · A) · x

where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), rank r << min(d, k).

The key insight from the original LoRA paper is empirical: when full fine-tuning changes a pre-trained model's weights (i.e., ΔW = W_fine_tuned - W_0), the resulting change matrix has a very low effective rank — the singular values of ΔW fall off quickly after the first few dominant components. This means approximating ΔW ≈ B · A at rank 16 or 32 captures the overwhelming majority of the useful fine-tuning signal while discarding noise.

The adapter output is scaled by a factor of lora_alpha / r before being added to the frozen output. If lora_alpha = 32 and r = 16, the effective scale is 2.0. This means the adapter contribution is amplified by 2× relative to the frozen base. The alpha parameter functions as a learning rate multiplier for the adapter branch: increasing alpha relative to r makes the adapter more aggressively override the base model behaviour. The default ratio alpha = 2 × r is a well-tested starting point — it keeps adapter influence significant without overwhelming the frozen base early in training.

LoRA and QLoRA Internals: NF4 Quantization, Double Quant, and the 4-Bit Architecture

QLoRA introduces three innovations on top of LoRA to squeeze the frozen base into 4 bits without significant quality loss.

Normal Float 4 (NF4): Standard 4-bit integer quantisation distributes its 16 representable values uniformly across a numerical range. This is inefficient for neural network weights, which are empirically normally distributed around zero. NF4 instead places its 16 quantisation levels at the quantiles of the standard normal distribution — more levels near zero where most weights cluster, fewer at the extremes. For normally distributed values this is information-theoretically optimal and empirically outperforms standard INT4 or FP4.

Double quantisation: The quantisation constants themselves (one per block of weights) consume memory. QLoRA quantises these constants a second time using FP8, saving approximately 0.37 bits per parameter on average — roughly 0.5 GB on a 65B parameter model. Small, but meaningful at scale.

Paged optimisers: When GPU memory is exhausted during a backward pass, CUDA would normally crash with an out-of-memory error. QLoRA uses NVIDIA's unified memory mechanism to automatically page optimizer states to CPU RAM and back as needed, eliminating these crashes at the cost of some training-step throughput.

The net result: a Llama 3 70B model that requires 140 GB in fp16 fits in roughly 35 GB in NF4 — an exact 4× compression ratio. Adapter training proceeds in bfloat16, so all gradient computations are numerically stable despite the compressed frozen base.

Performance Analysis: Rank Selection, Parameter Counts, and Training Cost Trade-offs

To understand what rank means in practice, consider a model with hidden dimension d = 4096 and a single attention projection of shape 4096 × 4096. A LoRA adapter at rank r adds 2 × r × 4096 trainable parameters — one matrix A of shape 4096 × r and one matrix B of shape r × 4096. For r = 16 this is 131 072 parameters, for r = 64 it is 524 288.

A Llama 3 8B model has 32 transformer layers, each with 4 attention projections and 3 MLP projections. Targeting all 7 projections at r = 16:

32 layers × 7 projections × 2 × 16 × 4096 = 29,360,128 trainable parameters

Out of 8 billion total parameters, this is 0.37 %. The bitsandbytes library reports this as trainable%: 0.37 when you call model.print_trainable_parameters() before training.

The practical guidance for rank selection:

r = 4 — use for style or tone changes on large models (< 7B); very fast training, minimal expressiveness
r = 16 — the universal default; covers instruction-following, domain adaptation, format changes
r = 32 or r = 64 — use when training on complex multi-step reasoning tasks or when r = 16 shows a loss plateau well above baseline
Increasing r beyond 64 rarely helps and approaches the cost of full fine-tuning

📊 Visual Reference

flowchart TD
    Pretrained["Pretrained
Model
(Frozen)"]
    LoRA["LoRA Adapter
(Trainable)"]
    Finetune["Fine-tune on
Custom Data"]
    Result["Fine-tuned
Model"]

    Pretrained --> LoRA
    LoRA --> Finetune
    Finetune --> Result

📊 The QLoRA Training Pipeline From Data to Served Model

The complete QLoRA workflow involves six distinct stages, each with specific tooling and failure points. Understanding the pipeline as a whole — rather than just the training step — is the difference between a working deployment and a model that behaves perfectly in the notebook but bizarrely in production.

The diagram below shows every stage, from the raw base model to a production-ready merged model served behind vLLM. Pay particular attention to the merge step: LoRA adapters loaded at inference time without merging incur a 2× compute overhead in the adapter projection, and the adapter checkpoint is architecturally fragile (it requires the exact same PEFT version and base model version to load).

graph TD
    BaseModel[Base model in fp16] --> Quantise[NF4 4-bit quantisation via bitsandbytes]
    Quantise --> FrozenBase[Frozen 4-bit base in memory]
    FrozenBase --> PrepKbit[prepare model for kbit training - gradient checkpointing]
    PrepKbit --> AttachAdapters[Attach LoRA adapters in bfloat16 via PEFT]
    TrainingData[Formatted training data] --> SFTLoop[SFTTrainer supervised fine-tuning loop]
    AttachAdapters --> SFTLoop
    SFTLoop --> SaveAdapter[Save LoRA adapter checkpoint]
    SaveAdapter --> MergeStep[Load base in fp16 and call merge and unload]
    MergeStep --> MergedModel[Merged fp16 model - no adapter dependency]
    MergedModel --> ServingLayer[vLLM or TGI serving endpoint]

Each arrow in the diagram hides a potential failure mode. The quantisation step must use bnb_4bit_quant_type="nf4" and bnb_4bit_compute_dtype=torch.bfloat16 — using float16 as the compute dtype can produce NaN gradients on A100 and H100 GPUs. The merge step loads the base model in fp16 (not 4-bit) because merging requires full-precision arithmetic to fuse W + B×A accurately. Skipping the merge and serving with adapter weights attached doubles inference latency for no accuracy benefit.

The performance comparison table below shows how different configurations trade GPU requirements against training time and quality relative to a full fine-tune baseline:

Configuration	Model	GPU Required	Train Time per 1 K examples	Quality vs Full FT
Full fine-tune	Llama 3 8B	4× A100 40 GB	~2 hours	Baseline
LoRA (r=16)	Llama 3 8B	1× A100 40 GB	~45 min	-2 to -4 %
QLoRA (r=16)	Llama 3 8B	1× RTX 4090 24 GB	~90 min	-3 to -6 %
LoRA (r=16)	Llama 3 70B	4× A100 80 GB	~8 hours	-2 to -4 %
QLoRA (r=16)	Llama 3 70B	2× A100 80 GB	~14 hours	-3 to -6 %

Quality deltas are measured as relative difference on task-specific benchmarks and vary significantly with dataset size and quality — 2 000 well-curated examples consistently outperform 20 000 auto-scraped examples.

🧪 Complete Working Example: Fine-Tuning Mistral-7B with QLoRA on a Custom Support Dataset

This section walks through a complete, runnable QLoRA fine-tune from data preparation to inference with the merged model. The scenario is a Mistral-7B-Instruct model adapted to summarise customer complaint tickets in a specific corporate format — a realistic domain-adaptation task that highlights every config decision you will encounter in practice.

The code is split into two scripts that mirror the two operational stages: training (including adapter saving) and merging + inference. Both scripts are annotated inline with the reasoning behind each hyperparameter choice.

Training Script: `qlora_train.py`

# qlora_train.py — fine-tune Mistral-7B-Instruct with QLoRA on custom data
import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from trl import SFTTrainer

MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"

# --- 1. Training data: instruction-response pairs ---
# In production, load from JSONL or Hugging Face dataset.
# Minimum viable size: ~200 examples. Recommended: 500–2000.
TRAINING_EXAMPLES = [
    {
        "instruction": "Summarize the following customer complaint in one sentence.",
        "input": "I've been waiting 3 weeks for my order and no one is responding to my emails.",
        "response": "Customer has been waiting 3 weeks for an undelivered order with no email response from support.",
    },
    # ... add 500-2000 more examples for real use cases
]

def format_for_mistral(ex: dict) -> str:
    """Apply Mistral's instruction template.

    Critically: use the model's own chat template, not a custom one.
    Mismatched templates are the #1 cause of fine-tuned models ignoring instructions.
    For other base models, use: tokenizer.apply_chat_template(messages, tokenize=False)
    """
    user_msg = ex["instruction"]
    if ex.get("input"):
        user_msg += f"\n\n{ex['input']}"
    return f"<s>[INST] {user_msg} [/INST] {ex['response']} </s>"

# --- 2. QLoRA quantization config ---
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",            # Normal Float 4 — optimal for normally distributed LLM weights
    bnb_4bit_compute_dtype=torch.bfloat16, # bfloat16 avoids NaN on A100/H100; do NOT use float16
    bnb_4bit_use_double_quant=True,        # quantise the quantisation constants for ~0.5 GB additional saving
)

# --- 3. Load model + tokenizer ---
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token  # Mistral has no dedicated pad token
tokenizer.padding_side = "right"           # right-padding avoids position embedding shifts

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",          # fills GPU first, spills to CPU if needed
    trust_remote_code=True,
)
# Enables gradient checkpointing for the 4-bit base — required before get_peft_model
model = prepare_model_for_kbit_training(model)

# --- 4. LoRA adapter config ---
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,               # rank — 16 balances adapter capacity vs parameter count
    lora_alpha=32,      # effective scale = alpha / r = 2.0; amplifies adapter contribution
    lora_dropout=0.05,  # light regularisation; 0.0 is acceptable for larger datasets
    bias="none",        # don't train biases — adds minimal expressiveness for significant overhead
    target_modules=[    # target all attention + MLP projections for maximal adaptation
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Expected output: trainable params: 41,943,040 || all params: 7,283,965,952 || trainable%: 0.5759

# --- 5. Dataset ---
dataset = Dataset.from_list([
    {"text": format_for_mistral(ex)} for ex in TRAINING_EXAMPLES
])

# --- 6. Training ---
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,       # set to your median sequence length; longer = more GPU memory
    args=TrainingArguments(
        output_dir="./mistral-7b-qlora",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,   # effective batch size = 2 × 8 = 16
        warmup_ratio=0.05,               # warm up over first 5% of steps; prevents early LR spikes
        learning_rate=2e-4,              # standard QLoRA LR; lower to 1e-4 if loss is unstable
        fp16=False,
        bf16=True,                       # must match bnb_4bit_compute_dtype
        logging_steps=10,
        save_strategy="epoch",
        evaluation_strategy="no",        # add eval_dataset + strategy="steps" for early stopping
        report_to="none",                # swap to "wandb" for production monitoring
    ),
)
trainer.train()

# --- 7. Save adapter only (not the full merged model) ---
model.save_pretrained("./mistral-7b-qlora-adapter")
tokenizer.save_pretrained("./mistral-7b-qlora-adapter")
print("Adapter saved. Run qlora_merge_and_infer.py to merge before serving.")

Merge and Inference Script: `qlora_merge_and_infer.py`

# qlora_merge_and_infer.py — merge adapter into base model weights, then run inference.
# This script runs AFTER training is complete.
# Always merge before deploying to production — adapter-only serving is 2x slower.
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

BASE_MODEL   = "mistralai/Mistral-7B-Instruct-v0.3"
ADAPTER_PATH = "./mistral-7b-qlora-adapter"
MERGED_PATH  = "./mistral-7b-merged"

def merge_adapter():
    """Fuse the LoRA adapter matrices (B × A) into the frozen base weights (W).

    Why: After merge_and_unload(), W_new = W_frozen + B × A for each layer.
    The adapter is removed, removing the extra projection overhead at inference.
    Load the base in fp16 for merging — NOT 4-bit.
    Merging requires full-precision matrix addition; NF4 cannot represent the result accurately.
    """
    print("Loading base model in fp16 for merging...")
    tokenizer = AutoTokenizer.from_pretrained(ADAPTER_PATH)
    base = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL,
        torch_dtype=torch.float16,  # full precision needed for merge arithmetic
        device_map="auto"
    )
    print("Attaching LoRA adapter...")
    model = PeftModel.from_pretrained(base, ADAPTER_PATH)

    print("Merging and unloading...")
    merged = model.merge_and_unload()  # fuses B×A into W, removes PEFT wrapper

    merged.save_pretrained(MERGED_PATH)
    tokenizer.save_pretrained(MERGED_PATH)
    print(f"Merged model saved to {MERGED_PATH}")

def infer(prompt: str) -> str:
    """Run inference on the merged model.

    The merged model is a standard HuggingFace CausalLM — no PEFT dependency.
    In production, replace this with vLLM for batched serving with 3–5x higher throughput.
    """
    tokenizer = AutoTokenizer.from_pretrained(MERGED_PATH)
    model = AutoModelForCausalLM.from_pretrained(
        MERGED_PATH,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    inputs = tokenizer(
        f"<s>[INST] {prompt} [/INST]",
        return_tensors="pt"
    ).to(model.device)

    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.1,   # low temperature for deterministic summarisation
            do_sample=True,
        )
    # Slice off the input tokens; decode only the generated response
    response_tokens = output[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(response_tokens, skip_special_tokens=True)

if __name__ == "__main__":
    merge_adapter()
    test_prompt = (
        "Summarize the following customer complaint in one sentence.\n\n"
        "I ordered a laptop stand three weeks ago. It never arrived, tracking hasn't updated "
        "in two weeks, and your customer support keeps closing my tickets without responding."
    )
    print("\nModel response:")
    print(infer(test_prompt))

The training script's most important configuration choices are the bnb_4bit_compute_dtype (always bfloat16, not float16), the target_modules list (all seven projection types for full adaptation), and the gradient_accumulation_steps (multiply with batch size to get an effective batch of 16–32). The merge script's most important detail is loading the base model in fp16 — not 4-bit — because the merge arithmetic requires numerical precision that NF4 cannot provide.

📊 Visual Reference

flowchart TD
    Task["Task Type"]
    Changing{Requires frequent
knowledge updates?}
    RAG["Use RAG
(Dynamic retrieval)"]
    Finetune["Use Fine-tuning
(Static weights)"]

    Task --> Changing
    Changing -->|Yes| RAG
    Changing -->|No| Finetune

📈 Evaluating Your Fine-Tune: Loss Curves, Perplexity, and the MMLU Canary

Training loss alone is a dangerously incomplete signal. A model can reach near-zero training loss on 800 examples while simultaneously catastrophically forgetting how to perform multi-step reasoning on topics outside the training set. Production-safe evaluation requires three distinct signals monitored together.

Training loss tells you whether the model is learning the target task format. A healthy loss curve for a QLoRA run drops sharply in the first 10–20 % of steps and then decelerates into a gentle plateau around 0.3–0.8 for instruction-following tasks. If the curve never leaves the starting value (typically 2.0–4.0 for causal LMs), check your chat template — the most common cause of a flat loss curve is misformatted training data that the model cannot overfit even with high learning rates. If the curve drops all the way to near 0.0 on a small dataset, you are overfitting — add early_stopping_patience or reduce num_train_epochs.

Perplexity on a held-out split of your training distribution is the in-domain quality signal. Compute it with trainer.evaluate() after adding an eval_dataset to your SFTTrainer config. Perplexity decreasing on the eval split while training loss is still high indicates the learning is generalising rather than memorising.

A general-capability canary — such as MMLU (5-shot), HellaSwag, or a fixed internal benchmark — detects catastrophic forgetting before it reaches users. Run this eval every epoch or every 500 steps. Seeing your domain task score rise while MMLU drops from 62 % to 48 % is a clear signal that your learning rate is too high or your dataset lacks diversity. The fix is typically a lower learning rate (5e-5 instead of 2e-4), adding system-prompt diversity to the training examples, or using DPO-style preference optimisation instead of pure SFT for the second stage.

For quick MMLU evaluation during a QLoRA run, the lm-evaluation-harness library from EleutherAI runs against any HuggingFace-compatible checkpoint with a single CLI command and produces per-task accuracy scores that can be logged to W&B or MLflow alongside your training metrics.

🏗️ Advanced Deployment Patterns: Adapter Composition, Continued Pre-Training, and Multi-Tenant Serving

Once you have mastered the basic QLoRA fine-tune loop, three advanced patterns become relevant as you move from single-model prototypes to production serving systems.

Stacking Multiple LoRA Adapters with LoraHub

LoraHub (Huang et al. 2023) demonstrates that LoRA adapters trained on different tasks can be combined by weighted interpolation. If you have an adapter fine-tuned on code generation and a separate adapter fine-tuned on structured data extraction, LoraHub searches for a coefficient vector [w1, w2] such that ΔW = w1 × ΔW_code + w2 × ΔW_extraction maximises performance on a new task — without any gradient-based training, using only a handful of in-context examples for calibration. This is valuable when you want to compose specialised adapters rather than training a single adapter on a mixture dataset.

PEFT supports this through add_weighted_adapter(), which merges a list of adapter checkpoints using either linear combination or SVD-based composition. Linear combination is faster; SVD-based produces lower approximation error when the adapters' weight spaces differ significantly.

Continued Pre-Training Before Instruction Fine-Tuning

For domains with highly specialised vocabulary — genomics, patent law, semiconductor design — a two-stage approach outperforms direct instruction fine-tuning. Stage one uses LoRA (not QLoRA) with a large rank (r=128) to run continued pre-training on domain documents with a standard next-token prediction objective. This step updates the model's vocabulary distribution without changing instruction-following behaviour. Stage two then runs a standard QLoRA instruction fine-tune on a smaller supervised dataset. The stage-one checkpoint absorbs domain vocabulary cheaply; the stage-two checkpoint aligns on the task format. The Microsoft Phi-3 and Yi series models both follow this two-stage curriculum at the base model level.

Multi-Tenant Adapter Hot-Swapping with vLLM

vLLM's LoRARequest API loads and unloads LoRA adapters per-request against a shared base model instance. This means a single GPU cluster hosting Llama 3 8B can simultaneously serve hundreds of customer-specific fine-tuned models by swapping adapters in the KV-cache-aware serving pipeline. Three requirements must hold for this to work correctly: all adapters must use the same rank (r), all adapters must target the same set of modules, and all adapters must be trained against the same base model checkpoint (not different quantisation configurations). Any deviation causes shape mismatches that crash the serving process.

The operational pattern is: train all customer adapters centrally with a shared LoraConfig template, store them in object storage (S3 or GCS), and reference them by ID in the LoRARequest at inference time. The base model stays loaded in GPU memory across all requests; only the much-smaller adapter tensors are swapped.

🌍 Who Is Running LoRA in Production and What They Have Learned

LoRA and QLoRA have moved from research papers into the core infrastructure of companies building LLM products. The patterns below reflect public case studies, blog posts, and open-source releases from organisations operating at scale.

Mistral AI and the vertical AI product wave. The Mistral 7B base model was explicitly designed to be fine-tunable on consumer hardware, and the company's release strategy — base + instruct + API — implicitly assumes that customers will run LoRA adapters on top. Legal AI startups (Harvey, Casetext), medical AI companies (Nabla, Suki), and customer-service platforms (Freshdesk, Zendesk AI) all use LoRA-adapted Mistral variants to deliver domain accuracy that a generic instruction model cannot match.

HuggingFace Zephyr: DPO on top of SFT on top of LoRA. The Zephyr-7B-beta model is a public demonstration of the full fine-tuning pipeline. The team ran SFT with LoRA adapters on Mistral-7B using a synthetic instruction dataset, then ran DPO (Direct Preference Optimisation) on the resulting checkpoint using human-preference pairs. The final model outperformed Llama 2 70B on the MT-Bench chat benchmark using a model one-tenth the size and a fraction of the compute. The DPO stage used DPOTrainer from the TRL library — the same library used in the training example above.

Anyscale: per-tenant adapter hot-swapping. Anyscale's managed fine-tuning product allows enterprise customers to maintain their own LoRA adapters and have them loaded at inference time using vLLM's dynamic adapter loading feature. Each tenant's adapter is stored in object storage and loaded into a shared base model instance on demand — a multiplexed serving architecture that makes per-customer model personalisation economically viable. The critical requirement for this pattern is that all adapters target the same rank and the same set of modules; otherwise the base model cannot serve them interchangeably.

Nous Research Hermes: instruction-tuning on synthetic data. The Hermes series (Hermes 2, Hermes 3) demonstrates that dataset quality dominates dataset size. Nous Research used synthetically generated instruction data from Claude and GPT-4 to create highly diverse training examples, then ran LoRA fine-tuning on Llama base models. Hermes 2 Pro Llama 3 8B consistently scores above identically-parameterised models trained on larger but lower-quality datasets — a real-world validation of the "500 curated examples beats 5 000 scraped ones" principle.

⚖️ Seven Ways QLoRA Fine-Tunes Go Wrong — and How to Fix Each One

Failure Mode	Root Cause	Symptom	Fix
Catastrophic forgetting	Learning rate too high; training data lacks diversity	MMLU or general benchmarks drop sharply after epoch 1	Lower LR to 5e-5; add diverse system prompts; cap at 1–2 epochs
Rank too low for task complexity	r=4 used for multi-step reasoning or code generation	Loss plateaus well above 0.5; outputs are grammatically correct but logically wrong	Increase r to 32 or 64; re-run with all 7 target modules
Overfitting on small dataset	< 200 examples; 5+ epochs	Train loss → 0; eval perplexity rises; model repeats training phrases verbatim	Add more examples, reduce epochs to 2, or switch to LoRA + DPO preference training
Wrong target modules	Only q_proj targeted	Model adapts tone but not reasoning; complex format instructions ignored	Add v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj
NF4 compute instability	bnb_4bit_compute_dtype=torch.float16 on Ampere/Hopper GPU	NaN loss after first 10–20 steps	Change to torch.bfloat16; re-run
Adapter not merged before serving	merge_and_unload() skipped; PEFT wrapper left attached	2× inference latency; serving crashes on model-version mismatch after upgrades	Always merge after training; save the merged model independently of the adapter
Chat template mismatch	tokenizer.apply_chat_template() not used; custom template applied	Model ignores instruction format; outputs raw completions without following the [INST] template	Use tokenizer.apply_chat_template(messages, tokenize=False) in your format function

These seven failure modes cover the overwhelming majority of issues reported in fine-tuning forums, GitHub issues on PEFT and TRL, and internal post-mortems from teams running QLoRA in production. The two most damaging in production — catastrophic forgetting and adapter-not-merged latency — both have simple preventions: a canary eval and a merge step in the deployment pipeline.

🧭 Choosing Your Fine-Tuning Strategy: LoRA vs QLoRA vs Full Fine-Tune vs DPO

Scenario	Recommendation	Reason
Maximum quality, compute available (≥ 4× A100)	Full fine-tune	No rank approximation error; optimal for high-stakes applications
1 GPU, behavioural change, ≤ 7B model	LoRA (r=16, all projections)	Sufficient VRAM in fp16; adapter overhead is minimal
1 GPU, 7B–13B model, ≤ 24 GB VRAM	QLoRA (r=16)	NF4 quantisation makes it fit; 90-min training on RTX 4090
2× A100, 70B model fine-tune	QLoRA (r=16–32)	Only viable single-node option; full fine-tune requires 8×
Have chosen/rejected completion pairs	DPO with LoRA adapters	DPOTrainer + LoRA targets alignment efficiently without RLHF infrastructure
Style or tone change only	LoRA r=4 on q_proj and v_proj	Minimal adapter; fast training; negligible quality drop for simple changes
Multi-tenant, per-customer adapters	LoRA (consistent rank + modules)	vLLM adapter hot-swap requires uniform adapter shapes across all tenants
Limited data (< 200 examples)	Prompt engineering first	Fine-tuning below 200 examples almost always overfits; few-shot prompting often matches or exceeds quality

The decision between LoRA and QLoRA is almost entirely a VRAM question. If the base model in fp16 fits in your available GPU memory with room for adapter gradients and optimizer states, use standard LoRA — it trains slightly faster than QLoRA and has no risk of NF4 compute-dtype mismatch. If VRAM is the bottleneck, QLoRA is the correct choice with essentially no workflow changes beyond the BitsAndBytesConfig block.

🛠️ HuggingFace PEFT, TRL, and bitsandbytes: The Three Libraries Powering the Stack

The QLoRA fine-tuning workflow depends on three HuggingFace ecosystem libraries that each solve a distinct piece of the problem. Understanding what each library does — and what it does not do — prevents the most common misconfiguration errors.

PEFT (Parameter-Efficient Fine-Tuning) is the adapter injection layer. Its get_peft_model() function wraps any HuggingFace PreTrainedModel with the adapter architecture specified in a LoraConfig. PEFT handles the target_modules injection (finding and wrapping the right nn.Linear layers), the zero-initialisation of B matrices, the trainable_parameters() tracking, and the merge_and_unload() fusion. It supports LoRA, QLoRA, prefix tuning, IA3, and adaptation prompts through the same unified API. PEFT does not handle training loops, dataset formatting, or quantisation — those are responsibilities of TRL and bitsandbytes respectively.

TRL (Transformer Reinforcement Learning) provides SFTTrainer for supervised fine-tuning and DPOTrainer for direct preference optimisation. Both extend HuggingFace Trainer with LLM-specific utilities: automatic dataset packing (fitting multiple short sequences into one context window to maximise GPU utilisation), chat-template application, and ConstantLength dataset handling. For teams that have human feedback data in the form of chosen/rejected response pairs, DPOTrainer can run DPO on top of an SFT checkpoint — or directly on a LoRA-adapted base — to align the model with human preferences without the complexity of PPO:

from trl import DPOTrainer, DPOConfig

# Minimal DPO setup on top of a LoRA-adapted model.
# dpo_dataset must contain "prompt", "chosen", and "rejected" fields.
dpo_trainer = DPOTrainer(
    model=model,                   # PEFT model from get_peft_model()
    ref_model=None,                # None = use implicit reference from LoRA frozen base
    args=DPOConfig(
        output_dir="./dpo-output",
        num_train_epochs=1,        # DPO converges faster than SFT; 1 epoch is often enough
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=5e-5,        # lower LR than SFT; DPO is sensitive to overshooting
        bf16=True,
        beta=0.1,                  # DPO temperature — controls how strongly to penalise rejected
    ),
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)
dpo_trainer.train()

bitsandbytes is the GPU quantisation backend. It provides the BitsAndBytesConfig class consumed by AutoModelForCausalLM.from_pretrained() and implements the actual NF4 and INT8 quantisation kernels via CUDA. Without bitsandbytes, load_in_4bit=True has no effect. The library also provides the paged Adam optimiser via optim="paged_adamw_32bit" in TrainingArguments, which automatically manages CPU offload of optimizer states to prevent out-of-memory crashes during the backward pass. Adding this argument to the training script above is recommended for any QLoRA run on a model above 13B parameters.

For a full guide on when to choose fine-tuning over retrieval-augmented generation, see RAG vs Fine-Tuning: When to Use Each.

📚 Lessons Learned from Running QLoRA Fine-Tunes in Production

merge_and_unload() is a deployment gate, not an optimisation. Adapters left attached at inference time add a full extra matrix multiplication to every forward pass. A 7B model with 7 adapter-attached projection layers runs at roughly 1.6× the latency of the merged equivalent. Always merge, always save the merged model separately from the adapter checkpoint.
tokenizer.apply_chat_template() prevents the #1 silent failure. Every base model has a specific conversation template — Mistral uses [INST]/[/INST], Llama 3 uses a different header format, Phi-3 uses yet another. Training with a mismatched template produces a model that has learned to follow instructions in the wrong format. At inference time, this manifests as a model that ignores the instruction entirely or produces garbled outputs. Always use tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) in your data formatting function.
r=16, alpha=32 is your starting point — only move up if loss plateaus. Increasing rank increases training time roughly linearly and does not always improve final quality. Start at r=16 for every new fine-tuning task. If the loss curve shows a clear plateau above a target of 0.5–0.8, try r=32. If you are adapting a model to generate structured code or complex multi-step documents, r=64 may be necessary.
Monitor a general capability benchmark every epoch. Task-specific training loss going to zero while MMLU drops from 60 % to 45 % is a loss of general capability that will create user-visible regressions in every use case outside your training distribution. Integrate a 5-shot MMLU or similar benchmark into your evaluation loop and set a threshold: if general capability drops more than 5 points absolute, the model fails the deployment gate.
QLoRA with paged Adam adds 20–30 % training time versus standard LoRA on the same VRAM. This overhead comes from CPU-GPU optimizer state paging. Budget for it in your training time estimates. For models that fit in VRAM without paging (e.g. 7B models on A100 80 GB), use standard LoRA + adamw_torch for faster training.
Dataset quality beats dataset size — always, at every scale. Across public benchmarks and internal experiments, 500 carefully curated examples with correct format, diverse instructions, and accurate responses consistently outperform 5 000 auto-scraped, lightly-filtered examples. Before scaling your dataset, audit a random 50-example sample manually. If more than 5 % of samples have format errors, truncation artefacts, or factually incorrect responses, clean the dataset before adding more data.

📌 TLDR

LoRA freezes the base model and trains two tiny matrices per layer (B ∈ ℝ^(d×r) and A ∈ ℝ^(r×k)) so that the effective weight update ΔW ≈ B × A has rank r. Only 0.1–0.6 % of parameters are trained; GPU memory drops by 60–70 %.
QLoRA adds NF4 4-bit quantisation of the frozen base (from 140 GB to 35 GB for a 70B model), double quantisation, and paged optimisers. A 70B fine-tune that requires 8× A100 80 GB under full fine-tuning runs on 2× A100 80 GB under QLoRA.
The three-library stack: PEFT injects adapters, TRL drives the training loop, bitsandbytes quantises the base.
Always merge_and_unload() before serving. Unmerged adapters run at 1.6× inference latency with no quality benefit.
Always use tokenizer.apply_chat_template(). Mismatched templates are the #1 silent failure mode.
Monitor general capability (MMLU or equivalent) alongside task loss to detect catastrophic forgetting before deployment.
Default configuration: r=16, alpha=32, all 7 attention + MLP projection modules targeted, bfloat16 compute, paged adamw for ≥ 30B models.

Quiet AI help

Explain simpler Compare approaches What next?

Article metadata

Written by

Abstract Algorithms

@abstractalgorithms

Related deep dives

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

31 min read

Softmax Function Explained: From Raw Scores to Probabilities

23 min read

Dot Product in Machine Learning: The Engine Behind Similarity, Attention, and Neural Networks

22 min read

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

31 min · Llm · best next step

Open Collection

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

Was this article useful?

Read this as a system of state, constraints, and failure boundaries.

The article’s conceptual path

📖 The Memory Wall That Blocked Every LLM Fine-Tune Before 2022

🔍 LoRA in Plain English: Margin Notes on a Frozen Textbook

⚙️ How LoRA Adds a Parallel Path to Every Attention Layer

🧠 Deep Dive: The Math of Low-Rank Decomposition, NF4 Quantization, and QLoRA's Architecture

Mathematical Model: The Low-Rank Decomposition Formula and Why Fine-Tuning Deltas Are Naturally Sparse

LoRA and QLoRA Internals: NF4 Quantization, Double Quant, and the 4-Bit Architecture

Performance Analysis: Rank Selection, Parameter Counts, and Training Cost Trade-offs

📊 Visual Reference

📊 The QLoRA Training Pipeline From Data to Served Model

🧪 Complete Working Example: Fine-Tuning Mistral-7B with QLoRA on a Custom Support Dataset

Training Script: qlora_train.py

Merge and Inference Script: qlora_merge_and_infer.py

📊 Visual Reference

📈 Evaluating Your Fine-Tune: Loss Curves, Perplexity, and the MMLU Canary

🏗️ Advanced Deployment Patterns: Adapter Composition, Continued Pre-Training, and Multi-Tenant Serving

Stacking Multiple LoRA Adapters with LoraHub

Continued Pre-Training Before Instruction Fine-Tuning

Multi-Tenant Adapter Hot-Swapping with vLLM

🌍 Who Is Running LoRA in Production and What They Have Learned

⚖️ Seven Ways QLoRA Fine-Tunes Go Wrong — and How to Fix Each One

🧭 Choosing Your Fine-Tuning Strategy: LoRA vs QLoRA vs Full Fine-Tune vs DPO

🛠️ HuggingFace PEFT, TRL, and bitsandbytes: The Three Libraries Powering the Stack

📚 Lessons Learned from Running QLoRA Fine-Tunes in Production

📌 TLDR

Training Script: `qlora_train.py`

Merge and Inference Script: `qlora_merge_and_infer.py`