All Posts

LoRA Explained: How to Fine-Tune LLMs on a Budget

Want to train your own LLM but don't have 100 GPUs? LoRA (Low-Rank Adaptation) lets you fine-tune...

Abstract AlgorithmsAbstract Algorithms
··5 min read
Cover Image for LoRA Explained: How to Fine-Tune LLMs on a Budget
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Fine-tuning a 7B-parameter LLM updates billions of weights and requires expensive GPUs. LoRA (Low-Rank Adaptation) freezes the original weights and trains only tiny adapter matrices that are added on top. 90%+ memory reduction; zero inference latency penalty.


📖 The Sticky Note Analogy

You have a 1,000-page textbook (the pre-trained LLM). You want to update it with new Quantum Physics content.

  • Full fine-tuning: Rewrite every page in the book. You need a massive printing press (8×A100 GPUs).
  • LoRA: Leave the book exactly as it is. Write updates on transparent sticky notes and paste them on the relevant pages. Tiny, cheap, portable.

When you read the book with sticky notes, you get the updated knowledge. When you remove them, you get back the original.


🔢 Why Full Fine-Tuning Is Expensive

A 7B parameter model stores each parameter as a 16-bit float. At training, you also need:

  • Optimizer states (Adam: 2× the parameters)
  • Gradients (1× the parameters)
  • Activations (variable)

Total memory for full fine-tuning 7B: ~56–112 GB. That's 4–8 A100 GPUs at ~$25,000/month each.

LoRA changes the math entirely.


⚙️ The Low-Rank Decomposition: Why It Works

Every weight matrix in a Transformer has shape $d \times d$ (e.g., $1000 \times 1000$).

Full fine-tuning trains $\Delta W$ of shape $1000 \times 1000$ = 1,000,000 parameters.

LoRA observes that the useful change in weights during fine-tuning tends to lie on a low-dimensional subspace (intrinsic dimensionality hypothesis). So instead of a full update, it trains two small matrices:

  • Matrix A: shape $d \times r$ ($1000 \times 4$) — 4,000 parameters.
  • Matrix B: shape $r \times d$ ($4 \times 1000$) — 4,000 parameters.

The effective weight update: $$\Delta W = A \times B$$ $$W{new} = W{frozen} + \alpha \cdot A \times B$$

Where $\alpha$ is a scaling factor hyperparameter.

Parameter count: 8,000 vs 1,000,000 — a 125× reduction at rank $r=4$.

flowchart TD
    Input["Input: x"]
    Frozen["W_frozen\n(not trained — frozen)"]
    A["Matrix A\n(d × r, trained)"]
    B["Matrix B\n(r × d, trained)"]
    Sum["Sum: W_frozen·x + α·B·A·x"]
    Output["Output: h"]

    Input --> Frozen --> Sum
    Input --> A --> B --> Sum
    Sum --> Output

🧠 Zero Inference Latency: Merging After Training

During training, we compute both paths (frozen W and $\alpha BA$) and sum them.

After training, we merge $W{merged} = W{frozen} + \alpha BA$.

This is just matrix addition — done once. The merged model is a standard Transformer with no extra branches. Zero latency overhead at inference compared to the original model.


⚖️ Rank, Alpha, and Target Modules: Key Hyperparameters

HyperparameterWhat It ControlsTypical Range
r (rank)Capacity of adaptation; higher = more expressive4, 8, 16, 64
alphaScaling of the LoRA update; often set to r or 2r16, 32, 64
target_modulesWhich layers get adaptersq_proj, v_proj (attention)

Rule of thumb: Start with r=8, alpha=16 on only q_proj and v_proj. Increase r if the task is complex (code generation, math reasoning).

Training with Hugging Face PEFT

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 3,407,872 || all params: 8,030,261,248 || trainable%: 0.042

Less than 0.05% of parameters are trained. A single A100 (40 GB) can handle a 7B model fine-tune.


🌍 LoRA vs Other PEFT Methods

MethodWhat Is TrainedMemoryQuality
Full Fine-TuningAll parametersVery highBest
LoRALow-rank adapter matrices only~10% of fullClose to full (within 1-2%)
QLoRALoRA on 4-bit quantized model~5% of fullSlightly lower; fits on consumer GPUs
Prefix TuningPrepended soft tokensLowTask-specific context only
Adapter LayersSmall bottleneck layersMediumGood; higher inference overhead

QLoRA = LoRA + 4-bit model quantization. Enables fine-tuning a 70B model on a single 48 GB GPU. Used by the LLaMA open-source community extensively.


📌 Summary

  • LoRA freezes original weights; trains only low-rank matrices A (d×r) and B (r×d).
  • Parameter reduction: from $d^2$ to $2dr$. At $r=4, d=1000$: 1M → 8K parameters.
  • Zero inference overhead: merge $W + \alpha BA$ after training; single forward pass.
  • QLoRA adds 4-bit quantization to run fine-tuning on consumer hardware.
  • Hugging Face PEFT wraps LoRA configuration in a few lines of Python.

📝 Practice Quiz

  1. What does LoRA train instead of the full weight matrix?

    • A) The attention bias terms only.
    • B) Two small low-rank matrices A (d×r) and B (r×d) whose product approximates the weight update.
    • C) Only the last layer of the model.
      Answer: B
  2. After LoRA training, how is the inference overhead compared to the original model?

    • A) ~10% slower due to the extra adapter pass.
    • B) Zero — A and B are merged into the original weight matrix before deployment.
    • C) 2× slower because two forward passes are needed.
      Answer: B
  3. What does QLoRA add on top of LoRA?

    • A) Quantized outputs for faster decoding.
    • B) 4-bit quantization of the frozen base model weights, enabling fine-tuning on consumer GPUs.
    • C) Multi-task training on multiple datasets simultaneously.
      Answer: B

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms