LoRA Explained: How to Fine-Tune LLMs on a Budget
Want to train your own LLM but don't have 100 GPUs? LoRA (Low-Rank Adaptation) lets you fine-tune...
Abstract Algorithms
TLDR: Fine-tuning a 7B-parameter LLM updates billions of weights and requires expensive GPUs. LoRA (Low-Rank Adaptation) freezes the original weights and trains only tiny adapter matrices that are added on top. 90%+ memory reduction; zero inference latency penalty.
📖 The Sticky Note Analogy
You have a 1,000-page textbook (the pre-trained LLM). You want to update it with new Quantum Physics content.
- Full fine-tuning: Rewrite every page in the book. You need a massive printing press (8×A100 GPUs).
- LoRA: Leave the book exactly as it is. Write updates on transparent sticky notes and paste them on the relevant pages. Tiny, cheap, portable.
When you read the book with sticky notes, you get the updated knowledge. When you remove them, you get back the original.
🔢 Why Full Fine-Tuning Is Expensive
A 7B parameter model stores each parameter as a 16-bit float. At training, you also need:
- Optimizer states (Adam: 2× the parameters)
- Gradients (1× the parameters)
- Activations (variable)
Total memory for full fine-tuning 7B: ~56–112 GB. That's 4–8 A100 GPUs at ~$25,000/month each.
LoRA changes the math entirely.
⚙️ The Low-Rank Decomposition: Why It Works
Every weight matrix in a Transformer has shape $d \times d$ (e.g., $1000 \times 1000$).
Full fine-tuning trains $\Delta W$ of shape $1000 \times 1000$ = 1,000,000 parameters.
LoRA observes that the useful change in weights during fine-tuning tends to lie on a low-dimensional subspace (intrinsic dimensionality hypothesis). So instead of a full update, it trains two small matrices:
- Matrix A: shape $d \times r$ ($1000 \times 4$) — 4,000 parameters.
- Matrix B: shape $r \times d$ ($4 \times 1000$) — 4,000 parameters.
The effective weight update: $$\Delta W = A \times B$$ $$W{new} = W{frozen} + \alpha \cdot A \times B$$
Where $\alpha$ is a scaling factor hyperparameter.
Parameter count: 8,000 vs 1,000,000 — a 125× reduction at rank $r=4$.
flowchart TD
Input["Input: x"]
Frozen["W_frozen\n(not trained — frozen)"]
A["Matrix A\n(d × r, trained)"]
B["Matrix B\n(r × d, trained)"]
Sum["Sum: W_frozen·x + α·B·A·x"]
Output["Output: h"]
Input --> Frozen --> Sum
Input --> A --> B --> Sum
Sum --> Output
🧠 Zero Inference Latency: Merging After Training
During training, we compute both paths (frozen W and $\alpha BA$) and sum them.
After training, we merge $W{merged} = W{frozen} + \alpha BA$.
This is just matrix addition — done once. The merged model is a standard Transformer with no extra branches. Zero latency overhead at inference compared to the original model.
⚖️ Rank, Alpha, and Target Modules: Key Hyperparameters
| Hyperparameter | What It Controls | Typical Range |
r (rank) | Capacity of adaptation; higher = more expressive | 4, 8, 16, 64 |
alpha | Scaling of the LoRA update; often set to r or 2r | 16, 32, 64 |
target_modules | Which layers get adapters | q_proj, v_proj (attention) |
Rule of thumb: Start with r=8, alpha=16 on only q_proj and v_proj. Increase r if the task is complex (code generation, math reasoning).
Training with Hugging Face PEFT
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 3,407,872 || all params: 8,030,261,248 || trainable%: 0.042
Less than 0.05% of parameters are trained. A single A100 (40 GB) can handle a 7B model fine-tune.
🌍 LoRA vs Other PEFT Methods
| Method | What Is Trained | Memory | Quality |
| Full Fine-Tuning | All parameters | Very high | Best |
| LoRA | Low-rank adapter matrices only | ~10% of full | Close to full (within 1-2%) |
| QLoRA | LoRA on 4-bit quantized model | ~5% of full | Slightly lower; fits on consumer GPUs |
| Prefix Tuning | Prepended soft tokens | Low | Task-specific context only |
| Adapter Layers | Small bottleneck layers | Medium | Good; higher inference overhead |
QLoRA = LoRA + 4-bit model quantization. Enables fine-tuning a 70B model on a single 48 GB GPU. Used by the LLaMA open-source community extensively.
📌 Summary
- LoRA freezes original weights; trains only low-rank matrices A (d×r) and B (r×d).
- Parameter reduction: from $d^2$ to $2dr$. At $r=4, d=1000$: 1M → 8K parameters.
- Zero inference overhead: merge $W + \alpha BA$ after training; single forward pass.
- QLoRA adds 4-bit quantization to run fine-tuning on consumer hardware.
- Hugging Face PEFT wraps LoRA configuration in a few lines of Python.
📝 Practice Quiz
What does LoRA train instead of the full weight matrix?
- A) The attention bias terms only.
- B) Two small low-rank matrices A (d×r) and B (r×d) whose product approximates the weight update.
- C) Only the last layer of the model.
Answer: B
After LoRA training, how is the inference overhead compared to the original model?
- A) ~10% slower due to the extra adapter pass.
- B) Zero — A and B are merged into the original weight matrix before deployment.
- C) 2× slower because two forward passes are needed.
Answer: B
What does QLoRA add on top of LoRA?
- A) Quantized outputs for faster decoding.
- B) 4-bit quantization of the frozen base model weights, enabling fine-tuning on consumer GPUs.
- C) Multi-task training on multiple datasets simultaneously.
Answer: B

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. �...
