RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
A practical decision guide with Python code for both paths — choose the right approach before you spend weeks building the wrong one.
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assistants. Never fine-tune to inject facts that will be stale before you finish training.
📖 The Two Teams That Built the Wrong Thing
Six months ago, Team A finished a RAG pipeline to answer questions about their internal documentation. They celebrated the launch and closed the sprint. Two months into production, the numbers told a different story: recall was stuck at 60%, the LLM was writing responses in a tone that made every answer sound like a Wikipedia summary, and it kept confusing internal product terminology with generic industry terms. Engineers blamed the vector store. The tech lead blamed the embedding model. The real culprit was the decision to use RAG to fix a problem that RAG cannot solve.
Around the same time, Team B decided to fine-tune a model on three years of support tickets. The result looked impressive in evaluation: the model used the correct product names, matched the support team's voice exactly, and confidently handled multi-step troubleshooting flows. Then a product release changed two pricing tiers and deprecated a feature. The fine-tuned model kept referencing the old tiers with complete confidence. Users filed tickets about advice that contradicted the website. Retraining took eleven days and $4,000 in GPU costs — and by the time it was done, two more features had changed.
Both teams picked the wrong tool. What makes this painful is that their failure modes looked almost identical on the surface: the model gave wrong answers. The root causes were completely different.
This is the central trap with RAG and fine-tuning: they both improve LLM output quality, but for fundamentally different reasons. Understanding that distinction — clearly, with real criteria, before you start building — is what separates teams that ship working AI systems from teams that spend months chasing the wrong problem.
🔍 Why RAG and Fine-Tuning Solve Different Problems
To use either tool well, you need to understand what each one actually changes about an LLM.
RAG (Retrieval-Augmented Generation) is a runtime architecture, not a training technique. The model's weights are never touched. Instead, at inference time, a retrieval layer fetches relevant documents or passages and injects them into the prompt as context. The model reads this context and uses it to answer. If the context is accurate and relevant, the answer will be grounded in it. If the context is stale or missing, the model falls back to its pretraining knowledge — with no retrieval signal that anything went wrong.
Think of RAG as giving a surgeon the patient's chart right before the operation. The surgeon's skills, training, and judgment stay exactly as they were. What changes is the specific, current information they have access to in the moment.
Fine-tuning modifies the model's weights using a targeted dataset. After fine-tuning, the model has changed at a fundamental level: its internal representations, its preferred output formats, its vocabulary for a domain, and its reasoning patterns all reflect the training data. You are not giving the model a reference — you are changing how it thinks and writes, permanently (or until the next training run).
Think of fine-tuning as sending that same surgeon to a specialty residency. The operations they perform after the residency reflect deeply internalized expertise — not a reference manual they read that morning.
| Dimension | RAG | Fine-Tuning |
| What changes | Nothing in the model | The model's weights |
| When the change applies | Every inference (new context) | Permanently, until retrained |
| Best for | Providing current facts, domain documents | Changing reasoning style, tone, terminology |
| Data freshness | Always current (re-index documents) | Frozen at training time |
| Failure mode | Retrieval miss → model hallucinates | Stale facts baked into weights |
| Setup cost | Hours to days | Days to weeks |
| Cost to update | Re-index changed documents | Retrain (compute + time) |
The confusion between them stems from the fact that both can make a model give better answers. But they do so by different mechanisms. Using RAG to "teach" terminology does not work because RAG only provides context — if the model does not understand the terminology in the first place, injected context about it will not reliably change how the model interprets it. Using fine-tuning to keep facts current is expensive and self-defeating because the model's weights go stale the moment the underlying reality changes.
⚙️ Six Signals That Tell You Which Tool to Reach For
Before writing a single line of code, apply this scoring table. Rate each factor for your situation. Whichever column accumulates more check marks is the starting recommendation.
| Factor | RAG is the better choice | Fine-Tuning is the better choice |
| Data freshness | Data changes weekly or daily | Knowledge domain is stable for months |
| Knowledge type | Facts, documents, policies, product specs | Reasoning style, tone, output format, domain jargon as syntax |
| Data volume | Any size — chunked at retrieval time | More than 500 high-quality labeled examples |
| Latency budget | Can absorb 100–500ms retrieval round-trip | Need sub-500ms with no external dependencies |
| Infrastructure | Vector DB + embedding model (no GPU) | GPU for training and for serving the fine-tuned model |
| Auditability need | Must cite sources; answers must be traceable | Behavior audit (did the model respond correctly?), not source audit |
Three clear decision tiers emerge from this table:
RAG only: Your knowledge is external and changes frequently. Every answer needs to be citable. Your team has no ML infrastructure and does not want to build it. Most internal documentation chatbots, customer-facing FAQ systems, and support bots with large ever-changing knowledge bases belong here.
Fine-tuning only: You need the model to consistently write in a specific style or follow a specific output format. Retrieval latency is not acceptable. The domain knowledge is stable and well-represented in your training data. Legal clause extraction, medical note formatting, and code generation with company-specific API conventions are common examples.
Both — the production-grade path: You need the model to have correct, current facts AND to respond in a consistent style or domain voice. Most mature AI assistants land here. Fine-tuning handles tone, format, and terminology; RAG handles grounding in current facts.
🧠 Deep Dive: How RAG and Fine-Tuning Work Under the Hood
Both approaches improve LLM output, but through completely different mechanisms. Understanding the internal machinery of each — before choosing — prevents the most expensive architectural mistakes.
RAG Internals: From Chunking to Re-Ranking
Most teams treat RAG as a solved problem after they wire up an embedding model and a vector store. Then they wonder why recall is 60%. The reality is that chunking strategy is where 80% of RAG failures originate — not the LLM, not the embedding model, and usually not the vector store.
Chunking: Where most RAG pipelines break. A chunk is the unit of text that gets embedded and stored. If your chunks are too large, the embedding averages over too much content and loses precision — the top-k results will contain the right information buried in noise, but the model may not extract it correctly. If your chunks are too small, you lose the surrounding context that makes a passage meaningful, and the model gets fragments that do not stand alone.
Three chunking strategies dominate production systems:
- Fixed-size chunking (e.g., 512 tokens with 64-token overlap) is the simplest and fastest. It works well for uniform-structure documents but breaks mid-sentence and mid-concept on prose or technical specs.
- Recursive character text splitting breaks at paragraph boundaries first, then sentence boundaries, then word boundaries, degrading gracefully until the chunk falls within the size limit. This is the right default for most text.
- Semantic chunking embeds every sentence and merges adjacent sentences when their embeddings are similar, splitting when similarity drops. It produces semantically coherent chunks but requires an embedding call per sentence during indexing.
Embedding model choice matters more than most teams realize. The embedding model converts your chunks and queries into dense vectors. The distance between these vectors determines retrieval quality. text-embedding-3-large from OpenAI produces high-quality embeddings but adds cost and API latency. BAAI/bge-m3 is a strong open-weight alternative with multilingual support. nomic-embed-text runs locally and produces surprisingly competitive results for English-only knowledge bases. Mismatched embedding models — using one model to index and a different model to query — are a silent failure mode that produces nonsensical retrieval results.
Vector store and retrieval strategy. A vector store indexes your chunk embeddings and performs approximate nearest-neighbor search at query time. The index algorithm matters: HNSW (Hierarchical Navigable Small World) is the de facto standard for production deployments — it achieves sub-10ms retrieval over millions of vectors with configurable precision/recall trade-offs. Flat (exact) search is only viable at small scale.
But vector similarity alone — also called dense retrieval — is not always best. For technical documentation with precise terminology (model names, version numbers, function signatures), sparse retrieval (BM25 keyword matching) often outperforms dense retrieval because exact keyword matches beat semantic generalization. Hybrid retrieval fuses both signals using Reciprocal Rank Fusion (RRF), and it consistently beats either approach alone for mixed-content knowledge bases.
Re-ranking: the cheap precision multiplier. After retrieving top-k candidates, a cross-encoder re-ranker re-scores each candidate against the original query using a full attention pass — much more expensive than embedding similarity but orders of magnitude more accurate. Running a cross-encoder/ms-marco-MiniLM-L6-v2 re-ranker on the top-20 vector results and keeping the top-5 routinely doubles precision at the cost of roughly 50–100ms of additional latency. For most production RAG systems, this is one of the highest-ROI improvements available.
The full pipeline order: query → embed → hybrid vector+BM25 search → top-20 candidates → CrossEncoder re-rank → top-5 context chunks → prompt assembly → LLM.
Fine-Tuning Internals: LoRA, QLoRA, and Weight Update Mechanics
Fine-tuning a large model from scratch on your dataset is called full fine-tuning — all parameters are updated during training. This produces the best possible specialization but is prohibitively expensive for models above 7B parameters. A 7B-parameter model at FP16 requires approximately 14GB just for weights, plus 2–3× more for optimizer states. For most teams, this is not practical.
Parameter-Efficient Fine-Tuning (PEFT) is the practical alternative. Instead of updating all weights, PEFT freezes the base model and trains only a small set of additional parameters. The dominant PEFT approach is LoRA.
LoRA: Low-Rank Adaptation. LoRA works by decomposing weight updates into two small matrices rather than updating the full weight matrix directly. For a pretrained weight matrix W of size d×k, LoRA adds a parallel update path: W + ΔW where ΔW = BA, B has shape d×r, and A has shape r×k, with rank r ≪ min(d, k). Only A and B are trained. The base model weights stay frozen. In practice, r=16 gives a good balance — it trains roughly 0.1–1% of total parameters while capturing most of the behavioral shift you want. Because B is initialized to zero, LoRA adapters have no effect at initialization and only diverge during training, which means the base model's general capability is preserved.
lora_alpha controls the scaling of the LoRA update: the effective update magnitude is (lora_alpha / r) * ΔW. Setting lora_alpha = 2 * r (e.g., alpha=32 when r=16) is a stable default. The target_modules parameter specifies which weight matrices receive LoRA adapters — for transformer models, the query and value projections (q_proj, v_proj) are the standard targets, though adding k_proj and o_proj improves results for complex reasoning tasks.
QLoRA: Fine-tuning 70B models on two A100s. QLoRA extends LoRA by quantizing the base model weights to 4 bits (using NF4 quantization, which is better suited to normally distributed weights than standard int4) before freezing them, then training LoRA adapters in BF16. The quantized base model reduces memory usage by roughly 4×. A 70B-parameter model that would require eight A100s for standard LoRA now fits on two. The tradeoff: quantization adds a small inference overhead and introduces quantization error in the frozen base — negligible for style adaptation tasks, occasionally noticeable for precise numerical reasoning.
Training data format: instruction triples, not raw text. Fine-tuning on raw completions is the most common beginner mistake. Models trained on raw text learn to continue text — they do not learn to follow instructions. Instruction-tuning formats every training example as a system/user/assistant triple, teaching the model to respond to directives rather than continue a document. For behavior alignment, Direct Preference Optimization (DPO) pairs each prompt with a preferred and rejected response — 200 high-quality preference pairs routinely outperform 2,000 raw SFT examples.
Overfitting signals to watch: A diverging gap between training loss (still decreasing) and validation loss (increasing) is the textbook signal. More subtle: measuring the model on a held-out general benchmark like MMLU or HellaSwag — if the fine-tuned model loses more than 5–10 percentage points on general reasoning versus the base, the learning rate is too high or the dataset is too small to prevent catastrophic forgetting.
Performance Analysis: Speed, Cost, and Quality Across Both Approaches
The table below compares the operational profile of each strategy across the dimensions that matter most when choosing. Use it alongside the six-factor decision table above to build your recommendation.
| Metric | RAG only | Fine-Tuning only | RAG + Fine-Tuning |
| Time to first value | Hours (index + prompt) | Days to weeks (data prep + train) | Weeks |
| Cost to update | Re-index changed docs (cheap) | Retrain on GPU (expensive) | Re-index only |
| Recall on fresh data | High (if retrieved) | None (frozen weights) | High |
| Tone and style adherence | Low (prompt-dependent, inconsistent) | High (baked into weights) | High |
| Hallucination risk | Lower (grounded in context) | Higher (if undertrained) | Lowest |
| Latency overhead | +100–500ms retrieval round-trip | 0ms (no retrieval at inference) | +100–500ms |
| Infrastructure floor | Vector DB + embedding model | GPU for training + fine-tuned serving | Both |
📊 The RAG Pipeline Flow, Visualized
The following diagram traces a query through a production-grade RAG pipeline. The two retrieval branches (vector similarity and BM25) run in parallel, their results are fused and re-ranked, and only then does the LLM receive context.
`mermaid graph TD A[User Query] --> B[Embed Query with Embedding Model] B --> C[Dense Vector Search in HNSW Index] A --> D[BM25 Sparse Keyword Search] C --> E[Merge Candidates via RRF Score Fusion] D --> E E --> F[CrossEncoder Re-Ranker] F --> G[Top-K Context Chunks Selected] G --> H[Prompt Assembly with System Instructions] H --> I[LLM Inference] I --> J[Grounded Response with Citations]
The diagram separates the two retrieval channels intentionally: dense search catches semantically similar passages even when query wording differs from document wording, while sparse BM25 catches exact product names, version numbers, and technical identifiers that embedding similarity tends to dilute. RRF fusion gives higher combined scores to documents that rank well in both channels, boosting precision without requiring a trained re-ranker at the fusion step.
---
## 📊 The LoRA Training and Serving Flow, Visualized
Training with LoRA involves freezing the base model and routing gradients only through the small adapter matrices. Serving can either merge the adapter weights back into the base model for zero-latency overhead, or keep them separate and apply them dynamically when multiple adapters need to share a single base model.
`mermaid
graph TD
A[Instruction-Formatted Training Examples] --> B[Tokenize with Chat Template]
B --> C[Load Base Model in 4-bit via QLoRA]
C --> D[Attach LoRA Adapter Layers to q-proj and v-proj]
D --> E[Forward Pass - Frozen Base plus Trainable Adapter]
E --> F[Compute Causal LM Loss]
F --> G[Backprop Through Adapter Only]
G --> H[Checkpoint Adapter Weights]
H --> I[Merge Adapter into Base Model]
I --> J[Serve with vLLM or HuggingFace TGI]
The key insight from this flow is what does NOT appear in the gradient path: the frozen base model layers. Because the base model parameters never receive gradient updates, there is no risk of catastrophic forgetting in those layers — the forgetting risk comes entirely from the LoRA adapter influencing the activation patterns that flow through frozen weights. Keeping rank r small (8–16 for style adaptation, 32–64 for domain reasoning) limits the adapter's capacity to distort base knowledge.
⚖️ Seven Failure Modes: Symptoms, Root Causes, and Fixes
These are the most common production failures across both approaches, drawn from real deployment patterns.
| Symptom | Approach | Root Cause | Fix |
| Recall below 70%; model ignores documents in context | RAG | Chunk size too large (embeddings diluted) or retrieval returning semantically distant passages | Use recursive chunking at 256–512 tokens; add re-ranker |
| Model gives correct facts but ignores retrieved context, citing old training knowledge | RAG | Context is appended after a long system prompt; model's attention dilutes toward the end | Move context closer to the question; reduce system prompt length |
| Query latency spikes above 2 seconds at scale | RAG | Flat index — exact nearest-neighbor search is O(n) | Migrate to HNSW index; add caching for frequent queries |
| Fine-tuned model writes correctly but states facts that changed post-training | Fine-Tuning | Fine-tuning was used to inject facts, not just style | Never inject facts via fine-tuning; add RAG layer for current information |
| Model's general reasoning degrades after fine-tuning (MMLU drops 15%) | Fine-Tuning | Learning rate too high, causing catastrophic forgetting in shared representations | Reduce LR to 1e-4 or lower; use cosine warmup; increase regularization with LoRA dropout |
| Fine-tuned model overfits to training phrasing — brittleness on paraphrased inputs | Fine-Tuning | Training set too small (fewer than 500 examples) or too homogeneous in phrasing | Augment training examples with paraphrases; use DPO pairs instead of raw SFT completions |
| Combined RAG + fine-tuned model disagrees with itself — retrieved context contradicts baked-in training beliefs | Both | Fine-tuning introduced strong priors that override retrieved context | Re-prompt to explicitly instruct the model to prefer context over prior knowledge; reduce LoRA rank |
🌍 How Production Teams Actually Use Both: Five Real Deployments
Understanding where the boundary between RAG and fine-tuning falls in real systems prevents the most common architectural mistakes.
Notion AI is a canonical RAG-without-fine-tuning deployment. Every user's content changes constantly — new pages, edits, restructured databases. Fine-tuning a model on each user's workspace would be impossibly expensive and perpetually stale. Instead, Notion's AI operates over the current state of the user's pages at inference time, injecting relevant blocks into context. The model itself is a managed API (GPT-4 family). Freshness over style is the correct trade-off here — Notion's users expect factual answers about their own documents, not a house writing style.
GitHub Copilot uses a layered approach that combines both techniques. The core code completion model was fine-tuned on a massive corpus of public code, which gave it deep competence in code patterns, idioms, and API conventions — behavioral knowledge that cannot be effectively retrieved. On top of that, Copilot's newer repository-aware features inject current code from open files and the local repository as context at suggestion time, which is textbook RAG. Neither approach alone would work: retrieval without domain-adapted behavior produces generic completions; fine-tuning without retrieval cannot see the user's actual code.
Stripe's support bot illustrates the combined architecture that most internal-facing enterprise assistants eventually converge on. The model was fine-tuned on historical support transcripts to internalize the company's support voice, escalation language, and troubleshooting patterns — all stable behavioral knowledge. Current product documentation, pricing tables, and API changelog notes are injected via RAG. This separation is deliberate: the fine-tuned model handles how to respond, the RAG layer handles what is currently true.
Bloomberg GPT went the other direction: full fine-tuning on a 700-billion-token financial corpus, no retrieval. The goal was domain-specific reasoning — understanding the implicit relationships between financial entities, the conventions of earnings call transcripts, the meaning of regulatory language — not just access to current financial data. Bloomberg's terminal already provides current data through structured queries; what they needed was a model that could reason about that data the way a trained financial analyst would. Full fine-tuning on domain text, not RAG, is the right tool for internalizing complex domain reasoning patterns.
Cursor (the AI code editor) shows how RAG architecture can substitute for some fine-tuning needs. Instead of fine-tuning on each codebase, Cursor indexes the current repository at session time and retrieves the most relevant files and functions as context for each suggestion. For code style and project-specific conventions, it relies on explicit context injection rather than baked-in training. This makes the tool immediately useful on any codebase without per-project training, at the cost of some inference latency and a hard cap on how much codebase context fits in a single prompt.
🧭 Nine-Question Decision Checklist
Answer each question. Follow the arrows to a recommendation.
- Does your knowledge base change more often than once per month? → Yes → RAG. No → Continue.
- Must every answer be traceable to a source document? → Yes → RAG. No → Continue.
- Is the problem about tone, writing style, or output format? → Yes → Fine-Tuning. No → Continue.
- Do you have more than 500 high-quality labeled examples? → No → RAG (fine-tuning on fewer examples usually hurts). Yes → Continue.
- Is retrieval latency acceptable in your product? → No → Fine-Tuning. Yes → Continue.
- Does your team have GPU infrastructure for training and serving? → No → RAG. Yes → Continue.
- Does the model need to understand domain-specific terminology as syntax (e.g., legal clause types, financial instrument names)? → Yes → Fine-Tuning. No → Continue.
- Do you need both current facts AND consistent style? → Yes → Both. No → Continue.
- Are you building a production assistant that handles sensitive, regulated, or legally significant content? → Yes → Both (RAG for grounding, fine-tuning for compliance tone). No → RAG (the safer starting point).
Recommendation tiers:
- RAG only: Questions 1, 2, 4 (No), 5 (No), or 6 (No) triggered.
- Fine-Tuning only: Questions 3 or 7 triggered, and questions 1 and 2 did not trigger.
- Both: Questions 8 or 9 triggered, or you reached question 8 with no hard stops.
- Start with RAG and add fine-tuning later: When in doubt. RAG gives you fast feedback on what the model is getting wrong. Those failure patterns become your fine-tuning training signal.
🧪 Worked Examples: Building a RAG Chain and Fine-Tuning Mistral-7B with LoRA
The two examples below demonstrate the practical shape of each approach. They are meant to be readable as architecture blueprints — the specific library calls matter less than the pattern each one enacts.
Example 1: A RAG Pipeline Over Internal Documentation (LangChain + Chroma)
This pipeline indexes a list of document strings, stores their embeddings in Chroma, and wraps everything in a LangChain LCEL chain that retrieves the five most relevant chunks before sending them to GPT-4o-mini. The RecursiveCharacterTextSplitter handles chunking, respecting paragraph and sentence boundaries before falling back to character splits.
Notice the RunnablePassthrough on the question branch — it passes the raw query string through unchanged to the prompt, while the retriever branch fetches and joins the relevant chunks. This is the standard LCEL RAG pattern and is trivially composable with re-rankers, conversation memory, or output parsers.
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
def build_rag_pipeline(docs: list[str]) -> object:
"""Build a RAG chain over a list of document strings.
Chunks each document, embeds with text-embedding-3-small,
and wraps in an LCEL chain that retrieves k=5 before answering.
"""
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.create_documents(docs)
vectorstore = Chroma.from_documents(
chunks,
embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
collection_name="internal-docs",
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
prompt = ChatPromptTemplate.from_template(
"Answer the question using only the context below.\n\n"
"Context: {context}\n\n"
"Question: {question}"
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def format_docs(docs):
return "\n\n".join(d.page_content for d in docs)
return (
{
"context": retriever | format_docs,
"question": RunnablePassthrough(),
}
| prompt
| llm
| StrOutputParser()
)
# Usage
chain = build_rag_pipeline([
"Our refund policy allows 30-day no-questions-asked returns on all products.",
"Premium tier subscribers receive priority support with a 2-hour SLA.",
"API rate limits are 1,000 requests per minute for standard accounts.",
])
print(chain.invoke("What is the refund policy?"))
# Output: "You can return any product within 30 days, no questions asked."
To add hybrid retrieval (BM25 + dense), replace the Chroma retriever with a BM25Retriever from langchain-community and a EnsembleRetriever that combines both with equal weight. To add re-ranking, wrap the ensemble retriever with a ContextualCompressionRetriever and a CrossEncoderReranker using cross-encoder/ms-marco-MiniLM-L6-v2.
Example 2: Fine-Tuning Mistral-7B with LoRA via HuggingFace PEFT
This snippet fine-tunes Mistral-7B-Instruct with QLoRA (4-bit quantized base + BF16 adapter training). The key parameters to understand are r=16 (adapter rank — higher means more capacity but more parameters), lora_alpha=32 (scaling factor, keep at 2× rank), and target_modules (which linear layers receive adapters).
The model.print_trainable_parameters() call is not cosmetic — it confirms that you are training approximately 0.1% of total parameters. If the number is 100%, your PEFT config was not applied correctly and you are performing full fine-tuning on a quantized model, which will produce poor results.
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"
def fine_tune_with_lora(examples: list[dict]) -> None:
"""Fine-tune Mistral-7B with QLoRA on instruction-following examples.
Each example must have:
- "instruction": str (the user prompt)
- "response": str (the expected model output)
After training, the LoRA adapter is saved to ./mistral-lora-adapter.
Merge with the base model using `model.merge_and_unload()` before serving.
"""
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
def format_example(ex: dict) -> dict:
# Mistral instruct format: [INST] user [/INST] assistant
return {
"text": f"<s>[INST] {ex['instruction']} [/INST] {ex['response']} </s>"
}
dataset = Dataset.from_list([format_example(e) for e in examples])
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
load_in_4bit=True, # QLoRA: freeze base in NF4 quantization
device_map="auto",
)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank — higher = more capacity, more trainable params
lora_alpha=32, # Scaling factor; keep at 2x r
lora_dropout=0.05,
target_modules=["q_proj", "v_proj"], # Attention projections only
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Expected: trainable params: ~8M || all params: ~7.24B || trainable%: 0.11%
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
args=TrainingArguments(
output_dir="./mistral-lora-checkpoints",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size: 16
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch",
),
)
trainer.train()
model.save_pretrained("./mistral-lora-adapter")
print("Adapter saved. Merge with base model before high-throughput serving.")
After saving, merge the adapter for serving with model = model.merge_and_unload() — this folds the LoRA matrices back into the base weight matrices and produces a standard HuggingFace model that can be served with vLLM at full throughput without any adapter overhead.
🛠️ LlamaIndex and HuggingFace PEFT: The OSS Stack That Powers Both Paths
LlamaIndex: A Higher-Level RAG Abstraction
LlamaIndex (formerly GPT Index) is a data framework designed specifically for connecting LLMs to external data sources. Where LangChain gives you building blocks (splitters, retrievers, chains), LlamaIndex gives you a higher-level abstraction: a VectorStoreIndex that manages chunking, embedding, indexing, and retrieval behind a single QueryEngine interface.
Its NodePostprocessors API lets you attach re-rankers, metadata filters, and context compressors to the query pipeline without manually wiring them together. For teams that want production-grade RAG with less plumbing, LlamaIndex converges faster.
The snippet below builds a RAG query engine from a directory of PDFs — the most common enterprise use case — in under ten lines:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
# Load all PDFs from ./docs (supports .pdf, .txt, .md, .docx)
documents = SimpleDirectoryReader("./docs").load_data()
# Chunk into 512-token nodes with 64-token overlap
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)
nodes = splitter.get_nodes_from_documents(documents)
# Build the vector index (embeds and indexes all nodes)
index = VectorStoreIndex(nodes)
# Create a query engine with top-5 retrieval
query_engine = index.as_query_engine(similarity_top_k=5)
# Query
response = query_engine.query("What is our refund policy?")
print(response)
LlamaIndex integrates natively with Chroma, Pinecone, Weaviate, Qdrant, and dozens of other vector stores. For persistence across sessions, pass a StorageContext to the index constructor rather than rebuilding from documents on every startup.
For a full deep-dive on production RAG patterns with LangChain and Chroma, see LangChain RAG: Retrieval-Augmented Generation in Practice.
HuggingFace PEFT: The Unified Interface for Parameter-Efficient Fine-Tuning
HuggingFace PEFT (Parameter-Efficient Fine-Tuning) is the library that makes LoRA, QLoRA, prefix tuning, and IA3 adapters first-class citizens in the HuggingFace ecosystem. The API is consistent across adapter types: LoraConfig, PrefixTuningConfig, and IA3Config all follow the same get_peft_model(base_model, config) pattern, making it easy to experiment with different methods without rewriting your training loop.
The companion library TRL (trl) provides SFTTrainer for supervised fine-tuning on formatted instruction datasets, DPOTrainer for Direct Preference Optimization, and PPOTrainer for RLHF. Together, PEFT and TRL cover the full fine-tuning spectrum from simple style adaptation (SFT with LoRA) to complex behavioral alignment (DPO or PPO).
Key operational note: when using QLoRA (load_in_4bit=True), also import BitsAndBytesConfig and explicitly configure the quantization type as bnb_4bit_quant_type="nf4" and bnb_4bit_compute_dtype=torch.bfloat16. The default int4 quantization produces noticeably worse results than NF4 for language model weights.
For a full deep-dive on LoRA and QLoRA including rank selection, merge strategies, and serving with vLLM, see Fine-Tuning LLMs: The Complete Engineer's Guide to SFT, LoRA, and RLHF.
📚 Six Lessons from Teams That Got This Wrong First
Chunking strategy causes more RAG failures than any other component. Teams blame the LLM when recall is low. The problem is almost always chunk size, overlap, or the absence of a re-ranker. Fix the chunking and retrieval before touching the prompt or the model.
Hybrid retrieval (BM25 + dense) consistently outperforms pure semantic search for technical documents. Product names, API method names, version numbers, and error codes do not embed well — they look semantically similar to many other strings. BM25 catches exact matches that dense retrieval misses.
Fine-tuning with fewer than 500 examples usually hurts more than it helps. With fewer examples, the model memorizes training phrasing rather than generalizing. If you are below this threshold, use few-shot prompting or DPO pairs instead. DPO with 200 high-quality preference pairs often beats SFT on 1,000 raw completions.
A CrossEncoder re-ranker doubles RAG precision with minimal latency cost. Adding
cross-encoder/ms-marco-MiniLM-L6-v2as a post-retrieval re-ranker on top-20 results is consistently one of the highest-ROI improvements in a RAG pipeline. It adds 50–100ms but routinely cuts irrelevant context by half.Combine strategically: fine-tune for tone and style, use RAG for facts. The two approaches are not in competition. The most reliable production pattern is a fine-tuned model that has internalized domain voice and output format, with a RAG layer that grounds its answers in current knowledge. Never use fine-tuning to inject facts that change — they will be stale before training finishes.
Build your evaluation set before you build your system. A 50-question golden set with human-verified answers lets you measure your baseline before spending GPU hours or building pipelines. Teams that skip this step spend weeks optimizing a metric they never defined.
📌 TLDR: The One-Page Decision Cheat Sheet
RAG is the right choice when your data changes, when answers must be traceable to source documents, and when your team lacks ML training infrastructure. It delivers fast time-to-value, cheap updates, and grounded responses — at the cost of retrieval latency and quality that depends entirely on retrieval quality.
Fine-Tuning is the right choice when you need consistent style, domain reasoning, or output format that cannot be reliably controlled through prompting. It eliminates retrieval latency and bakes in behavioral consistency — at the cost of training time, GPU budget, and rapid staleness for any factual knowledge you try to embed.
Both is the right architecture for production assistants that need to be simultaneously authoritative in style and accurate on current facts. Fine-tune the model to know how to respond; use RAG to ensure it knows what is currently true.
The one rule that saves the most time: Never fine-tune to inject facts. Facts change. Weights do not update themselves. RAG exists precisely to solve the freshness problem that fine-tuning cannot.
📝 Practice Quiz: Test Your Decision Reasoning
- A company builds a chatbot that answers questions about their product catalog. New products are added weekly. Which approach is most appropriate?
A) Fine-tuning on the catalog B) RAG over a product database C) Full fine-tuning with RLHF D) Prefix tuning
Correct Answer: B. Fine-tuning would produce a model with stale product knowledge days after training. RAG indexes the current catalog and retrieves live information at query time.
- A legal AI team fine-tunes a model on 10,000 contract examples and reports excellent results on their test set. In production, the model paraphrases queries slightly differently and precision collapses. What is the most likely root cause?
A) The embedding model is mismatched B) The model has overfit to the training phrasing — too few examples or too high a learning rate C) The vector store index is using flat search D) The system prompt is too long
Correct Answer: B. This is the classic fine-tuning overfitting pattern. The model memorized training phrasing rather than generalizing. Mitigation: augment training examples with paraphrases and reduce the learning rate.
- A team wants to fine-tune Mistral-7B but only has access to a single 24GB GPU. Which technique makes this feasible?
A) Full fine-tuning with gradient checkpointing B) QLoRA — 4-bit quantized base model with BF16 LoRA adapters C) Prefix tuning on all transformer layers D) LoRA with rank r=256
Correct Answer: B. QLoRA quantizes the frozen base model to 4 bits (reducing memory ~4×), trains small LoRA adapters in BF16, and fits fine-tuning of a 7B model comfortably within 24GB. Full fine-tuning at FP16 requires roughly 60GB for a 7B model including optimizer states.
- You add a CrossEncoder re-ranker to a RAG pipeline that was previously returning top-20 vector similarity results and selecting the top-5 for the prompt. What is the most accurate description of what the re-ranker does?
A) It embeds the query and documents using a bi-encoder and recomputes cosine similarity B) It generates a summary of each retrieved chunk before scoring C) It performs a full cross-attention pass between the query and each candidate document, producing a more accurate relevance score than vector similarity alone D) It filters out chunks shorter than a minimum length threshold
Correct Answer: C. A CrossEncoder re-ranker attends to both the query and each document jointly in a single forward pass, capturing fine-grained token-level interactions that embedding similarity (which encodes query and document independently) misses. This substantially improves precision at the cost of additional latency.
- Open-ended: Your company's internal HR chatbot answers questions about leave policies, benefits, and compliance procedures. Policies change quarterly. Users report that responses feel generic and impersonal — the tone does not match your company's internal communication style. Design a solution using both RAG and fine-tuning. Explain what each component handles, how you would structure the training data, and how you would evaluate the combined system.
Correct Answer: Fine-tune a base model on a curated set of historical HR communications — internal announcements, policy summaries written by the HR team, and high-quality support ticket responses — formatted as instruction-following triples. This teaches the model the company's internal voice, the degree of formality expected, and the phrasing conventions for sensitive HR topics. Use LoRA with r=16 to limit the parameter footprint and preserve the base model's general reasoning capabilities. Do not include policy facts in the training data — inject them at inference time via RAG instead.
For the RAG layer: chunk the HR policy documents at 256–512 tokens using a recursive splitter. Re-index on a quarterly schedule aligned to policy review cycles, or trigger re-indexing via a CI/CD pipeline when policy documents are updated in the source-of-truth system. Use hybrid retrieval (dense + BM25) to catch exact policy names and section numbers alongside semantic similarity.
Evaluation: build a 50-question golden set spanning three categories — factual accuracy (does the answer match the current policy?), tone adherence (does a blind human evaluator rate the response as matching the company voice?), and completeness (does the answer cover all relevant policy points for the question?). Measure each category separately against baseline (no fine-tuning, no RAG), RAG-only, and fine-tune-only to confirm that the combined system outperforms both individual approaches before shipping.
🔗 Related Posts
- Fine-Tuning LLMs: The Complete Engineer's Guide to SFT, LoRA, and RLHF — companion deep-dive on LoRA, QLoRA, SFT, and RLHF implementation
- LangChain RAG: Retrieval-Augmented Generation in Practice — end-to-end RAG pipeline with LangChain, Chroma, and production patterns
- Build vs Buy LLM: Self-Host vs API — choosing between managed API LLMs and self-hosted deployments
- AI Agents Explained: When LLMs Start Using Tools — how RAG integrates with agentic architectures
- LLM Skill Registry: Routing and Evaluation for Production Agents — routing and evaluation patterns that apply to both RAG and fine-tuned components
- Headless Agents: Deploying Skills as MCP Server — productionizing LLM skills including RAG-backed knowledge tools

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
