31 min readLlm Rag Fine Tuning

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

A practical decision guide with Python code for both paths — choose the right approach before you spend weeks building the wrong one.

Abstract Algorithms/Apr 19, 2026/LLM Engineering

On this page

📖 The Two Teams That Built the Wrong Thing 🔍 Why RAG and Fine-Tuning Solve Different Problems ⚙️ Six Signals That Tell You Which Tool to Reach For 🧠 Deep Dive: How RAG and Fine-Tuning Work Under the Hood RAG Internals: From Chunking to Re-Ranking Fine-Tuning Internals: LoRA, QLoRA, and Weight Update Mechanics Performance Analysis: Speed, Cost, and Quality Across Both Approaches 📊 The RAG Pipeline Flow, Visualized 📊 The LoRA Training and Serving Flow, Visualized

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Executive TLDR

📌 TL;DR Summary Use RAG when facts change frequently and answers must be source grounded.
Use fine tuning when you need stable behavior: tone, format, and domain specific reasoning.
Use RAG + fine tuning for production assistants that need both freshness and consistent response quality.
Never fine tune to inject fast changing facts.

Core mental model

Read this as a system of state, constraints, and failure boundaries.

A practical decision guide with Python code for both paths — choose the right approach before you spend weeks building the wrong one.

Explain simpler Compare tradeoffs

Key systems visualization

The article’s conceptual path

📖 The Two Teams That Built the Wrong Thing

🔍 Why RAG and Fine-Tuning Solve Different Problems

⚙️ Six Signals That Tell You Which Tool to Reach For

🧠 Deep Dive: How RAG and Fine-Tuning Work Under the Hood

📊 The RAG Pipeline Flow, Visualized

📌 TL;DR Summary

Use RAG when facts change frequently and answers must be source-grounded.
Use fine-tuning when you need stable behavior: tone, format, and domain-specific reasoning.
Use RAG + fine-tuning for production assistants that need both freshness and consistent response quality.
Never fine-tune to inject fast-changing facts. That knowledge will go stale before the next model update.

One-paragraph summary: RAG and fine-tuning are not competing upgrades. They solve different layers of the problem. RAG solves knowledge freshness by injecting current context at inference time. Fine-tuning solves behavior consistency by changing model weights. Most teams should start with RAG for speed and observability, then add fine-tuning only after seeing repeated behavioral failures that retrieval alone cannot fix.

🧭 Decision Matrix: RAG, Fine-Tuning, or Hybrid

Situation	Recommended Approach	Why
Product policies/pricing/docs change weekly	RAG	Re-indexing keeps answers current without retraining
You need strict tone/format consistency	Fine-tuning	Weight updates enforce behavior better than prompts
Regulated assistant needs citations and controlled language	RAG + Fine-tuning	RAG grounds facts, fine-tuning stabilizes communication
No GPU budget or ML training pipeline	RAG	Faster to ship and cheaper to maintain
Ultra-low latency, no retrieval round-trip allowed	Fine-tuning	No retrieval hop at inference
Internal support copilot with evolving docs and strict style	RAG + Fine-tuning	Best trade-off for freshness plus operator experience

📖 The Two Teams That Built the Wrong Thing

Six months ago, Team A finished a RAG pipeline to answer questions about their internal documentation. They celebrated the launch and closed the sprint. Two months into production, the numbers told a different story: recall was stuck at 60%, the LLM was writing responses in a tone that made every answer sound like a Wikipedia summary, and it kept confusing internal product terminology with generic industry terms. Engineers blamed the vector store. The tech lead blamed the embedding model. The real culprit was the decision to use RAG to fix a problem that RAG cannot solve.

Around the same time, Team B decided to fine-tune a model on three years of support tickets. The result looked impressive in evaluation: the model used the correct product names, matched the support team's voice exactly, and confidently handled multi-step troubleshooting flows. Then a product release changed two pricing tiers and deprecated a feature. The fine-tuned model kept referencing the old tiers with complete confidence. Users filed tickets about advice that contradicted the website. Retraining took eleven days and $4,000 in GPU costs — and by the time it was done, two more features had changed.

Both teams picked the wrong tool. What makes this painful is that their failure modes looked almost identical on the surface: the model gave wrong answers. The root causes were completely different.

This is the central trap with RAG and fine-tuning: they both improve LLM output quality, but for fundamentally different reasons. Understanding that distinction — clearly, with real criteria, before you start building — is what separates teams that ship working AI systems from teams that spend months chasing the wrong problem.

🔍 Core Concepts First: Why RAG and Fine-Tuning Solve Different Problems

To use either tool well, you need to understand what each one actually changes about an LLM.

RAG (Retrieval-Augmented Generation) is a runtime architecture, not a training technique. The model's weights are never touched. Instead, at inference time, a retrieval layer fetches relevant documents or passages and injects them into the prompt as context. The model reads this context and uses it to answer. If the context is accurate and relevant, the answer will be grounded in it. If the context is stale or missing, the model falls back to its pretraining knowledge — with no retrieval signal that anything went wrong.

Think of RAG as giving a surgeon the patient's chart right before the operation. The surgeon's skills, training, and judgment stay exactly as they were. What changes is the specific, current information they have access to in the moment.

Fine-tuning modifies the model's weights using a targeted dataset. After fine-tuning, the model has changed at a fundamental level: its internal representations, its preferred output formats, its vocabulary for a domain, and its reasoning patterns all reflect the training data. You are not giving the model a reference — you are changing how it thinks and writes, permanently (or until the next training run).

Think of fine-tuning as sending that same surgeon to a specialty residency. The operations they perform after the residency reflect deeply internalized expertise — not a reference manual they read that morning.

Dimension	RAG	Fine-Tuning
What changes	Nothing in the model	The model's weights
When the change applies	Every inference (new context)	Permanently, until retrained
Best for	Providing current facts, domain documents	Changing reasoning style, tone, terminology
Data freshness	Always current (re-index documents)	Frozen at training time
Failure mode	Retrieval miss → model hallucinates	Stale facts baked into weights
Setup cost	Hours to days	Days to weeks
Cost to update	Re-index changed documents	Retrain (compute + time)

The confusion between them stems from the fact that both can make a model give better answers. But they do so by different mechanisms. Using RAG to "teach" terminology does not work because RAG only provides context — if the model does not understand the terminology in the first place, injected context about it will not reliably change how the model interprets it. Using fine-tuning to keep facts current is expensive and self-defeating because the model's weights go stale the moment the underlying reality changes.

⚙️ Six Signals That Tell You Which Tool to Reach For

Before writing a single line of code, apply this scoring table. Rate each factor for your situation. Whichever column accumulates more check marks is the starting recommendation.

Factor	RAG is the better choice	Fine-Tuning is the better choice
Data freshness	Data changes weekly or daily	Knowledge domain is stable for months
Knowledge type	Facts, documents, policies, product specs	Reasoning style, tone, output format, domain jargon as syntax
Data volume	Any size — chunked at retrieval time	More than 500 high-quality labeled examples
Latency budget	Can absorb 100–500ms retrieval round-trip	Need sub-500ms with no external dependencies
Infrastructure	Vector DB + embedding model (no GPU)	GPU for training and for serving the fine-tuned model
Auditability need	Must cite sources; answers must be traceable	Behavior audit (did the model respond correctly?), not source audit

Three clear decision tiers emerge from this table:

RAG only: Your knowledge is external and changes frequently. Every answer needs to be citable. Your team has no ML infrastructure and does not want to build it. Most internal documentation chatbots, customer-facing FAQ systems, and support bots with large ever-changing knowledge bases belong here.

Fine-tuning only: You need the model to consistently write in a specific style or follow a specific output format. Retrieval latency is not acceptable. The domain knowledge is stable and well-represented in your training data. Legal clause extraction, medical note formatting, and code generation with company-specific API conventions are common examples.

Both — the production-grade path: You need the model to have correct, current facts AND to respond in a consistent style or domain voice. Most mature AI assistants land here. Fine-tuning handles tone, format, and terminology; RAG handles grounding in current facts.

⚖️ Comparison: Cost, Latency, Maintenance, Freshness, and Control

Dimension	RAG	Fine-Tuning	Hybrid (RAG + Fine-Tuning)
Setup time	Fast (hours to days)	Slower (days to weeks)	Slowest initial setup
Ongoing update cost	Low (re-index docs)	High (retraining cycles)	Medium (re-index often, retrain selectively)
Inference latency	Higher (retrieval + generation)	Lower (generation only)	Higher than fine-tuning only
Data freshness	High	Low unless retrained frequently	High
Behavioral consistency	Medium (prompt-dependent)	High	High
Operational complexity	Medium	Medium-High	High
Best fit	Knowledge-heavy assistants	Style/format-heavy tasks	Enterprise copilots and support systems

Quick takeaway: If your core risk is stale answers, start with RAG. If your core risk is inconsistent behavior, fine-tune. If both risks matter in production, use a hybrid architecture intentionally.

🧠 Deep Dive: How RAG and Fine-Tuning Work Under the Hood

Both approaches improve LLM output, but through completely different mechanisms. Understanding the internal machinery of each — before choosing — prevents the most expensive architectural mistakes.

RAG Internals: From Chunking to Re-Ranking

Most teams treat RAG as a solved problem after they wire up an embedding model and a vector store. Then they wonder why recall is 60%. The reality is that chunking strategy is where 80% of RAG failures originate — not the LLM, not the embedding model, and usually not the vector store.

Chunking: Where most RAG pipelines break. A chunk is the unit of text that gets embedded and stored. If your chunks are too large, the embedding averages over too much content and loses precision — the top-k results will contain the right information buried in noise, but the model may not extract it correctly. If your chunks are too small, you lose the surrounding context that makes a passage meaningful, and the model gets fragments that do not stand alone.

Three chunking strategies dominate production systems:

Fixed-size chunking (e.g., 512 tokens with 64-token overlap) is the simplest and fastest. It works well for uniform-structure documents but breaks mid-sentence and mid-concept on prose or technical specs.
Recursive character text splitting breaks at paragraph boundaries first, then sentence boundaries, then word boundaries, degrading gracefully until the chunk falls within the size limit. This is the right default for most text.
Semantic chunking embeds every sentence and merges adjacent sentences when their embeddings are similar, splitting when similarity drops. It produces semantically coherent chunks but requires an embedding call per sentence during indexing.

Embedding model choice matters more than most teams realize. The embedding model converts your chunks and queries into dense vectors. The distance between these vectors determines retrieval quality. text-embedding-3-large from OpenAI produces high-quality embeddings but adds cost and API latency. BAAI/bge-m3 is a strong open-weight alternative with multilingual support. nomic-embed-text runs locally and produces surprisingly competitive results for English-only knowledge bases. Mismatched embedding models — using one model to index and a different model to query — are a silent failure mode that produces nonsensical retrieval results.

Vector store and retrieval strategy. A vector store indexes your chunk embeddings and performs approximate nearest-neighbor search at query time. The index algorithm matters: HNSW (Hierarchical Navigable Small World) is the de facto standard for production deployments — it achieves sub-10ms retrieval over millions of vectors with configurable precision/recall trade-offs. Flat (exact) search is only viable at small scale.

But vector similarity alone — also called dense retrieval — is not always best. For technical documentation with precise terminology (model names, version numbers, function signatures), sparse retrieval (BM25 keyword matching) often outperforms dense retrieval because exact keyword matches beat semantic generalization. Hybrid retrieval fuses both signals using Reciprocal Rank Fusion (RRF), and it consistently beats either approach alone for mixed-content knowledge bases.

Re-ranking: the cheap precision multiplier. After retrieving top-k candidates, a cross-encoder re-ranker re-scores each candidate against the original query using a full attention pass — much more expensive than embedding similarity but orders of magnitude more accurate. Running a cross-encoder/ms-marco-MiniLM-L6-v2 re-ranker on the top-20 vector results and keeping the top-5 routinely doubles precision at the cost of roughly 50–100ms of additional latency. For most production RAG systems, this is one of the highest-ROI improvements available.

The full pipeline order: query → embed → hybrid vector+BM25 search → top-20 candidates → CrossEncoder re-rank → top-5 context chunks → prompt assembly → LLM.

Fine-Tuning Internals: LoRA, QLoRA, and Weight Update Mechanics

Fine-tuning a large model from scratch on your dataset is called full fine-tuning — all parameters are updated during training. This produces the best possible specialization but is prohibitively expensive for models above 7B parameters. A 7B-parameter model at FP16 requires approximately 14GB just for weights, plus 2–3× more for optimizer states. For most teams, this is not practical.

Parameter-Efficient Fine-Tuning (PEFT) is the practical alternative. Instead of updating all weights, PEFT freezes the base model and trains only a small set of additional parameters. The dominant PEFT approach is LoRA.

LoRA: Low-Rank Adaptation. LoRA works by decomposing weight updates into two small matrices rather than updating the full weight matrix directly. For a pretrained weight matrix W of size d×k, LoRA adds a parallel update path: W + ΔW where ΔW = BA, B has shape d×r, and A has shape r×k, with rank r ≪ min(d, k). Only A and B are trained. The base model weights stay frozen. In practice, r=16 gives a good balance — it trains roughly 0.1–1% of total parameters while capturing most of the behavioral shift you want. Because B is initialized to zero, LoRA adapters have no effect at initialization and only diverge during training, which means the base model's general capability is preserved.

lora_alpha controls the scaling of the LoRA update: the effective update magnitude is (lora_alpha / r) * ΔW. Setting lora_alpha = 2 * r (e.g., alpha=32 when r=16) is a stable default. The target_modules parameter specifies which weight matrices receive LoRA adapters — for transformer models, the query and value projections (q_proj, v_proj) are the standard targets, though adding k_proj and o_proj improves results for complex reasoning tasks.

QLoRA: Fine-tuning 70B models on two A100s. QLoRA extends LoRA by quantizing the base model weights to 4 bits (using NF4 quantization, which is better suited to normally distributed weights than standard int4) before freezing them, then training LoRA adapters in BF16. The quantized base model reduces memory usage by roughly 4×. A 70B-parameter model that would require eight A100s for standard LoRA now fits on two. The tradeoff: quantization adds a small inference overhead and introduces quantization error in the frozen base — negligible for style adaptation tasks, occasionally noticeable for precise numerical reasoning.

Training data format: instruction triples, not raw text. Fine-tuning on raw completions is the most common beginner mistake. Models trained on raw text learn to continue text — they do not learn to follow instructions. Instruction-tuning formats every training example as a system/user/assistant triple, teaching the model to respond to directives rather than continue a document. For behavior alignment, Direct Preference Optimization (DPO) pairs each prompt with a preferred and rejected response — 200 high-quality preference pairs routinely outperform 2,000 raw SFT examples.

Overfitting signals to watch: A diverging gap between training loss (still decreasing) and validation loss (increasing) is the textbook signal. More subtle: measuring the model on a held-out general benchmark like MMLU or HellaSwag — if the fine-tuned model loses more than 5–10 percentage points on general reasoning versus the base, the learning rate is too high or the dataset is too small to prevent catastrophic forgetting.

Performance Analysis: Speed, Cost, and Quality Across Both Approaches

The table below compares the operational profile of each strategy across the dimensions that matter most when choosing. Use it alongside the six-factor decision table above to build your recommendation.

Metric	RAG only	Fine-Tuning only	RAG + Fine-Tuning
Time to first value	Hours (index + prompt)	Days to weeks (data prep + train)	Weeks
Cost to update	Re-index changed docs (cheap)	Retrain on GPU (expensive)	Re-index only
Recall on fresh data	High (if retrieved)	None (frozen weights)	High
Tone and style adherence	Low (prompt-dependent, inconsistent)	High (baked into weights)	High
Hallucination risk	Lower (grounded in context)	Higher (if undertrained)	Lowest
Latency overhead	+100–500ms retrieval round-trip	0ms (no retrieval at inference)	+100–500ms
Infrastructure floor	Vector DB + embedding model	GPU for training + fine-tuned serving	Both

📊 Visual Reference

flowchart TD
    Pretrained["Pretrained
Model
(Frozen)"]
    LoRA["LoRA Adapter
(Trainable)"]
    Finetune["Fine-tune on
Custom Data"]
    Result["Fine-tuned
Model"]

    Pretrained --> LoRA
    LoRA --> Finetune
    Finetune --> Result

🏗️ Architecture and Workflow: How This Works in Production

The architecture decision affects far more than model quality. It changes your deployment pipeline, observability strategy, incident response, and cost profile.

Use the two workflow diagrams below as implementation blueprints:

RAG pipeline: optimize retrieval quality, ranking, and grounding checks.
Fine-tuning workflow: optimize data quality, adapter training, and safe model rollout.

📊 The RAG Pipeline Flow, Visualized

The following diagram traces a query through a production-grade RAG pipeline. The two retrieval branches (vector similarity and BM25) run in parallel, their results are fused and re-ranked, and only then does the LLM receive context.

graph TD
    A[User Query] --> B[Embed Query with Embedding Model]
    B --> C[Dense Vector Search in HNSW Index]
    A --> D[BM25 Sparse Keyword Search]
    C --> E[Merge Candidates via RRF Score Fusion]
    D --> E
    E --> F[CrossEncoder Re-Ranker]
    F --> G[Top-K Context Chunks Selected]
    G --> H[Prompt Assembly with System Instructions]
    H --> I[LLM Inference]
    I --> J[Grounded Response with Citations]

The diagram separates the two retrieval channels intentionally: dense search catches semantically similar passages even when query wording differs from document wording, while sparse BM25 catches exact product names, version numbers, and technical identifiers that embedding similarity tends to dilute. RRF fusion gives higher combined scores to documents that rank well in both channels, boosting precision without requiring a trained re-ranker at the fusion step.

📊 The LoRA Training and Serving Flow, Visualized

Training with LoRA involves freezing the base model and routing gradients only through the small adapter matrices. Serving can either merge the adapter weights back into the base model for zero-latency overhead, or keep them separate and apply them dynamically when multiple adapters need to share a single base model.

graph TD
    A[Instruction-Formatted Training Examples] --> B[Tokenize with Chat Template]
    B --> C[Load Base Model in 4-bit via QLoRA]
    C --> D[Attach LoRA Adapter Layers to q-proj and v-proj]
    D --> E[Forward Pass - Frozen Base plus Trainable Adapter]
    E --> F[Compute Causal LM Loss]
    F --> G[Backprop Through Adapter Only]
    G --> H[Checkpoint Adapter Weights]
    H --> I[Merge Adapter into Base Model]
    I --> J[Serve with vLLM or HuggingFace TGI]

The key insight from this flow is what does NOT appear in the gradient path: the frozen base model layers. Because the base model parameters never receive gradient updates, there is no risk of catastrophic forgetting in those layers — the forgetting risk comes entirely from the LoRA adapter influencing the activation patterns that flow through frozen weights. Keeping rank r small (8–16 for style adaptation, 32–64 for domain reasoning) limits the adapter's capacity to distort base knowledge.

⚖️ Seven Failure Modes: Symptoms, Root Causes, and Fixes

These are the most common production failures across both approaches, drawn from real deployment patterns.

Symptom	Approach	Root Cause	Fix
Recall below 70%; model ignores documents in context	RAG	Chunk size too large (embeddings diluted) or retrieval returning semantically distant passages	Use recursive chunking at 256–512 tokens; add re-ranker
Model gives correct facts but ignores retrieved context, citing old training knowledge	RAG	Context is appended after a long system prompt; model's attention dilutes toward the end	Move context closer to the question; reduce system prompt length
Query latency spikes above 2 seconds at scale	RAG	Flat index — exact nearest-neighbor search is O(n)	Migrate to HNSW index; add caching for frequent queries
Fine-tuned model writes correctly but states facts that changed post-training	Fine-Tuning	Fine-tuning was used to inject facts, not just style	Never inject facts via fine-tuning; add RAG layer for current information
Model's general reasoning degrades after fine-tuning (MMLU drops 15%)	Fine-Tuning	Learning rate too high, causing catastrophic forgetting in shared representations	Reduce LR to 1e-4 or lower; use cosine warmup; increase regularization with LoRA dropout
Fine-tuned model overfits to training phrasing — brittleness on paraphrased inputs	Fine-Tuning	Training set too small (fewer than 500 examples) or too homogeneous in phrasing	Augment training examples with paraphrases; use DPO pairs instead of raw SFT completions
Combined RAG + fine-tuned model disagrees with itself — retrieved context contradicts baked-in training beliefs	Both	Fine-tuning introduced strong priors that override retrieved context	Re-prompt to explicitly instruct the model to prefer context over prior knowledge; reduce LoRA rank

🌍 Real-World Scenarios: How Teams Actually Choose in Production

Understanding where the boundary between RAG and fine-tuning falls in real systems prevents the most common architectural mistakes.

Notion AI is a canonical RAG-without-fine-tuning deployment. Every user's content changes constantly — new pages, edits, restructured databases. Fine-tuning a model on each user's workspace would be impossibly expensive and perpetually stale. Instead, Notion's AI operates over the current state of the user's pages at inference time, injecting relevant blocks into context. The model itself is a managed API (GPT-4 family). Freshness over style is the correct trade-off here — Notion's users expect factual answers about their own documents, not a house writing style.

GitHub Copilot uses a layered approach that combines both techniques. The core code completion model was fine-tuned on a massive corpus of public code, which gave it deep competence in code patterns, idioms, and API conventions — behavioral knowledge that cannot be effectively retrieved. On top of that, Copilot's newer repository-aware features inject current code from open files and the local repository as context at suggestion time, which is textbook RAG. Neither approach alone would work: retrieval without domain-adapted behavior produces generic completions; fine-tuning without retrieval cannot see the user's actual code.

Stripe's support bot illustrates the combined architecture that most internal-facing enterprise assistants eventually converge on. The model was fine-tuned on historical support transcripts to internalize the company's support voice, escalation language, and troubleshooting patterns — all stable behavioral knowledge. Current product documentation, pricing tables, and API changelog notes are injected via RAG. This separation is deliberate: the fine-tuned model handles how to respond, the RAG layer handles what is currently true.

Bloomberg GPT went the other direction: full fine-tuning on a 700-billion-token financial corpus, no retrieval. The goal was domain-specific reasoning — understanding the implicit relationships between financial entities, the conventions of earnings call transcripts, the meaning of regulatory language — not just access to current financial data. Bloomberg's terminal already provides current data through structured queries; what they needed was a model that could reason about that data the way a trained financial analyst would. Full fine-tuning on domain text, not RAG, is the right tool for internalizing complex domain reasoning patterns.

Cursor (the AI code editor) shows how RAG architecture can substitute for some fine-tuning needs. Instead of fine-tuning on each codebase, Cursor indexes the current repository at session time and retrieves the most relevant files and functions as context for each suggestion. For code style and project-specific conventions, it relies on explicit context injection rather than baked-in training. This makes the tool immediately useful on any codebase without per-project training, at the cost of some inference latency and a hard cap on how much codebase context fits in a single prompt.

🧭 Nine-Question Decision Checklist

Answer each question. Follow the arrows to a recommendation.

Does your knowledge base change more often than once per month? → Yes → RAG. No → Continue.
Must every answer be traceable to a source document? → Yes → RAG. No → Continue.
Is the problem about tone, writing style, or output format? → Yes → Fine-Tuning. No → Continue.
Do you have more than 500 high-quality labeled examples? → No → RAG (fine-tuning on fewer examples usually hurts). Yes → Continue.
Is retrieval latency acceptable in your product? → No → Fine-Tuning. Yes → Continue.
Does your team have GPU infrastructure for training and serving? → No → RAG. Yes → Continue.
Does the model need to understand domain-specific terminology as syntax (e.g., legal clause types, financial instrument names)? → Yes → Fine-Tuning. No → Continue.
Do you need both current facts AND consistent style? → Yes → Both. No → Continue.
Are you building a production assistant that handles sensitive, regulated, or legally significant content? → Yes → Both (RAG for grounding, fine-tuning for compliance tone). No → RAG (the safer starting point).

Recommendation tiers:

RAG only: Questions 1, 2, 4 (No), 5 (No), or 6 (No) triggered.
Fine-Tuning only: Questions 3 or 7 triggered, and questions 1 and 2 did not trigger.
Both: Questions 8 or 9 triggered, or you reached question 8 with no hard stops.
Start with RAG and add fine-tuning later: When in doubt. RAG gives you fast feedback on what the model is getting wrong. Those failure patterns become your fine-tuning training signal.

🧪 Worked Examples: Building a RAG Chain and Fine-Tuning Mistral-7B with LoRA

The two examples below demonstrate the practical shape of each approach. They are meant to be readable as architecture blueprints — the specific library calls matter less than the pattern each one enacts.

Example 1: A RAG Pipeline Over Internal Documentation (LangChain + Chroma)

This pipeline indexes a list of document strings, stores their embeddings in Chroma, and wraps everything in a LangChain LCEL chain that retrieves the five most relevant chunks before sending them to GPT-4o-mini. The RecursiveCharacterTextSplitter handles chunking, respecting paragraph and sentence boundaries before falling back to character splits.

Notice the RunnablePassthrough on the question branch — it passes the raw query string through unchanged to the prompt, while the retriever branch fetches and joins the relevant chunks. This is the standard LCEL RAG pattern and is trivially composable with re-rankers, conversation memory, or output parsers.

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

def build_rag_pipeline(docs: list[str]) -> object:
    """Build a RAG chain over a list of document strings.

    Chunks each document, embeds with text-embedding-3-small,
    and wraps in an LCEL chain that retrieves k=5 before answering.
    """
    splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
    chunks = splitter.create_documents(docs)

    vectorstore = Chroma.from_documents(
        chunks,
        embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
        collection_name="internal-docs",
    )
    retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

    prompt = ChatPromptTemplate.from_template(
        "Answer the question using only the context below.\n\n"
        "Context: {context}\n\n"
        "Question: {question}"
    )
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    def format_docs(docs):
        return "\n\n".join(d.page_content for d in docs)

    return (
        {
            "context": retriever | format_docs,
            "question": RunnablePassthrough(),
        }
        | prompt
        | llm
        | StrOutputParser()
    )

# Usage
chain = build_rag_pipeline([
    "Our refund policy allows 30-day no-questions-asked returns on all products.",
    "Premium tier subscribers receive priority support with a 2-hour SLA.",
    "API rate limits are 1,000 requests per minute for standard accounts.",
])

print(chain.invoke("What is the refund policy?"))
# Output: "You can return any product within 30 days, no questions asked."

To add hybrid retrieval (BM25 + dense), replace the Chroma retriever with a BM25Retriever from langchain-community and a EnsembleRetriever that combines both with equal weight. To add re-ranking, wrap the ensemble retriever with a ContextualCompressionRetriever and a CrossEncoderReranker using cross-encoder/ms-marco-MiniLM-L6-v2.

Example 2: Fine-Tuning Mistral-7B with LoRA via HuggingFace PEFT

This snippet fine-tunes Mistral-7B-Instruct with QLoRA (4-bit quantized base + BF16 adapter training). The key parameters to understand are r=16 (adapter rank — higher means more capacity but more parameters), lora_alpha=32 (scaling factor, keep at 2× rank), and target_modules (which linear layers receive adapters).

The model.print_trainable_parameters() call is not cosmetic — it confirms that you are training approximately 0.1% of total parameters. If the number is 100%, your PEFT config was not applied correctly and you are performing full fine-tuning on a quantized model, which will produce poor results.

from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer

MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"

def fine_tune_with_lora(examples: list[dict]) -> None:
    """Fine-tune Mistral-7B with QLoRA on instruction-following examples.

    Each example must have:
        - "instruction": str  (the user prompt)
        - "response": str     (the expected model output)

    After training, the LoRA adapter is saved to ./mistral-lora-adapter.
    Merge with the base model using `model.merge_and_unload()` before serving.
    """
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token

    def format_example(ex: dict) -> dict:
        # Mistral instruct format: [INST] user [/INST] assistant
        return {
            "text": f"<s>[INST] {ex['instruction']} [/INST] {ex['response']} </s>"
        }

    dataset = Dataset.from_list([format_example(e) for e in examples])

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        load_in_4bit=True,   # QLoRA: freeze base in NF4 quantization
        device_map="auto",
    )

    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=16,                # Rank — higher = more capacity, more trainable params
        lora_alpha=32,       # Scaling factor; keep at 2x r
        lora_dropout=0.05,
        target_modules=["q_proj", "v_proj"],  # Attention projections only
    )
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    # Expected: trainable params: ~8M || all params: ~7.24B || trainable%: 0.11%

    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        dataset_text_field="text",
        args=TrainingArguments(
            output_dir="./mistral-lora-checkpoints",
            num_train_epochs=3,
            per_device_train_batch_size=4,
            gradient_accumulation_steps=4,  # Effective batch size: 16
            learning_rate=2e-4,
            fp16=True,
            logging_steps=10,
            save_strategy="epoch",
        ),
    )
    trainer.train()
    model.save_pretrained("./mistral-lora-adapter")
    print("Adapter saved. Merge with base model before high-throughput serving.")

After saving, merge the adapter for serving with model = model.merge_and_unload() — this folds the LoRA matrices back into the base weight matrices and produces a standard HuggingFace model that can be served with vLLM at full throughput without any adapter overhead.

📊 Visual Reference

flowchart TD
    Task["Task Type"]
    Changing{Requires frequent
knowledge updates?}
    RAG["Use RAG
(Dynamic retrieval)"]
    Finetune["Use Fine-tuning
(Static weights)"]

    Task --> Changing
    Changing -->|Yes| RAG
    Changing -->|No| Finetune

🛠️ LlamaIndex and HuggingFace PEFT: The OSS Stack That Powers Both Paths

LlamaIndex: A Higher-Level RAG Abstraction

LlamaIndex (formerly GPT Index) is a data framework designed specifically for connecting LLMs to external data sources. Where LangChain gives you building blocks (splitters, retrievers, chains), LlamaIndex gives you a higher-level abstraction: a VectorStoreIndex that manages chunking, embedding, indexing, and retrieval behind a single QueryEngine interface.

Its NodePostprocessors API lets you attach re-rankers, metadata filters, and context compressors to the query pipeline without manually wiring them together. For teams that want production-grade RAG with less plumbing, LlamaIndex converges faster.

The snippet below builds a RAG query engine from a directory of PDFs — the most common enterprise use case — in under ten lines:

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

# Load all PDFs from ./docs (supports .pdf, .txt, .md, .docx)
documents = SimpleDirectoryReader("./docs").load_data()

# Chunk into 512-token nodes with 64-token overlap
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)
nodes = splitter.get_nodes_from_documents(documents)

# Build the vector index (embeds and indexes all nodes)
index = VectorStoreIndex(nodes)

# Create a query engine with top-5 retrieval
query_engine = index.as_query_engine(similarity_top_k=5)

# Query
response = query_engine.query("What is our refund policy?")
print(response)

LlamaIndex integrates natively with Chroma, Pinecone, Weaviate, Qdrant, and dozens of other vector stores. For persistence across sessions, pass a StorageContext to the index constructor rather than rebuilding from documents on every startup.

For a full deep-dive on production RAG patterns with LangChain and Chroma, see LangChain RAG: Retrieval-Augmented Generation in Practice.

HuggingFace PEFT: The Unified Interface for Parameter-Efficient Fine-Tuning

HuggingFace PEFT (Parameter-Efficient Fine-Tuning) is the library that makes LoRA, QLoRA, prefix tuning, and IA3 adapters first-class citizens in the HuggingFace ecosystem. The API is consistent across adapter types: LoraConfig, PrefixTuningConfig, and IA3Config all follow the same get_peft_model(base_model, config) pattern, making it easy to experiment with different methods without rewriting your training loop.

The companion library TRL (trl) provides SFTTrainer for supervised fine-tuning on formatted instruction datasets, DPOTrainer for Direct Preference Optimization, and PPOTrainer for RLHF. Together, PEFT and TRL cover the full fine-tuning spectrum from simple style adaptation (SFT with LoRA) to complex behavioral alignment (DPO or PPO).

Key operational note: when using QLoRA (load_in_4bit=True), also import BitsAndBytesConfig and explicitly configure the quantization type as bnb_4bit_quant_type="nf4" and bnb_4bit_compute_dtype=torch.bfloat16. The default int4 quantization produces noticeably worse results than NF4 for language model weights.

For a full deep-dive on LoRA and QLoRA including rank selection, merge strategies, and serving with vLLM, see Fine-Tuning LLMs: The Complete Engineer's Guide to SFT, LoRA, and RLHF.

📚 Six Lessons from Teams That Got This Wrong First

Chunking strategy causes more RAG failures than any other component. Teams blame the LLM when recall is low. The problem is almost always chunk size, overlap, or the absence of a re-ranker. Fix the chunking and retrieval before touching the prompt or the model.
Hybrid retrieval (BM25 + dense) consistently outperforms pure semantic search for technical documents. Product names, API method names, version numbers, and error codes do not embed well — they look semantically similar to many other strings. BM25 catches exact matches that dense retrieval misses.
Fine-tuning with fewer than 500 examples usually hurts more than it helps. With fewer examples, the model memorizes training phrasing rather than generalizing. If you are below this threshold, use few-shot prompting or DPO pairs instead. DPO with 200 high-quality preference pairs often beats SFT on 1,000 raw completions.
A CrossEncoder re-ranker doubles RAG precision with minimal latency cost. Adding cross-encoder/ms-marco-MiniLM-L6-v2 as a post-retrieval re-ranker on top-20 results is consistently one of the highest-ROI improvements in a RAG pipeline. It adds 50–100ms but routinely cuts irrelevant context by half.
Combine strategically: fine-tune for tone and style, use RAG for facts. The two approaches are not in competition. The most reliable production pattern is a fine-tuned model that has internalized domain voice and output format, with a RAG layer that grounds its answers in current knowledge. Never use fine-tuning to inject facts that change — they will be stale before training finishes.
Build your evaluation set before you build your system. A 50-question golden set with human-verified answers lets you measure your baseline before spending GPU hours or building pipelines. Teams that skip this step spend weeks optimizing a metric they never defined.

✅ Final Recommendation: When to Choose RAG, Fine-Tuning, or Hybrid

Choose RAG when:

Your source of truth changes frequently.
You need source-cited answers.
You want the fastest path to production.

Choose Fine-Tuning when:

You need stable tone, output structure, and domain-specific response behavior.
Your knowledge domain is stable enough that retraining cycles are acceptable.
Retrieval latency is unacceptable for your product UX.

Choose Hybrid when:

You need both factual freshness and consistent communication style.
You run customer-facing assistants where mistakes are costly.
You can support both retrieval infrastructure and periodic model adaptation.

Bottom line: Start with RAG to establish a measurable baseline. Add fine-tuning only for recurring behavioral gaps that retrieval cannot fix. This sequence minimizes risk, reduces wasted training spend, and gives you the strongest path to production reliability.

📌 TLDR: The One-Page Decision Cheat Sheet

RAG is the right choice when your data changes, when answers must be traceable to source documents, and when your team lacks ML training infrastructure. It delivers fast time-to-value, cheap updates, and grounded responses — at the cost of retrieval latency and quality that depends entirely on retrieval quality.

Fine-Tuning is the right choice when you need consistent style, domain reasoning, or output format that cannot be reliably controlled through prompting. It eliminates retrieval latency and bakes in behavioral consistency — at the cost of training time, GPU budget, and rapid staleness for any factual knowledge you try to embed.

Both is the right architecture for production assistants that need to be simultaneously authoritative in style and accurate on current facts. Fine-tune the model to know how to respond; use RAG to ensure it knows what is currently true.

The one rule that saves the most time: Never fine-tune to inject facts. Facts change. Weights do not update themselves. RAG exists precisely to solve the freshness problem that fine-tuning cannot.

Quiet AI help

Explain simpler Compare approaches What next?

Article metadata

Written by

Abstract Algorithms

@abstractalgorithms

Related deep dives

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

31 min read

ANN Index Types Explained: When to Choose Flat, HNSW, IVF, or IVF-PQ

14 min read

Softmax Function Explained: From Raw Scores to Probabilities

23 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

31 min · Llm · best next step

Open Collection