All Posts

LangChain RAG: Retrieval-Augmented Generation in Practice

Ground your LLM in real data: build a RAG pipeline with FAISS, Chroma, and LangChain retrievers — step by step.

Abstract AlgorithmsAbstract Algorithms
··20 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

⚡ TLDR: RAG in 30 Seconds

TLDR: RAG (Retrieval-Augmented Generation) fixes the LLM knowledge-cutoff problem by fetching relevant documents at query time and injecting them as context. With LangChain you build the full pipeline — load → split → embed → index → retrieve → answer — in clean, composable Python. This post walks every step end-to-end with working code and a complete Company Knowledge Base example.

📖 The Stale Knowledge Problem: Why LLMs Confidently Get It Wrong

Imagine you are the lead developer at a legal firm. Your team spends months building an AI assistant to help associates query hundreds of proprietary case files, internal memos, and client contracts. You wire up GPT-4 and deploy it. On day one, a senior partner asks: "What were the liability exclusions in the Hartwell vs. Meridian settlement?" The assistant replies with a confident, well-structured answer — and every detail is fabricated. There is no Hartwell vs. Meridian in the model's training data. It hallucinated the entire thing.

This is the core limitation every LLM shares: a training cutoff. The model learned from a snapshot of the world. It knows nothing about documents added after that snapshot, and it knows nothing about your private data — ever. Asking it about proprietary case files is like asking someone to recall a book they have never read.

The naive fix is fine-tuning: retrain (or adapt) the model on your documents. Fine-tuning is expensive, slow to iterate, and still does not guarantee factual grounding — the model may interpolate incorrectly between training examples. You also cannot fine-tune continuously every time a new document lands.

Retrieval-Augmented Generation (RAG) is the practical solution the industry converged on. Instead of baking documents into weights, RAG fetches the most relevant document chunks at query time, pastes them into the prompt as context, and lets the LLM synthesize an answer from evidence it can actually see. The model is not guessing from memory; it is reading a provided excerpt and summarizing it. Hallucination drops dramatically because the answer is anchored in retrieved text.

🔍 Grounding the Model: The Core Idea Behind RAG

A useful mental model: think of a closed-book exam versus an open-book exam. A vanilla LLM is a closed-book test — it can only recall what it memorized during training. RAG turns it into an open-book exam. Before the LLM answers, the system hunts through a library for the most relevant passages and places them in front of the model.

The three-phase loop is:

  1. Retrieve — given the user's query, search a document store for the top-k most relevant chunks.
  2. Augment — prepend those chunks to the prompt as context.
  3. Generate — let the LLM produce an answer grounded in that context.

The LLM's role shifts from "memory oracle" to "reading-comprehension engine." This is a far easier task for a language model, and the answers are verifiable against the source chunks.

PhaseWhat happensLangChain component
RetrieveVector similarity search over embedded chunksVectorStoreRetriever
AugmentInsert retrieved chunks into a prompt templateChatPromptTemplate
GenerateLLM reads context and produces answerChatOpenAI / local LLM

⚙️ The RAG Pipeline Step by Step: From Documents to Answers

Building RAG involves two distinct pipelines: ingestion (run once or on schedule) and retrieval (run per query).

Ingestion: Getting Documents Into the Vector Store

Step 1 — Document Loading. LangChain's DocumentLoader abstractions handle PDFs, HTML pages, plain text, Notion exports, and more. Each loader returns a list of Document objects (content + metadata).

pip install langchain langchain-community chromadb faiss-cpu sentence-transformers openai
from langchain_community.document_loaders import TextLoader, PyPDFLoader, WebBaseLoader

# Load a local text file
loader = TextLoader("company_policy.txt")
docs = loader.load()

# Load a PDF
pdf_loader = PyPDFLoader("org_chart.pdf")
pdf_docs = pdf_loader.load()

# Load a web page
web_loader = WebBaseLoader("https://example.com/faq")
web_docs = web_loader.load()

all_docs = docs + pdf_docs + web_docs

Step 2 — Text Splitting. LLMs have a finite context window. A 50-page PDF cannot be stuffed into a single prompt. RecursiveCharacterTextSplitter breaks documents into overlapping chunks. The overlap ensures that sentences split across chunk boundaries are still represented in at least one complete chunk, preserving coherence.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,      # characters per chunk
    chunk_overlap=64,    # characters shared between consecutive chunks
    separators=["\n\n", "\n", " ", ""]
)

chunks = splitter.split_documents(all_docs)
print(f"Split into {len(chunks)} chunks")

Chunk size is a tuning dial. Too large and you waste token budget with irrelevant text. Too small and you lose context that the LLM needs to formulate a coherent answer. A 400–600 character chunk with 10–15% overlap is a reliable starting point.

Step 3 — Embedding Generation. Each chunk is converted to a dense numeric vector that encodes its semantic meaning. Similar chunks end up close together in vector space, enabling similarity search. LangChain supports both cloud and local embedding models.

# Option A: OpenAI embeddings (cloud, high quality)
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Option B: HuggingFace local embeddings (free, runs on CPU)
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

The HuggingFace option is fully local — no API keys, no cost per call, and privacy-preserving for sensitive documents like legal case files.

Step 4 — Vector Store Indexing. Embed all chunks and store vectors in a searchable index.

from langchain_community.vectorstores import FAISS

# Build FAISS index from chunks
vectorstore = FAISS.from_documents(chunks, embeddings)

# Persist locally for reuse
vectorstore.save_local("faiss_index")

# Load from disk on next run
vectorstore = FAISS.load_local("faiss_index", embeddings,
                                allow_dangerous_deserialization=True)

Retrieval: Answering a Query

Step 5 — Similarity Search. At query time, embed the user's question and find the nearest chunk vectors.

retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 4})

results = retriever.invoke("What is the remote work policy?")
for doc in results:
    print(doc.page_content[:200])

🧠 Deep Dive: How Vector Retrieval Actually Works

The Internals: Embeddings, Cosine Similarity, and FAISS Indexes

When you embed a sentence like "remote work requires manager approval", the embedding model produces a vector with hundreds of dimensions — typically 384 to 1536 floats depending on the model. Each dimension captures a latent semantic feature learned during model training.

When a query arrives, you embed it into the same vector space and measure cosine similarity — the cosine of the angle between the query vector and every stored document vector. A score of 1.0 means the vectors point in exactly the same direction (semantically identical); 0.0 means orthogonal (unrelated). The retriever returns the k chunks with the highest similarity scores.

FAISS (Facebook AI Similarity Search) implements this efficiently. The default IndexFlatL2 computes exact nearest-neighbor distances and scales linearly with the number of vectors — fine for thousands of documents. For millions of vectors, IndexIVFFlat partitions the space into Voronoi cells and only searches within relevant cells, trading a tiny recall loss for dramatic speed gains.

Under the hood, FAISS.from_documents calls your embedding model once per chunk, stores the resulting vectors in a NumPy array, and wraps it in a FAISS index. The save_local call writes this index plus a pickle of the docstore (metadata + text) to disk.

Performance Analysis: Retrieval Speed vs. Index Size

Index typeExactSpeedMemoryBest for
IndexFlatL2✅ YesO(n) per queryHigh< 100k docs
IndexIVFFlat⚠️ ApproximateO(√n)Medium100k–10M docs
IndexHNSWFlat⚠️ ApproximateO(log n)Very highLow-latency prod

The real bottleneck in most RAG pipelines is not retrieval speed — it is the LLM call latency (often 1–5 seconds) and the embedding throughput during ingestion. Batch embedding calls and caching the index on disk reduces ingestion from minutes to seconds on re-runs.

Context window budget is also a performance constraint. With k=4 chunks of 512 characters each, you consume roughly 500 tokens of context — modest. Push k to 20 with large chunks and you exhaust the context window before the LLM even starts generating.

🏗️ Retrieval Strategies: MMR, Metadata Filters, and Contextual Compression

Plain similarity search can return redundant chunks — five chunks saying the same thing in slightly different words. Maximal Marginal Relevance (MMR) balances relevance against diversity: each subsequent result must be both similar to the query and dissimilar to already-selected results.

retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.6}
)
# lambda_mult: 1.0 = pure similarity, 0.0 = pure diversity

Metadata filtering narrows search to a document subset before running similarity search — useful when your index holds multi-tenant or multi-topic documents.

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 4, "filter": {"source": "company_policy.txt"}}
)

Contextual Compression is a post-retrieval refinement step. The raw retrieved chunk may contain mostly irrelevant sentences with one golden sentence buried inside. A ContextualCompressionRetriever wraps a base retriever and runs a secondary LLM pass to extract only the sentences relevant to the query.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 6})
)

compressed_docs = compression_retriever.invoke("What is the vacation policy?")

The trade-off: contextual compression improves answer precision but adds a second LLM call per query — roughly doubling latency. Reserve it for cases where answer quality clearly outweighs the cost.

📊 Visualizing the Full RAG Pipeline

graph TD
    A[Raw Documents\nPDFs · TXT · HTML] --> B[Document Loaders]
    B --> C[Text Splitter\nRecursiveCharacterTextSplitter]
    C --> D[Embedding Model\nOpenAI · HuggingFace]
    D --> E[(Vector Store\nFAISS · Chroma)]

    F[User Query] --> G[Query Embedder]
    G --> H{Similarity Search\nTop-k Chunks}
    E --> H
    H --> I[Context Formatter\nformat_docs]
    I --> J[Prompt Template\nSystem + Context + Question]
    J --> K[LLM\nGPT-4o · Claude · Local]
    K --> L[Final Answer]

    style A fill:#e8f5e9,stroke:#388e3c
    style E fill:#e3f2fd,stroke:#1976d2
    style K fill:#fce4ec,stroke:#c62828
    style L fill:#fff9c4,stroke:#f57f17

The pipeline splits cleanly into two halves: everything above the Vector Store node is the ingestion path (run offline). Everything from User Query down is the retrieval + generation path (run per request).

Let us return to the legal firm. The team ingests three document types:

  • Case summaries (PDF): 2,000+ files covering past litigation outcomes.
  • Internal policy memos (TXT): HR policy, billing rates, escalation procedures.
  • Client contracts (PDF): active engagement letters with custom terms.

With a RAG pipeline in place, the workflow changes entirely. When a partner asks "What were the confidentiality terms in the Vantage Capital engagement?", the system:

  1. Embeds the query and retrieves the 4 most relevant chunks from the Vantage Capital contract.
  2. Injects those chunks into a prompt alongside the question.
  3. The LLM reads the actual contract text and extracts the relevant clauses.

The answer is now traceable — the firm can display the source chunk alongside the answer as a citation. If the contract says something different from what the LLM answered, the source is right there to audit.

Sample queries against the Company Knowledge Base (from the worked example below):

QueryRetrieved SourceBehavior
"What is the remote work approval process?"company_policy.txtAccurate clause extracted
"Who owns the AI product roadmap?"org_chart.txtCorrect title + name returned
"What does the FAQ say about free tier limits?"product_faq.txtExact limit figures cited
"How many days notice for contract termination?"company_policy.txtClause number + days cited
"Is the VP of Engineering the same as the CTO?"org_chart.txtRoles correctly distinguished

⚖️ Trade-offs & Failure Modes in Production RAG

RAG dramatically reduces hallucinations but introduces its own failure taxonomy.

Failure Mode 1 — The LLM ignores the context. Retrieved chunks are present in the prompt, but the LLM answers from its pretrained weights instead. This happens when the retrieved context contradicts the model's strong priors (e.g., a custom company name that collides with a known public entity) or when the context is buried too far from the question in a long prompt. Mitigation: use explicit instruction in the system prompt — "Answer only using the provided context. If the context does not contain the answer, say 'I don't know.'"

Failure Mode 2 — Wrong chunks retrieved. The similarity search returns chunks about a tangentially related topic. Root causes: chunk size too large (dilutes the signal), embeddings too generic for domain-specific jargon, or insufficient k. Mitigation: tune chunk size (try 256–512 chars), add metadata filters, and evaluate retrieval with a precision@k benchmark before integrating the LLM.

Failure Mode 3 — Stale index. Documents change but the vector store is not refreshed. An employee's policy query returns outdated vacation-day counts from last year's handbook. Mitigation: trigger re-ingestion on document update events (file watcher, webhook, or scheduled job). With Chroma, you can upsert documents by ID without rebuilding the full index.

Failure Mode 4 — Context window overflow. High k + large chunks = prompt too long for the model. The LLM silently truncates or refuses the request. Mitigation: track token counts before sending; reduce k, shrink chunk size, or use contextual compression.

FailureSymptomFix
LLM ignores contextCorrect docs retrieved, wrong answerStronger system prompt + temperature=0
Wrong chunksAnswer unrelated to queryTune chunk size, use MMR, add metadata filter
Stale indexOutdated facts citedAutomated re-ingestion pipeline
Context overflowTruncated or refused responsesReduce k, compress, or use larger context model

🧭 Decision Guide: When to Build a RAG Pipeline

SituationRecommendation
Use whenYou have domain-specific or private documents not in the model's training data
Use whenFacts change frequently (pricing, policies, personnel) — fine-tuning can't keep up
Avoid whenYour document set is very small (< 20 docs) — just stuff them directly into the system prompt
Avoid whenQueries require multi-hop reasoning across many documents — vanilla RAG retrieves per query, not chains of reasoning
Better alternativeLangGraph agentic RAG when the assistant needs to decide whether and what to retrieve based on conversation state
Edge caseQueries spanning multiple source documents: use MMR + higher k, or build a document graph

🧪 Practical Examples: Building the Full RAG Chain with LCEL

Below is a complete, self-contained Company Knowledge Base assistant. It ingests three mock documents, indexes them, and answers five realistic queries using a clean LCEL chain.

from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# ── 1. Mock documents ────────────────────────────────────────────────────────
raw_docs = [
    Document(
        page_content=(
            "Remote Work Policy: Employees may work remotely up to 3 days per week. "
            "Manager approval is required via the HR portal. Requests must be submitted "
            "at least 48 hours in advance. Core hours are 10am–3pm in the employee's "
            "local timezone. Equipment is the employee's responsibility when working remotely. "
            "Violation of core-hour requirements may result in the remote privilege being revoked."
        ),
        metadata={"source": "company_policy.txt", "section": "remote-work"},
    ),
    Document(
        page_content=(
            "Product FAQ — Free Tier: The free tier includes 1,000 API calls per month "
            "and 500 MB of storage. Rate limits are 10 requests per second. Paid plans "
            "start at $49/month for 50,000 API calls. Enterprise pricing is available on "
            "request. SLA guarantees of 99.9% uptime apply only to paid plans. "
            "Free tier accounts are subject to fair-use suspension after 3 consecutive "
            "months of exceeding soft limits."
        ),
        metadata={"source": "product_faq.txt", "section": "pricing"},
    ),
    Document(
        page_content=(
            "Org Chart — Engineering: The VP of Engineering is Jordan Riley. "
            "Jordan reports directly to the CEO, Alex Nguyen. "
            "The CTO role is currently vacant following the departure of Sam Patel in Q1. "
            "The AI product roadmap is owned by the Head of Product, Casey Kim. "
            "Engineering has four sub-teams: Platform, Frontend, Data, and AI Research. "
            "Each sub-team is led by a Senior Engineering Manager."
        ),
        metadata={"source": "org_chart.txt", "section": "engineering"},
    ),
]

# ── 2. Split ─────────────────────────────────────────────────────────────────
splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=40)
chunks = splitter.split_documents(raw_docs)

# ── 3. Embed + Index ─────────────────────────────────────────────────────────
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 3})

# ── 4. Prompt ─────────────────────────────────────────────────────────────────
system_msg = (
    "You are a helpful company assistant. Answer ONLY from the context provided. "
    "If the context does not contain the answer, respond with: 'I don't have that information.'"
)
prompt = ChatPromptTemplate.from_messages([
    ("system", system_msg),
    ("human", "Context:\n{context}\n\nQuestion: {question}"),
])

# ── 5. LCEL Chain ─────────────────────────────────────────────────────────────
def format_docs(docs):
    return "\n\n---\n\n".join(doc.page_content for doc in docs)

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# ── 6. Run queries ────────────────────────────────────────────────────────────
queries = [
    "What is the remote work approval process?",
    "Who owns the AI product roadmap?",
    "What are the free tier API call limits?",
    "How many days do I need to submit a remote work request in advance?",
    "Is the CTO the same person as the VP of Engineering?",
]

for q in queries:
    print(f"\nQ: {q}")
    print(f"A: {rag_chain.invoke(q)}")

The LCEL chain reads left-to-right: the retriever fetches relevant chunks, format_docs joins them into a single string, the prompt template merges context and question, the LLM synthesizes the answer, and StrOutputParser extracts the plain text response. Streaming is free — replace rag_chain.invoke(q) with rag_chain.stream(q) to yield tokens as they arrive.

🛠️ ChromaDB: Persistent Local Vector Storage in Practice

FAISS is excellent for in-memory and file-based indexes, but it requires manual serialization. ChromaDB is a purpose-built, embeddable vector database with a client-server mode, built-in persistence, and collection management. It stores embeddings, documents, and metadata in a SQLite + HNSW-backed store that survives process restarts without any extra save/load calls.

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# First run: build and persist
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",   # persists automatically
    collection_name="company_kb"
)

# Subsequent runs: reload without re-embedding
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings,
    collection_name="company_kb"
)

# Upsert new documents without a full rebuild
vectorstore.add_documents(new_chunks)

retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

ChromaDB exposes a collection_name concept, letting you partition different document sets (e.g., "legal_cases", "hr_policies", "product_docs") into isolated namespaces within a single persist_directory. This makes it straightforward to add multi-tenant isolation without spinning up separate vector database instances.

For production deployments, ChromaDB supports a client-server mode (chromadb.HttpClient) so your ingestion pipeline and query service can connect to the same remote store. For a deeper look at how ChromaDB fits into a production pipeline, see the dedicated post linked in Related Posts below.

📚 Lessons Learned from Real RAG Deployments

Evaluate retrieval before the LLM. A common trap is treating the whole pipeline as a black box and tuning prompts when the real culprit is poor retrieval. Measure precision@k independently: given a query, do the top-k retrieved chunks actually contain the answer? If not, fix chunking and embedding before touching the LLM layer.

Chunk size matters more than model choice. Developers often debate GPT-4 vs Claude while ignoring that 1024-character chunks with 0 overlap are silently discarding the context the LLM needs. In our testing, reducing chunk size from 1024 to 400 characters with 15% overlap improved answer accuracy by more than switching to a larger embedding model.

Make retrieval transparent to users. Surfacing the source chunks alongside the answer — "Answer sourced from company_policy.txt, section 3.2" — builds user trust and makes errors visible. Users self-correct when they can see the retrieved evidence.

Plan for index drift. Documents change. Build a lightweight update pipeline from day one. With Chroma, a file watcher that calls vectorstore.add_documents() on new or modified files keeps the index fresh with minimal overhead.

System prompt is load-bearing. The instruction "answer only from the context" is not optional flavor text. Without it, strong models like GPT-4 will confidently override retrieved evidence with their pretrained knowledge, especially when context is ambiguous or contradictory.

🧭 What RAG + LangGraph Looks Like

The RAG chain you built here is a fixed pipeline: every query always retrieves from the same store using the same strategy. In agentic systems built with LangGraph, the agent itself decides when to retrieve, which collection to query, and how many retrieval rounds to perform based on conversation state. If the first retrieval returns low-confidence chunks, the agent can reformulate the query or switch to a different retrieval strategy — all within a stateful graph where each node is an explicit decision point. See the LangGraph 101 post in Related Posts for the foundation needed to build that pattern.

📌 Summary & Key Takeaways

  • RAG solves the LLM knowledge-cutoff problem by retrieving relevant document chunks at query time and injecting them as context before generation.
  • The ingestion pipeline (load → split → embed → index) runs offline; the retrieval pipeline (embed query → search → augment prompt → generate) runs per request.
  • Chunk size and overlap are the highest-leverage tuning parameters — measure retrieval precision@k before tuning the LLM.
  • FAISS is the fastest option for local, in-memory indexes; ChromaDB adds persistence, upserts, and multi-collection support with no extra serialization code.
  • MMR retrieval prevents redundant chunks; contextual compression strips irrelevant sentences from retrieved chunks to improve answer quality at the cost of an extra LLM call.
  • The most common RAG failure mode is not the LLM — it is the retriever returning wrong or stale chunks. Evaluate the retriever independently.
  • One-liner to remember: "RAG turns an LLM into a reading-comprehension engine — give it the right passage and it will ace the test."

📝 Practice Quiz

  1. Which component in the RAG ingestion pipeline converts raw text chunks into numeric vectors?

    • A) The RecursiveCharacterTextSplitter
    • B) The embedding model
    • C) The vector store
    • D) The StrOutputParser Correct Answer: B
  2. A RAG system consistently retrieves chunks about the wrong subtopic. The most likely root cause is:

    • A) The LLM temperature is set too high
    • B) The system prompt does not mention RAG
    • C) Chunk size is too large, diluting the semantic signal per chunk
    • D) The FAISS index is stored in memory instead of on disk Correct Answer: C
  3. Your RAG pipeline retrieves five chunks that all say essentially the same thing. Which retrieval strategy best addresses this?

    • A) Increase k to fetch more chunks
    • B) Switch from similarity search to MMR (Maximal Marginal Relevance)
    • C) Use a larger embedding model
    • D) Decrease chunk_overlap to zero Correct Answer: B
  4. A legal firm wants to query both case summaries and HR policies but keep them isolated so a policy query never retrieves case text. Which ChromaDB feature enables this?

    • A) persist_directory partitioning
    • B) collection_name namespacing
    • C) Metadata filtering with source key
    • D) Both B and C are valid approaches Correct Answer: D
  5. Open-ended challenge: A RAG pipeline for a customer support bot passes all unit tests but consistently gives wrong answers in production. The retrieved chunks look correct when inspected manually. What are three different root causes you would investigate, and what mitigation would you apply to each? (No single correct answer — reason through the evidence.)

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms