All Posts

Guide to Using RAG with LangChain and ChromaDB/FAISS

Build a 'Chat with PDF' app in 10 minutes. We walk through the code for loading documents, creati...

Abstract AlgorithmsAbstract Algorithms
ยทยท13 min read

AI-assisted content.

TLDR: RAG (Retrieval-Augmented Generation) gives an LLM access to your private documents at query time. You chunk and embed documents into a vector store (ChromaDB or FAISS), retrieve the relevant chunks at query time, and inject them into the LLM's prompt. The model answers from real data instead of hallucinating.


TLDR: RAG combines a vector database (ChromaDB) with an LLM โ€” retrieve relevant document chunks at query time, inject them as context, and get grounded answers with sources.

๐Ÿ“– Why LLMs Hallucinate and RAG Fixes It

An LLM's knowledge is frozen at its training cutoff. Ask it about your internal documentation, your product catalog, or a document uploaded today โ€” it will generate a plausible-sounding answer from its weights, not from your data.

RAG bridges this gap:

  • No fine-tuning required (fine-tuning is expensive and won't help for dynamic data)
  • Works with any LLM
  • Updates in real time โ€” add new documents to the vector store and they're immediately searchable

๐Ÿ” Key Concepts: Embeddings and Vector Similarity

When you search for "refund policy" in a document, a keyword search returns results containing those exact words. But what if your policy document says "money-back guarantee"? Keyword search misses it. Embeddings solve this.

An embedding is a list of numbers โ€” a high-dimensional vector (e.g., 1,536 numbers for OpenAI's text-embedding-3-small) that encodes the semantic meaning of a piece of text. Similar meanings produce vectors that point in similar directions in that space.

Cosine similarity measures the angle between two vectors. A score near 1.0 means "very similar meaning"; near 0 means "unrelated." This is how "money-back guarantee" and "refund policy" score highly similar โ€” they occupy neighboring regions in embedding space even though they share no keywords.

The RAG pipeline at a conceptual level:

  1. Embed โ€” convert every document chunk into a vector
  2. Store โ€” save those vectors in a vector database (ChromaDB, FAISS, Pinecone)
  3. Retrieve โ€” embed the user's query and find the top-K closest vectors by cosine similarity
  4. Inject โ€” paste the matched chunks into the LLM's prompt as context

Why RAG beats fine-tuning for dynamic/private data: Fine-tuning bakes knowledge into model weights. It costs hundreds of dollars per training run, takes hours, and requires full retraining every time your data changes. RAG adds new documents to the vector store in seconds with no model retraining. You also get source citations for free โ€” you know exactly which chunks the answer came from.

๐Ÿ“Š Full RAG Pipeline Sequence

sequenceDiagram
    participant D as Documents
    participant S as TextSplitter
    participant E as EmbeddingModel
    participant V as ChromaDB
    participant U as User
    participant R as Retriever
    participant L as LLM

    D->>S: Split into chunks
    S->>E: Embed each chunk
    E->>V: Store vectors + metadata
    U->>R: Submit query
    R->>E: Embed query
    E->>V: Cosine similarity search
    V-->>R: Top-K chunks
    R->>L: Inject chunks into prompt
    L-->>U: Grounded answer + citations

This sequence diagram shows the two-phase RAG lifecycle in one view. The top half is the offline indexing phase: documents are split, embedded, and stored in ChromaDB before any user query arrives. The bottom half is the online retrieval phase: the user's query is embedded, the top-K most similar chunks are retrieved by cosine similarity, injected into the LLM's prompt, and a grounded answer is returned with source citations. The key takeaway is that retrieval quality โ€” not generation quality โ€” is the dominant factor in RAG answer quality; bad context guarantees bad answers regardless of model size.

๐Ÿ“Š LangChain RAG Chain Components

flowchart LR
    DL[DocumentLoader (PDF, TXT, HTML)]
    TS[TextSplitter chunk_size=500]
    EM[Embeddings (OpenAI / HuggingFace)]
    VS[VectorStore (ChromaDB / FAISS)]
    RT[Retriever top_k=4]
    LLM[LLM (gpt-4o-mini)]
    ANS[Answer]

    DL --> TS --> EM --> VS --> RT --> LLM --> ANS

This flowchart maps each LangChain component to its stage in the RAG pipeline, showing the linear dependency chain from raw document to final answer. The DocumentLoader ingests source files, the TextSplitter creates manageable chunks, the Embeddings model converts text to vectors, the VectorStore indexes and retrieves them, and the Retriever feeds the top-K chunks to the LLM. Understanding this component chain makes debugging straightforward: isolate which stage is broken by inspecting intermediate outputs (chunks, retrieved docs, injected prompt) before assuming the model is at fault.


๐Ÿ”ข Step 1: Chunking and Embedding Your Documents

Before storing, split your documents into chunks and convert each to an embedding.

Why chunking?

Chunk too smallChunk too large
Missing contextToo much noise injected into LLM
Loses coherence across sentencesUses up context window budget
Many irrelevant chunks returnedRetrieval quality degrades

Overlap (typically 10โ€“20% of chunk size) preserves context across chunk boundaries.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,    # ~500 tokens: self-contained enough for focused retrieval without losing sentence coherence
    chunk_overlap=50,  # 10% overlap ensures sentences split at boundaries appear in both adjacent chunks
)
chunks = splitter.split_documents(docs)

โš™๏ธ Step 2: Storing and Searching with ChromaDB

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
)

# Retrieve top-4 most relevant chunks for a query
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

ChromaDB stores embeddings locally (persisted to disk). FAISS is an alternative โ€” in-memory only, faster for pure search, no built-in persistence.

FeatureChromaDBFAISS
PersistenceYes (disk)No (in-memory, manual save)
Metadata filteringYesLimited
LatencySlightly higherVery low
Best forPrototyping + productionHigh-throughput search, research

๐Ÿง  Deep Dive: The RetrievalQA Chain in LangChain

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",   # "stuff" = inject all context at once
    retriever=retriever,
    return_source_documents=True,
)

result = qa_chain.invoke({"query": "What is our refund policy?"})
print(result["result"])
print(result["source_documents"])

The chain_type="stuff" injects all retrieved chunks into one prompt. For larger context windows, use "map_reduce" (summarize each chunk separately, then combine) or "refine".

flowchart LR
    Q[User Question] --> Embed[Embed question]
    Embed --> Retrieve[Retrieve top-k chunks from ChromaDB]
    Retrieve --> Prompt[Inject chunks into LLM prompt]
    Prompt --> LLM[LLM call]
    LLM --> A[Answer with citations]

๐Ÿ”ฌ Internals

RAG pipelines split documents into overlapping chunks (typically 512โ€“1024 tokens with 10โ€“20% overlap), embed each chunk with a sentence encoder, and store embeddings in a vector index. At query time, the query is embedded and the top-k chunks are retrieved by approximate nearest-neighbor search (HNSW or IVF). The retrieved chunks are injected into the LLM prompt as context, constraining generation to retrieved evidence.

โšก Performance Analysis

ChromaDB with HNSW index retrieves top-5 results from 100K documents in under 10ms on CPU. Embedding 1M tokens with ext-embedding-ada-002 costs ~.10 and takes ~5 minutes. End-to-end RAG latency on a 100K-document corpus is typically 200โ€“500ms (embedding + retrieval + LLM), versus 50โ€“150ms for pure LLM โ€” a 3โ€“5ร— latency trade-off that eliminates hallucination on factual queries.

โš–๏ธ Trade-offs & Failure Modes: Retrieval Quality Traps

ProblemSymptomFix
Chunks too largeIrrelevant content in contextReduce chunk size, test retrieval quality
Top-k too lowAnswer misses key detailsIncrease k, or use reranker
Embedding model mismatchPoor retrievalUse same model for indexing and querying
No metadata filteringReturns documents from wrong projectAdd where filters in ChromaDB
Chain type wrong for large docsContext overflowSwitch to map_reduce or refine

๐Ÿ“Š The RAG Pipeline: End-to-End Flow

The complete RAG workflow has two phases: an offline indexing phase that builds the vector store, and an online retrieval phase that grounds each LLM response in retrieved evidence.

flowchart TD
    A[Documents] --> B[Chunking]
    B --> C[Embedding Model]
    C --> D[ChromaDB Vector Index]
    E[User Query] --> F[Query Embedding]
    F --> G[ANN Search]
    G --> D
    G --> H[Top-k Chunks]
    H --> I[LLM + Context Window]
    I --> J[Grounded Answer]

๐Ÿงญ Decision Guide: RAG Architecture End-to-End

RAG splits neatly into two phases that run at different times:

  • Indexing phase (offline, runs once per document set): Load documents โ†’ split into chunks โ†’ embed each chunk โ†’ persist vectors in ChromaDB. This is a batch process you rerun only when documents change. It can be slow โ€” thousands of embedding API calls โ€” so run it ahead of time.
  • Retrieval phase (online, runs per query): Embed the user question โ†’ search the vector store for the top-K closest chunks โ†’ inject chunks into the LLM prompt โ†’ return the answer. This must be fast (under 200 ms for good UX); the vector search itself is typically sub-millisecond.

Understanding this split helps you debug: if answers are bad, ask first "is retrieval returning the right chunks?" before blaming the LLM.

flowchart TD
    Docs[Source Documents (PDFs, TXT, HTML)]
    Split[Text Splitter chunk_size=500, overlap=50]
    Embed[Embedding Model (text-embedding-3-small)]
    Store[Vector Store (ChromaDB / FAISS)]
    Query[User Query]
    QEmbed[Embed Query]
    Retrieve[Retrieve Top-K Chunks]
    Prompt[Inject Chunks into LLM Prompt]
    LLM[LLM Call (gpt-4o-mini)]
    Answer[Answer + Source Citations]

    Docs --> Split --> Embed --> Store
    Query --> QEmbed --> Retrieve
    Store --> Retrieve
    Retrieve --> Prompt --> LLM --> Answer

๐ŸŒ Real-World Applications of RAG

RAG powers document Q&A systems across every industry. The core pattern is identical โ€” the differences are in which documents get indexed and what guardrails are required.

Use CaseDocuments IndexedSpecial Consideration
Chat with PDFUploaded PDFs, reportsPer-user isolation; never mix tenants' data in one collection
Customer support KBHelp articles, FAQsHigh update frequency; re-index on every content publish
Legal document searchContracts, case law, filingsCitation accuracy is critical; always surface source chunks
Medical records Q&AClinical notes, research papersStrict access control; HIPAA/GDPR compliance required
Code documentation searchAPI docs, READMEs, changelogsSplit on function/class boundaries, not character counts
Enterprise wiki Q&AConfluence, Notion, internal wikisMetadata filtering by team, project, or date is essential

Hybrid search in production: Pure semantic search sometimes misses exact terms โ€” product codes, error numbers, names. Production systems often combine keyword search (BM25) with vector search, then rerank with a cross-encoder model. LangChain supports this via EnsembleRetriever, which merges both result sets before passing them to the LLM.


๐Ÿงช Practical Exercises

The best way to build intuition for RAG is to break things intentionally. These three exercises take you from raw text to a working Q&A chain.

Exercise 1 โ€” Load and Chunk a Local File

Load a .txt file, split it into small chunks, and inspect what the splitter produces. Print the first three chunks to see exactly where boundaries fall โ€” this makes chunking strategy concrete.

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader("my_notes.txt")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
chunks = splitter.split_documents(docs)

print(f"Total chunks: {len(chunks)}")
for chunk in chunks[:3]:
    print(chunk.page_content)
    print("---")

Exercise 2 โ€” Index into ChromaDB and Run a Similarity Search

Take the chunks from Exercise 1, index them into a persisted ChromaDB collection, and run raw similarity search before attaching an LLM. This isolates retrieval quality from answer quality.

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    chunks, embeddings, persist_directory="./test_chroma"
)

results = vectorstore.similarity_search("What is the main topic?", k=3)
for r in results:
    print(r.page_content)

Exercise 3 โ€” Wire Up RetrievalQA and Test Edge Cases

Wrap everything in a RetrievalQA chain and test three questions: one the document covers well, one it covers partially, and one it doesn't cover at all. Observe how the model behaves when relevant context is absent โ€” this reveals whether it hallucinates or correctly says "I don't know."

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True,
)

for q in ["[question your doc covers well]", "[partial topic]", "[topic not in doc]"]:
    result = qa.invoke({"query": q})
    print(result["result"])
    print(f"Sources: {len(result['source_documents'])} chunks\n")

๐Ÿ› ๏ธ LangChain, ChromaDB, and FAISS: The Three-Library RAG Stack

The RAG pipeline described above maps exactly onto three open-source libraries that the community has standardized on for production systems.

LangChain is the orchestration layer โ€” it provides document loaders, text splitters, retriever abstractions, and ready-made RAG chains (RetrievalQA, ConversationalRetrievalChain) that wire embedding models, vector stores, and LLMs together with a consistent API.

ChromaDB is an open-source, embeddable vector database optimized for developer iteration โ€” it persists vectors to disk, supports metadata filtering, and runs in-process (no separate server required), making it the default choice for prototyping and small-to-medium production RAG systems.

FAISS (Facebook AI Similarity Search) is a high-performance similarity search library optimized for billion-scale vector collections; it runs entirely in memory and is the go-to choice when retrieval latency and throughput are the primary constraints.

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma, FAISS
from langchain.chains import RetrievalQA

# Step 1: Load and chunk the document
loader = PyPDFLoader("product_manual.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Step 2a: ChromaDB โ€” persistent, metadata-filterable (recommended for most use cases)
chroma_store = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
chroma_retriever = chroma_store.as_retriever(search_kwargs={"k": 4})

# Step 2b: FAISS โ€” in-memory, ultra-fast (recommended for high-throughput batch queries)
faiss_store = FAISS.from_documents(chunks, embeddings)
faiss_store.save_local("./faiss_index")    # persist manually
faiss_retriever = faiss_store.as_retriever(search_kwargs={"k": 4})

# Step 3: Wire the retriever into a QA chain โ€” identical API for both stores
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

for name, retriever in [("ChromaDB", chroma_retriever), ("FAISS", faiss_retriever)]:
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm, retriever=retriever, return_source_documents=True
    )
    result = qa_chain.invoke({"query": "What is the warranty period?"})
    print(f"[{name}] {result['result']}")

LangChain's retriever abstraction means the RetrievalQA chain is identical whether the backing store is ChromaDB, FAISS, Pinecone, or any other provider โ€” the swap is one line.

For full deep-dives on LangChain, ChromaDB, and FAISS, dedicated follow-up posts are planned.


๐Ÿ“š Key Lessons Learned Building RAG Systems

  1. Chunking strategy is the single most important RAG parameter. Test chunk_size values of 200, 500, and 1000 on your own documents. What matters is whether each chunk contains a self-contained, answerable unit of information โ€” not an arbitrary number of characters.

  2. Always use the same embedding model for indexing and querying. If you index with text-embedding-3-small and later query with text-embedding-3-large, the vector spaces are incompatible and retrieval quality collapses silently. Lock the model name in a config constant.

  3. Monitor retrieval quality separately from answer quality. Print the source_documents returned by the chain. If retrieval returns irrelevant chunks, fix chunking and top-K before touching the LLM prompt โ€” the model cannot answer well from bad context.

  4. ChromaDB vs. FAISS is a prototyping vs. throughput choice. ChromaDB persists to disk and supports metadata filtering โ€” ideal for iterating on a real project. FAISS is in-memory and extremely fast โ€” better for high-throughput batch processing or research benchmarks where persistence is handled separately.

  5. Start with chain_type="stuff" and only switch when necessary. "stuff" injects all chunks at once and is simplest to debug. Switch to "map_reduce" only when your retrieved context plus the question exceeds the model's context window โ€” map-reduce adds latency and API cost.


๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • RAG = retrieve relevant document chunks + inject them into the LLM prompt at query time.
  • Chunking strategy is critical: too small loses context, too large adds noise.
  • ChromaDB handles persistence and metadata filtering; FAISS is faster but in-memory only.
  • The RetrievalQA chain is the standard LangChain building block for RAG.
  • Monitor retrieval quality separately from answer quality โ€” bad retrieval = bad answers regardless of model.

Share

Test Your Knowledge

๐Ÿง 

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms