All Posts

Guide to Using RAG with LangChain and ChromaDB/FAISS

Build a 'Chat with PDF' app in 10 minutes. We walk through the code for loading documents, creati...

Abstract AlgorithmsAbstract Algorithms
ยทยท4 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: RAG (Retrieval-Augmented Generation) gives an LLM access to your private documents at query time. You chunk and embed documents into a vector store (ChromaDB or FAISS), retrieve the relevant chunks at query time, and inject them into the LLM's prompt. The model answers from real data instead of hallucinating.


๐Ÿ“– Why LLMs Hallucinate and RAG Fixes It

An LLM's knowledge is frozen at its training cutoff. Ask it about your internal documentation, your product catalog, or a document uploaded today โ€” it will generate a plausible-sounding answer from its weights, not from your data.

RAG bridges this gap:

  • No fine-tuning required (fine-tuning is expensive and won't help for dynamic data)
  • Works with any LLM
  • Updates in real time โ€” add new documents to the vector store and they're immediately searchable

๐Ÿ”ข Step 1: Chunking and Embedding Your Documents

Before storing, split your documents into chunks and convert each to an embedding.

Why chunking?

Chunk too smallChunk too large
Missing contextToo much noise injected into LLM
Loses coherence across sentencesUses up context window budget
Many irrelevant chunks returnedRetrieval quality degrades

Overlap (typically 10โ€“20% of chunk size) preserves context across chunk boundaries.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
)
chunks = splitter.split_documents(docs)

โš™๏ธ Step 2: Storing and Searching with ChromaDB

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
)

# Retrieve top-4 most relevant chunks for a query
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

ChromaDB stores embeddings locally (persisted to disk). FAISS is an alternative โ€” in-memory only, faster for pure search, no built-in persistence.

FeatureChromaDBFAISS
PersistenceYes (disk)No (in-memory, manual save)
Metadata filteringYesLimited
LatencySlightly higherVery low
Best forPrototyping + productionHigh-throughput search, research

๐Ÿง  Step 3: The RetrievalQA Chain in LangChain

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",   # "stuff" = inject all context at once
    retriever=retriever,
    return_source_documents=True,
)

result = qa_chain.invoke({"query": "What is our refund policy?"})
print(result["result"])
print(result["source_documents"])

The chain_type="stuff" injects all retrieved chunks into one prompt. For larger context windows, use "map_reduce" (summarize each chunk separately, then combine) or "refine".

flowchart LR
    Q[User Question] --> Embed[Embed question]
    Embed --> Retrieve[Retrieve top-k chunks\nfrom ChromaDB]
    Retrieve --> Prompt[Inject chunks into\nLLM prompt]
    Prompt --> LLM[LLM call]
    LLM --> A[Answer with citations]

โš–๏ธ What Can Go Wrong: Retrieval Quality Traps

ProblemSymptomFix
Chunks too largeIrrelevant content in contextReduce chunk size, test retrieval quality
Top-k too lowAnswer misses key detailsIncrease k, or use reranker
Embedding model mismatchPoor retrievalUse same model for indexing and querying
No metadata filteringReturns documents from wrong projectAdd where filters in ChromaDB
Chain type wrong for large docsContext overflowSwitch to map_reduce or refine

๐Ÿ“Œ Key Takeaways

  • RAG = retrieve relevant document chunks + inject them into the LLM prompt at query time.
  • Chunking strategy is critical: too small loses context, too large adds noise.
  • ChromaDB handles persistence and metadata filtering; FAISS is faster but in-memory only.
  • The RetrievalQA chain is the standard LangChain building block for RAG.
  • Monitor retrieval quality separately from answer quality โ€” bad retrieval = bad answers regardless of model.

๐Ÿงฉ Test Your Understanding

  1. Why can't you just fine-tune an LLM instead of using RAG for private documents?
  2. You set chunk_overlap=0 and notice answers miss context from the boundary of two chunks. What should you change?
  3. When should you use chain_type="map_reduce" instead of "stuff"?
  4. What is the risk of using different embedding models for indexing and querying?

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms