Guide to Using RAG with LangChain and ChromaDB/FAISS
Build a 'Chat with PDF' app in 10 minutes. We walk through the code for loading documents, creati...
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: RAG (Retrieval-Augmented Generation) gives an LLM access to your private documents at query time. You chunk and embed documents into a vector store (ChromaDB or FAISS), retrieve the relevant chunks at query time, and inject them into the LLM's prompt. The model answers from real data instead of hallucinating.
TLDR: RAG combines a vector database (ChromaDB) with an LLM โ retrieve relevant document chunks at query time, inject them as context, and get grounded answers with sources.
๐ Why LLMs Hallucinate and RAG Fixes It
An LLM's knowledge is frozen at its training cutoff. Ask it about your internal documentation, your product catalog, or a document uploaded today โ it will generate a plausible-sounding answer from its weights, not from your data.
RAG bridges this gap:
- No fine-tuning required (fine-tuning is expensive and won't help for dynamic data)
- Works with any LLM
- Updates in real time โ add new documents to the vector store and they're immediately searchable
๐ Key Concepts: Embeddings and Vector Similarity
When you search for "refund policy" in a document, a keyword search returns results containing those exact words. But what if your policy document says "money-back guarantee"? Keyword search misses it. Embeddings solve this.
An embedding is a list of numbers โ a high-dimensional vector (e.g., 1,536 numbers for OpenAI's text-embedding-3-small) that encodes the semantic meaning of a piece of text. Similar meanings produce vectors that point in similar directions in that space.
Cosine similarity measures the angle between two vectors. A score near 1.0 means "very similar meaning"; near 0 means "unrelated." This is how "money-back guarantee" and "refund policy" score highly similar โ they occupy neighboring regions in embedding space even though they share no keywords.
The RAG pipeline at a conceptual level:
- Embed โ convert every document chunk into a vector
- Store โ save those vectors in a vector database (ChromaDB, FAISS, Pinecone)
- Retrieve โ embed the user's query and find the top-K closest vectors by cosine similarity
- Inject โ paste the matched chunks into the LLM's prompt as context
Why RAG beats fine-tuning for dynamic/private data: Fine-tuning bakes knowledge into model weights. It costs hundreds of dollars per training run, takes hours, and requires full retraining every time your data changes. RAG adds new documents to the vector store in seconds with no model retraining. You also get source citations for free โ you know exactly which chunks the answer came from.
๐ Full RAG Pipeline Sequence
sequenceDiagram
participant D as Documents
participant S as TextSplitter
participant E as EmbeddingModel
participant V as ChromaDB
participant U as User
participant R as Retriever
participant L as LLM
D->>S: Split into chunks
S->>E: Embed each chunk
E->>V: Store vectors + metadata
U->>R: Submit query
R->>E: Embed query
E->>V: Cosine similarity search
V-->>R: Top-K chunks
R->>L: Inject chunks into prompt
L-->>U: Grounded answer + citations
This sequence diagram shows the two-phase RAG lifecycle in one view. The top half is the offline indexing phase: documents are split, embedded, and stored in ChromaDB before any user query arrives. The bottom half is the online retrieval phase: the user's query is embedded, the top-K most similar chunks are retrieved by cosine similarity, injected into the LLM's prompt, and a grounded answer is returned with source citations. The key takeaway is that retrieval quality โ not generation quality โ is the dominant factor in RAG answer quality; bad context guarantees bad answers regardless of model size.
๐ LangChain RAG Chain Components
flowchart LR
DL[DocumentLoader (PDF, TXT, HTML)]
TS[TextSplitter chunk_size=500]
EM[Embeddings (OpenAI / HuggingFace)]
VS[VectorStore (ChromaDB / FAISS)]
RT[Retriever top_k=4]
LLM[LLM (gpt-4o-mini)]
ANS[Answer]
DL --> TS --> EM --> VS --> RT --> LLM --> ANS
This flowchart maps each LangChain component to its stage in the RAG pipeline, showing the linear dependency chain from raw document to final answer. The DocumentLoader ingests source files, the TextSplitter creates manageable chunks, the Embeddings model converts text to vectors, the VectorStore indexes and retrieves them, and the Retriever feeds the top-K chunks to the LLM. Understanding this component chain makes debugging straightforward: isolate which stage is broken by inspecting intermediate outputs (chunks, retrieved docs, injected prompt) before assuming the model is at fault.
๐ข Step 1: Chunking and Embedding Your Documents
Before storing, split your documents into chunks and convert each to an embedding.
Why chunking?
| Chunk too small | Chunk too large |
| Missing context | Too much noise injected into LLM |
| Loses coherence across sentences | Uses up context window budget |
| Many irrelevant chunks returned | Retrieval quality degrades |
Overlap (typically 10โ20% of chunk size) preserves context across chunk boundaries.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # ~500 tokens: self-contained enough for focused retrieval without losing sentence coherence
chunk_overlap=50, # 10% overlap ensures sentences split at boundaries appear in both adjacent chunks
)
chunks = splitter.split_documents(docs)
โ๏ธ Step 2: Storing and Searching with ChromaDB
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db",
)
# Retrieve top-4 most relevant chunks for a query
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
ChromaDB stores embeddings locally (persisted to disk). FAISS is an alternative โ in-memory only, faster for pure search, no built-in persistence.
| Feature | ChromaDB | FAISS |
| Persistence | Yes (disk) | No (in-memory, manual save) |
| Metadata filtering | Yes | Limited |
| Latency | Slightly higher | Very low |
| Best for | Prototyping + production | High-throughput search, research |
๐ง Deep Dive: The RetrievalQA Chain in LangChain
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # "stuff" = inject all context at once
retriever=retriever,
return_source_documents=True,
)
result = qa_chain.invoke({"query": "What is our refund policy?"})
print(result["result"])
print(result["source_documents"])
The chain_type="stuff" injects all retrieved chunks into one prompt. For larger context windows, use "map_reduce" (summarize each chunk separately, then combine) or "refine".
flowchart LR
Q[User Question] --> Embed[Embed question]
Embed --> Retrieve[Retrieve top-k chunks from ChromaDB]
Retrieve --> Prompt[Inject chunks into LLM prompt]
Prompt --> LLM[LLM call]
LLM --> A[Answer with citations]
๐ฌ Internals
RAG pipelines split documents into overlapping chunks (typically 512โ1024 tokens with 10โ20% overlap), embed each chunk with a sentence encoder, and store embeddings in a vector index. At query time, the query is embedded and the top-k chunks are retrieved by approximate nearest-neighbor search (HNSW or IVF). The retrieved chunks are injected into the LLM prompt as context, constraining generation to retrieved evidence.
โก Performance Analysis
ChromaDB with HNSW index retrieves top-5 results from 100K documents in under 10ms on CPU. Embedding 1M tokens with ext-embedding-ada-002 costs ~.10 and takes ~5 minutes. End-to-end RAG latency on a 100K-document corpus is typically 200โ500ms (embedding + retrieval + LLM), versus 50โ150ms for pure LLM โ a 3โ5ร latency trade-off that eliminates hallucination on factual queries.
โ๏ธ Trade-offs & Failure Modes: Retrieval Quality Traps
| Problem | Symptom | Fix |
| Chunks too large | Irrelevant content in context | Reduce chunk size, test retrieval quality |
| Top-k too low | Answer misses key details | Increase k, or use reranker |
| Embedding model mismatch | Poor retrieval | Use same model for indexing and querying |
| No metadata filtering | Returns documents from wrong project | Add where filters in ChromaDB |
| Chain type wrong for large docs | Context overflow | Switch to map_reduce or refine |
๐ The RAG Pipeline: End-to-End Flow
The complete RAG workflow has two phases: an offline indexing phase that builds the vector store, and an online retrieval phase that grounds each LLM response in retrieved evidence.
flowchart TD
A[Documents] --> B[Chunking]
B --> C[Embedding Model]
C --> D[ChromaDB Vector Index]
E[User Query] --> F[Query Embedding]
F --> G[ANN Search]
G --> D
G --> H[Top-k Chunks]
H --> I[LLM + Context Window]
I --> J[Grounded Answer]
๐งญ Decision Guide: RAG Architecture End-to-End
RAG splits neatly into two phases that run at different times:
- Indexing phase (offline, runs once per document set): Load documents โ split into chunks โ embed each chunk โ persist vectors in ChromaDB. This is a batch process you rerun only when documents change. It can be slow โ thousands of embedding API calls โ so run it ahead of time.
- Retrieval phase (online, runs per query): Embed the user question โ search the vector store for the top-K closest chunks โ inject chunks into the LLM prompt โ return the answer. This must be fast (under 200 ms for good UX); the vector search itself is typically sub-millisecond.
Understanding this split helps you debug: if answers are bad, ask first "is retrieval returning the right chunks?" before blaming the LLM.
flowchart TD
Docs[Source Documents (PDFs, TXT, HTML)]
Split[Text Splitter chunk_size=500, overlap=50]
Embed[Embedding Model (text-embedding-3-small)]
Store[Vector Store (ChromaDB / FAISS)]
Query[User Query]
QEmbed[Embed Query]
Retrieve[Retrieve Top-K Chunks]
Prompt[Inject Chunks into LLM Prompt]
LLM[LLM Call (gpt-4o-mini)]
Answer[Answer + Source Citations]
Docs --> Split --> Embed --> Store
Query --> QEmbed --> Retrieve
Store --> Retrieve
Retrieve --> Prompt --> LLM --> Answer
๐ Real-World Applications of RAG
RAG powers document Q&A systems across every industry. The core pattern is identical โ the differences are in which documents get indexed and what guardrails are required.
| Use Case | Documents Indexed | Special Consideration |
| Chat with PDF | Uploaded PDFs, reports | Per-user isolation; never mix tenants' data in one collection |
| Customer support KB | Help articles, FAQs | High update frequency; re-index on every content publish |
| Legal document search | Contracts, case law, filings | Citation accuracy is critical; always surface source chunks |
| Medical records Q&A | Clinical notes, research papers | Strict access control; HIPAA/GDPR compliance required |
| Code documentation search | API docs, READMEs, changelogs | Split on function/class boundaries, not character counts |
| Enterprise wiki Q&A | Confluence, Notion, internal wikis | Metadata filtering by team, project, or date is essential |
Hybrid search in production: Pure semantic search sometimes misses exact terms โ product codes, error numbers, names. Production systems often combine keyword search (BM25) with vector search, then rerank with a cross-encoder model. LangChain supports this via EnsembleRetriever, which merges both result sets before passing them to the LLM.
๐งช Practical Exercises
The best way to build intuition for RAG is to break things intentionally. These three exercises take you from raw text to a working Q&A chain.
Exercise 1 โ Load and Chunk a Local File
Load a .txt file, split it into small chunks, and inspect what the splitter produces. Print the first three chunks to see exactly where boundaries fall โ this makes chunking strategy concrete.
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = TextLoader("my_notes.txt")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
chunks = splitter.split_documents(docs)
print(f"Total chunks: {len(chunks)}")
for chunk in chunks[:3]:
print(chunk.page_content)
print("---")
Exercise 2 โ Index into ChromaDB and Run a Similarity Search
Take the chunks from Exercise 1, index them into a persisted ChromaDB collection, and run raw similarity search before attaching an LLM. This isolates retrieval quality from answer quality.
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
chunks, embeddings, persist_directory="./test_chroma"
)
results = vectorstore.similarity_search("What is the main topic?", k=3)
for r in results:
print(r.page_content)
Exercise 3 โ Wire Up RetrievalQA and Test Edge Cases
Wrap everything in a RetrievalQA chain and test three questions: one the document covers well, one it covers partially, and one it doesn't cover at all. Observe how the model behaves when relevant context is absent โ this reveals whether it hallucinates or correctly says "I don't know."
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
return_source_documents=True,
)
for q in ["[question your doc covers well]", "[partial topic]", "[topic not in doc]"]:
result = qa.invoke({"query": q})
print(result["result"])
print(f"Sources: {len(result['source_documents'])} chunks\n")
๐ ๏ธ LangChain, ChromaDB, and FAISS: The Three-Library RAG Stack
The RAG pipeline described above maps exactly onto three open-source libraries that the community has standardized on for production systems.
LangChain is the orchestration layer โ it provides document loaders, text splitters, retriever abstractions, and ready-made RAG chains (RetrievalQA, ConversationalRetrievalChain) that wire embedding models, vector stores, and LLMs together with a consistent API.
ChromaDB is an open-source, embeddable vector database optimized for developer iteration โ it persists vectors to disk, supports metadata filtering, and runs in-process (no separate server required), making it the default choice for prototyping and small-to-medium production RAG systems.
FAISS (Facebook AI Similarity Search) is a high-performance similarity search library optimized for billion-scale vector collections; it runs entirely in memory and is the go-to choice when retrieval latency and throughput are the primary constraints.
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma, FAISS
from langchain.chains import RetrievalQA
# Step 1: Load and chunk the document
loader = PyPDFLoader("product_manual.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Step 2a: ChromaDB โ persistent, metadata-filterable (recommended for most use cases)
chroma_store = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
chroma_retriever = chroma_store.as_retriever(search_kwargs={"k": 4})
# Step 2b: FAISS โ in-memory, ultra-fast (recommended for high-throughput batch queries)
faiss_store = FAISS.from_documents(chunks, embeddings)
faiss_store.save_local("./faiss_index") # persist manually
faiss_retriever = faiss_store.as_retriever(search_kwargs={"k": 4})
# Step 3: Wire the retriever into a QA chain โ identical API for both stores
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
for name, retriever in [("ChromaDB", chroma_retriever), ("FAISS", faiss_retriever)]:
qa_chain = RetrievalQA.from_chain_type(
llm=llm, retriever=retriever, return_source_documents=True
)
result = qa_chain.invoke({"query": "What is the warranty period?"})
print(f"[{name}] {result['result']}")
LangChain's retriever abstraction means the RetrievalQA chain is identical whether the backing store is ChromaDB, FAISS, Pinecone, or any other provider โ the swap is one line.
For full deep-dives on LangChain, ChromaDB, and FAISS, dedicated follow-up posts are planned.
๐ Key Lessons Learned Building RAG Systems
Chunking strategy is the single most important RAG parameter. Test
chunk_sizevalues of 200, 500, and 1000 on your own documents. What matters is whether each chunk contains a self-contained, answerable unit of information โ not an arbitrary number of characters.Always use the same embedding model for indexing and querying. If you index with
text-embedding-3-smalland later query withtext-embedding-3-large, the vector spaces are incompatible and retrieval quality collapses silently. Lock the model name in a config constant.Monitor retrieval quality separately from answer quality. Print the
source_documentsreturned by the chain. If retrieval returns irrelevant chunks, fix chunking and top-K before touching the LLM prompt โ the model cannot answer well from bad context.ChromaDB vs. FAISS is a prototyping vs. throughput choice. ChromaDB persists to disk and supports metadata filtering โ ideal for iterating on a real project. FAISS is in-memory and extremely fast โ better for high-throughput batch processing or research benchmarks where persistence is handled separately.
Start with
chain_type="stuff"and only switch when necessary."stuff"injects all chunks at once and is simplest to debug. Switch to"map_reduce"only when your retrieved context plus the question exceeds the model's context window โ map-reduce adds latency and API cost.
๐ TLDR: Summary & Key Takeaways
- RAG = retrieve relevant document chunks + inject them into the LLM prompt at query time.
- Chunking strategy is critical: too small loses context, too large adds noise.
- ChromaDB handles persistence and metadata filtering; FAISS is faster but in-memory only.
- The
RetrievalQAchain is the standard LangChain building block for RAG. - Monitor retrieval quality separately from answer quality โ bad retrieval = bad answers regardless of model.
๐ Related Posts
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer โ 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2ร A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
