RAG Explained: How to Give Your LLM a Brain Upgrade
LLMs hallucinate. RAG fixes that. Learn how Retrieval-Augmented Generation connects ChatGPT to your private data.
Abstract Algorithms
TLDR: LLMs have a training cut-off and no access to private data. RAG (Retrieval-Augmented Generation) solves both problems by retrieving relevant documents from an external store and injecting them into the prompt before generation. No retraining required.
๐ The Open-Book Exam Analogy
A standard LLM is like a student who has memorized everything in a textbook โ but cannot consult notes during the exam. Helpful for general questions; unreliable when the answer changed after the book was printed.
A RAG-enhanced LLM is like the same student with an open-book policy. Before answering, they quickly scan for the relevant pages, read them, and incorporate those facts into the answer.
Why this matters:
| Property | Standard LLM | RAG-enhanced LLM |
| Knowledge source | Training data only (static) | Training data + external index (dynamic) |
| Private/proprietary data | No access | Yes โ via your vector store |
| Hallucination risk | Higher (guesses from patterns) | Lower (grounded in retrieved docs) |
| Update cost | Full retraining | Update the index only |
๐ The Three-Step RAG Pipeline
Every RAG system, regardless of framework, follows the same three steps:
- Retrieve โ Convert the query to an embedding vector. Search a vector database for the nearest stored document embeddings. Return top-N chunks.
- Augment โ Inject the retrieved chunks into the prompt as context.
- Generate โ The LLM generates a response grounded in the provided context.
graph TD
A[User Query] --> B[Embed Query
vec = embed_model]
B --> C[Vector DB Similarity Search
top-k cosine nearest neighbors]
C --> D[Retrieved Document Chunks]
D --> E[Augmented Prompt
System + Context + Query]
E --> F[LLM]
F --> G[Grounded Response]
โ๏ธ How Retrieval Actually Works: Embeddings and Cosine Similarity
Every piece of text โ document or query โ is transformed into a dense vector by an embedding model (e.g., text-embedding-3-small, nomic-embed-text).
Semantic similarity between query vector $q$ and document vector $d_i$ is measured by cosine similarity:
$$ ext{sim}(q, d_i) = rac{q \cdot d_i}{\|q\|\,\|d_i\|}$$
The vector store returns the top-k document chunks with the highest similarity scores.
Minimal Python RAG skeleton (LangChain):
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
# 1. Build index from documents
texts = ["Redis is an in-memory key-value store.", "PostgreSQL supports ACID transactions."]
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(texts, embeddings)
# 2. Build retrieval chain
llm = ChatOpenAI(model="gpt-4o-mini")
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())
# 3. Query
result = qa_chain.invoke("What is Redis used for?")
print(result["result"])
# โ "Redis is an in-memory key-value store used for caching..."
๐ RAG Query Pipeline
sequenceDiagram
participant U as User
participant E as Embed Model
participant V as Vector DB
participant L as LLM
U->>E: Query text
E->>V: Query embedding
V-->>E: Top-K chunks
E-->>L: Augmented prompt
L-->>U: Grounded response
๐ End-to-End RAG Data Flow
This diagram shows the complete data journey โ from document ingestion through query answering โ in a single view.
graph TD
subgraph Offline Indexing
A[Raw Documents] --> B[Chunker: split into 300-500 token pieces]
B --> C[Embedding Model: chunk โ float32 vector]
C --> D[Vector Store: FAISS / Pinecone / pgvector]
end
subgraph Online Query
E[User Query] --> F[Embed query with same model]
F --> G[Cosine similarity search in Vector Store]
G --> H[Top-k relevant chunks retrieved]
H --> I[Augmented Prompt: system + chunks + query]
I --> J[LLM generates grounded response]
end
D --> G
The offline indexing pipeline and the online query pipeline share exactly one thing: the embedding model. Using different models for indexing and querying is a common mistake that causes retrieval to silently fail, because the vector spaces will not align.
Critical constraint: Always use the same embedding model version for both indexing and querying. Upgrading the embedding model requires re-indexing all documents.
๐ง Deep Dive: Vector Search and Embedding Space
Every chunk and query is projected into the same high-dimensional vector space by the embedding model. Cosine similarity measures the angle between two vectors โ not their length โ so short and long chunks are compared fairly. Vector databases like FAISS and Pinecone use approximate nearest-neighbor (ANN) algorithms (e.g., HNSW) to search millions of vectors in milliseconds, trading a tiny recall loss for a 100ร+ speed gain over exact exhaustive search.
๐ Real-World Applications: RAG in Production: Indexing Pipeline
Before queries can be answered, documents must be indexed. The indexing pipeline runs offline (and on updates):
graph LR
A[Raw Documents
.pdf, .md, .html] --> B[Chunker
split into 300-500 token chunks]
B --> C[Embedding Model
chunk โ float32 vector]
C --> D[Vector Store
FAISS, Pinecone, Weaviate, pgvector]
D --> E[Index ready for retrieval]
Chunking strategy matters. Too large: retrieval returns diluted context. Too small: chunks lose semantic coherence.
| Parameter | Typical value | Effect |
| Chunk size | 300โ500 tokens | Larger = more context, noisier retrieval |
| Chunk overlap | 50โ100 tokens | Avoids cutting key facts at boundaries |
| Top-k retrieved | 3โ8 | More chunks = richer context but longer prompt |
| Similarity threshold | > 0.75 (cosine) | Filters weak matches |
๐ Document Ingestion Pipeline
flowchart TD
D[Raw Documents] --> C[Chunking]
C --> EM[Embedding Model]
EM --> VS[Vector Store]
VS --> IDX[Indexed & Searchable]
โ๏ธ Trade-offs & Failure Modes: When RAG Works Well โ and When It Doesn't
RAG excels when:
- Data changes frequently โ update the index without retraining.
- Queries require private/proprietary context only you have.
- You need traceable source attribution.
- Hallucination risk is unacceptable (medical, legal, financial).
RAG struggles when:
- Retrieved chunks are irrelevant โ garbage in, garbage out.
- The answer requires multi-document reasoning across many chunks.
- Latency budget is tight โ retrieval adds ~50โ200 ms.
- The knowledge is stable and general โ fine-tuning is cheaper at that point.
| Failure mode | Symptom | Fix |
| Irrelevant retrieval | LLM ignores context or hallucinates anyway | Better embeddings; re-rank retrieved chunks |
| Context too long | LLM truncates or loses focus | Reduce top-k; better chunking |
| Stale index | Answers based on outdated info | Incremental index updates + TTL policies |
| Keyword mismatch | Query words don't match doc words semantically | Use dense (semantic) + sparse (BM25) hybrid retrieval |
๐งญ Decision Guide: Choosing Between RAG, Fine-tuning, and Prompt Engineering
| Approach | When to use | Cost | Freshness |
| Prompt engineering | Task format/style adjustment | Lowest | Static data only |
| RAG | Dynamic, private, or frequently changing data | Medium | Real-time via index updates |
| Fine-tuning | Domain vocabulary, tone, or format at scale | High (GPU + data) | Frozen at training time |
| RAG + fine-tuning | Best retrieval AND specialized behavior | Highest | Real-time data + domain adaptation |
๐ฏ What to Learn Next
- Tokenization Explained: How LLMs Understand Text
- LLM Terms: A Helpful Glossary
- Advanced AI Agents: RAG and the Future of Intelligence
๐งช Hands-On: Build and Query a Minimal RAG System
The fastest way to internalize RAG is to run a working system locally. The following walkthrough uses FAISS (in-memory) and LangChain so there are no external API dependencies beyond an OpenAI key.
Prerequisites:
pip install langchain langchain-openai faiss-cpu tiktoken
export OPENAI_API_KEY=sk-...
Step 1 โ Create documents and build the index:
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.schema import Document
docs = [
Document(page_content="Redis is an in-memory key-value store used for caching and pub/sub."),
Document(page_content="PostgreSQL is a relational database supporting ACID transactions and JSONB."),
Document(page_content="Kafka is a distributed event-streaming platform built for high-throughput."),
]
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(docs, embeddings)
Step 2 โ Query the index directly (retrieval only):
results = vectorstore.similarity_search("What database supports transactions?", k=2)
for r in results:
print(r.page_content)
# โ PostgreSQL is a relational database...
# โ Redis is an in-memory key-value store...
Step 3 โ Connect retrieval to generation:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever(search_kwargs={"k": 2}))
answer = qa.invoke("Which database should I use for a cache layer?")
print(answer["result"])
# โ Based on the context, Redis is the right choice for a cache layer...
What to observe: Ask a question whose answer is NOT in the documents (e.g., "What is MongoDB?") and note that the model either says it does not have context or falls back to its training knowledge. This is the expected behavior โ and why production RAG systems often include a fallback instruction in the system prompt.
๐ ๏ธ LangChain + ChromaDB: A Persistent Local RAG Stack
LangChain is the open-source Python orchestration framework for building LLM pipelines โ it provides document loaders, text splitters, retrieval chains, and prompt templates that wire together the three RAG steps (retrieve โ augment โ generate) with minimal boilerplate. ChromaDB is a lightweight, persistent vector store that runs locally (no cloud account needed) and integrates natively with LangChain โ making it the fastest way to run a production-realistic RAG pipeline on a laptop.
Together they solve the key RAG problems from this post: consistent embedding models at index and query time, configurable chunk size and overlap, top-k retrieval, and a RetrievalQA chain that injects retrieved context into the prompt automatically.
# pip install langchain langchain-openai langchain-community chromadb tiktoken
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.schema import Document
import os
# โโ Step 1: Prepare documents โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
docs_raw = [
Document(page_content="Redis is an in-memory key-value store used for caching and real-time leaderboards."),
Document(page_content="PostgreSQL is a relational database supporting ACID transactions, JSONB, and full-text search."),
Document(page_content="Kafka is a distributed event-streaming platform built for high-throughput, fault-tolerant pipelines."),
Document(page_content="ChromaDB is an open-source vector database designed for embedding storage and similarity search."),
]
# โโ Step 2: Split into chunks (400 tokens, 80-token overlap) โโโโโโโโโโโโโโโโโ
splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=80)
chunks = splitter.split_documents(docs_raw)
print(f"Chunks created: {len(chunks)}")
# โโ Step 3: Embed and store in ChromaDB (persisted to disk) โโโโโโโโโโโโโโโโโโ
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory="./chroma_db", # survives process restarts
collection_name="tech_docs",
)
print("Index size:", vectorstore._collection.count()) # โ 4
# โโ Step 4: Build RetrievalQA chain โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
retriever = vectorstore.as_retriever(search_kwargs={"k": 2}) # top-2 chunks
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) # low T for grounded answers
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # "stuff" = inject all retrieved chunks into one prompt
retriever=retriever,
return_source_documents=True,
)
# โโ Step 5: Query โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
result = qa_chain.invoke({"query": "Which database should I use for a real-time leaderboard?"})
print("\nAnswer:", result["result"])
# โ "Based on the context, Redis is the best choice for a real-time leaderboard..."
print("\nSources used:")
for doc in result["source_documents"]:
print(" -", doc.page_content[:80], "...")
# โโ Step 6: Reload persisted index on next run โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# (no re-indexing needed โ ChromaDB loads from disk)
vectorstore_reload = Chroma(
persist_directory="./chroma_db",
embedding_function=embedding_model,
collection_name="tech_docs",
)
The persist_directory parameter makes ChromaDB durable across restarts โ a key production requirement. The return_source_documents=True flag enables source attribution (Lesson 4 from the lessons section), letting you show users which document chunk grounded each answer.
For a full deep-dive on LangChain and ChromaDB for production RAG pipelines, a dedicated follow-up post is planned.
๐ Lessons from RAG in Production
Lesson 1 โ Retrieval quality is the ceiling for answer quality. No matter how powerful the LLM, it cannot synthesize a correct answer from irrelevant chunks. Invest in retrieval before tuning generation parameters.
Lesson 2 โ Chunk size is the most impactful tuning lever. Chunks that are too large retrieve diluted context ("the answer is somewhere in this 1,000-token chunk"). Chunks that are too small lose the surrounding sentence that gives meaning. Start at 400 tokens with 80-token overlap and measure retrieval recall.
Lesson 3 โ Hybrid retrieval (dense + sparse) outperforms pure semantic search. BM25 keyword search catches exact product names and identifiers that semantic embeddings miss. Reciprocal Rank Fusion combines both result lists without requiring score normalization.
Lesson 4 โ Always attribute sources. Return the document source (URL, page number, filename) alongside the answer. Source attribution converts a hallucination-risk into a verifiable fact. It also builds user trust and enables debugging.
Lesson 5 โ Treat the RAG pipeline as a data pipeline. Index freshness, embedding model versioning, and chunk metadata management are engineering problems, not ML problems. Apply the same observability practices you would to any production data pipeline: monitor indexing lag, set alerts on retrieval latency, and version your embedding models.
๐ TLDR: Summary & Key Takeaways
- RAG grounds LLM responses in external documents without any model retraining.
- The pipeline: embed query โ nearest-neighbor search โ inject chunks โ generate.
- Cosine similarity measures how semantically close a query is to each stored document chunk.
- Chunk size, top-k, and similarity threshold are the three tuning levers for retrieval quality.
- RAG is the right default for private, domain-specific, or frequently updated knowledge bases.
๐ Practice Quiz
What is the primary advantage of RAG over fine-tuning for keeping an LLM up to date?
- A) RAG is always cheaper to run
- B) RAG retrieves current documents at inference time without retraining
- C) RAG improves mathematical reasoning
- D) RAG increases the model's context window
Correct Answer: B โ Fine-tuned knowledge is frozen at training time. RAG queries a live index so the model can answer questions about documents added after its training cutoff.
What does cosine similarity measure in the context of RAG retrieval?
- A) The exact word overlap between query and document
- B) The semantic angle between two embedding vectors โ higher score means more similar
- C) The number of tokens shared between query and chunk
- D) The distance in physical storage between vectors
Correct Answer: B โ Cosine similarity measures the angle between vectors in high-dimensional space. Vectors pointing in the same direction (similar meaning) have a similarity score close to 1.0.
A RAG system returns highly relevant chunks, but the LLM response still contains wrong facts. What is the most likely cause?
- A) The embedding model is too large
- B) The LLM is ignoring the provided context (context-faithfulness failure)
- C) The vector database index is corrupted
- D) The chunk size is too small
Correct Answer: B โ This is a context-faithfulness failure. The model "knows" a conflicting fact from pretraining and overrides the retrieved context. Mitigation: add explicit instructions like "Answer only using the provided context. If the context does not contain the answer, say so."
๐ Related Posts
- Tokenization Explained: How LLMs Understand Text
- LLM Terms: A Helpful Glossary
- RAG with LangChain and ChromaDB Guide

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions โ with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...

Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work.
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy โ but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose โ range, hash, consistent hashing, or directory โ determines whether range queries stay ch...
