All Posts

RAG Explained: How to Give Your LLM a Brain Upgrade

LLMs hallucinate. RAG fixes that. Learn how Retrieval-Augmented Generation connects ChatGPT to your private data.

Abstract AlgorithmsAbstract Algorithms
ยทยท12 min read
Cover Image for RAG Explained: How to Give Your LLM a Brain Upgrade
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: LLMs have a training cut-off and no access to private data. RAG (Retrieval-Augmented Generation) solves both problems by retrieving relevant documents from an external store and injecting them into the prompt before generation. No retraining required.


๐Ÿ“– The Open-Book Exam Analogy

A standard LLM is like a student who has memorized everything in a textbook โ€” but cannot consult notes during the exam. Helpful for general questions; unreliable when the answer changed after the book was printed.

A RAG-enhanced LLM is like the same student with an open-book policy. Before answering, they quickly scan for the relevant pages, read them, and incorporate those facts into the answer.

Why this matters:

PropertyStandard LLMRAG-enhanced LLM
Knowledge sourceTraining data only (static)Training data + external index (dynamic)
Private/proprietary dataNo accessYes โ€” via your vector store
Hallucination riskHigher (guesses from patterns)Lower (grounded in retrieved docs)
Update costFull retrainingUpdate the index only

๐Ÿ” The Three-Step RAG Pipeline

Every RAG system, regardless of framework, follows the same three steps:

  1. Retrieve โ€” Convert the query to an embedding vector. Search a vector database for the nearest stored document embeddings. Return top-N chunks.
  2. Augment โ€” Inject the retrieved chunks into the prompt as context.
  3. Generate โ€” The LLM generates a response grounded in the provided context.
graph TD
    A[User Query] --> B[Embed Query
vec = embed_model]
    B --> C[Vector DB Similarity Search
top-k cosine nearest neighbors]
    C --> D[Retrieved Document Chunks]
    D --> E[Augmented Prompt
System + Context + Query]
    E --> F[LLM]
    F --> G[Grounded Response]

โš™๏ธ How Retrieval Actually Works: Embeddings and Cosine Similarity

Every piece of text โ€” document or query โ€” is transformed into a dense vector by an embedding model (e.g., text-embedding-3-small, nomic-embed-text).

Semantic similarity between query vector $q$ and document vector $d_i$ is measured by cosine similarity:

$$ ext{sim}(q, d_i) = rac{q \cdot d_i}{\|q\|\,\|d_i\|}$$

The vector store returns the top-k document chunks with the highest similarity scores.

Minimal Python RAG skeleton (LangChain):

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

# 1. Build index from documents
texts = ["Redis is an in-memory key-value store.", "PostgreSQL supports ACID transactions."]
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(texts, embeddings)

# 2. Build retrieval chain
llm = ChatOpenAI(model="gpt-4o-mini")
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())

# 3. Query
result = qa_chain.invoke("What is Redis used for?")
print(result["result"])
# โ†’ "Redis is an in-memory key-value store used for caching..."

๐Ÿ“Š RAG Query Pipeline

sequenceDiagram
    participant U as User
    participant E as Embed Model
    participant V as Vector DB
    participant L as LLM
    U->>E: Query text
    E->>V: Query embedding
    V-->>E: Top-K chunks
    E-->>L: Augmented prompt
    L-->>U: Grounded response

๐Ÿ“Š End-to-End RAG Data Flow

This diagram shows the complete data journey โ€” from document ingestion through query answering โ€” in a single view.

graph TD
    subgraph Offline Indexing
        A[Raw Documents] --> B[Chunker: split into 300-500 token pieces]
        B --> C[Embedding Model: chunk โ†’ float32 vector]
        C --> D[Vector Store: FAISS / Pinecone / pgvector]
    end
    subgraph Online Query
        E[User Query] --> F[Embed query with same model]
        F --> G[Cosine similarity search in Vector Store]
        G --> H[Top-k relevant chunks retrieved]
        H --> I[Augmented Prompt: system + chunks + query]
        I --> J[LLM generates grounded response]
    end
    D --> G

The offline indexing pipeline and the online query pipeline share exactly one thing: the embedding model. Using different models for indexing and querying is a common mistake that causes retrieval to silently fail, because the vector spaces will not align.

Critical constraint: Always use the same embedding model version for both indexing and querying. Upgrading the embedding model requires re-indexing all documents.


๐Ÿง  Deep Dive: Vector Search and Embedding Space

Every chunk and query is projected into the same high-dimensional vector space by the embedding model. Cosine similarity measures the angle between two vectors โ€” not their length โ€” so short and long chunks are compared fairly. Vector databases like FAISS and Pinecone use approximate nearest-neighbor (ANN) algorithms (e.g., HNSW) to search millions of vectors in milliseconds, trading a tiny recall loss for a 100ร—+ speed gain over exact exhaustive search.


๐ŸŒ Real-World Applications: RAG in Production: Indexing Pipeline

Before queries can be answered, documents must be indexed. The indexing pipeline runs offline (and on updates):

graph LR
    A[Raw Documents
.pdf, .md, .html] --> B[Chunker
split into 300-500 token chunks]
    B --> C[Embedding Model
chunk โ†’ float32 vector]
    C --> D[Vector Store
FAISS, Pinecone, Weaviate, pgvector]
    D --> E[Index ready for retrieval]

Chunking strategy matters. Too large: retrieval returns diluted context. Too small: chunks lose semantic coherence.

ParameterTypical valueEffect
Chunk size300โ€“500 tokensLarger = more context, noisier retrieval
Chunk overlap50โ€“100 tokensAvoids cutting key facts at boundaries
Top-k retrieved3โ€“8More chunks = richer context but longer prompt
Similarity threshold> 0.75 (cosine)Filters weak matches

๐Ÿ“Š Document Ingestion Pipeline

flowchart TD
    D[Raw Documents] --> C[Chunking]
    C --> EM[Embedding Model]
    EM --> VS[Vector Store]
    VS --> IDX[Indexed & Searchable]

โš–๏ธ Trade-offs & Failure Modes: When RAG Works Well โ€” and When It Doesn't

RAG excels when:

  • Data changes frequently โ€” update the index without retraining.
  • Queries require private/proprietary context only you have.
  • You need traceable source attribution.
  • Hallucination risk is unacceptable (medical, legal, financial).

RAG struggles when:

  • Retrieved chunks are irrelevant โ€” garbage in, garbage out.
  • The answer requires multi-document reasoning across many chunks.
  • Latency budget is tight โ€” retrieval adds ~50โ€“200 ms.
  • The knowledge is stable and general โ€” fine-tuning is cheaper at that point.
Failure modeSymptomFix
Irrelevant retrievalLLM ignores context or hallucinates anywayBetter embeddings; re-rank retrieved chunks
Context too longLLM truncates or loses focusReduce top-k; better chunking
Stale indexAnswers based on outdated infoIncremental index updates + TTL policies
Keyword mismatchQuery words don't match doc words semanticallyUse dense (semantic) + sparse (BM25) hybrid retrieval

๐Ÿงญ Decision Guide: Choosing Between RAG, Fine-tuning, and Prompt Engineering

ApproachWhen to useCostFreshness
Prompt engineeringTask format/style adjustmentLowestStatic data only
RAGDynamic, private, or frequently changing dataMediumReal-time via index updates
Fine-tuningDomain vocabulary, tone, or format at scaleHigh (GPU + data)Frozen at training time
RAG + fine-tuningBest retrieval AND specialized behaviorHighestReal-time data + domain adaptation

๐ŸŽฏ What to Learn Next


๐Ÿงช Hands-On: Build and Query a Minimal RAG System

The fastest way to internalize RAG is to run a working system locally. The following walkthrough uses FAISS (in-memory) and LangChain so there are no external API dependencies beyond an OpenAI key.

Prerequisites:

pip install langchain langchain-openai faiss-cpu tiktoken
export OPENAI_API_KEY=sk-...

Step 1 โ€” Create documents and build the index:

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.schema import Document

docs = [
    Document(page_content="Redis is an in-memory key-value store used for caching and pub/sub."),
    Document(page_content="PostgreSQL is a relational database supporting ACID transactions and JSONB."),
    Document(page_content="Kafka is a distributed event-streaming platform built for high-throughput."),
]

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(docs, embeddings)

Step 2 โ€” Query the index directly (retrieval only):

results = vectorstore.similarity_search("What database supports transactions?", k=2)
for r in results:
    print(r.page_content)
# โ†’ PostgreSQL is a relational database...
# โ†’ Redis is an in-memory key-value store...

Step 3 โ€” Connect retrieval to generation:

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever(search_kwargs={"k": 2}))

answer = qa.invoke("Which database should I use for a cache layer?")
print(answer["result"])
# โ†’ Based on the context, Redis is the right choice for a cache layer...

What to observe: Ask a question whose answer is NOT in the documents (e.g., "What is MongoDB?") and note that the model either says it does not have context or falls back to its training knowledge. This is the expected behavior โ€” and why production RAG systems often include a fallback instruction in the system prompt.


๐Ÿ› ๏ธ LangChain + ChromaDB: A Persistent Local RAG Stack

LangChain is the open-source Python orchestration framework for building LLM pipelines โ€” it provides document loaders, text splitters, retrieval chains, and prompt templates that wire together the three RAG steps (retrieve โ†’ augment โ†’ generate) with minimal boilerplate. ChromaDB is a lightweight, persistent vector store that runs locally (no cloud account needed) and integrates natively with LangChain โ€” making it the fastest way to run a production-realistic RAG pipeline on a laptop.

Together they solve the key RAG problems from this post: consistent embedding models at index and query time, configurable chunk size and overlap, top-k retrieval, and a RetrievalQA chain that injects retrieved context into the prompt automatically.

# pip install langchain langchain-openai langchain-community chromadb tiktoken

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.schema import Document
import os

# โ”€โ”€ Step 1: Prepare documents โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
docs_raw = [
    Document(page_content="Redis is an in-memory key-value store used for caching and real-time leaderboards."),
    Document(page_content="PostgreSQL is a relational database supporting ACID transactions, JSONB, and full-text search."),
    Document(page_content="Kafka is a distributed event-streaming platform built for high-throughput, fault-tolerant pipelines."),
    Document(page_content="ChromaDB is an open-source vector database designed for embedding storage and similarity search."),
]

# โ”€โ”€ Step 2: Split into chunks (400 tokens, 80-token overlap) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=80)
chunks = splitter.split_documents(docs_raw)
print(f"Chunks created: {len(chunks)}")

# โ”€โ”€ Step 3: Embed and store in ChromaDB (persisted to disk) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./chroma_db",   # survives process restarts
    collection_name="tech_docs",
)
print("Index size:", vectorstore._collection.count())  # โ†’ 4

# โ”€โ”€ Step 4: Build RetrievalQA chain โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})  # top-2 chunks
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)          # low T for grounded answers

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",          # "stuff" = inject all retrieved chunks into one prompt
    retriever=retriever,
    return_source_documents=True,
)

# โ”€โ”€ Step 5: Query โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
result = qa_chain.invoke({"query": "Which database should I use for a real-time leaderboard?"})
print("\nAnswer:", result["result"])
# โ†’ "Based on the context, Redis is the best choice for a real-time leaderboard..."
print("\nSources used:")
for doc in result["source_documents"]:
    print(" -", doc.page_content[:80], "...")

# โ”€โ”€ Step 6: Reload persisted index on next run โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# (no re-indexing needed โ€” ChromaDB loads from disk)
vectorstore_reload = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embedding_model,
    collection_name="tech_docs",
)

The persist_directory parameter makes ChromaDB durable across restarts โ€” a key production requirement. The return_source_documents=True flag enables source attribution (Lesson 4 from the lessons section), letting you show users which document chunk grounded each answer.

For a full deep-dive on LangChain and ChromaDB for production RAG pipelines, a dedicated follow-up post is planned.


๐Ÿ“š Lessons from RAG in Production

Lesson 1 โ€” Retrieval quality is the ceiling for answer quality. No matter how powerful the LLM, it cannot synthesize a correct answer from irrelevant chunks. Invest in retrieval before tuning generation parameters.

Lesson 2 โ€” Chunk size is the most impactful tuning lever. Chunks that are too large retrieve diluted context ("the answer is somewhere in this 1,000-token chunk"). Chunks that are too small lose the surrounding sentence that gives meaning. Start at 400 tokens with 80-token overlap and measure retrieval recall.

Lesson 3 โ€” Hybrid retrieval (dense + sparse) outperforms pure semantic search. BM25 keyword search catches exact product names and identifiers that semantic embeddings miss. Reciprocal Rank Fusion combines both result lists without requiring score normalization.

Lesson 4 โ€” Always attribute sources. Return the document source (URL, page number, filename) alongside the answer. Source attribution converts a hallucination-risk into a verifiable fact. It also builds user trust and enables debugging.

Lesson 5 โ€” Treat the RAG pipeline as a data pipeline. Index freshness, embedding model versioning, and chunk metadata management are engineering problems, not ML problems. Apply the same observability practices you would to any production data pipeline: monitor indexing lag, set alerts on retrieval latency, and version your embedding models.


๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • RAG grounds LLM responses in external documents without any model retraining.
  • The pipeline: embed query โ†’ nearest-neighbor search โ†’ inject chunks โ†’ generate.
  • Cosine similarity measures how semantically close a query is to each stored document chunk.
  • Chunk size, top-k, and similarity threshold are the three tuning levers for retrieval quality.
  • RAG is the right default for private, domain-specific, or frequently updated knowledge bases.

๐Ÿ“ Practice Quiz

  1. What is the primary advantage of RAG over fine-tuning for keeping an LLM up to date?

    • A) RAG is always cheaper to run
    • B) RAG retrieves current documents at inference time without retraining
    • C) RAG improves mathematical reasoning
    • D) RAG increases the model's context window

    Correct Answer: B โ€” Fine-tuned knowledge is frozen at training time. RAG queries a live index so the model can answer questions about documents added after its training cutoff.

  2. What does cosine similarity measure in the context of RAG retrieval?

    • A) The exact word overlap between query and document
    • B) The semantic angle between two embedding vectors โ€” higher score means more similar
    • C) The number of tokens shared between query and chunk
    • D) The distance in physical storage between vectors

    Correct Answer: B โ€” Cosine similarity measures the angle between vectors in high-dimensional space. Vectors pointing in the same direction (similar meaning) have a similarity score close to 1.0.

  3. A RAG system returns highly relevant chunks, but the LLM response still contains wrong facts. What is the most likely cause?

    • A) The embedding model is too large
    • B) The LLM is ignoring the provided context (context-faithfulness failure)
    • C) The vector database index is corrupted
    • D) The chunk size is too small

    Correct Answer: B โ€” This is a context-faithfulness failure. The model "knows" a conflicting fact from pretraining and overrides the retrieved context. Mitigation: add explicit instructions like "Answer only using the provided context. If the context does not contain the answer, say so."



Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms