Advanced AI: Agents, RAG, and the Future of Intelligence
Is an LLM a brain in a jar? To make it truly useful, we need to give it access to the world. This guide explains RAG and Agents.
Abstract Algorithms
TLDR: Large Language Models are brilliant "brains in a jar." Retrieval-Augmented Generation (RAG) hands them a constantly refreshed memory, while AI Agents give them tools to act in the world. Combined, they turn static knowledge into dynamic, goal-directed intelligence โ the next frontier of practical AI.
๐งญ Decision Guide: From Smart Chatbots to Systems That Actually Act
Klarna's AI support bot answered 2.3 million customer queries in its first month โ but hallucinated refund policies roughly 15% of the time, because it had no access to Klarna's current policy documents. The model was brilliant at generating fluent, confident text; it simply didn't know what the actual rules were on any given day.
RAG (Retrieval-Augmented Generation) fixes this by fetching relevant documents from a live knowledge base before the LLM generates its response โ so it answers with actual policy, not a statistical guess. AI Agents go further: they give the model tools to act, like querying a database, calling an API, or updating a ticket record.
Advanced AI = a Large Language Model augmented with:
- Retrieval-Augmented Generation (RAG) โ up-to-date knowledge beyond the training cutoff, and
- AI Agents โ tools to invoke external APIs and perform real-world actions.
| Feature | Plain LLM | LLM + RAG | LLM + Agents | LLM + RAG + Agents |
| Access to real-time data | โ | โ | โ via tools | โ |
| Ability to execute commands | โ | โ | โ | โ |
| Hallucination mitigation | Low | Medium | Medium | High |
| Use-case complexity | Simple Q&A | Knowledge-heavy Q&A | Automation | End-to-end autonomous workflows |
๐ The Two Pillars: RAG and Agents Explained
Retrieval-Augmented Generation (RAG)
Problem: LLMs are trained on a static snapshot of the internet. Anything beyond that cutoff โ stock prices, internal docs, breaking news โ is invisible to them.
Idea: Before generating a response, a retriever fetches the most relevant passages from an external corpus (vector store, database, search engine). The LLM then conditions on those passages, effectively reading fresh material.
AI Agents
Problem: An LLM can only output text. It cannot click a button, call an API, or run code.
Idea: Wrap the LLM in a controller that can decide which tool (web browser, calculator, database client, custom function) to invoke. The tool runs, returns a result, and the LLM incorporates that into the next reasoning step.
Toy Knowledge Base
| doc_id | title | content excerpt |
| 1 | Q1 2024 Earnings | "Revenue grew 12 % YoY, driven by cloud services." |
| 2 | Paris Weather 2024-03-06 | "Morning: 8 ยฐC, Light rain; Afternoon: 14 ยฐC, Sunny." |
| 3 | OpenAI API Pricing | "$0.02 per 1k tokens for gpt-4-turbo." |
A vector-search on "What's the weather in Paris today?" retrieves doc 2, which the LLM uses to answer accurately instead of hallucinating or refusing.
๐ What You Need to Know First
RAG (Retrieval-Augmented Generation) is a technique that grounds LLM responses in external knowledge. Instead of relying solely on weights baked in during training, a RAG pipeline fetches relevant documents at query time and injects them into the model's context. This eliminates hallucination on factual questions and keeps answers current without retraining.
Agents are LLM-powered systems that can plan, make decisions, and use tools to accomplish multi-step goals. Unlike a simple prompt-response loop, an agent observes its environment, selects actions (API calls, code execution, searches), and iterates until the task is done. The LLM acts as the reasoning engine; tools extend what it can affect.
Together, RAG and agents form the backbone of most production AI applications: RAG supplies fresh, grounded context while agents orchestrate complex workflows across many steps and systems.
โ๏ธ How RAG and Agents Work Under the Hood
graph LR
A[User Input] --> B{RAG needed?}
B -- Yes --> C[Retriever โ Top-k Passages]
C --> D[LLM contextual generation]
B -- No --> D
D --> E{Agent needed?}
E -- Yes --> F[Tool Selector]
F --> G[Execute Tool]
G --> D
E -- No --> H[Final Answer]
Retriever โ a bi-encoder (e.g., Sentence-Transformers) mapping queries and documents into the same embedding space, queried via FAISS or HNSW for sub-millisecond ANN lookup.
LLM Engine โ decoder-only transformer receiving a concatenated [SYSTEM][CONTEXT][USER] prompt where context is the retrieved passages.
Agent Controller โ an orchestrator (LLM or rule-based) that decides which registered tool to fire. Tool outputs are fed back into the next generation step.
Retrieval scoring (cosine similarity) and agent action selection formulas are derived in full in the Deep Dive section below.
๐ง Deep Dive: RAG and Agent System Internals
Internals
A production RAG pipeline has four core modules working in concert:
- Embedding Model โ encodes both query and documents into a shared vector space. Bi-encoders (e.g.,
all-MiniLM-L6-v2) process documents at index time and queries at runtime, keeping online latency low. - Vector Index โ stores document embeddings for approximate nearest neighbor (ANN) lookup. FAISS (flat or HNSW), Pinecone, and Weaviate serve different scale requirements.
- Reranker (optional) โ a cross-encoder that re-scores the top-k candidates for higher precision at the cost of added latency.
- LLM Engine โ receives a structured
[SYSTEM][CONTEXT][USER]prompt and generates the final response conditioned on retrieved passages.
Agent internals layer on a tool registry (mapping intent labels to callable functions), a scratchpad (intermediate reasoning trace), and a step budget (maximum tool invocations to prevent infinite loops).
Performance Analysis
| Component | Typical latency | Scaling bottleneck |
| Embedding query | 5โ20 ms (GPU) | Model size |
| ANN vector search | 1โ10 ms | Index size (HNSW: O(log N)) |
| LLM generation | 500โ3000 ms | Token count ร model size |
| Tool execution | 50โ5000 ms | External API speed |
End-to-end latency is dominated by LLM generation and slow tool calls. Cache frequently-used query embeddings and parallelize independent tool invocations to reduce p99 latency.
Mathematical Model
Retrieval scoring uses cosine similarity between query $\mathbf{q}$ and document $\mathbf{d}_i$:
$$s_i = \frac{\mathbf{q} \cdot \mathbf{d}_i}{\|\mathbf{q}\| \, \|\mathbf{d}_i\|}$$
Agent action selection at each reasoning step $t$:
$$\hat{a}_t = \arg\max_{a \in \mathcal{A}} P_\theta(a \mid \text{context}_t, \text{scratchpad}_t)$$
A hard step budget enforces termination: if $t > T_{\max}$, the agent returns the best available partial answer and halts.
๐๏ธ Advanced Architecture: Multi-Agent Orchestration
Production RAG+Agent systems rarely rely on a single agent. Common advanced patterns:
Multi-agent pipelines: a planner LLM decomposes tasks into subtasks dispatched to specialized sub-agents (researcher, coder, verifier). Results are merged by an aggregator LLM.
ReAct (Reasoning + Acting): the agent interleaves reasoning traces with tool calls, producing a scratchpad that a supervisor can inspect and override in real time.
Self-RAG: the LLM learns to decide when to retrieve using a special token, avoiding unnecessary latency when parametric knowledge is already sufficient.
| Pattern | Best for | Trade-off |
| Single RAG agent | Knowledge-heavy Q&A | Simple to debug |
| Multi-agent pipeline | Complex multi-step tasks | Higher latency, coordination overhead |
| ReAct | Interleaved reasoning + tool use | Verbose scratchpad; harder to parallelize |
| Self-RAG | Mixed knowledge/retrieval tasks | Requires fine-tuning to learn the retrieve token |
๐ The Orchestration Flow: Node-by-Node
| Node | Purpose | Typical Implementation |
| User Input | Raw natural-language query | API endpoint, voice assistant |
| RAG? | Does the query need external knowledge? | Heuristic: entity/date presence |
| Retriever | Fetch top-k relevant documents | FAISS, Pinecone, ElasticSearch |
| LLM Generation | Context-conditioned answer | GPT-4, Claude, Llama-2 |
| Agent Needed? | Does the answer require tool execution? | Prompt flag or rule-based check |
| Tool Selector | Pick the right tool | Registry mapping intents to callables |
| Execute Tool | Run and capture output | LangChain Tool, custom wrappers |
| Final Answer | Return polished response | Post-processing: dedup, formatting |
Example end-to-end trace
Query: "What's the weather forecast for Paris next week?"
- RAG? โ Yes (contains location + time reference)
- Retriever pulls latest meteorological reports
- LLM drafts forecast conditioned on retrieved context
- Agent Needed? โ No โ no external API call needed
- Final Answer returned
๐ Agent Decision Tree
flowchart TD
Q[Receive Question] --> K{Enough info?}
K -- Yes --> A[Generate Answer]
K -- No --> T[Select Tool]
T --> C[Call Tool]
C --> O[Observe Result]
O --> K
This diagram shows the agent's core decision loop: it checks whether it already has enough information to answer, then selects and calls a tool if not, incorporating the Observation before re-evaluating. The backward arrow from Observe Result back to Enough info? is the iterative reasoning cycle โ the loop continues until the question is answered or the hard step budget is reached.
๐ Real-World Applications of RAG and Agents
| Domain | RAG use | Agent use |
| Enterprise search | Retrieve internal docs, policies, runbooks | Execute API calls, update tickets |
| Customer support | Retrieve FAQs, product documentation | Book returns, check order status |
| Code assistants | Retrieve relevant codebase context | Run tests, apply patches |
| Medical research | Retrieve clinical literature | Query structured databases, flag contradictions |
| Financial analysis | Retrieve earnings reports and filings | Pull live market data, generate summaries |
The most successful production deployments combine a tight retrieval corpus (well-scoped and frequently refreshed) with conservative agent permissions (read-only by default; write actions require explicit confirmation gates).
๐งช Building a RAG+Agent Handler in Python
This example implements the full RAG + Agent architecture described in this post: cosine similarity retrieval, an LLM conditioned on the retrieved passage, and a ReAct agent governed by a hard step budget. It was chosen because each of the three code sections maps directly to a concept covered above โ retrieval scoring formula, grounded generation, and the runaway-agent failure mode โ so you can see exactly where each mechanism lives in real production code. As you read it, focus on how the three stages are independent: retrieval fires before generation, and tool execution happens only when the LLM explicitly decides a call is needed.
# pip install sentence-transformers langchain langchain-openai langchainhub numpy
import numpy as np
from sentence_transformers import SentenceTransformer
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain.agents import create_react_agent, AgentExecutor
from langchain import hub
# โโ Step 1: RAG โ retrieve the most relevant doc via cosine similarity โโโโโโโโโ
# This implements the retrieval scoring formula: s_i = (q ยท d_i) / (||q|| ยท ||d_i||)
embed_model = SentenceTransformer("all-MiniLM-L6-v2")
docs = [
"Refund policy: 30-day returns on all items.",
"Shipping: free on orders over $50.",
"Account: password reset via settings > security.",
]
doc_embeddings = embed_model.encode(docs) # shape: (3, 384) โ indexed at startup
def retrieve(query: str) -> str:
"""Cosine similarity retrieval โ higher score = more relevant passage."""
q_vec = embed_model.encode([query]) # (1, 384)
sims = (q_vec @ doc_embeddings.T)[0] # dot products
sims /= (np.linalg.norm(q_vec) * np.linalg.norm(doc_embeddings, axis=1))
return docs[int(sims.argmax())] # top-1 retrieved passage
# โโ Step 2: LLM conditioned on retrieved context โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# The LLM never guesses policy โ it reasons only over the retrieved passage
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
query = "What is the return policy?"
context = retrieve(query) # โ "Refund policy: 30-day returns on all items."
answer = llm.invoke(f"Context: {context}\nQuestion: {query}")
print("RAG answer:", answer.content) # grounded in retrieved fact, not hallucinated
# โโ Step 3: Agent โ LLM + tool + hard step budget โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# The @tool decorator registers check_order_status in the agent's tool registry
@tool
def check_order_status(order_id: str) -> str:
"""Look up the current status of a customer order by ID."""
return f"Order {order_id}: shipped, arrives 2025-08-10." # mock DB lookup
tools = [check_order_status]
prompt = hub.pull("hwchase17/react") # ReAct template: Thought โ Action โ Observation
agent = create_react_agent(llm, tools, prompt)
executor = AgentExecutor(
agent=agent, tools=tools,
max_iterations=5, # hard step budget โ prevents the runaway-agent failure mode
verbose=True # prints the full reasoning scratchpad at each iteration
)
result = executor.invoke({"input": "What is the status of order ORD-42?"})
print("Agent answer:", result["output"])
# verbose=True reveals the ReAct loop:
# Thought: I need to look up order ORD-42.
# Action: check_order_status("ORD-42")
# Observation: Order ORD-42: shipped, arrives 2025-08-10.
# Final Answer: Your order ships on 2025-08-10.
๐ ReAct Agent Loop
sequenceDiagram
participant U as User
participant L as LLM
participant T as Tool
U->>L: Task / Question
L-->>L: Thought: plan action
L->>T: Tool call
T-->>L: Observation
L-->>L: Next thought
L-->>U: Final Answer
This sequence diagram traces the ReAct loop from the code example above: the LLM alternates between internal Thought steps (reasoning) and outbound Tool calls (acting), incorporating each Observation before deciding the next action. The key takeaway is that the LLM never answers directly from its weights alone โ it reads the tool's Observation before producing the Final Answer, grounding the response in verified external output rather than parametric memory.
โ๏ธ Trade-offs & Failure Modes: Trade-offs and Operational Failure Modes
| Failure Mode | Symptom | Mitigation |
| Ambiguous entity | "Paris" = city or person | Prompt LLM to disambiguate |
| Empty retrieval | Retriever returns no results | Fall back to generic or trigger web-search tool |
| Tool timeout | API rate-limit exceeded | Exponential back-off, provider fallback |
| Hallucination | LLM fabricates facts | Source-check: cross-validate with retrieved passages |
| Runaway agent | Infinite tool-call loop | Hard step limit, circuit breaker |
Performance at scale:
| Metric | RAG Only | Agents Only | RAG + Agents |
| Latency per request | $O(d{\cdot}k + T_ ext{LLM})$ | $O(T_ ext{LLM} + m\hat{T})$ | Combined โ network I/O dominates |
| Vector store space | $O(N{\cdot}d)$ | $O(1)$ | $O(N{\cdot}d)$ + tool state |
Cache hot queries and parallelize independent tool calls to reduce tail latency.
๐งญ Decision Guide: Choosing the Right Mix
| Situation | Recommendation |
| Need up-to-date factual answers | Deploy RAG (vector store + fresh corpus) |
| Need to perform actions (book, query DB) | Add Agents with a well-defined tool registry |
| Both knowledge freshness and actions | RAG + Agents; start with a simple planner LLM |
| Low latency, high volume | Lightweight retriever (BM25) + smaller model |
| Complex multi-step workflows | Stateful orchestrator (LangChain, LlamaIndex) |
| Confidential data | On-prem vector store, encrypted tool calls, audit logs |
๐ ๏ธ LangChain: Composable RAG and Agent Pipelines in Python
LangChain is an open-source Python (and JavaScript) framework that provides chains, retrievers, tools, and agent executors as composable building blocks โ letting you wire together the RAG + Agent architecture described in this post without rebuilding every component from scratch.
Its RetrievalQA chain connects a vector store retriever to an LLM in one call; its AgentExecutor gives the LLM a registered tool registry with automatic loop control, solving the runaway-agent failure mode via a configurable max_iterations budget.
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.agents import AgentExecutor, create_react_agent
from langchain_core.tools import Tool
from langchain import hub
# --- RAG pipeline ---
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(
["Refund policy: 30-day returns on all items.",
"Shipping: free on orders over $50."],
embedding=embeddings
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
rag_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
print(rag_chain.invoke("What is the return policy?"))
# {'result': 'All items can be returned within 30 days.'}
# --- Agent with tool ---
def check_order_status(order_id: str) -> str:
return f"Order {order_id}: shipped, arrives 2025-08-10."
tools = [Tool(name="OrderStatus",
func=check_order_status,
description="Look up an order by ID.")]
prompt = hub.pull("hwchase17/react")
agent = create_react_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, max_iterations=5)
print(executor.invoke({"input": "What is the status of order ORD-42?"}))
LangChain's max_iterations=5 guard is the production-safe implementation of the step-budget concept from the Deep Dive section.
For a full deep-dive on LangChain, a dedicated follow-up post is planned.
๐ ๏ธ LlamaIndex: Structured Document Ingestion for RAG
LlamaIndex (formerly GPT Index) is an open-source Python framework specializing in data ingestion, indexing, and querying for RAG systems โ particularly for structured document hierarchies like PDFs, Notion pages, and SQL databases that LangChain handles less naturally.
Where LangChain shines on agent orchestration, LlamaIndex shines on building multi-document corpora with hierarchical summaries, metadata filters, and re-ranking โ directly addressing the "tight retrieval corpus" requirement from the decision guide.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Configure models
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# Ingest a directory of documents (PDFs, .txt, .md)
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
# Query with automatic retrieval โ generation
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("Summarise the Q3 2024 earnings highlights.")
print(response) # grounded answer with source citations
LlamaIndex persists the index to disk with index.storage_context.persist("./index_store"), enabling production RAG without re-embedding documents on every restart.
For a full deep-dive on LlamaIndex, a dedicated follow-up post is planned.
๐ What to Learn Next
- Neural Networks Explained: From Neurons to Deep Learning
- RAG Explained: How to Give Your LLM a Brain Upgrade
- Large Language Models: The Generative AI Revolution
๐ TLDR: Summary & Key Takeaways
- A bare LLM has no memory beyond its training cutoff and cannot take actions โ RAG and Agents fix both.
- RAG supplies fresh knowledge via vector retrieval; Agents supply action capability via tools.
- Latency is dominated by the slowest external API โ cache and parallelize aggressively.
- Tool contracts (strict input/output schemas) prevent the LLM from inventing malformed API calls.
- Log everything: query embeddings, retrieved doc IDs, tool invocations, and token usage โ observability is non-negotiable.
๐ Practice Quiz
What problem does RAG solve that a plain LLM cannot?
A) It makes the model faster B) It gives the model access to knowledge beyond its training cutoff C) It reduces model size D) It eliminates hallucinations entirely
Correct Answer: B โ RAG retrieves fresh passages from an external corpus at query time, letting the LLM answer questions about events and documents it never saw during training.
Why must tool input/output schemas be strictly defined in an agent system?
A) To reduce token costs B) To prevent the LLM from generating malformed API calls C) Schemas are optional โ the agent infers them dynamically D) To enforce latency SLAs
Correct Answer: B โ A strict schema constrains the LLM's output to valid tool parameters, preventing runtime failures from incorrectly structured calls.
What is the primary latency bottleneck in a RAG + Agent system?
A) Embedding generation is always the slowest step B) The slowest external API or tool call dominates end-to-end latency C) Vector search is O(N) and always slow D) The reranker adds the majority of latency
Correct Answer: B โ LLM generation and slow external tools dominate wall-clock time. ANN search is typically sub-10 ms and rarely the bottleneck.
You are designing a RAG+Agent system for a hospital that needs to answer clinical questions and update patient records. What architectural safeguards would you implement, and why? (Open-ended โ no single correct answer)
Consider: read-only vs. write tool permissions, audit logging of all tool invocations, human-confirmation gates before write actions, retrieval corpus freshness guarantees, hallucination mitigation through source attribution, and regulatory compliance (HIPAA).
๐ Related Posts
- RAG Explained: How to Give Your LLM a Brain Upgrade
- Large Language Models: The Generative AI Revolution
- Tokenization Explained: How LLMs Understand Text
Tags

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions โ with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...

Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work.
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy โ but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose โ range, hash, consistent hashing, or directory โ determines whether range queries stay ch...
