All Posts

AI Agents Explained: When LLMs Start Using Tools

An LLM can talk, but an AI Agent can *act*. We explain how Agents use the ReAct framework to brow...

Abstract AlgorithmsAbstract Algorithms
··12 min read
Cover Image for AI Agents Explained: When LLMs Start Using Tools

AI-assisted content.

TLDR: A standard LLM is a brain in a jar — it can reason but cannot act. An AI Agent connects that brain to tools (web search, code execution, APIs). Instead of just answering a question, an agent executes a loop of Thought → Action → Observation until the goal is reached.


📖 Brain in a Jar vs Brain with Arms

A plain LLM generates text. Give it "What is the weather in Tokyo today?" and it will:

  • Answer from training data (which is months or years old).
  • Confidently hallucinate a plausible-sounding answer.

An AI agent would:

  1. Recognize it needs current weather data.
  2. Call a weather API tool.
  3. Return the real, live answer.

The difference: the agent can act on the world, not just describe it.


🔍 The Basics: What Is an AI Agent

An AI agent is a program that uses a large language model as its reasoning engine and connects it to external tools so it can take real actions in the world. Where a plain LLM is purely generative — text in, text out — an agent augments the LLM with capabilities like web search, code execution, file reading, database queries, and API calls.

Three components make up almost every agent:

  1. The LLM — the reasoning core that decides what to do next.
  2. Tools — callable functions the LLM is allowed to invoke (e.g., search_web, run_python).
  3. The loop — the agent keeps taking steps (Thought → Action → Observation) until it decides the goal is complete.

The key insight is that the LLM is never given raw access to tools. Instead, the tool descriptions (names and docstrings) are injected into the system prompt. The model reads those descriptions and decides — at each loop step — which tool to call, with what arguments, and why.

This architecture is called ReAct (Reasoning + Acting), introduced in a 2022 paper by Yao et al. ReAct proved that chaining reasoning traces with tool calls dramatically outperforms either pure prompting or pure tool-use alone.

How is an agent different from a simple API call? A single API call is one-shot: input → output. An agent is iterative: it observes the result of each action and can decide to take another action, fix an error, try a different tool, or produce a final answer — all without human intervention in the loop.


⚙️ The ReAct Loop: Thought → Action → Observation

The dominant pattern for agents is ReAct (Reasoning + Acting). The model cycles through three steps until the task is complete:

StepTypeContent
1ThoughtI need to find out when the movie Titanic was released.
2Actionsearch("Titanic movie release date")
3Observation"Titanic was released in December 1997."
4ThoughtNow I need to find who was US president in December 1997.
5Actionsearch("US President December 1997")
6Observation"Bill Clinton was US President in December 1997."
7ThoughtI have all the information. I can answer.
8Final AnswerBill Clinton was president when Titanic was released.
flowchart TD
    Start([User Goal]) --> T[Thought: What do I need?]
    T --> A[Action: Call a Tool]
    A --> O[Observation: Tool Result]
    O --> D{Goal reached?}
    D -- No --> T
    D -- Yes --> Answer([Return Final Answer])

This loop continues until the model decides it has enough information to answer.

📊 ReAct Loop Sequence

sequenceDiagram
    participant U as User
    participant L as LLM
    participant T as Tool
    U->>L: User request
    L-->>L: Thought: what to do
    L->>T: Action: call tool
    T-->>L: Observation: result
    L-->>L: Next thought
    L-->>U: Final Answer

📊 How an Agent Processes a Request

When a user sends a question to an agent-powered system, several layers of logic activate before any answer is returned. Understanding this flow helps you reason about latency, costs, and failure points.

flowchart TD
    U([User Request]) --> SYS[Inject Tool Schemas into LLM Context]
    SYS --> LLM1[LLM Reasoning: Thought  What do I need?]
    LLM1 --> TC[Tool Selection: Which tool and args?]
    TC --> TE[Tool Execution: Call external function]
    TE --> OBS[Observation: Capture tool output]
    OBS --> CHK{Goal reached?}
    CHK -- No --> LLM1
    CHK -- Yes --> ANS([Return Final Answer to User])

What happens at each node:

  • Inject Tool Schemas — before the first LLM call, all tool names, descriptions, and input types are serialised into the system prompt so the model knows what it can use.
  • LLM Reasoning — the model outputs a structured "Thought" explaining its next move. This trace is not shown to the user but is the core of the agent's transparency.
  • Tool Selection — based on the Thought, the model emits a structured tool call (name + JSON arguments). Modern LLMs use function-calling APIs to enforce valid JSON here.
  • Tool Execution — your application code runs the actual function, calling the real API or executing code in a sandbox.
  • Observation — the result is appended to the conversation history so the model can read it on the next iteration.
  • Goal check — the model itself decides when it has enough information. If it emits a "Final Answer", the loop exits.

🔢 Tool Definitions: How an Agent Knows What It Can Do

A tool is a function the model can call. In LangChain you define tools with a name, description, and input schema:

from langchain.tools import tool

@tool
def search_web(query: str) -> str:
    """Search the web for current information. Use this for recent events or facts."""
    return web_search_api(query)

@tool
def run_python(code: str) -> str:
    """Execute Python code and return the output. Use this for calculations."""
    return exec_sandbox(code)

The model receives the tool descriptions in its system prompt and decides which to call (and with what arguments) based on the task. It never sees the implementation — only the name and docstring.

📊 Agent Tool Decision Flow

flowchart TD
    I[Input from User] --> LM[LLM Reasoning]
    LM --> D{Need a tool?}
    D -- No --> R[Return Answer]
    D -- Yes --> TS[Select Tool]
    TS --> EX[Execute Tool]
    EX --> OB[Observe Output]
    OB --> LM

🧠 Deep Dive: Building a Simple Agent with LangChain

from langchain.agents import create_react_agent, AgentExecutor
from langchain_openai import ChatOpenAI
from langchain import hub

llm = ChatOpenAI(model="gpt-4o-mini")
tools = [search_web, run_python]

prompt = hub.pull("hwchase17/react")          # standard ReAct prompt
agent = create_react_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

result = executor.invoke({"input": "Who was US president when Titanic was released?"})
print(result["output"])

The verbose=True flag shows you the full Thought/Action/Observation chain — invaluable for debugging.


🌍 Real-World Applications: Real-World Agent Use Cases

Use caseTools used
Customer support triageCRM lookup, ticket creation, knowledge base search
Data analyst botSQL runner, Python executor, chart renderer
Code reviewer agentGitHub file reader, linter, test runner
Travel bookingFlight search API, hotel API, calendar API
Research assistantWeb search, PDF reader, citation manager

🧪 Practical: Debugging Agent Behavior

Agents fail in opaque ways that are hard to catch without systematic debugging. Here is a practical workflow for diagnosing misbehaving agents.

Step 1 — Enable verbose output. In LangChain, set verbose=True on the AgentExecutor. This prints the full Thought/Action/Observation chain to stdout, letting you see exactly what the model decided at each step and what each tool returned.

executor = AgentExecutor(agent=agent, tools=tools, verbose=True, max_iterations=10)

Step 2 — Inspect the tool call arguments. The most common failure is the model calling the right tool with the wrong arguments. Look at the Action Input in the verbose log. If the model is passing None, an empty string, or a hallucinated value, the docstring on your tool is probably too vague.

Step 3 — Check for loops. If the agent keeps calling the same tool with the same arguments and getting no progress, it is stuck. Add max_iterations=10 (or lower) to break the loop and inspect the last observation. Usually the tool is returning an error the model does not know how to handle.

Step 4 — Shrink the context. If the agent starts forgetting earlier observations or repeating itself, the context window may be filling up. Consider summarising older observations or limiting how many tool calls are preserved in history.

Common failure patterns at a glance:

SymptomProbable CauseFix
Tool called with empty argsVague tool descriptionImprove docstring with examples
Infinite loopTool returns silent errorReturn readable error message
Wrong tool chosenTool names too similarRename for clarity
Context window exceededToo many iterationsSummarise older observations
Hallucinated answerNo relevant tool availableAdd the needed tool or constrain scope

⚖️ Trade-offs & Failure Modes: When Agents Fail: Hallucinations, Loops, and Cost Blowouts

Agents introduce new failure modes beyond plain LLMs:

Hallucinated tool calls — the model invents arguments or calls a non-existent tool. Fix: validate tool schemas strictly; use structured outputs.

Infinite loops — the agent gets stuck in Thought→Action→Observation cycles with no progress. Fix: set a hard max_iterations limit.

Cost explosion — each loop iteration is an API call + tool call. A task that needs 15 iterations with GPT-4 can cost $1 per query. Fix: use cheaper models for planning steps; cache repeated tool results.

Context overflow — long observation histories can push earlier context out of the window. Fix: summarize or prune old observations periodically.


🧭 Decision Guide: When to Use an AI Agent

Use an agent when the task requires multiple steps, tool use, or dynamic decision-making that can't be scripted upfront. Prefer a simple LLM call when a single prompt is sufficient — agents add latency and cost. Use agents for research, code-generation loops, or multi-tool workflows; avoid them for single-turn Q&A or classification tasks.


🛠️ LangChain AgentExecutor: Wiring the ReAct Loop in Five Lines of Python

LangChain is an open-source Python framework for composing LLM-powered applications; its agent module provides create_react_agent, AgentExecutor, and @tool decorator to implement the Thought → Action → Observation ReAct loop with minimal boilerplate.

create_react_agent wires an LLM, a list of tools, and a ReAct prompt template into an executable agent. AgentExecutor runs the loop: it calls the LLM, parses tool invocations, executes the tools, appends observations to context, and stops when the model emits a final answer or max_iterations is reached.

from langchain.agents import create_react_agent, AgentExecutor
from langchain.tools import tool
from langchain_openai import ChatOpenAI
from langchain import hub

# 1. Define tools — the LLM only ever sees the name and docstring
@tool
def search_web(query: str) -> str:
    """Search the web for current facts, news, or recent events. Returns a text summary."""
    # In production: call Tavily, SerpAPI, or a custom search client
    return f"[Search result for '{query}': Titanic was released December 19, 1997]"

@tool
def run_python(code: str) -> str:
    """Execute Python code in a sandboxed environment. Returns stdout. Use for math and data."""
    import io, contextlib
    buf = io.StringIO()
    with contextlib.redirect_stdout(buf):
        exec(code, {"__builtins__": __builtins__})
    return buf.getvalue() or "(no output)"

# 2. Wire the agent — LLM + tools + standard ReAct prompt from LangChain Hub
llm   = ChatOpenAI(model="gpt-4o-mini", temperature=0)
tools = [search_web, run_python]
prompt = hub.pull("hwchase17/react")   # standard ReAct template

agent    = create_react_agent(llm, tools, prompt)
executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,               # prints full Thought/Action/Observation chain
    max_iterations=10,          # hard cap — prevents runaway cost loops
    handle_parsing_errors=True, # recover gracefully from malformed tool calls
)

# 3. Run — the executor manages the full ReAct loop automatically
result = executor.invoke({
    "input": "What is the square root of the year Titanic was released?"
})
print(result["output"])
# Thought: I need the release year of Titanic.
# Action: search_web("Titanic movie release year")
# Observation: Released December 19, 1997
# Thought: Now compute sqrt(1997).
# Action: run_python("import math; print(math.sqrt(1997))")
# Observation: 44.688...
# Final Answer: approximately 44.69

verbose=True is the most important debugging flag — it exposes every Thought/Action/Observation step, making it straightforward to diagnose incorrect tool selection or malformed arguments. Set max_iterations before going to production: without it, a confused agent loops until the token budget is exhausted.

For a full deep-dive on LangChain multi-tool agents, memory integration, and production observability with LangSmith tracing, a dedicated follow-up post is planned.


📚 Key Lessons About AI Agents

Five concrete lessons from building and deploying agents in production:

  1. Tool descriptions are your most important prompt. The model never sees tool code — only the name and docstring. Write docstrings like user-facing documentation: what the tool does, when to use it, what to pass in, and what it returns. One ambiguous tool description causes more failures than any model limitation.

  2. Always set max_iterations. Without a hard cap, a confused agent will keep looping until your API budget is exhausted. In production, treat hitting the iteration limit as an error to alert on, not a silent fallback.

  3. Structured outputs improve reliability dramatically. Agents that return tool call arguments as free-form text hallucinate more often than agents using JSON-schema-validated function calling. Use OpenAI function calling or LangChain structured output tools wherever possible.

  4. Cost scales with loop depth. Every Thought → Action → Observation iteration is one full LLM inference plus one tool call. A task requiring 10 iterations with a frontier model can cost $0.50–$1.00 per query. Profile your agents before going to production and set budgets per session.

  5. Agents are hard to test deterministically. Unlike a pure function with fixed input/output, agents make probabilistic decisions at each step. Write integration tests using recorded tool responses (fixtures) rather than live tool calls, and explicitly test common failure paths such as tool errors and empty results.


📌 TLDR: Summary & Key Takeaways

  • A plain LLM generates text; an agent generates text and calls tools to act.
  • The dominant loop is ReAct: Thought → Action → Observation, repeated until the task is complete.
  • Tools are functions with a name and description; the LLM decides when and how to call them.
  • Key failure modes: hallucinated tool calls, infinite loops, cost explosion, and context overflow.
  • Always set max_iterations and monitor tool call costs in production.

🧩 Test Your Understanding

  1. What is the difference between an LLM and an AI agent?
  2. In the ReAct pattern, what triggers the agent to stop looping?
  3. Why are tool descriptions (docstrings) so important for agent reliability?
  4. Name two ways to prevent an agent from running up an unexpectedly large API bill.

Share
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms