Multistep AI Agents: The Power of Planning
Simple AI agents react one step at a time. Multistep agents are different: they create a full pla...
Abstract Algorithms
AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: A simple ReAct agent reacts one tool call at a time. A multistep agent plans a complete task decomposition upfront, then executes each step sequentially โ handling complex goals that require 5-10 interdependent actions without re-prompting the LLM for each step.
๐ Line Cook vs. Head Chef
A line cook (simple ReAct agent): receives one ticket, cooks one dish, hands it over. Then the next ticket.
A head chef (multistep agent): receives the full dinner party menu, plans the entire 5-course sequence, coordinates prep timing for all dishes, anticipates which items can be done in parallel, and manages the full execution before the first guest is seated.
The difference: planning before acting. Complex goals require plans, not just reactions.
๐ What Makes a Multistep Agent Different
A simple ReAct agent operates with a single-step planning horizon: it looks at the current state, picks the next action, observes the result, and repeats. This works well for short, open-ended tasks where the required steps are unknown ahead of time. But when a task involves 5โ10 interdependent actions, this approach becomes inefficient โ the LLM must rediscover context at every step, decisions made in step 6 can conflict with assumptions from step 2, and there is no mechanism to detect that a task is structurally impossible before spending tokens on execution.
A multistep agent solves this by introducing a distinct planning phase before any tool is called. During planning, the LLM receives the full goal and produces a structured decomposition โ typically a JSON array of steps, each with an action name and arguments. This plan encodes the task dependency graph: which steps can run in parallel, which must complete before others begin, and what the expected output of each step feeds into downstream steps.
Key concepts in multistep agent design:
- Planning horizon: how many steps ahead the agent reasons before committing to an action
- Task decomposition: breaking a complex goal into discrete, executable sub-tasks with clear inputs and outputs
- Step dependencies: the directed acyclic graph (DAG) of which steps must complete before others can begin
- Re-planning: when a step fails, generating a revised plan for the remaining steps based on current state
- Context management: summarizing intermediate results to prevent context window overflow across many steps
Understanding these concepts is essential before choosing whether to use a multistep agent or a simpler reactive loop for a given problem.
๐ข Simple ReAct vs. Plan-and-Execute: Core Difference
| Dimension | ReAct (Single-Step Loop) | Plan-and-Execute (Multistep) |
| Planning | None โ LLM decides next action after each observation | LLM creates a full plan upfront (JSON array of steps) |
| LLM calls | One per action (tight feedback loop) | One for planning; one per step for execution |
| Best for | Short, open-ended tasks with unknown required steps | Long tasks with a knowable step structure |
| Failure handling | Adapts after each observation | Re-plan on step failure |
| Token cost | Lower per step | Higher plan call; lower execution calls |
โ๏ธ The Plan-and-Execute Architecture
Goal: "Research the top 3 AI papers from last month, summarize each, and draft a blog post."
Phase 1 โ Plan Call (one LLM call):
[
{ "step": 1, "action": "search", "args": ["top AI papers July 2025"] },
{ "step": 2, "action": "fetch_abstract", "args": ["paper_id_1"] },
{ "step": 3, "action": "summarize", "args": ["abstract_1"] },
{ "step": 4, "action": "fetch_abstract", "args": ["paper_id_2"] },
{ "step": 5, "action": "summarize", "args": ["abstract_2"] },
{ "step": 6, "action": "fetch_abstract", "args": ["paper_id_3"] },
{ "step": 7, "action": "summarize", "args": ["abstract_3"] },
{ "step": 8, "action": "write_post", "args": ["[summary_1, summary_2, summary_3]"] }
]
Phase 2 โ Execution Loop (LLM only called when tool output needs reasoning):
flowchart TD
Goal[Complex Goal] --> Planner[LLM Planner (one call JSON plan)]
Planner --> Loop[Executor Loop]
Loop --> Step[Execute Next Step (tool call or sub-LLM call)]
Step --> Check{Last step?}
Check -->|No| Loop
Check -->|Yes| Result[Final Result]
Step -->|Failure| Replan[Re-plan remaining steps]
Replan --> Loop
๐ Multistep Agent Execution Flow
Understanding how a multistep agent transitions between planning and execution is critical for debugging and designing reliable pipelines. The flow is not a simple linear sequence โ it includes conditional branches for parallel execution and failure recovery. The diagram below captures the full lifecycle from receiving a goal to delivering a final aggregated result.
flowchart TD
Goal[User Goal] --> Planner[LLM Planner (produce JSON step list)]
Planner --> Validate[Validate Plan (tool names, schema)]
Validate -->|Valid| Decompose[Decompose into Parallel + Sequential Steps]
Validate -->|Invalid| Planner
Decompose --> Parallel[Run Parallel Steps (independent actions)]
Decompose --> Sequential[Run Sequential Steps (dependent actions)]
Parallel --> Aggregate[Aggregate Intermediate Results]
Sequential --> Aggregate
Aggregate --> Check{All steps complete?}
Check -->|Yes| Final[Final Result Delivered]
Check -->|Step Failed| Replan[Re-plan Remaining Steps from Current State]
Replan --> Decompose
This flow highlights two key design decisions: plan validation happens before any tool call (catching hallucinated tool names early), and aggregation combines outputs from parallel and sequential branches before the next dependent step begins. The re-plan path preserves completed step results โ there is no need to restart from the beginning when a single step fails mid-execution.
๐ Real-World Applications: Real-World Use Cases
Multistep agents are not theoretical constructs โ they power several categories of real production systems today.
Automated Research Pipelines A research agent might receive a goal like "produce a competitive analysis of the top 5 CRM tools." The plan includes: search for the top 5 tools (step 1), fetch the pricing page for each tool (steps 2โ6 in parallel), extract key features from each page (steps 7โ11), compare them across a structured rubric (step 12), and produce a formatted markdown report (step 13). No human re-prompts the system between steps โ the plan drives the full execution.
Coding Agents Agents like GitHub Copilot Workspace and Devin use multistep planning to handle tasks like "add OAuth login to this codebase." The plan might include: reading existing auth files, identifying integration points, writing new code files, updating configuration, running tests, and fixing any test failures. Each step has dependencies on prior outputs, making reactive single-step agents impractical.
Business Process Automation Invoice processing pipelines, HR onboarding workflows, and compliance audits can all be modeled as multistep plans. An invoice agent might: extract line items (step 1), validate against purchase orders (step 2), route exceptions for human review (step 3), and post approved invoices to the accounting system (step 4). The structured plan makes auditing and failure recovery straightforward.
Data Analysis Workflows Data science agents receive goals like "analyze Q3 sales data and identify the top 3 regional trends." The plan includes: querying the data warehouse, cleaning the dataset, computing regional aggregates, ranking by growth rate, and generating a narrative summary. Each step produces a structured artifact consumed by the next step โ a pattern that maps directly to the plan-and-execute model.
In every case, the shared characteristic is a knowable step structure: the agent can enumerate the required actions before observing the results of any single action. This is the fundamental criterion for choosing a multistep agent over a reactive loop.
๐ Multi-Step Planning Sequence
sequenceDiagram
participant U as User
participant P as LLM Planner
participant E as Executor
participant T as Tools
U->>P: Submit goal
P->>P: Decompose into step list
P-->>E: Plan [step1, step2, step3...]
E->>T: Execute step 1
T-->>E: Result 1
E->>P: Observe result 1
P->>P: Update plan if needed
E->>T: Execute step 2
T-->>E: Result 2
E->>T: Execute step 3
T-->>E: Result 3
E-->>U: Aggregated final answer
This sequence diagram traces the communication flow between the four key participants in a multistep agent run: the user who submits the goal, the LLM Planner that decomposes it into a numbered step list, the Executor that drives each tool call, and the Tools that perform the actual operations. Notice that the Planner is consulted again after Step 1 completes โ this models the optional re-planning trigger, where intermediate results can cause the agent to revise remaining steps before continuing. This feedback loop is what distinguishes plan-and-execute agents from a static batch of sequential tool calls.
๐ Planning Loop Decision Flow
flowchart TD
Goal[Receive Goal]
Decompose[LLM: Decompose into step list]
Queue[Queue Steps (DAG order)]
Execute[Execute Next Step with Tools]
UpdateState[Update Shared State with Result]
Done{All Steps Complete?}
Replan{Step Failed?}
Final[Return Final Result]
ReplanStep[Re-plan Remaining Steps from State]
Goal --> Decompose --> Queue --> Execute --> UpdateState
UpdateState --> Done
Done -->|Yes| Final
Done -->|No| Replan
Replan -->|Yes| ReplanStep --> Queue
Replan -->|No| Execute
This flowchart details the executor's decision logic during the step-by-step execution phase. After each step's result is written to shared state, the executor checks two conditions: whether all steps are complete, and whether the last step failed. A successful completion routes to the final result. A step failure triggers the re-planner, which generates a revised plan for the remaining steps using current accumulated state โ avoiding a full restart. If neither condition is met, the executor simply advances to the next queued step, forming the core execution loop of any plan-and-execute agent.
๐งช Practical: Building a Simple Plan-and-Execute Agent
The fastest way to understand multistep agents is to trace how the planner and executor components interact in code. Below is a minimal Python example using LangChain's PlanAndExecute abstraction with custom tools.
Step 1 โ Define tools and the LLM:
from langchain_openai import ChatOpenAI
from langchain.tools import tool
@tool
def search_web(query: str) -> str:
"""Search the web for the given query."""
return f"Results for: {query}"
@tool
def summarize_text(text: str) -> str:
"""Summarize the provided text."""
return f"Summary of: {text[:50]}..."
tools = [search_web, summarize_text]
llm = ChatOpenAI(model="gpt-4o", temperature=0)
Step 2 โ Wire up planner and executor:
from langchain_experimental.plan_and_execute import (
PlanAndExecute, load_agent_executor, load_chat_planner
)
planner = load_chat_planner(llm) # single LLM call that generates the JSON step list
executor = load_agent_executor(llm, tools, verbose=True) # runs each step; verbose=True prints tool call traces
agent = PlanAndExecute(planner=planner, executor=executor, verbose=True)
Step 3 โ Invoke with a multi-step goal:
result = agent.invoke({
"input": "Find the top 2 benefits of multistep AI agents and summarize them."
})
print(result["output"])
When you run this with verbose=True, you will see the planner emit a JSON step list, then the executor call each tool in sequence. This trace is invaluable for debugging โ if a step produces an unexpected output, you can inspect exactly where the plan diverged from reality and adjust either the planner prompt or the tool implementation accordingly.
๐ ๏ธ LangGraph and AutoGen: How OSS Frameworks Implement Multistep Agent Orchestration
Two open-source frameworks have emerged as the leading ways to build multistep agents with explicit state management, and they take very different architectural approaches.
LangGraph is a graph-based agent orchestration library from LangChain that models an agent's execution as a directed graph of nodes (LLM calls, tool calls, or logic) and edges (conditional transitions). It gives you full control over state, supports cycles (loops), and is designed for durable, resumable workflows โ making it the production-ready choice for complex multistep pipelines.
from langgraph.graph import StateGraph, END
from typing import TypedDict
# Define the shared state for the agent
class AgentState(TypedDict):
task: str
plan: list
results: list
current_step: int
# Build the graph
graph = StateGraph(AgentState)
# Add nodes: each is a function that reads and writes state
graph.add_node("planner", run_planner) # LLM call โ produces plan list
graph.add_node("executor", run_executor) # Tool call โ appends to results
graph.add_node("checker", check_completion) # Logic โ routes to next step or END
# Wire the execution flow
graph.set_entry_point("planner")
graph.add_edge("planner", "executor")
graph.add_conditional_edges(
"executor",
lambda state: "checker" if state["current_step"] < len(state["plan"]) else END,
)
graph.add_edge("checker", "executor")
agent = graph.compile()
result = agent.invoke({"task": "Research top 3 AI papers and draft a summary", "plan": [], "results": [], "current_step": 0})
AutoGen (Microsoft) takes a conversation-centric approach โ agents are modeled as participants in a multi-agent dialogue, where each agent has a role (AssistantAgent, UserProxyAgent) and can call tools or other agents as part of the conversation flow. It excels at collaborative multi-agent tasks where two or more LLMs reason together.
from autogen import AssistantAgent, UserProxyAgent
planner = AssistantAgent(
name="Planner",
system_message="You decompose goals into numbered steps and produce a JSON plan.",
llm_config={"model": "gpt-4o"},
)
executor = UserProxyAgent(
name="Executor",
human_input_mode="NEVER",
code_execution_config={"work_dir": "./workspace"},
)
# The executor drives the conversation; the planner produces the plan
executor.initiate_chat(
planner,
message="Research the top 3 AI papers from last month and summarize each.",
max_turns=10,
)
| Framework | Model | Best for |
| LangGraph | Graph of nodes + state | Durable pipelines, resumable workflows, conditional branching |
| AutoGen | Multi-agent conversation | Collaborative reasoning, code generation, peer-review loops |
For a full deep-dive on LangGraph and AutoGen, dedicated follow-up posts are planned.
๐ Key Lessons
Plan validation is non-negotiable. Always validate the generated plan against the registered tool list and expected argument schema before execution begins. Catching hallucinated tool names at plan time costs one validation pass; catching them at execution time costs the tokens from all preceding steps plus a re-plan.
Re-plan from failure state, not from scratch. When step 4 of a 10-step plan fails, the results of steps 1โ3 are valid and should be preserved. A well-designed executor passes the current completed-steps state back to the planner, which then generates a revised plan for steps 4โ10 only. Restarting from step 1 wastes tokens and time.
Summarize intermediate results aggressively. Long intermediate outputs from tool calls accumulate in the context window. For plans with more than 5โ6 steps, summarize each step's output before passing it to the next step. This prevents context overflow and keeps the LLM focused on the current step rather than retrieving irrelevant prior results.
Prefer parallel execution for independent steps. If steps 2, 3, and 4 each depend only on step 1's output (and not on each other), they can run in parallel. This reduces total wall-clock time significantly for I/O-bound steps like web fetches or database queries. The plan structure should encode this parallelism explicitly.
Choose multistep agents only for knowable-structure tasks. If the required steps cannot be enumerated without observing intermediate results, a ReAct loop is more appropriate. Multistep planning adds overhead โ one extra LLM call for planning and the cognitive cost of maintaining a step list. Use it when the structure is known; use ReAct when it is not.
๐ TLDR: Summary & Key Takeaways
- Multistep agents plan the full task structure upfront, then execute step by step with minimal LLM calls during execution.
- Plan-and-Execute = one Planner LLM call โ JSON step list โ Executor loop using tools.
- Best for tasks with a knowable structure (reports, research pipelines, automated workflows).
- Failure handling: re-plan from failed step, not from scratch.
- LangChain's
PlanAndExecutewraps this pattern in a few lines of Python.
๐ง Deep Dive: LangChain Plan-and-Execute Agent
from langchain_experimental.plan_and_execute import (
PlanAndExecute,
load_agent_executor,
load_chat_planner,
)
from langchain_openai import ChatOpenAI
from langchain_community.tools import WikipediaQueryRun, DuckDuckGoSearchRun
llm = ChatOpenAI(model="gpt-4o", temperature=0) # temperature=0: deterministic planning produces more consistent JSON step lists
tools = [WikipediaQueryRun(), DuckDuckGoSearchRun()]
planner = load_chat_planner(llm) # generates the structured JSON plan from the goal
executor = load_agent_executor(llm, tools, verbose=True)
agent = PlanAndExecute(planner=planner, executor=executor)
# planner produces the step list once; executor calls tools for each step without re-running the planner
agent.invoke({
"input": "Research the top 3 AI papers from last month, summarize each, and draft a blog post."
})
The planner produces a step list; the executor runs each step with access to tools.
๐ฌ Internals
Multi-step agents implement a Thought-Action-Observation loop: the LLM emits a reasoning trace, selects a tool call with arguments, receives the result, and appends it to context before the next step. The agent loop runs until a terminal condition (task complete, max steps, or error). Tool dispatch is typically handled by structured output parsing (JSON function calling) rather than free-text extraction, reducing parse failures.
โก Performance Analysis
A 5-step agent with GPT-4 averages 15โ25 seconds end-to-end due to sequential LLM calls (~3s each) plus tool latency. Parallelizing independent tool calls (e.g., concurrent web searches) cuts wall time by 40โ60%. Smaller orchestrator models (GPT-3.5 or 7B local) reduce per-step cost by 10โ50ร at the expense of ~15% more planning errors on complex tasks.
โ๏ธ Trade-offs & Failure Modes: When to Use Multistep Agents vs Simple Agents
| Use Case | Simple ReAct | Multistep Plan-Execute |
| Q&A with a single tool lookup | โ | Overkill |
| Writing a report with 8 research steps | โ | โ |
| Interactive conversation with user feedback | โ | Awkward |
| Automated pipeline with known step structure | โ | โ |
| Debugging code with back-and-forth tool calls | โ | โ |
Critical failure modes for multistep agents:
- Stale plan: If step 3 fails, steps 4-8 may be based on incorrect assumptions. Solution: re-plan from the failure point.
- Context window overflow: 10-step plans with long intermediate outputs can exceed context length. Solution: summarize intermediate results before passing to the next step.
- Hallucinated tool calls: LLM may plan to call a tool that doesn't exist. Solution: validate the plan against available tools before execution begins.
๐งญ Decision Guide: ReAct vs. Plan-and-Execute
| Situation | Use |
| Steps can be enumerated before starting | Plan-and-Execute |
| Each next step depends on observing prior results | ReAct |
| Task has 5+ interdependent actions | Plan-and-Execute |
| Interactive conversation with user feedback | ReAct |
| Automated pipeline with predictable structure | Plan-and-Execute |
๐ Related Posts
- AI Agents Explained: When LLMs Start Using Tools
- LangChain Development Guide
- LLM Skills vs Tools in Agent Design
- How GPT/LLM Works

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer โ 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2ร A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
