All Posts

Multistep AI Agents: The Power of Planning

Simple AI agents react one step at a time. Multistep agents are different: they create a full pla...

Abstract AlgorithmsAbstract Algorithms
ยทยท14 min read
Cover Image for Multistep AI Agents: The Power of Planning

AI-assisted content.

TLDR: A simple ReAct agent reacts one tool call at a time. A multistep agent plans a complete task decomposition upfront, then executes each step sequentially โ€” handling complex goals that require 5-10 interdependent actions without re-prompting the LLM for each step.


๐Ÿ“– Line Cook vs. Head Chef

A line cook (simple ReAct agent): receives one ticket, cooks one dish, hands it over. Then the next ticket.

A head chef (multistep agent): receives the full dinner party menu, plans the entire 5-course sequence, coordinates prep timing for all dishes, anticipates which items can be done in parallel, and manages the full execution before the first guest is seated.

The difference: planning before acting. Complex goals require plans, not just reactions.


๐Ÿ” What Makes a Multistep Agent Different

A simple ReAct agent operates with a single-step planning horizon: it looks at the current state, picks the next action, observes the result, and repeats. This works well for short, open-ended tasks where the required steps are unknown ahead of time. But when a task involves 5โ€“10 interdependent actions, this approach becomes inefficient โ€” the LLM must rediscover context at every step, decisions made in step 6 can conflict with assumptions from step 2, and there is no mechanism to detect that a task is structurally impossible before spending tokens on execution.

A multistep agent solves this by introducing a distinct planning phase before any tool is called. During planning, the LLM receives the full goal and produces a structured decomposition โ€” typically a JSON array of steps, each with an action name and arguments. This plan encodes the task dependency graph: which steps can run in parallel, which must complete before others begin, and what the expected output of each step feeds into downstream steps.

Key concepts in multistep agent design:

  • Planning horizon: how many steps ahead the agent reasons before committing to an action
  • Task decomposition: breaking a complex goal into discrete, executable sub-tasks with clear inputs and outputs
  • Step dependencies: the directed acyclic graph (DAG) of which steps must complete before others can begin
  • Re-planning: when a step fails, generating a revised plan for the remaining steps based on current state
  • Context management: summarizing intermediate results to prevent context window overflow across many steps

Understanding these concepts is essential before choosing whether to use a multistep agent or a simpler reactive loop for a given problem.


๐Ÿ”ข Simple ReAct vs. Plan-and-Execute: Core Difference

DimensionReAct (Single-Step Loop)Plan-and-Execute (Multistep)
PlanningNone โ€” LLM decides next action after each observationLLM creates a full plan upfront (JSON array of steps)
LLM callsOne per action (tight feedback loop)One for planning; one per step for execution
Best forShort, open-ended tasks with unknown required stepsLong tasks with a knowable step structure
Failure handlingAdapts after each observationRe-plan on step failure
Token costLower per stepHigher plan call; lower execution calls

โš™๏ธ The Plan-and-Execute Architecture

Goal: "Research the top 3 AI papers from last month, summarize each, and draft a blog post."

Phase 1 โ€” Plan Call (one LLM call):

[
  { "step": 1, "action": "search", "args": ["top AI papers July 2025"] },
  { "step": 2, "action": "fetch_abstract", "args": ["paper_id_1"] },
  { "step": 3, "action": "summarize", "args": ["abstract_1"] },
  { "step": 4, "action": "fetch_abstract", "args": ["paper_id_2"] },
  { "step": 5, "action": "summarize", "args": ["abstract_2"] },
  { "step": 6, "action": "fetch_abstract", "args": ["paper_id_3"] },
  { "step": 7, "action": "summarize", "args": ["abstract_3"] },
  { "step": 8, "action": "write_post", "args": ["[summary_1, summary_2, summary_3]"] }
]

Phase 2 โ€” Execution Loop (LLM only called when tool output needs reasoning):

flowchart TD
    Goal[Complex Goal] --> Planner[LLM Planner (one call  JSON plan)]
    Planner --> Loop[Executor Loop]
    Loop --> Step[Execute Next Step (tool call or sub-LLM call)]
    Step --> Check{Last step?}
    Check -->|No| Loop
    Check -->|Yes| Result[Final Result]
    Step -->|Failure| Replan[Re-plan remaining steps]
    Replan --> Loop

๐Ÿ“Š Multistep Agent Execution Flow

Understanding how a multistep agent transitions between planning and execution is critical for debugging and designing reliable pipelines. The flow is not a simple linear sequence โ€” it includes conditional branches for parallel execution and failure recovery. The diagram below captures the full lifecycle from receiving a goal to delivering a final aggregated result.

flowchart TD
    Goal[User Goal] --> Planner[LLM Planner (produce JSON step list)]
    Planner --> Validate[Validate Plan (tool names, schema)]
    Validate -->|Valid| Decompose[Decompose into Parallel + Sequential Steps]
    Validate -->|Invalid| Planner
    Decompose --> Parallel[Run Parallel Steps (independent actions)]
    Decompose --> Sequential[Run Sequential Steps (dependent actions)]
    Parallel --> Aggregate[Aggregate Intermediate Results]
    Sequential --> Aggregate
    Aggregate --> Check{All steps complete?}
    Check -->|Yes| Final[Final Result Delivered]
    Check -->|Step Failed| Replan[Re-plan Remaining Steps from Current State]
    Replan --> Decompose

This flow highlights two key design decisions: plan validation happens before any tool call (catching hallucinated tool names early), and aggregation combines outputs from parallel and sequential branches before the next dependent step begins. The re-plan path preserves completed step results โ€” there is no need to restart from the beginning when a single step fails mid-execution.


๐ŸŒ Real-World Applications: Real-World Use Cases

Multistep agents are not theoretical constructs โ€” they power several categories of real production systems today.

Automated Research Pipelines A research agent might receive a goal like "produce a competitive analysis of the top 5 CRM tools." The plan includes: search for the top 5 tools (step 1), fetch the pricing page for each tool (steps 2โ€“6 in parallel), extract key features from each page (steps 7โ€“11), compare them across a structured rubric (step 12), and produce a formatted markdown report (step 13). No human re-prompts the system between steps โ€” the plan drives the full execution.

Coding Agents Agents like GitHub Copilot Workspace and Devin use multistep planning to handle tasks like "add OAuth login to this codebase." The plan might include: reading existing auth files, identifying integration points, writing new code files, updating configuration, running tests, and fixing any test failures. Each step has dependencies on prior outputs, making reactive single-step agents impractical.

Business Process Automation Invoice processing pipelines, HR onboarding workflows, and compliance audits can all be modeled as multistep plans. An invoice agent might: extract line items (step 1), validate against purchase orders (step 2), route exceptions for human review (step 3), and post approved invoices to the accounting system (step 4). The structured plan makes auditing and failure recovery straightforward.

Data Analysis Workflows Data science agents receive goals like "analyze Q3 sales data and identify the top 3 regional trends." The plan includes: querying the data warehouse, cleaning the dataset, computing regional aggregates, ranking by growth rate, and generating a narrative summary. Each step produces a structured artifact consumed by the next step โ€” a pattern that maps directly to the plan-and-execute model.

In every case, the shared characteristic is a knowable step structure: the agent can enumerate the required actions before observing the results of any single action. This is the fundamental criterion for choosing a multistep agent over a reactive loop.

๐Ÿ“Š Multi-Step Planning Sequence

sequenceDiagram
    participant U as User
    participant P as LLM Planner
    participant E as Executor
    participant T as Tools

    U->>P: Submit goal
    P->>P: Decompose into step list
    P-->>E: Plan [step1, step2, step3...]
    E->>T: Execute step 1
    T-->>E: Result 1
    E->>P: Observe result 1
    P->>P: Update plan if needed
    E->>T: Execute step 2
    T-->>E: Result 2
    E->>T: Execute step 3
    T-->>E: Result 3
    E-->>U: Aggregated final answer

This sequence diagram traces the communication flow between the four key participants in a multistep agent run: the user who submits the goal, the LLM Planner that decomposes it into a numbered step list, the Executor that drives each tool call, and the Tools that perform the actual operations. Notice that the Planner is consulted again after Step 1 completes โ€” this models the optional re-planning trigger, where intermediate results can cause the agent to revise remaining steps before continuing. This feedback loop is what distinguishes plan-and-execute agents from a static batch of sequential tool calls.

๐Ÿ“Š Planning Loop Decision Flow

flowchart TD
    Goal[Receive Goal]
    Decompose[LLM: Decompose into step list]
    Queue[Queue Steps (DAG order)]
    Execute[Execute Next Step with Tools]
    UpdateState[Update Shared State with Result]
    Done{All Steps Complete?}
    Replan{Step Failed?}
    Final[Return Final Result]
    ReplanStep[Re-plan Remaining Steps from State]

    Goal --> Decompose --> Queue --> Execute --> UpdateState
    UpdateState --> Done
    Done -->|Yes| Final
    Done -->|No| Replan
    Replan -->|Yes| ReplanStep --> Queue
    Replan -->|No| Execute

This flowchart details the executor's decision logic during the step-by-step execution phase. After each step's result is written to shared state, the executor checks two conditions: whether all steps are complete, and whether the last step failed. A successful completion routes to the final result. A step failure triggers the re-planner, which generates a revised plan for the remaining steps using current accumulated state โ€” avoiding a full restart. If neither condition is met, the executor simply advances to the next queued step, forming the core execution loop of any plan-and-execute agent.


๐Ÿงช Practical: Building a Simple Plan-and-Execute Agent

The fastest way to understand multistep agents is to trace how the planner and executor components interact in code. Below is a minimal Python example using LangChain's PlanAndExecute abstraction with custom tools.

Step 1 โ€” Define tools and the LLM:

from langchain_openai import ChatOpenAI
from langchain.tools import tool

@tool
def search_web(query: str) -> str:
    """Search the web for the given query."""
    return f"Results for: {query}"

@tool
def summarize_text(text: str) -> str:
    """Summarize the provided text."""
    return f"Summary of: {text[:50]}..."

tools = [search_web, summarize_text]
llm = ChatOpenAI(model="gpt-4o", temperature=0)

Step 2 โ€” Wire up planner and executor:

from langchain_experimental.plan_and_execute import (
    PlanAndExecute, load_agent_executor, load_chat_planner
)

planner  = load_chat_planner(llm)                          # single LLM call that generates the JSON step list
executor = load_agent_executor(llm, tools, verbose=True)   # runs each step; verbose=True prints tool call traces
agent    = PlanAndExecute(planner=planner, executor=executor, verbose=True)

Step 3 โ€” Invoke with a multi-step goal:

result = agent.invoke({
    "input": "Find the top 2 benefits of multistep AI agents and summarize them."
})
print(result["output"])

When you run this with verbose=True, you will see the planner emit a JSON step list, then the executor call each tool in sequence. This trace is invaluable for debugging โ€” if a step produces an unexpected output, you can inspect exactly where the plan diverged from reality and adjust either the planner prompt or the tool implementation accordingly.


๐Ÿ› ๏ธ LangGraph and AutoGen: How OSS Frameworks Implement Multistep Agent Orchestration

Two open-source frameworks have emerged as the leading ways to build multistep agents with explicit state management, and they take very different architectural approaches.

LangGraph is a graph-based agent orchestration library from LangChain that models an agent's execution as a directed graph of nodes (LLM calls, tool calls, or logic) and edges (conditional transitions). It gives you full control over state, supports cycles (loops), and is designed for durable, resumable workflows โ€” making it the production-ready choice for complex multistep pipelines.

from langgraph.graph import StateGraph, END
from typing import TypedDict

# Define the shared state for the agent
class AgentState(TypedDict):
    task: str
    plan: list
    results: list
    current_step: int

# Build the graph
graph = StateGraph(AgentState)

# Add nodes: each is a function that reads and writes state
graph.add_node("planner", run_planner)      # LLM call โ†’ produces plan list
graph.add_node("executor", run_executor)    # Tool call โ†’ appends to results
graph.add_node("checker", check_completion) # Logic โ†’ routes to next step or END

# Wire the execution flow
graph.set_entry_point("planner")
graph.add_edge("planner", "executor")
graph.add_conditional_edges(
    "executor",
    lambda state: "checker" if state["current_step"] < len(state["plan"]) else END,
)
graph.add_edge("checker", "executor")

agent = graph.compile()
result = agent.invoke({"task": "Research top 3 AI papers and draft a summary", "plan": [], "results": [], "current_step": 0})

AutoGen (Microsoft) takes a conversation-centric approach โ€” agents are modeled as participants in a multi-agent dialogue, where each agent has a role (AssistantAgent, UserProxyAgent) and can call tools or other agents as part of the conversation flow. It excels at collaborative multi-agent tasks where two or more LLMs reason together.

from autogen import AssistantAgent, UserProxyAgent

planner = AssistantAgent(
    name="Planner",
    system_message="You decompose goals into numbered steps and produce a JSON plan.",
    llm_config={"model": "gpt-4o"},
)

executor = UserProxyAgent(
    name="Executor",
    human_input_mode="NEVER",
    code_execution_config={"work_dir": "./workspace"},
)

# The executor drives the conversation; the planner produces the plan
executor.initiate_chat(
    planner,
    message="Research the top 3 AI papers from last month and summarize each.",
    max_turns=10,
)
FrameworkModelBest for
LangGraphGraph of nodes + stateDurable pipelines, resumable workflows, conditional branching
AutoGenMulti-agent conversationCollaborative reasoning, code generation, peer-review loops

For a full deep-dive on LangGraph and AutoGen, dedicated follow-up posts are planned.


๐Ÿ“š Key Lessons

  1. Plan validation is non-negotiable. Always validate the generated plan against the registered tool list and expected argument schema before execution begins. Catching hallucinated tool names at plan time costs one validation pass; catching them at execution time costs the tokens from all preceding steps plus a re-plan.

  2. Re-plan from failure state, not from scratch. When step 4 of a 10-step plan fails, the results of steps 1โ€“3 are valid and should be preserved. A well-designed executor passes the current completed-steps state back to the planner, which then generates a revised plan for steps 4โ€“10 only. Restarting from step 1 wastes tokens and time.

  3. Summarize intermediate results aggressively. Long intermediate outputs from tool calls accumulate in the context window. For plans with more than 5โ€“6 steps, summarize each step's output before passing it to the next step. This prevents context overflow and keeps the LLM focused on the current step rather than retrieving irrelevant prior results.

  4. Prefer parallel execution for independent steps. If steps 2, 3, and 4 each depend only on step 1's output (and not on each other), they can run in parallel. This reduces total wall-clock time significantly for I/O-bound steps like web fetches or database queries. The plan structure should encode this parallelism explicitly.

  5. Choose multistep agents only for knowable-structure tasks. If the required steps cannot be enumerated without observing intermediate results, a ReAct loop is more appropriate. Multistep planning adds overhead โ€” one extra LLM call for planning and the cognitive cost of maintaining a step list. Use it when the structure is known; use ReAct when it is not.


๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • Multistep agents plan the full task structure upfront, then execute step by step with minimal LLM calls during execution.
  • Plan-and-Execute = one Planner LLM call โ†’ JSON step list โ†’ Executor loop using tools.
  • Best for tasks with a knowable structure (reports, research pipelines, automated workflows).
  • Failure handling: re-plan from failed step, not from scratch.
  • LangChain's PlanAndExecute wraps this pattern in a few lines of Python.

๐Ÿง  Deep Dive: LangChain Plan-and-Execute Agent

from langchain_experimental.plan_and_execute import (
    PlanAndExecute,
    load_agent_executor,
    load_chat_planner,
)
from langchain_openai import ChatOpenAI
from langchain_community.tools import WikipediaQueryRun, DuckDuckGoSearchRun

llm = ChatOpenAI(model="gpt-4o", temperature=0)  # temperature=0: deterministic planning produces more consistent JSON step lists
tools = [WikipediaQueryRun(), DuckDuckGoSearchRun()]

planner   = load_chat_planner(llm)   # generates the structured JSON plan from the goal
executor  = load_agent_executor(llm, tools, verbose=True)
agent     = PlanAndExecute(planner=planner, executor=executor)
# planner produces the step list once; executor calls tools for each step without re-running the planner

agent.invoke({
    "input": "Research the top 3 AI papers from last month, summarize each, and draft a blog post."
})

The planner produces a step list; the executor runs each step with access to tools.


๐Ÿ”ฌ Internals

Multi-step agents implement a Thought-Action-Observation loop: the LLM emits a reasoning trace, selects a tool call with arguments, receives the result, and appends it to context before the next step. The agent loop runs until a terminal condition (task complete, max steps, or error). Tool dispatch is typically handled by structured output parsing (JSON function calling) rather than free-text extraction, reducing parse failures.

โšก Performance Analysis

A 5-step agent with GPT-4 averages 15โ€“25 seconds end-to-end due to sequential LLM calls (~3s each) plus tool latency. Parallelizing independent tool calls (e.g., concurrent web searches) cuts wall time by 40โ€“60%. Smaller orchestrator models (GPT-3.5 or 7B local) reduce per-step cost by 10โ€“50ร— at the expense of ~15% more planning errors on complex tasks.

โš–๏ธ Trade-offs & Failure Modes: When to Use Multistep Agents vs Simple Agents

Use CaseSimple ReActMultistep Plan-Execute
Q&A with a single tool lookupโœ…Overkill
Writing a report with 8 research stepsโ€”โœ…
Interactive conversation with user feedbackโœ…Awkward
Automated pipeline with known step structureโ€”โœ…
Debugging code with back-and-forth tool callsโœ…โ€”

Critical failure modes for multistep agents:

  • Stale plan: If step 3 fails, steps 4-8 may be based on incorrect assumptions. Solution: re-plan from the failure point.
  • Context window overflow: 10-step plans with long intermediate outputs can exceed context length. Solution: summarize intermediate results before passing to the next step.
  • Hallucinated tool calls: LLM may plan to call a tool that doesn't exist. Solution: validate the plan against available tools before execution begins.

๐Ÿงญ Decision Guide: ReAct vs. Plan-and-Execute

SituationUse
Steps can be enumerated before startingPlan-and-Execute
Each next step depends on observing prior resultsReAct
Task has 5+ interdependent actionsPlan-and-Execute
Interactive conversation with user feedbackReAct
Automated pipeline with predictable structurePlan-and-Execute


Share
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms