Home/Blog/Java/Low-Level Design for an AI Agent Orchestration Engine: Designing a Stateful Execution Framework
JavaAdvancedβ€’23 min readβ€’

Low-Level Design for an AI Agent Orchestration Engine: Designing a Stateful Execution Framework

Learn how to design and implement a stateful, graph-based agent orchestration engine in Java with full OOP design.

Abstract Algorithms

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: In this guide, we design a stateful, graph-based AI agent execution engine in Java using clean object-oriented principles. By structuring execution as nodes and edges over a shared state, we prevent memory leaks, manage complex routing, and create a highly extensible framework.


πŸ“– The Design Challenge: Imperative Spaghetti vs. Graph-Based Orchestration

Imagine you are building a customer support automation platform. Your agent needs to read incoming emails, decide whether to fetch data from an internal billing system, call a validation model to ensure compliance, and either reply automatically or escalate to a human.

When you start, you might implement this as a series of nested service class calls:

public class AgentService {
    public void processTicket(Ticket ticket) {
        Classification classification = llmClient.classify(ticket);
        if (classification.isBillingRelated()) {
            BillingInfo info = billingClient.fetch(ticket.getUserId());
            String draft = llmClient.generateDraft(ticket, info);
            if (complianceClient.validate(draft)) {
                emailClient.send(ticket.getUserEmail(), draft);
            } else {
                escalationService.escalate(ticket, "Compliance validation failed.");
            }
        } else {
            // More nested branches...
        }
    }
}

This nested imperative code works fine for simple trees. But what happens when the agent needs to loop? For instance, if the compliance check fails, you might want to send the compliance error back to the LLM agent, ask it to revise the draft, and try again, capping the loop at three attempts.

As soon as you introduce loops, retry policies, state history, and asynchronous human-in-the-loop approvals, this nested class structure collapses. You end up with:

  • Spaghetti Callbacks: Different services calling each other cyclically, creating stack overflows.
  • Volatile State: Thread-local contexts that disappear when a container restarts, causing loss of execution progress.
  • Tight Coupling: Adding a new safety guardrail or tool requires modifying the core orchestration logic.

To solve this, we need a graph-based stateful orchestration engine. By modeling our agentic workflow as a directed graph where nodes represent execution steps and edges define transition logic over a shared, serializable state, we decouple the execution mechanism from individual tasks.


πŸ‘₯ Use Cases & Actors

Before diving into code, we must define the functional boundary of our engine.

Actors

  • End User: Submits a natural language query or task to the system.
  • Orchestrator Platform: The core engine that loads the execution graph, tracks history, and manages state transitions.
  • AI Model (LLM): The reasoning provider that determines planning steps or formatting.
  • Tool Executor: An external integration (e.g., database, API client) invoked by the agent during execution.
  • Human Reviewer: An operator who inspects proposed high-risk actions (e.g., initiating a refund) and approves or rejects them.

Functional Scope (Use Cases)

  1. Define Execution Graph: Programmatically build a directed graph containing execution nodes (e.g., Tool call, LLM prompt, human review) and transition edges.
  2. Execute Stateful Workflow: Run a graph execution from a starting node, updating a thread-safe, persisted state context at each step.
  3. Handle Loops and Decisions: Allow nodes to return state updates that dictate conditional routing to downstream nodes.
  4. Pause for Human Verification: Support interrupting the workflow at a node, saving the state, and resuming when an external actor submits approval.
  5. Register and Invoke Tools: Register custom tools dynamically and allow the execution loop to invoke them based on model output.

βš™οΈ Core Mechanics: Class Relationships and State Transitions

To build a reliable graph orchestration engine, we must separate the structural configuration of our graph from the dynamic execution state. This distinction is captured in the class diagram below, detailing how static node models coordinate with thread-safe data context structures.

classDiagram
    class StateGraph {
        -Map~String, Node~ nodes
        -Map~String, List~Edge~~ edges
        +addNode(String id, Node node)
        +addEdge(String fromNode, String toNode)
        +addConditionalEdge(String fromNode, String toNode, String stateKey, String expectedValue)
        +getNodes() Map~String, Node~
        +getTransitions(String nodeId) List~Edge~
    }

    class Node {
        <>
        +execute(AgentState state) NodeResult
    }

    class BaseNode {
        <>
        -String name
        +getName() String
    }

    class LLMReasoningNode {
        -LLMClient llmClient
        -String promptTemplate
        +execute(AgentState state) NodeResult
    }

    class ToolExecutionNode {
        -ToolRegistry toolRegistry
        +execute(AgentState state) NodeResult
    }

    class HumanApprovalNode {
        +execute(AgentState state) NodeResult
    }

    class Edge {
        -String fromNode
        -String toNode
        -TransitionCondition condition
        +getFrom() String
        +getTo() String
        +evaluate(AgentState state) boolean
    }

    class AgentState {
        -String executionId
        -Map~String, Object~ values
        -List~String~ history
        +get(String key) Object
        +put(String key, Object val)
        +recordVisit(String nodeId)
        +getHistory() List~String~
    }

    class AgentExecutor {
        -StateGraph graph
        -StateStore stateStore
        +run(String executionId, String startNodeId) AgentState
        +resume(String executionId, Map~String, Object~ humanInput) AgentState
    }

    class Tool {
        <>
        +execute(Map~String, Object~ params) Map~String, Object~
    }

    class ToolRegistry {
        -Map~String, Tool~ tools
        +register(String name, Tool tool)
        +getTool(String name) Tool
    }

    StateGraph "1" *-- "*" Node : contains
    StateGraph "1" *-- "*" Edge : contains
    BaseNode ..|> Node : implements
    LLMReasoningNode --|> BaseNode : extends
    ToolExecutionNode --|> BaseNode : extends
    HumanApprovalNode --|> BaseNode : extends
    AgentExecutor --> StateGraph : executes
    AgentExecutor --> AgentState : modifies
    ToolExecutionNode --> ToolRegistry : uses
    ToolRegistry "1" *-- "*" Tool : manages

This structural blueprint explains how the graph metadata is decoupled from runtime context. The StateGraph acts as a composite holding a map of Node interfaces and transition Edge instances. Concrete nodes such as LLMReasoningNode, ToolExecutionNode, and HumanApprovalNode extend BaseNode which implements the Node interface. The AgentExecutor coordinates the execution flow, operating directly on the StateGraph and updating AgentState. The ToolExecutionNode delegates execution to specific Tool instances managed by the ToolRegistry.


🧩 OOP Pillars Mapping

Our design enforces the core object-oriented programming pillars to ensure clean separation of concerns:

  • Encapsulation: The AgentState class encapsulates the internal map storing execution parameters and a list representing the step execution history. Access to this state is restricted via getters and mutators, protecting the integrity of thread-local state during runs.
  • Abstraction: The Node interface abstracts the behavior of graph execution blocks. The executor does not need to know whether a node calls an LLM, invokes a local database query, or prompts for human input; it simply invokes execute(AgentState).
  • Inheritance: We define an abstract class BaseNode containing shared properties like name and helper logger methods. Concrete nodes, such as LLMReasoningNode and ToolExecutionNode, inherit these baseline configurations, minimizing code duplication.
  • Polymorphism: Polymorphism is realized at run-time when the AgentExecutor loops through nodes. Calling node.execute(state) executes model routing in LLMReasoningNode or reflection-based API execution in ToolExecutionNode depending on the actual instance subtype.

πŸ“ SOLID Principles Walkthrough

  • Single Responsibility Principle (SRP):
    • StateGraph is responsible only for maintaining the topology (nodes and edges) of the graph. It does not manage active thread executions.
    • AgentExecutor handles the runtime lifecycle, coordinate state persistence, and manages step advancement.
    • ToolRegistry is responsible solely for indexing and looking up executable tools.
  • Open-Closed Principle (OCP):
    • The execution framework is open for extension but closed for modification. If we need to support a new step typeβ€”for instance, a node that translates responses to multiple languagesβ€”we implement the Node interface and add it to the graph without changing the execution code inside AgentExecutor.
  • Liskov Substitution Principle (LSP):
    • Any concrete subclass of BaseNode can be substituted wherever a Node interface is expected. The AgentExecutor executes all nodes uniformly through the Node reference, without requiring typecasting or instanceof checks.
  • Interface Segregation Principle (ISP):
    • Rather than creating a single massive class with registration, execution, and validation routines, we define small focused interfaces: Node (containing only the execution contract) and Tool (containing only the parameters execution model).
  • Dependency Inversion Principle (DIP):
    • The AgentExecutor depends entirely on high-level abstractions (Node interface) rather than concrete implementations like LLMReasoningNode. We inject these dependencies at runtime.

πŸ”— Interface Contracts Inventory

Below, we detail the interfaces that serve as boundaries between different subsystems.

1. Node

  • Purpose: Represents a single logical execution block inside the graph.
  • Method Signature:
    • NodeResult execute(AgentState state): Executes the node operation using the current workflow state.
  • Parameters: AgentState state (read-write access to current execution state).
  • Returns: NodeResult containing state updates and status (e.g., SUCCESS, PAUSED, FAILED).

2. Tool

  • Purpose: Represents an external system action that an LLM can invoke.
  • Method Signature:
    • Map<String, Object> execute(Map<String, Object> params): Runs the business logic of the tool with the extracted prompt arguments.
  • Parameters: Map of inputs parsed from the LLM JSON output.
  • Returns: Map representing the payload returned by the external service.

3. LLMClient

  • Purpose: Abstracts model interactions, allowing easy swapping between open-source models and managed cloud APIs.
  • Method Signature:
    • String generate(String prompt): Submits the formatted prompt to the configured LLM and returns the raw response text.
  • Parameters: Compiled prompt string containing system directives and context.
  • Returns: Text output generated by the model.

🧠 Deep Dive: Stateful Execution Internals

Orchestrator engines operate like low-overhead virtual machines running steps sequentially, writing to a persistent ledger at each transition boundary. Here, we analyze the runtime mechanics, mathematical models, and latency characteristics under high concurrencies.

The Internals

The core execution path inside the AgentExecutor is a sequential state transition loop. When execution is initiated, the engine reads the active node configuration, executes the business logic inside a database transaction sandbox, and checks the conditional edge evaluate functions to calculate the next target node identifier.

This step execution must be sandboxed:

  • If node execution fails, we roll back state changes and marker updates to keep the state clean.
  • If node execution requests suspension (e.g., waiting for human review), we save the intermediate state with a transition resume pointer and yield control back to the thread pool.

Mathematical Model

We define the execution loop mathematically as: Given a directed graph $G = (V, E)$, where:

  • $V = {v_1, v_2, \dots, v_n}$ is the set of execution nodes, where each node $v_i$ represents a function $f_i: S \to (S', \text{status})$ mapping the current state $S \in \mathcal{S}$ to a modified state $S'$ and an execution status.
  • $E \subseteq V \times V \times \mathcal{C}$ is the set of directed transition edges, where each edge $e = (v_a, v_b, c)$ contains a condition predicate function $c: S \to {\text{true}, \text{false}}$.

The state transition function $\delta: V \times \mathcal{S} \to V \times \mathcal{S}$ at step $t$ is computed as: $$ (S{t+1}, \text{status}{t+1}) = f_{\text{curr}}(St) $$ If $\text{status}{t+1} = \text{SUCCESS}$, the next node is selected by: $$ v{t+1} = \text{arg first } e = (v{\text{curr}}, v{\text{next}}, c) \in E \text{ s.t. } c(S{t+1}) = \text{true} $$

Performance Analysis

  • Time Complexity: The execution logic has an algorithmic cost of $O(N \times D)$ where $N$ is the number of visited nodes in the workflow path and $D$ is the maximum out-degree (transitions) of any node. Evaluating condition edges requires evaluating predicates, which are $O(1)$ dictionary checks.
  • Space Complexity: Graph definition requires $O(|V| + |E|)$ space to store the configurations. Dynamic runtime space is $O(H + K)$ where $H$ is the history path list size and $K$ is the number of variables in the AgentState bag.
  • Bottlenecks: Because nodes are generally synchronous, model inference (LLM latency ~1-3s) dominates runtime execution. Thread context switching can become a bottleneck under high concurrent request bursts, requiring virtual thread pools or asynchronous reactive executors.

πŸ—οΈ Advanced Concepts: Thread Concurrency and State Persistence

When running at enterprise scale, you must address distributed systems concerns to ensure state durability and execution safety.

1. Optimistic Locking and Concurrency

If multiple instances of the orchestrator API attempt to resume or tick the same execution ID simultaneously, race conditions can corrupt the event trace. To prevent this, include a version field inside the database row mapping the execution. Every update statement must execute a check-and-swap:

UPDATE agent_state SET values = :newValues, version = version + 1 WHERE execution_id = :id AND version = :currentVersion;

2. Transaction Demarcation

To avoid database connection starvation, do not keep transactions open while waiting for external LLM API calls. LLM calls can take several seconds. Open the transaction, read the state, run the inference outside the transaction, and open a new short-lived transaction to write the updates and edge transitions.


πŸ“Š Visualizing the Execution Flow

The flowchart below demonstrates the runtime execution loop inside the AgentExecutor, handling node transitions, pauses, and error states.

flowchart TD
    Start([Start Run]) --> Load[Load or Create AgentState]
    Load --> RunLoop{Current Node exists?}
    RunLoop -- Yes --> Visit[Record Node in Visit History]
    Visit --> Execute[Execute Node using current AgentState]
    Execute --> Evaluate{NodeResult status?}
    Evaluate -- FAILED --> Fail[Save State & Return State with Error]
    Evaluate -- PAUSED --> Pause[Save State & Suspend Execution]
    Evaluate -- SUCCESS --> Merge[Merge stateUpdates into AgentState]
    Merge --> FindEdge[Evaluate transition edges for match]
    FindEdge --> NextNode{Matched nextNode?}
    NextNode -- Yes --> UpdateNext[Set nextNodeId]
    UpdateNext --> RunLoop
    NextNode -- No --> RunLoop
    RunLoop -- No --> End([End Run & Return State])

This state lifecycle flow visually charts how a run progresses from the start trigger. If a node fails, execution terminates immediately to prevent cascading updates. If a node returns a PAUSED state, the orchestrator suspends thread execution, saves the current history to database records, and awaits external resume triggers (e.g., human-in-the-loop input).


🌍 Real-World Applications: Customer Support Automation

A classic real-world use case for this architecture is a support triage and refund processor. Under this scenario, the orchestrator routes tickets through intent classifiers and safety checkers.

For instance, when a ticket requests a refund over $100:

  1. An LLM classifier node identifies the user request.
  2. A database query tool pulls customer order data.
  3. The graph routes the flow to a HumanApprovalNode.
  4. The system pauses, returns a tracking token to the Web UI, and queues a review task for support representatives.
  5. Once approved, the REST API invokes resume(executionId, approvalInput) to complete the bank transaction.

βš–οΈ Architectural Trade-offs and Failure Modes

  • State Durability vs. Latency: Writing state snapshots to database disks at every node step ensures recovery after hardware failure, but adds database write latency (~10-20ms per step). You must trade off persistence frequency for path performance.
  • Dynamic Autonomy vs. Predictability: Letting an LLM decide edge targets dynamically makes workflows highly flexible but hard to validate. We trade this off by forcing edge paths to be statically declared in StateGraph and conditionalized on predictable state values rather than raw text routing.
  • Sandbox Security vs. Cost: Running tool executions in isolated Docker containers prevents host execution escapes but increases latency and resource overhead.

🧭 Decision Guide: State Storage Framework Comparison

When designing your agent database layer, choose your storage system based on operational latency constraints and auditing requirements.

Storage TypeRead/Write LatencyReliability / DurabilityBest For
In-Memory StoreSub-millisecondLow (Lost on restart)local testing, transient workflows
Redis Cache Store~1-5 msMedium (AOF/RDB dependent)high-QPS stateless execution
Relational Database (PostgreSQL)~10-50 msHigh (ACID transaction isolation)high-compliance, audit-logged flows
Durable Workflow Engine (Temporal)~50-100 msVery High (Event-sourced, replayable)long-running multi-step agents

πŸ§ͺ Custom Node Extension: Value Threshold Routing

This demonstration illustrates how to write a custom threshold router that routes flow to different nodes depending on numeric thresholds in the state without editing our execution engine code.

We implement a custom routing node that reads a numeric balance or score and branches accordingly:

package dev.abstractalgorithms.agent.orchestrator;

import java.util.HashMap;
import java.util.Map;

public class ValueThresholdRouterNode extends BaseNode {
    private final String stateKeyToCheck;
    private final double threshold;
    private final String highValueTarget;
    private final String lowValueTarget;

    public ValueThresholdRouterNode(String name, String stateKeyToCheck, double threshold, 
                                    String highValueTarget, String lowValueTarget) {
        super(name);
        this.stateKeyToCheck = stateKeyToCheck;
        this.threshold = threshold;
        this.highValueTarget = highValueTarget;
        this.lowValueTarget = lowValueTarget;
    }

    @Override
    public NodeResult execute(AgentState state) {
        Object value = state.get(stateKeyToCheck);
        double numericVal = 0.0;

        if (value instanceof Number) {
            numericVal = ((Number) value).doubleValue();
        } else if (value instanceof String) {
            try {
                numericVal = Double.parseDouble((String) value);
            } catch (NumberFormatException ignored) {
                return new NodeResult(ExecutionStatus.FAILED, "State value is not a parsable double: " + value);
            }
        }

        Map<String, Object> updates = new HashMap<>();
        if (numericVal >= threshold) {
            updates.put("routeDecision", highValueTarget);
        } else {
            updates.put("routeDecision", lowValueTarget);
        }

        return new NodeResult(ExecutionStatus.SUCCESS, updates);
    }
}

We can plug this node directly into our StateGraph and define conditional edges that inspect the routeDecision property to jump to different processing queues. This ensures that adding custom routing logic does not require changing a single line inside the AgentExecutor or StateGraph core libraries.


πŸ’» Full Java Domain Model

Here is the complete Java implementation of the stateful orchestration domain model. This code represents the core library containing the graph definition, execution model, and state persistence.

package dev.abstractalgorithms.agent.orchestrator;

import java.util.*;
import java.util.concurrent.ConcurrentHashMap;

// =========================================================================
// Core Domain Types & Enums
// =========================================================================

public enum ExecutionStatus {
    SUCCESS,
    PAUSED,
    FAILED
}

public class NodeResult {
    private final ExecutionStatus status;
    private final Map<String, Object> stateUpdates;
    private final String errorMessage;

    public NodeResult(ExecutionStatus status, Map<String, Object> stateUpdates) {
        this.status = status;
        this.stateUpdates = stateUpdates != null ? stateUpdates : new HashMap<>();
        this.errorMessage = null;
    }

    public NodeResult(ExecutionStatus status, String errorMessage) {
        this.status = status;
        this.stateUpdates = new HashMap<>();
        this.errorMessage = errorMessage;
    }

    public ExecutionStatus getStatus() { return status; }
    public Map<String, Object> getStateUpdates() { return stateUpdates; }
    public String getErrorMessage() { return errorMessage; }
}

// =========================================================================
// The State Class
// =========================================================================

public class AgentState {
    private final String executionId;
    private final Map<String, Object> values;
    private final List<String> visitHistory;

    public AgentState(String executionId) {
        this.executionId = executionId;
        this.values = new ConcurrentHashMap<>();
        this.visitHistory = Collections.synchronizedList(new ArrayList<>());
    }

    public String getExecutionId() { return executionId; }

    public Object get(String key) {
        return values.get(key);
    }

    public void put(String key, Object val) {
        if (val == null) {
            values.remove(key);
        } else {
            values.put(key, val);
        }
    }

    public void recordVisit(String nodeId) {
        visitHistory.add(nodeId);
    }

    public List<String> getVisitHistory() {
        return new ArrayList<>(visitHistory);
    }

    public Map<String, Object> getSnapshot() {
        return new HashMap<>(values);
    }

    public void mergeUpdates(Map<String, Object> updates) {
        if (updates != null) {
            values.putAll(updates);
        }
    }
}

// =========================================================================
// Graph Structure Primitives
// =========================================================================

public interface Node {
    NodeResult execute(AgentState state);
    String getName();
}

public abstract class BaseNode implements Node {
    protected final String name;

    protected BaseNode(String name) {
        this.name = name;
    }

    @Override
    public String getName() {
        return name;
    }
}

public class Edge {
    private final String fromNode;
    private final String toNode;
    private final String conditionStateKey;
    private final Object expectedStateValue;

    // Standard static transition edge
    public Edge(String fromNode, String toNode) {
        this.fromNode = fromNode;
        this.toNode = toNode;
        this.conditionStateKey = null;
        this.expectedStateValue = null;
    }

    // Conditional transition edge
    public Edge(String fromNode, String toNode, String conditionStateKey, Object expectedStateValue) {
        this.fromNode = fromNode;
        this.toNode = toNode;
        this.conditionStateKey = conditionStateKey;
        this.expectedStateValue = expectedStateValue;
    }

    public String getFromNode() { return fromNode; }
    public String getToNode() { return toNode; }

    public boolean evaluate(AgentState state) {
        if (conditionStateKey == null) {
            return true; // Static transition always evaluates to true
        }
        Object currentValue = state.get(conditionStateKey);
        return Objects.equals(currentValue, expectedStateValue);
    }
}

public class StateGraph {
    private final Map<String, Node> nodes = new HashMap<>();
    private final Map<String, List<Edge>> edges = new HashMap<>();

    public void addNode(String id, Node node) {
        nodes.put(id, node);
    }

    public void addEdge(String fromNode, String toNode) {
        edges.computeIfAbsent(fromNode, k -> new ArrayList<>()).add(new Edge(fromNode, toNode));
    }

    public void addConditionalEdge(String fromNode, String toNode, String stateKey, Object expectedValue) {
        edges.computeIfAbsent(fromNode, k -> new ArrayList<>()).add(new Edge(fromNode, toNode, stateKey, expectedValue));
    }

    public Node getNode(String id) {
        return nodes.get(id);
    }

    public List<Edge> getTransitions(String nodeId) {
        return edges.getOrDefault(nodeId, Collections.emptyList());
    }
}

// =========================================================================
// Node Implementations & Client Abstractions
// =========================================================================

public interface LLMClient {
    String generate(String prompt);
}

public class LLMReasoningNode extends BaseNode {
    private final LLMClient llmClient;
    private final String systemPrompt;

    public LLMReasoningNode(String name, LLMClient llmClient, String systemPrompt) {
        super(name);
        this.llmClient = llmClient;
        this.systemPrompt = systemPrompt;
    }

    @Override
    public NodeResult execute(AgentState state) {
        String input = (String) state.get("userInput");
        String fullPrompt = systemPrompt + "\nUser Input: " + input + "\nContext: " + state.get("context");

        try {
            String response = llmClient.generate(fullPrompt);
            Map<String, Object> updates = new HashMap<>();
            updates.put("llmResponse", response);

            // Basic router analysis
            if (response.toUpperCase().contains("CALL_TOOL:")) {
                updates.put("nextAction", "RUN_TOOL");
                String toolCallDetails = response.substring(response.toUpperCase().indexOf("CALL_TOOL:"));
                updates.put("toolCallRequest", toolCallDetails);
            } else {
                updates.put("nextAction", "RESPOND");
            }
            return new NodeResult(ExecutionStatus.SUCCESS, updates);
        } catch (Exception e) {
            return new NodeResult(ExecutionStatus.FAILED, "LLM invocation failed: " + e.getMessage());
        }
    }
}

public interface Tool {
    Map<String, Object> execute(Map<String, Object> params);
}

public class ToolRegistry {
    private final Map<String, Tool> tools = new ConcurrentHashMap<>();

    public void register(String name, Tool tool) {
        tools.put(name, tool);
    }

    public Tool getTool(String name) {
        return tools.get(name);
    }
}

public class ToolExecutionNode extends BaseNode {
    private final ToolRegistry toolRegistry;

    public ToolExecutionNode(String name, ToolRegistry toolRegistry) {
        super(name);
        this.toolRegistry = toolRegistry;
    }

    @Override
    public NodeResult execute(AgentState state) {
        String toolCall = (String) state.get("toolCallRequest");
        if (toolCall == null) {
            return new NodeResult(ExecutionStatus.FAILED, "No tool call request available in state");
        }

        try {
            // Simplistic parse: e.g., "CALL_TOOL: billing-fetch {userId=123}"
            String clean = toolCall.replace("CALL_TOOL:", "").trim();
            String[] parts = clean.split(" ", 2);
            String toolName = parts[0];

            Tool tool = toolRegistry.getTool(toolName);
            if (tool == null) {
                return new NodeResult(ExecutionStatus.FAILED, "Tool not registered: " + toolName);
            }

            Map<String, Object> params = new HashMap<>();
            if (parts.length > 1) {
                params.put("rawParams", parts[1]);
            }

            Map<String, Object> result = tool.execute(params);
            Map<String, Object> stateUpdates = new HashMap<>();
            stateUpdates.put("context", result.toString());
            stateUpdates.put("nextAction", "EVALUATE");

            return new NodeResult(ExecutionStatus.SUCCESS, stateUpdates);
        } catch (Exception e) {
            return new NodeResult(ExecutionStatus.FAILED, "Tool execution failed: " + e.getMessage());
        }
    }
}

public class HumanApprovalNode extends BaseNode {
    public HumanApprovalNode(String name) {
        super(name);
    }

    @Override
    public NodeResult execute(AgentState state) {
        Object approval = state.get("humanApproval");
        if (approval == null) {
            // Interrupt execution, wait for external input
            return new NodeResult(ExecutionStatus.PAUSED, Collections.emptyMap());
        }

        Map<String, Object> updates = new HashMap<>();
        if (Boolean.TRUE.equals(approval)) {
            updates.put("nextAction", "EXECUTE_WRITE");
        } else {
            updates.put("nextAction", "REJECT");
        }
        return new NodeResult(ExecutionStatus.SUCCESS, updates);
    }
}

// =========================================================================
// Executor Engine
// =========================================================================

public interface StateStore {
    void save(AgentState state);
    AgentState load(String executionId);
}

public class MemoryStateStore implements StateStore {
    private final Map<String, AgentState> store = new ConcurrentHashMap<>();

    @Override
    public void save(AgentState state) {
        store.put(state.getExecutionId(), state);
    }

    @Override
    public AgentState load(String executionId) {
        return store.get(executionId);
    }
}

public class AgentExecutor {
    private final StateGraph graph;
    private final StateStore stateStore;

    public AgentExecutor(StateGraph graph, StateStore stateStore) {
        this.graph = graph;
        this.stateStore = stateStore;
    }

    public AgentState run(String executionId, String startNodeId) {
        AgentState state = stateStore.load(executionId);
        if (state == null) {
            state = new AgentState(executionId);
        }
        return executeLoop(state, startNodeId);
    }

    public AgentState resume(String executionId, Map<String, Object> humanInput) {
        AgentState state = stateStore.load(executionId);
        if (state == null) {
            throw new IllegalArgumentException("Execution ID not found: " + executionId);
        }
        state.mergeUpdates(humanInput);

        // Find where we left off (the last node in visit history that paused)
        List<String> history = state.getVisitHistory();
        if (history.isEmpty()) {
            throw new IllegalStateException("Execution history is empty, cannot resume");
        }
        String lastNodeId = history.get(history.size() - 1);
        return executeLoop(state, lastNodeId);
    }

    private AgentState executeLoop(AgentState state, String currentNodeId) {
        String nextNode = currentNodeId;

        while (nextNode != null) {
            Node node = graph.getNode(nextNode);
            if (node == null) {
                throw new IllegalStateException("Graph is missing defined node: " + nextNode);
            }

            state.recordVisit(nextNode);
            NodeResult result = node.execute(state);

            if (result.getStatus() == ExecutionStatus.FAILED) {
                state.put("executionError", result.getErrorMessage());
                stateStore.save(state);
                return state;
            }

            state.mergeUpdates(result.getStateUpdates());
            stateStore.save(state);

            if (result.getStatus() == ExecutionStatus.PAUSED) {
                return state; // Suspend processing
            }

            // Find transition edge
            List<Edge> transitions = graph.getTransitions(nextNode);
            String matchedNextNode = null;
            for (Edge edge : transitions) {
                if (edge.evaluate(state)) {
                    matchedNextNode = edge.getToNode();
                    break;
                }
            }
            nextNode = matchedNextNode;
        }
        return state;
    }
}

🌐 Spring Boot REST API Controller

To expose this engine to external clients and user interfaces, we expose it using a Spring Boot @RestController. The controller handles starting execution and resuming when a human submits validation inputs.

package dev.abstractalgorithms.agent.orchestrator;

import org.springframework.web.bind.annotation.*;
import org.springframework.http.HttpStatus;
import org.springframework.web.server.ResponseStatusException;
import java.util.*;

@RestController
@RequestMapping("/api/v1/agent-executions")
public class AgentExecutionController {
    private final AgentExecutor agentExecutor;
    private final StateStore stateStore;

    public AgentExecutionController(AgentExecutor agentExecutor, StateStore stateStore) {
        this.agentExecutor = agentExecutor;
        this.stateStore = stateStore;
    }

    @PostMapping("/start")
    public AgentStateResponse startExecution(
            @RequestParam String executionId, 
            @RequestParam String startNodeId, 
            @RequestBody Map<String, Object> initialInput) {

        AgentState state = new AgentState(executionId);
        state.mergeUpdates(initialInput);
        stateStore.save(state);

        AgentState result = agentExecutor.run(executionId, startNodeId);
        return mapToResponse(result);
    }

    @PostMapping("/{executionId}/resume")
    public AgentStateResponse resumeExecution(
            @PathVariable String executionId, 
            @RequestBody Map<String, Object> humanInput) {

        AgentState result = agentExecutor.resume(executionId, humanInput);
        return mapToResponse(result);
    }

    @GetMapping("/{executionId}")
    public AgentStateResponse getExecutionState(@PathVariable String executionId) {
        AgentState state = stateStore.load(executionId);
        if (state == null) {
            throw new ResponseStatusException(HttpStatus.NOT_FOUND, "Execution " + executionId + " not found");
        }
        return mapToResponse(state);
    }

    private AgentStateResponse mapToResponse(AgentState state) {
        return new AgentStateResponse(
            state.getExecutionId(),
            state.getSnapshot(),
            state.getVisitHistory()
        );
    }
}

class AgentStateResponse {
    private final String executionId;
    private final Map<String, Object> stateSnapshot;
    private final List<String> visitHistory;

    public AgentStateResponse(String executionId, Map<String, Object> stateSnapshot, List<String> visitHistory) {
        this.executionId = executionId;
        this.stateSnapshot = stateSnapshot;
        this.visitHistory = visitHistory;
    }

    public String getExecutionId() { return executionId; }
    public Map<String, Object> getStateSnapshot() { return stateSnapshot; }
    public List<String> getVisitHistory() { return visitHistory; }
}

πŸ› οΈ Open-Source Implementations: How Frameworks Solve This

In production enterprise applications, developers rarely write custom state machines from scratch. Instead, they rely on mature orchestration libraries:

  • LangChain4j: The standard framework for Java developers. It utilizes "AI Services" where routing and tool execution are configured using declarative interfaces and annotations, hiding prompt assembly under dynamic proxies.
  • Spring AI: An extension of the Spring ecosystem that provides abstract interfaces for chat clients, memory retention, vector databases, and tools.
  • Temporal: For long-running, complex, state-resilient agent workflows. Temporal handles state persistence, timeout budgets, and crash recovery out of the box using event sourcing.

For a full deep-dive on configuring LangChain4j with Spring Boot, see our companion post on LangChain Development Guide.


πŸ“š Lessons Learned: Designing for Correctness

  1. Beware Infinite Loop Cycles: If an LLM node continuously triggers a tool node, and the tool outputs keep failing validation, the agent can enter an infinite loop, draining API budgets. Always declare an iteration counter limit within the state properties and force termination if it is exceeded.
  2. Avoid Large Thread-Local Bloat: Serializing the entire chat history context on every step node execution can bottleneck memory. Cache large document context tables separately and pass only lightweight keys or embedding references inside the AgentState object.
  3. Handle Concurrency Explicitly: If a user submits input to a paused workflow multiple times, double-invocations can corrupt the history list. Use optimistic locking on your database state layer to reject duplicate execution triggers.

πŸ“Œ Key Takeaways

  • State-Graph vs Callbacks: Avoid nested class calls when building agent loops; model them as nodes (execution blocks) and edges (transition paths) over a shared, serializable state context.
  • Pause and Resume: Asynchronous loops require state suspension models. Model checkpoints via explicit nodes returning a PAUSED execution flag.
  • Encapsulate Domain State: Protect the runtime variables inside thread-safe models (AgentState) separate from the static topology definitions (StateGraph).
  • Enforce OCP: Design nodes as abstract executions. Add custom routing or safety guardrails by implementing the Node interface, avoiding modifications to core engine executors.

Article tools

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.