All Posts

Human-in-the-Loop Workflows with LangGraph: Interrupts, Approvals, and Async Execution

Pause LangGraph agents mid-run for human approval: interrupt(), Command, update_state(), and async resume patterns.

Abstract AlgorithmsAbstract Algorithms
Β·Β·18 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Pause LangGraph agents mid-run with interrupt(), get human approval, resume with Command.


πŸ“– The Autonomous Agent Risk: When Acting Without Permission Goes Wrong

Your autonomous coding agent refactored the authentication module while you were in a meeting. It looked right to the LLM. It broke production.

This is not a hypothetical. As LangGraph agents gain access to real tools β€” GitHub APIs, database write operations, email senders, cloud billing β€” each step they take can have irreversible real-world consequences. An agent that deletes the wrong S3 bucket, commits a breaking change to main, or triggers a $2,000 API bill does not stop to ask for permission. It just acts.

Human-in-the-Loop (HITL) is the architectural answer to this problem. Instead of letting the agent run to completion unchecked, you insert deliberate pause points where a human can inspect the proposed action, approve it, reject it, or correct it before execution continues.

ModeWhen It ActsWho Can Stop ItSuitable For
Fully AutonomousImmediatelyNobodyLow-stakes, reversible tasks
Interrupt-on-ActionBefore irreversible stepsHuman at decision pointsAPI calls, file writes, deployments
Step-by-Step ApprovalAfter every nodeHuman at each stepSensitive pipelines, audited workflows

LangGraph provides first-class support for all three modes through interrupt(), NodeInterrupt, Command, and update_state(). This post walks through each piece, shows you how they connect, and ends with a complete PR review agent that pauses before applying any code change.


πŸ” HITL Fundamentals: interrupt(), NodeInterrupt, and the Checkpointer Requirement

Three primitives make human-in-the-loop possible in LangGraph. Understanding what each one does β€” and what it requires β€” prevents the most common configuration mistakes.

interrupt() β€” The Primary Pause Mechanism

interrupt() is a function you call inside a node to pause the graph and return control to the caller with a payload. The graph stops at exactly that line. The return value of interrupt() is whatever the human sends back when they resume.

from langgraph.types import interrupt

def approval_node(state: AgentState):
    # Graph pauses here; payload is sent to the caller
    human_decision = interrupt({
        "question": "Approve this action?",
        "proposed_action": state["proposed_action"],
    })
    # Execution resumes here after the human responds
    return {"approved": human_decision == "approve"}

interrupt() is the recommended pattern in current LangGraph versions. It reads naturally β€” pause, collect input, continue β€” and integrates cleanly with the async execution model.

NodeInterrupt β€” The Exception-Based Older Pattern

NodeInterrupt is a special exception you raise from inside a node. The graph catches it, stores the interrupt payload, and halts. The human then calls update_state() to inject their response into the graph state, and re-invokes without a Command to resume.

from langgraph.errors import NodeInterrupt

def review_node(state: AgentState):
    if not state.get("human_reviewed"):
        # Older pattern: raise an exception to interrupt
        raise NodeInterrupt(f"Please review proposed changes: {state['proposed_action']}")
    return {"status": "reviewed"}

When to use each:

interrupt()NodeInterrupt
Resume mechanismCommand(resume=value)update_state() + re-invoke
Return value in nodeThe human's response directlyRead from state after update
Recommendedβœ… Yes (current LangGraph)⚠️ Legacy; still supported

The Checkpointer: Non-Negotiable

Neither pattern works without a checkpointer. When interrupt() halts the graph, the entire state snapshot β€” including which node was running, what values are in every state key, and where in the node body execution stopped β€” must be persisted somewhere so it can be restored when the human responds.

Without a checkpointer, the graph has no memory between the pause and the resume. LangGraph will raise an error at compile time if you try to use interrupt() on a graph without one.

from langgraph.checkpoint.memory import MemorySaver

checkpointer = MemorySaver()  # In-memory; use SqliteSaver or RedisSaver in production
graph = builder.compile(checkpointer=checkpointer)

Every invocation also requires a thread ID in the config so the checkpointer knows which conversation's state to restore:

config = {"configurable": {"thread_id": "pr-review-session-42"}}
graph.invoke(initial_state, config)

βš™οΈ Building an Approval Workflow: Pause, Inspect, Approve, Resume

The full HITL lifecycle has four phases. Here is the complete flow in code.

graph TD
    A([Start: invoke graph]) --> B[analyze_node runs]
    B --> C[approval_node hits interrupt]
    C --> D{Checkpointer saves\nfrozen state}
    D --> E([Return __interrupt__ to caller])
    E --> F[Human reviews payload]
    F --> G{Decision}
    G -- approve --> H[invoke Command resume=approve]
    G -- reject --> I[invoke Command resume=reject]
    H --> J[Checkpointer restores state]
    I --> J
    J --> K[approval_node resumes;\ninterrupt returns decision]
    K --> L[apply_node executes]
    L --> M([Graph completes])

The checkpointer (D β†’ J) is the invisible bridge that makes both invoke() calls part of the same workflow.

Phase 1 β€” Invoke the graph (first run):

config = {"configurable": {"thread_id": "deploy-approval-001"}}

result = graph.invoke({"task": "deploy service to production"}, config)

The graph runs until it hits interrupt(). At that point, execution halts and result contains the interrupt payload under result["__interrupt__"]:

# result["__interrupt__"] == [
#   Interrupt(value={"question": "Approve deploy?", "service": "auth-api"}, ...)
# ]
interrupt_payload = result["__interrupt__"][0].value

Phase 2 β€” Surface the question to the user:

Your application layer (CLI, web UI, Slack bot) presents interrupt_payload to the human. This is pure application code β€” LangGraph has no opinion on how you collect the human's response.

Phase 3 β€” Resume with Command:

from langgraph.types import Command

# Human typed "approve" in the UI
resumed = graph.invoke(Command(resume="approve"), config)

Command(resume=user_input) tells the graph: the human responded with this value; resume from where you paused and give this back as the return value of interrupt().

Phase 4 β€” Execution continues from the frozen point:

The node that called interrupt() receives "approve" as the return value, evaluates it, and the graph continues to the next node as normal.

Key insight: The same config (same thread_id) must be used for both the initial invoke and the resume. This is how the checkpointer knows which frozen state to restore.


🧠 Deep Dive: How LangGraph Freezes and Resumes Graph Execution

The Internals

When interrupt() is called inside a node, three things happen in sequence:

  1. Signal to the executor. interrupt() raises a special internal exception (GraphInterrupt) that the LangGraph executor catches at the top of the execution loop. This unwinds the call stack cleanly without corrupting state.

  2. Checkpoint capture. Before returning to the caller, the executor serializes the full state snapshot β€” including the exact values of all state keys at the moment of interruption β€” and writes it to the checkpointer under the current thread_id. The checkpoint also records which node interrupted and the interrupt payload.

  3. Control returned to caller. The graph's invoke() returns to your application code with the interrupt payload visible. The graph is now "frozen" β€” persisted in the checkpointer, waiting.

When Command(resume=value) arrives:

  1. The executor looks up the thread's latest checkpoint in the checkpointer.
  2. It restores the node that interrupted and re-enters the node's function body from the top β€” but this time, interrupt() returns value immediately instead of pausing again.
  3. The node completes normally and the graph advances.

This re-entry model means the code before interrupt() in your node function runs twice: once on the way to the pause, and once on the way back. Keep any side effects (API calls, database writes) after the interrupt() call, not before.

def deploy_node(state: AgentState):
    # βœ… Safe: pure computation before interrupt
    proposed = build_deploy_plan(state["task"])

    # Graph pauses here β€” code above runs again on resume (idempotent is fine)
    decision = interrupt({"plan": proposed})

    # βœ… Safe: side effects AFTER interrupt, only execute once (after resume)
    if decision == "approve":
        call_deploy_api(proposed)  # Only called post-resume
    return {"deployed": decision == "approve"}

Performance Analysis

ConcernDetailMitigation
Human latencyGraphs can be paused for seconds, hours, or daysUse persistent checkpointers (SQLite, Redis, Postgres) not MemorySaver
Checkpoint storage costEach interrupt serializes full state; large states (long message histories) grow fastTrim state before interrupt; store only the delta needed for resume
Timeout riskIf the human never responds, the thread is strandedImplement a background job that expires stale threads after N hours
Concurrent interruptsMultiple threads interrupted simultaneously require unique thread IDs and isolated checkpoint rowsUse UUIDs for thread IDs, never share them across users

For production deployments, MemorySaver is only suitable for development and testing. Use SqliteSaver for single-process deployments or LangGraph Cloud's built-in Postgres checkpointer for distributed multi-instance setups.


πŸ“Š The Interrupt-Resume Cycle: Full Lifecycle in One Diagram

sequenceDiagram
    participant App as Application Layer
    participant Graph as LangGraph Executor
    participant Node as Agent Node
    participant CP as Checkpointer
    participant Human as Human Reviewer

    App->>Graph: invoke(initial_state, config)
    Graph->>Node: execute analyze_node()
    Node-->>Graph: returns state update
    Graph->>Node: execute approval_node()
    Node->>Graph: interrupt(proposed_action)
    Graph->>CP: save_checkpoint(thread_id, state_snapshot)
    Graph-->>App: return {"__interrupt__": [payload]}

    App->>Human: display(payload["proposed_action"])
    Human-->>App: "approve" / "reject" / edited_value

    App->>Graph: invoke(Command(resume=decision), config)
    Graph->>CP: restore_checkpoint(thread_id)
    Graph->>Node: re-enter approval_node(); interrupt() returns decision
    Node-->>Graph: returns {"approved": True}
    Graph->>Node: execute apply_node()
    Node-->>Graph: returns {"applied": True}
    Graph-->>App: final state

The diagram makes two non-obvious points explicit: the checkpointer is the bridge between the two invoke() calls, and approval_node is re-entered from its beginning on resume β€” it does not continue from a saved instruction pointer.


🌍 Real-World Applications: Where Human-in-the-Loop Is Non-Negotiable

Autonomous DevOps Pipelines

A LangGraph agent monitors production metrics, detects anomalies, and generates a runbook of remediation steps. Steps like "restart service" might auto-execute, but "scale down the database replica" triggers an interrupt(). The on-call engineer receives the proposed change in Slack, approves or rejects it with one click, and the agent resumes.

Input: Prometheus alert payload, current service topology
Interrupt payload: {"action": "scale_down", "target": "db-replica-3", "reason": "p99 latency normal"}
Human decision: Approve or reject with a comment added via update_state()

Financial Transaction Approval

An expense automation agent categorizes invoices and schedules payments. Payments under $500 auto-approve; anything above interrupts the workflow and routes to the finance manager's approval queue. The graph is paused indefinitely until the manager acts β€” potentially overnight β€” thanks to the persistent checkpointer.

An AI drafts contract clauses based on deal terms. Before finalizing, the graph pauses at every clause that modifies liability language. A lawyer reviews each proposed clause in a web UI, edits the text via update_state(), and marks it approved. The agent then formats the final document with the reviewed text.

These three patterns share a common architecture: the agent handles the mechanical labor (analysis, proposal generation, formatting) while the human handles the judgment calls (approve, edit, reject). LangGraph's HITL primitives make this division of responsibility a first-class design choice rather than an afterthought.


βš–οΈ Trade-offs and Failure Modes: Deadlocked Graphs, Stale State, and Human Latency

Performance vs. Safety

Adding an interrupt to a workflow introduces unbounded latency. A fully autonomous graph completes in seconds; a human-gated one can sit frozen for hours. If your system has SLAs, you need to decide which nodes are worth gating. The rule of thumb: interrupt on irreversible, high-blast-radius actions only β€” not on every step.

Failure Mode 1 β€” The Deadlocked Graph

A thread is interrupted and nobody resumes it. The checkpointer holds the frozen state indefinitely. In production, implement a TTL-based expiry job that scans for threads not resumed within a threshold (e.g., 24 hours) and marks them as abandoned. Without this, you accumulate orphaned threads that silently consume checkpoint storage.

Failure Mode 2 β€” Stale State on Late Resume

The LLM proposed changes based on a codebase snapshot from 09:00. The human approves at 17:00. In the meantime, three other PRs merged. The proposed changes now conflict. HITL does not automatically re-validate the proposal against the new world state β€” you must build that check into your apply_changes node explicitly.

def apply_changes(state: AgentState):
    if state["approved"]:
        # Re-validate: is the proposal still valid in current state?
        if is_proposal_stale(state["proposed_changes"]):
            return {"applied": False, "error": "state changed while waiting for approval"}
        execute_changes(state["proposed_changes"])
    return {"applied": state["approved"]}

Failure Mode 3 β€” Multiple Interrupts in One Thread

If a graph has two interrupt() calls in sequence, each one halts and resumes independently, in order. This is intentional β€” LangGraph queues them. But if your UI assumes only one interrupt per thread, the second interrupt will surface unexpectedly. Design your UI to handle any number of sequential interrupts on the same thread ID.

Mitigation Summary

Failure ModeMitigation
Deadlocked graphTTL expiry job; dashboard to surface stalled threads
Stale stateRe-validate proposal after resume before executing
Unexpected second interruptUI must handle the interrupt queue, not assume a single pause
Lost checkpoint (MemorySaver restart)Use persistent checkpointer (SQLite/Redis/Postgres) in all non-dev environments

🧭 Decision Guide: Full Autonomy vs Interrupt-on-Action vs Step-by-Step Approval

SituationRecommendation
Use full autonomy whenActions are reversible (read-only queries, draft creation, summarization) and failure cost is low
Use interrupt-on-action whenSpecific nodes perform irreversible external calls (API writes, deploys, payments); the rest of the graph can run freely
Use step-by-step approval whenThe domain is regulated (legal, medical, financial), every output needs audit trail, or the agent is new and untested in production
Avoid HITL entirely whenSub-second latency is required (real-time inference, streaming) or human availability cannot be guaranteed (overnight batch jobs)
Edge casesParallel branches with multiple interrupt() calls require careful thread management; one interrupt per branch per execution step

πŸ§ͺ Practical Example: PR Review Agent That Asks Before Applying Changes

This agent demonstrates the complete HITL lifecycle in a scenario where the stakes are high enough to justify every pause: code diffs that, if applied incorrectly, break production. The PR review scenario was chosen because it maps cleanly to the three primitives covered in this post β€” interrupt() for pausing, Command(resume=...) for resuming, and update_state() for human correction β€” each one triggered at a distinct point in the workflow. As you read through the nodes, watch for the interrupt() call inside review_node and trace how the payload flows back as the return value when the graph resumes β€” that handoff is the mechanism that makes the entire pattern work.

from typing import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.types import interrupt, Command
from langgraph.checkpoint.memory import MemorySaver
from langchain_openai import ChatOpenAI

# --- State schema ---
class PRReviewState(TypedDict):
    pr_number: int
    diff: str
    analysis: str
    proposed_changes: list[str]
    approved: bool
    applied: bool

llm = ChatOpenAI(model="gpt-4o")

# --- Node 1: Analyze the diff ---
def analyze_pr(state: PRReviewState) -> dict:
    response = llm.invoke(
        f"Analyze this code diff and identify specific issues:\n{state['diff']}"
    )
    return {"analysis": response.content}

# --- Node 2: Generate concrete change proposals ---
def propose_changes(state: PRReviewState) -> dict:
    response = llm.invoke(
        f"Based on this analysis, write 3 specific, actionable code changes:\n{state['analysis']}"
    )
    changes = [line for line in response.content.split("\n") if line.strip()]
    return {"proposed_changes": changes[:3]}

# --- Node 3: Interrupt for human approval ---
def human_approval(state: PRReviewState) -> dict:
    # Graph pauses here β€” payload goes to the caller
    decision = interrupt({
        "message": "Review proposed changes. Reply 'approve' to proceed or 'reject' to cancel.",
        "pr_number": state["pr_number"],
        "proposed_changes": state["proposed_changes"],
    })
    return {"approved": decision.strip().lower() == "approve"}

# --- Node 4: Apply approved changes ---
def apply_changes(state: PRReviewState) -> dict:
    if not state["approved"]:
        print(f"PR #{state['pr_number']}: changes rejected by reviewer.")
        return {"applied": False}

    # Re-validate before committing (guard against stale state)
    print(f"Applying {len(state['proposed_changes'])} changes to PR #{state['pr_number']}:")
    for change in state["proposed_changes"]:
        print(f"  β†’ {change}")
    # In production: call GitHub API, run git apply, post review comment, etc.
    return {"applied": True}

# --- Build the graph ---
builder = StateGraph(PRReviewState)
builder.add_node("analyze_pr", analyze_pr)
builder.add_node("propose_changes", propose_changes)
builder.add_node("human_approval", human_approval)
builder.add_node("apply_changes", apply_changes)

builder.add_edge(START, "analyze_pr")
builder.add_edge("analyze_pr", "propose_changes")
builder.add_edge("propose_changes", "human_approval")
builder.add_edge("human_approval", "apply_changes")
builder.add_edge("apply_changes", END)

checkpointer = MemorySaver()
graph = builder.compile(checkpointer=checkpointer)

Running the agent β€” Phase 1 (invoke and receive interrupt):

config = {"configurable": {"thread_id": "pr-42-review"}}
sample_diff = "- def login(username, password):\n+ def login(username: str, password: str) -> bool:"

result = graph.invoke(
    {"pr_number": 42, "diff": sample_diff, "approved": False, "applied": False},
    config
)

# The graph paused at human_approval
payload = result["__interrupt__"][0].value
print(payload["message"])           # "Review proposed changes..."
print(payload["proposed_changes"])  # ["Add type annotations", ...]

Phase 2 (human reviews and resumes):

# Simulating a human typing "approve" in a UI
resumed = graph.invoke(Command(resume="approve"), config)

print(resumed["approved"])  # True
print(resumed["applied"])   # True

The thread ID "pr-42-review" binds the two invoke() calls together. The checkpointer restores the frozen state from Phase 1, interrupt() returns "approve" inside human_approval, and the graph completes through apply_changes.


πŸ› οΈ LangGraph's update_state(): Editing Agent Memory Before Resuming

Command(resume=value) answers the interrupt but leaves the rest of the state unchanged. Sometimes the human doesn't just want to approve β€” they want to fix what the agent got wrong. That's what graph.update_state() is for.

# After the graph interrupts, inspect the current state snapshot
snapshot = graph.get_state(config)
print(snapshot.values["proposed_changes"])
# ["1. Add type hints", "2. Remove unused import", "3. Rename variable x to user_id"]

# The human wants to override change #3 before approving
corrected_changes = [
    "1. Add type hints",
    "2. Remove unused import",
    "3. Extract login logic into a dedicated AuthService class",  # Human edited this
]

graph.update_state(
    config,
    {"proposed_changes": corrected_changes}
)

# Now resume β€” the graph will apply the corrected list, not the LLM's original
resumed = graph.invoke(Command(resume="approve"), config)

update_state() writes directly into the persisted checkpoint under the given thread_id. The next time the graph reads state["proposed_changes"], it gets the human's corrected version, not the LLM's original.

What you can edit:

State key typeCan update_state() modify it?Notes
Simple values (str, int, bool)βœ… YesReplaces the value directly
Lists with Annotated[list, operator.add]βœ… YesAppends; pass new items only
Plain listsβœ… YesReplaces the whole list
LangGraph message historyβœ… YesPass {"messages": [HumanMessage(...)]}

This makes update_state() a powerful tool not just for HITL corrections, but for injecting external context into a running agent β€” for example, appending a new document retrieved from a database while the agent is paused.


πŸ“š Lessons Learned

1. Always persist your checkpointer before going to production. MemorySaver lives in RAM. A process restart during an interrupt wipes all frozen thread states. Use SqliteSaver or a Redis/Postgres backend in any environment where the human might take longer than the process uptime.

2. Keep code before interrupt() idempotent. Because interrupt() causes the node function to re-execute from the top on resume, any side effects (writes, API calls) placed before interrupt() will run twice. Pure computation before, side effects after.

3. Design for the "never resumes" case. Not every human responds promptly β€” or at all. Build a thread expiry mechanism. Query the checkpointer for threads whose updated_at is older than your TTL and either auto-reject them or notify the human again.

4. One thread ID per independent conversation. Thread IDs must be unique per user session. Never reuse a thread ID across different users or different task instances. Collisions mean one user's approval resumes another's graph.

5. Don't interrupt on reversible steps. HITL is a cost: human time, latency, and coordination overhead. Reserve it for the actions that truly warrant it β€” irreversible, high-blast-radius, or regulated steps. A graph that interrupts on every LLM call trains humans to rubber-stamp approvals, defeating the purpose.


πŸ“Œ TLDR: Summary and Key Takeaways

TLDR: Pause LangGraph agents mid-run with interrupt(), get human approval, resume with Command.

  • interrupt(payload) pauses graph execution inside a node and returns the payload to the caller; the node re-enters from the top on resume.
  • Command(resume=value) is the mechanism to resume a paused graph; the value becomes the return of interrupt().
  • A checkpointer is mandatory β€” without one, the frozen state cannot be persisted and HITL will not work.
  • update_state() lets humans correct the agent's state before resuming, not just approve or reject it.
  • Stale state is a real failure mode: validate the proposal against current world state after every resume, not just before the interrupt.
  • Interrupt-on-action (pause only at destructive operations) is the practical sweet spot between full autonomy and step-by-step human approval.
  • The memorable rule: agents decide what to do; humans decide whether to let it happen.

πŸ“ Practice Quiz

  1. Which function do you call inside a LangGraph node to pause execution and return a payload to the caller?

    • A) NodeInterrupt(payload)
    • B) interrupt(payload)
    • C) graph.pause(payload) Correct Answer: B
  2. You have a LangGraph agent that interrupts for human approval. After the interrupt fires, you restart your Python process. When the human responds, what happens?

    • A) The graph resumes normally β€” the state is in memory
    • B) The graph raises a GraphInterrupt error and discards the state
    • C) The graph cannot resume because MemorySaver state was lost on restart Correct Answer: C
  3. A human reviewer wants to change one of the agent's proposed values before approving. Which tool do they use?

    • A) graph.update_state(config, {"key": new_value})
    • B) Command(resume={"key": new_value})
    • C) Re-invoke the graph from the beginning with corrected input Correct Answer: A
  4. Open-ended challenge: You are designing a CI/CD agent that runs 20 steps: unit tests, lint, build, integration tests, and a final production deploy. Which of these should trigger an interrupt()? Consider the trade-offs between safety, latency, and human fatigue β€” and explain how you would decide which steps deserve a human gate versus which should run autonomously.



Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms