AI Architecture Patterns: Routers, Planner-Worker Loops, Memory Layers, and Evaluation Guardrails
Production AI needs explicit routing, memory, execution, and evaluation layers rather than one loop.
Abstract AlgorithmsIntermediate
For developers with some experience. Builds on fundamentals.
Estimated read time: 13 min
AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: A single agent loop is enough for a demo, but production AI systems need explicit layers for routing, execution, memory, and evaluation. Those layers determine safety, latency, cost, and traceability far more than model choice alone.
TLDR: Production AI architecture is mostly a routing and control problem: send each request through only the layers it needs, then prove output quality before exposure.
A customer support copilot worked great in demos but hallucinated in 30% of live tickets. The fix was not a better model โ it was adding an explicit routing layer (classify intent first, so billing questions never hit the expensive reasoning path), a memory layer (store resolved tickets so the model stops confabulating policy), and an evaluation layer (score every response before the user sees it, escalate failures to a human queue). Hallucination rate dropped from 30% to under 2% in six weeks.
Here is the pattern in three lines: request arrives โ router classifies intent and picks the cheapest safe path โ evaluator scores the answer before it leaves the system. Everything else in this post is how to build and operate those three steps reliably.
๐ Why AI Pattern Choice Matters More Than Prompt Tuning
Teams usually start with one model and one prompt. That works for demos, then fails in production for predictable reasons: request mix broadens, tool calls fail, costs spike, and bad answers become operational incidents.
Architecture patterns solve this by separating responsibilities:
routingchooses the cheapest safe path,planningdecomposes tasks that need multiple steps,memorycontrols what context can be trusted,evaluationguards output quality and policy safety.
| Production symptom | Pattern response |
| Every request is expensive | Add routing and cheaper direct paths |
| Tool-heavy tasks are brittle | Add planner-worker orchestration |
| Answers cite stale policy | Add layered memory freshness controls |
| Hallucinations reach users | Add inline evaluation and escalation |
๐ When to Use Each AI Pattern (and When Not To)
| Pattern | Use when | Avoid when | First implementation move |
| Router | Request types and risk levels vary | Product has one narrow use case | Start with 3-5 route classes only |
| Planner-worker | Tasks need stepwise tool usage | Most tasks are one-shot Q&A | Restrict planner to bounded workflows |
| Layered memory | Multi-turn context and policy docs matter | Session-only Q&A with no persistence | Separate session memory from durable retrieval |
| Runtime evaluator | Wrong answers are costly or regulated | Low-stakes experimentation | Add pass/fail guard before final response |
Quick practical rule
- Start with router + evaluator for most production copilots.
- Add planner only for workflows with measurable multi-step value.
- Add richer memory only after freshness and ownership are defined.
โ๏ธ How the AI Runtime Works in Practice
- Classify request intent and risk.
- Route to direct-answer path or workflow path.
- If workflow path, generate a bounded plan.
- Retrieve scoped memory with freshness checks.
- Execute tools/workers with trace logging.
- Evaluate answer quality and policy compliance.
- Return answer, fallback, or escalate to human.
| Stage | Practical control | Common failure |
| Route | Intent + risk classifier | Overfitted route taxonomy |
| Plan | Max steps, allowed tools | Planner loop runs too long |
| Memory | Source trust tier + TTL | Stale documents outrank newer policy |
| Execute | Per-tool timeout and retry budget | Tool failures cascade into hallucinated answers |
| Evaluate | Rubric checks + policy checks | Evaluator too weak or too permissive |
๐ ๏ธย How to Implement: 10-Step Rollout Checklist
- Define request classes (
faq,account_action,policy_sensitive,complex_workflow). - Create router policy mapping each class to a path.
- Set latency and cost budget per path.
- Implement planner only for one complex class first.
- Split memory into session context, task memory, and durable retrieval.
- Add document freshness metadata (
source,version,updated_at). - Add evaluator with explicit pass/fail rubric and escalation reason codes.
- Instrument traces for route choice, tool calls, retrieval IDs, and evaluator decision.
- Run offline replay tests against historical incidents.
- Launch with kill switch and fallback model path.
Done criteria:
| Gate | Pass condition |
| Safety | High-risk outputs are blocked or escalated |
| Cost | p50 cost per successful task remains in budget |
| Reliability | Tool failure does not produce fabricated final answers |
| Explainability | Every final answer has a route + evidence trace |
๐ง Deep Dive: Latency, Traceability, and Memory Quality
The Internals: Route Policy, Memory Boundaries, and Eval Enforcement
Routing should use explicit features: intent, risk class, required tools, and user tier. Avoid free-form prompt-only routing for critical paths.
Memory should be layered and owned:
- Session memory: short-lived dialogue context.
- Task memory: state for one ongoing workflow.
- Durable retrieval: policy docs, runbooks, knowledge base.
Evaluation must run inline for risky paths. Treat it as a runtime gate, not a dashboard-only metric.
| Control | What good looks like |
| Route explainability | Logs include route decision and feature values |
| Memory provenance | Every cited fact links to source ID/version |
| Eval actionability | Fail result includes reason + fallback action |
Performance Analysis: What to Measure Weekly
| Metric | Why it matters |
| Route misclassification rate | Measures cost and behavior drift |
| End-to-end p95 latency by path | Prevents hidden latency stacking |
| Retrieval freshness failure rate | Detects stale-memory risk |
| Eval false-negative rate | Detects unsafe answers slipping through |
| Cost per accepted response | Measures architecture sustainability |
Debug order for incidents:
- Was route choice correct?
- Was retrieval scoped and fresh?
- Did tool execution succeed within budget?
- Did evaluator correctly gate output?
๐ AI Runtime Flow: Route, Plan, Retrieve, Execute, and Guard
flowchart TD
A[User request] --> B[Risk and intent router]
B --> C{Direct path or workflow path?}
C -->|Direct| D[Answer model with minimal context]
C -->|Workflow| E[Planner with bounded steps]
E --> F[Tool workers]
F --> G[Layered memory retrieval]
D --> H[Runtime evaluator]
G --> H
H --> I{Pass rubric and policy?}
I -->|Yes| J[Return answer with trace metadata]
I -->|No| K[Fallback model or human escalation]
This diagram maps the complete runtime flow of a production AI system from raw user input to guarded response delivery. Requests enter a risk-and-intent router that splits traffic between a direct path (single model call) and a workflow path (planner with bounded steps, tool workers, and layered memory retrieval). Both paths converge at a runtime evaluator that checks the answer against a rubric and policy โ passing responses carry trace metadata while failing ones escalate to a fallback model or human queue, ensuring no unsafe output reaches the user regardless of which path was taken.
๐ Routing Pattern: Intent to Specialized Agent
flowchart TD
A[Incoming Request] --> B[Intent Classifier]
B --> C{Risk Class}
C -->|faq / low-risk| D[Direct Answer Agent]
C -->|account_action| E[Workflow Agent]
C -->|complex_workflow| F[Planner-Worker Agent]
D --> G[Runtime Evaluator]
E --> G
F --> G
G -->|Pass| H[Return Answer + Trace]
G -->|Fail| I[Human Escalation Queue]
This flowchart shows how an intent classifier routes each incoming request to the right specialized agent tier. Low-risk FAQ requests go directly to a lightweight Direct Answer Agent, standard account actions route to a Workflow Agent, and complex multi-step requests flow to the Planner-Worker Agent. All three paths converge at a shared Runtime Evaluator, ensuring that regardless of routing path, every answer must pass the same policy gate before reaching the user or escalating to a human queue.
๐ Memory and Planning Loop: Agent Observe-Plan-Act
sequenceDiagram
participant U as User
participant A as Agent
participant M as Memory Layer
participant T as Tool
participant E as Evaluator
U->>A: Request
A->>M: Retrieve context
M-->>A: Session + durable docs
A->>A: Plan steps (max 4)
loop Execute tools
A->>T: Invoke tool
T-->>A: Observation
A->>A: Update plan
end
A->>E: Evaluate answer
E-->>A: Pass/Fail + reason code
A-->>U: Final answer or escalate
This sequence diagram traces the observe-plan-act loop at the heart of the planner-worker pattern. The agent first retrieves scoped session context and durable documents from the Memory Layer, then decomposes the request into a bounded plan of at most four steps, executing each tool call and updating the plan with each observation before proceeding. The final answer passes through an Evaluator that returns a pass/fail verdict with a reason code โ making every agent decision auditable and the escalation path deterministic rather than ad hoc.
๐ Real-World Applications: Realistic Scenario: Support Copilot With Compliance Constraints
Constraints:
- 600k monthly chats across billing and account security.
- 2.5 second p95 response target for simple questions.
- PII policy violations must be <0.1%.
- Cost cap of $0.015 per accepted answer.
Practical architecture:
- Router sends
faqtraffic to cheaper direct path. account_securityroutes to workflow path with strict evaluator.- Planner used only for incident and account-action workflows.
- Memory retrieval restricted to policy version matching current quarter.
- Any failed evaluator check escalates to human queue.
| Constraint | Architecture decision | Why it helps |
| Tight latency budget | Direct route for simple intents | Avoids planner/tool overhead |
| Compliance risk | Inline evaluator with policy rubric | Blocks unsafe output before user sees it |
| Cost cap | Path-specific model tiers | Prevents expensive model overuse |
| Audit need | Route + evidence trace logs | Makes incidents diagnosable |
โ๏ธ Trade-offs & Failure Modes: Pros, Cons, and Risks by Pattern Layer
| Layer | Pros | Cons | Key risk | Mitigation |
| Router | Controls cost and latency | Extra classification complexity | Misrouting high-risk tasks | Keep route classes simple and monitored |
| Planner-worker | Better handling of complex tasks | Adds latency and orchestration work | Unbounded loops | Enforce max steps and tool allowlist |
| Layered memory | Better context relevance | More data governance work | Stale policy leakage | Freshness TTL + source version checks |
| Evaluator | Prevents unsafe or low-quality output | Additional runtime overhead | False confidence from weak rubric | Regularly calibrate with failure replay |
๐งญ Decision Guide: What to Add First
| Situation | Recommendation |
| Mostly simple Q&A with occasional risky answers | Add runtime evaluator first |
| Many intents and uneven cost profile | Add router next |
| Complex workflows need tools and decomposition | Add planner-worker only for those paths |
| Stale citations and context drift incidents | Add layered memory governance |
If you can only ship one control in the next sprint, ship the evaluator on high-risk paths first.
๐งช Practical Example: Incident Assistant Architecture Slice
Minimal design for an SRE incident assistant:
- Router identifies
incident_triagerequests. - Planner creates max 4-step plan (logs, metrics, runbook, recommendation).
- Workers query approved observability tools only.
- Memory is task-scoped and expires after incident closure.
- Evaluator rejects recommendations lacking supporting evidence links.
if route == "incident_triage":
plan = planner.create(max_steps=4)
evidence = workers.execute(plan, tool_allowlist)
response = model.summarize(evidence)
if evaluator.pass(response, evidence, policy):
return response
return escalate_to_human(reason="insufficient evidence")
Operator Field Note: What Fails First in Production
A recurring pattern from postmortems is that incidents in AI Architecture Patterns: Routers, Planner-Worker Loops, Memory Layers, and Evaluation Guardrails start with weak signals long before full outage.
- Early warning signal: one guardrail metric drifts (error rate, lag, divergence, or stale-read ratio) while dashboards still look mostly green.
- First containment move: freeze rollout, route to the last known safe path, and cap retries to avoid amplification.
- Escalate immediately when: customer-visible impact persists for two monitoring windows or recovery automation fails once.
15-Minute SRE Drill
- Replay one bounded failure case in staging.
- Capture one metric, one trace, and one log that prove the guardrail worked.
- Update the runbook with exact rollback command and owner on call.
๐ ๏ธ LangGraph and LangSmith: Stateful Agent Graphs with Built-In Evaluation
LangGraph is a Python library from LangChain that models AI agent workflows as directed graphs (StateGraph), where each node is a callable function and edges encode conditional branching โ exactly the router โ planner โ evaluator topology described in this post. LangSmith provides observability and automated evaluation for LangGraph workflows in production.
How it solves the problem: Rather than writing custom orchestration code for routing, planning, memory, and evaluation, LangGraph encodes each layer as a typed graph node. Memory state flows between nodes via a shared TypedDict schema; LangSmith traces every node invocation, tool call, and evaluation decision โ making the debugging workflow from the "debug order for incidents" table above practical rather than theoretical.
from typing import TypedDict, Literal
from langgraph.graph import StateGraph, END
from langchain_core.messages import HumanMessage, AIMessage
# โโ Shared agent state โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
class AgentState(TypedDict):
request: str
intent: str # router output: "faq" | "account_action" | "complex_workflow"
risk_level: str # router output: "low" | "high"
plan: list[str] # planner output: ordered steps (empty for direct path)
evidence: list[str] # tool worker output: supporting facts
answer: str # model output
eval_pass: bool # evaluator output
# โโ Node: intent + risk router โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def router_node(state: AgentState) -> AgentState:
"""Classify intent and risk class; choose direct or workflow path."""
# In production, use a fast fine-tuned classifier or prompt
intent = classify_intent(state["request"]) # returns "faq" | "account_action" | ...
risk = classify_risk(state["request"]) # returns "low" | "high"
return {**state, "intent": intent, "risk_level": risk, "plan": []}
# โโ Conditional edge: route to direct answer or planner โโโโโโโโโโโโโโโโโโโโโโ
def route_decision(state: AgentState) -> Literal["direct_answer", "planner"]:
return "planner" if state["intent"] == "complex_workflow" else "direct_answer"
# โโ Node: direct answer (low-cost path) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def direct_answer_node(state: AgentState) -> AgentState:
answer = llm.invoke([HumanMessage(content=state["request"])]).content
return {**state, "answer": answer, "evidence": []}
# โโ Node: planner (bounded step decomposition) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def planner_node(state: AgentState) -> AgentState:
plan = generate_plan(state["request"], max_steps=4)
evidence = execute_tools(plan, tool_allowlist=["logs", "metrics", "runbook"])
answer = llm.invoke(evidence_prompt(state["request"], evidence)).content
return {**state, "plan": plan, "evidence": evidence, "answer": answer}
# โโ Node: runtime evaluator โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def evaluator_node(state: AgentState) -> AgentState:
passes = evaluate_answer(
answer = state["answer"],
evidence = state["evidence"],
rubric = ["no_pii", "evidence_linked", "policy_compliant"],
)
return {**state, "eval_pass": passes}
# โโ Conditional edge: pass โ return, fail โ escalate โโโโโโโโโโโโโโโโโโโโโโโโโ
def eval_decision(state: AgentState) -> Literal["return_answer", "escalate"]:
return "return_answer" if state["eval_pass"] else "escalate"
def escalate_node(state: AgentState) -> AgentState:
queue_for_human(state["request"], reason="evaluator_failed")
return {**state, "answer": "Your request has been escalated to our team."}
# โโ Build the graph โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
workflow = StateGraph(AgentState)
workflow.add_node("router", router_node)
workflow.add_node("direct_answer", direct_answer_node)
workflow.add_node("planner", planner_node)
workflow.add_node("evaluator", evaluator_node)
workflow.add_node("escalate", escalate_node)
workflow.set_entry_point("router")
workflow.add_conditional_edges("router", route_decision)
workflow.add_edge("direct_answer", "evaluator")
workflow.add_edge("planner", "evaluator")
workflow.add_conditional_edges("evaluator", eval_decision)
workflow.add_edge("return_answer", END)
workflow.add_edge("escalate", END)
agent = workflow.compile()
LangSmith traces every node call, tool invocation, and evaluator decision automatically when LANGCHAIN_TRACING_V2=true is set in the environment โ providing the route + evidence audit trail required by the compliance constraints in the real-world scenario above.
For a full deep-dive on LangGraph and LangSmith in production AI systems, a dedicated follow-up post is planned.
๐ Lessons Learned
- Route fewer paths well instead of many paths poorly.
- Planner value comes from bounded execution, not autonomous sprawl.
- Memory quality is about freshness and ownership, not vector size.
- Evaluation must block unsafe output in real time.
- Traceability is the key to debugging AI incidents quickly.
๐ TLDR: Summary & Key Takeaways
- Production AI patterns should be selected by risk, latency, and cost profile.
- Use routers to control path selection and spending.
- Use planner-worker only where decomposition materially improves outcomes.
- Use layered memory with freshness metadata and provenance.
- Use runtime evaluation as the final guard before answer exposure.
๐ Related Posts
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data
TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node โ virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...
Clock Skew and Causality Violations: Why Distributed Clocks Lie
TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions โ but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...
Stale Reads and Cascading Failures in Distributed Systems
TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable โ stale reads...
Split Brain Explained: When Two Nodes Both Think They Are Leader
TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader โ each accepting writes the other never sees. Prevent it with quorum consensus (at least โN/2โ+1 nodes must agree before leadership is g...
