AI Architecture Patterns: Routers, Planner-Worker Loops, Memory Layers, and Evaluation Guardrails
Production AI needs explicit routing, memory, execution, and evaluation layers rather than one loop.
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: A single agent loop is enough for a demo, but production AI systems need explicit layers for routing, execution, memory, and evaluation. Those layers determine safety, latency, cost, and traceability far more than model choice alone.
TLDR: Production AI architecture is mostly a routing and control problem: send each request through only the layers it needs, then prove output quality before exposure.
A customer support copilot worked great in demos but hallucinated in 30% of live tickets. The fix was not a better model — it was adding an explicit routing layer (classify intent first, so billing questions never hit the expensive reasoning path), a memory layer (store resolved tickets so the model stops confabulating policy), and an evaluation layer (score every response before the user sees it, escalate failures to a human queue). Hallucination rate dropped from 30% to under 2% in six weeks.
Here is the pattern in three lines: request arrives → router classifies intent and picks the cheapest safe path → evaluator scores the answer before it leaves the system. Everything else in this post is how to build and operate those three steps reliably.
📖 Why AI Pattern Choice Matters More Than Prompt Tuning
Teams usually start with one model and one prompt. That works for demos, then fails in production for predictable reasons: request mix broadens, tool calls fail, costs spike, and bad answers become operational incidents.
Architecture patterns solve this by separating responsibilities:
routingchooses the cheapest safe path,planningdecomposes tasks that need multiple steps,memorycontrols what context can be trusted,evaluationguards output quality and policy safety.
| Production symptom | Pattern response |
| Every request is expensive | Add routing and cheaper direct paths |
| Tool-heavy tasks are brittle | Add planner-worker orchestration |
| Answers cite stale policy | Add layered memory freshness controls |
| Hallucinations reach users | Add inline evaluation and escalation |
🔍 When to Use Each AI Pattern (and When Not To)
| Pattern | Use when | Avoid when | First implementation move |
| Router | Request types and risk levels vary | Product has one narrow use case | Start with 3-5 route classes only |
| Planner-worker | Tasks need stepwise tool usage | Most tasks are one-shot Q&A | Restrict planner to bounded workflows |
| Layered memory | Multi-turn context and policy docs matter | Session-only Q&A with no persistence | Separate session memory from durable retrieval |
| Runtime evaluator | Wrong answers are costly or regulated | Low-stakes experimentation | Add pass/fail guard before final response |
Quick practical rule
- Start with router + evaluator for most production copilots.
- Add planner only for workflows with measurable multi-step value.
- Add richer memory only after freshness and ownership are defined.
⚙️ How the AI Runtime Works in Practice
- Classify request intent and risk.
- Route to direct-answer path or workflow path.
- If workflow path, generate a bounded plan.
- Retrieve scoped memory with freshness checks.
- Execute tools/workers with trace logging.
- Evaluate answer quality and policy compliance.
- Return answer, fallback, or escalate to human.
| Stage | Practical control | Common failure |
| Route | Intent + risk classifier | Overfitted route taxonomy |
| Plan | Max steps, allowed tools | Planner loop runs too long |
| Memory | Source trust tier + TTL | Stale documents outrank newer policy |
| Execute | Per-tool timeout and retry budget | Tool failures cascade into hallucinated answers |
| Evaluate | Rubric checks + policy checks | Evaluator too weak or too permissive |
🛠️ How to Implement: 10-Step Rollout Checklist
- Define request classes (
faq,account_action,policy_sensitive,complex_workflow). - Create router policy mapping each class to a path.
- Set latency and cost budget per path.
- Implement planner only for one complex class first.
- Split memory into session context, task memory, and durable retrieval.
- Add document freshness metadata (
source,version,updated_at). - Add evaluator with explicit pass/fail rubric and escalation reason codes.
- Instrument traces for route choice, tool calls, retrieval IDs, and evaluator decision.
- Run offline replay tests against historical incidents.
- Launch with kill switch and fallback model path.
Done criteria:
| Gate | Pass condition |
| Safety | High-risk outputs are blocked or escalated |
| Cost | p50 cost per successful task remains in budget |
| Reliability | Tool failure does not produce fabricated final answers |
| Explainability | Every final answer has a route + evidence trace |
🧠 Deep Dive: Latency, Traceability, and Memory Quality
The Internals: Route Policy, Memory Boundaries, and Eval Enforcement
Routing should use explicit features: intent, risk class, required tools, and user tier. Avoid free-form prompt-only routing for critical paths.
Memory should be layered and owned:
- Session memory: short-lived dialogue context.
- Task memory: state for one ongoing workflow.
- Durable retrieval: policy docs, runbooks, knowledge base.
Evaluation must run inline for risky paths. Treat it as a runtime gate, not a dashboard-only metric.
| Control | What good looks like |
| Route explainability | Logs include route decision and feature values |
| Memory provenance | Every cited fact links to source ID/version |
| Eval actionability | Fail result includes reason + fallback action |
Performance Analysis: What to Measure Weekly
| Metric | Why it matters |
| Route misclassification rate | Measures cost and behavior drift |
| End-to-end p95 latency by path | Prevents hidden latency stacking |
| Retrieval freshness failure rate | Detects stale-memory risk |
| Eval false-negative rate | Detects unsafe answers slipping through |
| Cost per accepted response | Measures architecture sustainability |
Debug order for incidents:
- Was route choice correct?
- Was retrieval scoped and fresh?
- Did tool execution succeed within budget?
- Did evaluator correctly gate output?
📊 AI Runtime Flow: Route, Plan, Retrieve, Execute, and Guard
flowchart TD
A[User request] --> B[Risk and intent router]
B --> C{Direct path or workflow path?}
C -->|Direct| D[Answer model with minimal context]
C -->|Workflow| E[Planner with bounded steps]
E --> F[Tool workers]
F --> G[Layered memory retrieval]
D --> H[Runtime evaluator]
G --> H
H --> I{Pass rubric and policy?}
I -->|Yes| J[Return answer with trace metadata]
I -->|No| K[Fallback model or human escalation]
This diagram maps the complete runtime flow of a production AI system from raw user input to guarded response delivery. Requests enter a risk-and-intent router that splits traffic between a direct path (single model call) and a workflow path (planner with bounded steps, tool workers, and layered memory retrieval). Both paths converge at a runtime evaluator that checks the answer against a rubric and policy — passing responses carry trace metadata while failing ones escalate to a fallback model or human queue, ensuring no unsafe output reaches the user regardless of which path was taken.
📊 Routing Pattern: Intent to Specialized Agent
flowchart TD
A[Incoming Request] --> B[Intent Classifier]
B --> C{Risk Class}
C -->|faq / low-risk| D[Direct Answer Agent]
C -->|account_action| E[Workflow Agent]
C -->|complex_workflow| F[Planner-Worker Agent]
D --> G[Runtime Evaluator]
E --> G
F --> G
G -->|Pass| H[Return Answer + Trace]
G -->|Fail| I[Human Escalation Queue]
This flowchart shows how an intent classifier routes each incoming request to the right specialized agent tier. Low-risk FAQ requests go directly to a lightweight Direct Answer Agent, standard account actions route to a Workflow Agent, and complex multi-step requests flow to the Planner-Worker Agent. All three paths converge at a shared Runtime Evaluator, ensuring that regardless of routing path, every answer must pass the same policy gate before reaching the user or escalating to a human queue.
📊 Memory and Planning Loop: Agent Observe-Plan-Act
sequenceDiagram
participant U as User
participant A as Agent
participant M as Memory Layer
participant T as Tool
participant E as Evaluator
U->>A: Request
A->>M: Retrieve context
M-->>A: Session + durable docs
A->>A: Plan steps (max 4)
loop Execute tools
A->>T: Invoke tool
T-->>A: Observation
A->>A: Update plan
end
A->>E: Evaluate answer
E-->>A: Pass/Fail + reason code
A-->>U: Final answer or escalate
This sequence diagram traces the observe-plan-act loop at the heart of the planner-worker pattern. The agent first retrieves scoped session context and durable documents from the Memory Layer, then decomposes the request into a bounded plan of at most four steps, executing each tool call and updating the plan with each observation before proceeding. The final answer passes through an Evaluator that returns a pass/fail verdict with a reason code — making every agent decision auditable and the escalation path deterministic rather than ad hoc.
🌍 Real-World Applications: Realistic Scenario: Support Copilot With Compliance Constraints
Constraints:
- 600k monthly chats across billing and account security.
- 2.5 second p95 response target for simple questions.
- PII policy violations must be <0.1%.
- Cost cap of $0.015 per accepted answer.
Practical architecture:
- Router sends
faqtraffic to cheaper direct path. account_securityroutes to workflow path with strict evaluator.- Planner used only for incident and account-action workflows.
- Memory retrieval restricted to policy version matching current quarter.
- Any failed evaluator check escalates to human queue.
| Constraint | Architecture decision | Why it helps |
| Tight latency budget | Direct route for simple intents | Avoids planner/tool overhead |
| Compliance risk | Inline evaluator with policy rubric | Blocks unsafe output before user sees it |
| Cost cap | Path-specific model tiers | Prevents expensive model overuse |
| Audit need | Route + evidence trace logs | Makes incidents diagnosable |
⚖️ Trade-offs & Failure Modes: Pros, Cons, and Risks by Pattern Layer
| Layer | Pros | Cons | Key risk | Mitigation |
| Router | Controls cost and latency | Extra classification complexity | Misrouting high-risk tasks | Keep route classes simple and monitored |
| Planner-worker | Better handling of complex tasks | Adds latency and orchestration work | Unbounded loops | Enforce max steps and tool allowlist |
| Layered memory | Better context relevance | More data governance work | Stale policy leakage | Freshness TTL + source version checks |
| Evaluator | Prevents unsafe or low-quality output | Additional runtime overhead | False confidence from weak rubric | Regularly calibrate with failure replay |
🧭 Decision Guide: What to Add First
| Situation | Recommendation |
| Mostly simple Q&A with occasional risky answers | Add runtime evaluator first |
| Many intents and uneven cost profile | Add router next |
| Complex workflows need tools and decomposition | Add planner-worker only for those paths |
| Stale citations and context drift incidents | Add layered memory governance |
If you can only ship one control in the next sprint, ship the evaluator on high-risk paths first.
🧪 Practical Example: Incident Assistant Architecture Slice
Minimal design for an SRE incident assistant:
- Router identifies
incident_triagerequests. - Planner creates max 4-step plan (logs, metrics, runbook, recommendation).
- Workers query approved observability tools only.
- Memory is task-scoped and expires after incident closure.
- Evaluator rejects recommendations lacking supporting evidence links.
if route == "incident_triage":
plan = planner.create(max_steps=4)
evidence = workers.execute(plan, tool_allowlist)
response = model.summarize(evidence)
if evaluator.pass(response, evidence, policy):
return response
return escalate_to_human(reason="insufficient evidence")
Operator Field Note: What Fails First in Production
A recurring pattern from postmortems is that incidents in AI Architecture Patterns: Routers, Planner-Worker Loops, Memory Layers, and Evaluation Guardrails start with weak signals long before full outage.
- Early warning signal: one guardrail metric drifts (error rate, lag, divergence, or stale-read ratio) while dashboards still look mostly green.
- First containment move: freeze rollout, route to the last known safe path, and cap retries to avoid amplification.
- Escalate immediately when: customer-visible impact persists for two monitoring windows or recovery automation fails once.
15-Minute SRE Drill
- Replay one bounded failure case in staging.
- Capture one metric, one trace, and one log that prove the guardrail worked.
- Update the runbook with exact rollback command and owner on call.
🛠️ LangGraph and LangSmith: Stateful Agent Graphs with Built-In Evaluation
LangGraph is a Python library from LangChain that models AI agent workflows as directed graphs (StateGraph), where each node is a callable function and edges encode conditional branching — exactly the router → planner → evaluator topology described in this post. LangSmith provides observability and automated evaluation for LangGraph workflows in production.
How it solves the problem: Rather than writing custom orchestration code for routing, planning, memory, and evaluation, LangGraph encodes each layer as a typed graph node. Memory state flows between nodes via a shared TypedDict schema; LangSmith traces every node invocation, tool call, and evaluation decision — making the debugging workflow from the "debug order for incidents" table above practical rather than theoretical.
from typing import TypedDict, Literal
from langgraph.graph import StateGraph, END
from langchain_core.messages import HumanMessage, AIMessage
# ── Shared agent state ────────────────────────────────────────────────────────
class AgentState(TypedDict):
request: str
intent: str # router output: "faq" | "account_action" | "complex_workflow"
risk_level: str # router output: "low" | "high"
plan: list[str] # planner output: ordered steps (empty for direct path)
evidence: list[str] # tool worker output: supporting facts
answer: str # model output
eval_pass: bool # evaluator output
# ── Node: intent + risk router ────────────────────────────────────────────────
def router_node(state: AgentState) -> AgentState:
"""Classify intent and risk class; choose direct or workflow path."""
# In production, use a fast fine-tuned classifier or prompt
intent = classify_intent(state["request"]) # returns "faq" | "account_action" | ...
risk = classify_risk(state["request"]) # returns "low" | "high"
return {**state, "intent": intent, "risk_level": risk, "plan": []}
# ── Conditional edge: route to direct answer or planner ──────────────────────
def route_decision(state: AgentState) -> Literal["direct_answer", "planner"]:
return "planner" if state["intent"] == "complex_workflow" else "direct_answer"
# ── Node: direct answer (low-cost path) ──────────────────────────────────────
def direct_answer_node(state: AgentState) -> AgentState:
answer = llm.invoke([HumanMessage(content=state["request"])]).content
return {**state, "answer": answer, "evidence": []}
# ── Node: planner (bounded step decomposition) ────────────────────────────────
def planner_node(state: AgentState) -> AgentState:
plan = generate_plan(state["request"], max_steps=4)
evidence = execute_tools(plan, tool_allowlist=["logs", "metrics", "runbook"])
answer = llm.invoke(evidence_prompt(state["request"], evidence)).content
return {**state, "plan": plan, "evidence": evidence, "answer": answer}
# ── Node: runtime evaluator ────────────────────────────────────────────────────
def evaluator_node(state: AgentState) -> AgentState:
passes = evaluate_answer(
answer = state["answer"],
evidence = state["evidence"],
rubric = ["no_pii", "evidence_linked", "policy_compliant"],
)
return {**state, "eval_pass": passes}
# ── Conditional edge: pass → return, fail → escalate ─────────────────────────
def eval_decision(state: AgentState) -> Literal["return_answer", "escalate"]:
return "return_answer" if state["eval_pass"] else "escalate"
def escalate_node(state: AgentState) -> AgentState:
queue_for_human(state["request"], reason="evaluator_failed")
return {**state, "answer": "Your request has been escalated to our team."}
# ── Build the graph ────────────────────────────────────────────────────────────
workflow = StateGraph(AgentState)
workflow.add_node("router", router_node)
workflow.add_node("direct_answer", direct_answer_node)
workflow.add_node("planner", planner_node)
workflow.add_node("evaluator", evaluator_node)
workflow.add_node("escalate", escalate_node)
workflow.set_entry_point("router")
workflow.add_conditional_edges("router", route_decision)
workflow.add_edge("direct_answer", "evaluator")
workflow.add_edge("planner", "evaluator")
workflow.add_conditional_edges("evaluator", eval_decision)
workflow.add_edge("return_answer", END)
workflow.add_edge("escalate", END)
agent = workflow.compile()
LangSmith traces every node call, tool invocation, and evaluator decision automatically when LANGCHAIN_TRACING_V2=true is set in the environment — providing the route + evidence audit trail required by the compliance constraints in the real-world scenario above.
For a full deep-dive on LangGraph and LangSmith in production AI systems, a dedicated follow-up post is planned.
📚 Lessons Learned
- Route fewer paths well instead of many paths poorly.
- Planner value comes from bounded execution, not autonomous sprawl.
- Memory quality is about freshness and ownership, not vector size.
- Evaluation must block unsafe output in real time.
- Traceability is the key to debugging AI incidents quickly.
📌 TLDR: Summary & Key Takeaways
- Production AI patterns should be selected by risk, latency, and cost profile.
- Use routers to control path selection and spending.
- Use planner-worker only where decomposition materially improves outcomes.
- Use layered memory with freshness metadata and provenance.
- Use runtime evaluation as the final guard before answer exposure.
🔗 Related Posts
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
