LLM Skill Registries, Routing Policies, and Evaluation for Production Agents
After tools and skills, this is the control plane: registry design, routing rules, and evaluation loops.
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: If tools are primitives and skills are reusable routines, then the skill registry + router + evaluator is your production control plane. This layer decides which skill runs, under what constraints, and how you detect regressions before users do.
๐ Why a Skill Registry Becomes the Agent Control Plane
In small demos, the model picks a tool and returns a decent answer. In production, that is not enough.
You need answers to operational questions:
- Which skills are currently active?
- Which team owns each skill and guardrail policy?
- Which skills are safe for high-risk intents?
- What changed between yesterday's and today's routing behavior?
That is what a skill registry solves. It is not just a list of skill names. It is the source of truth for execution behavior.
Example โ one registry entry in practice:
{
"skill_id": "sql_query_v2",
"input_schema": { "query": "string", "db": "string" },
"routing_condition": "intent == 'data_lookup' AND risk_level == 'low'",
"eval_hook": "sql_accuracy_v1"
}
When a user asks "Show me all orders over $500", the router matches intent data_lookup, confirms risk_level == 'low', selects sql_query_v2, and tags the response for evaluation via sql_accuracy_v1. That four-field entry is the minimum viable registry contract.
| Capability | Without registry | With registry |
| Skill discovery | Prompt memory or hardcoded list | Queryable metadata |
| Governance | Ad hoc docs | Owner, risk level, policy fields |
| Routing consistency | Prompt-dependent | Deterministic + scored selection |
| Incident triage | Slow transcript digging | Versioned skill and route traces |
A practical architecture has three pieces:
- Registry: skill metadata and contracts.
- Router: selects the best skill for a request.
- Evaluator: measures quality, safety, latency, and drift.
๐ Skill Routing Sequence
sequenceDiagram
participant U as User
participant IC as Intent Classifier
participant SR as Skill Registry
participant PG as Policy Gate
participant E as Executor
participant T as Skill Tracer
U->>IC: User query
IC->>SR: intent + entities
SR->>SR: Embed + cosine match
SR-->>PG: Candidate skills
PG->>PG: Check risk / permissions
PG-->>E: Approved skill
E->>T: Execute skill (step trace)
T-->>U: Response + route metadata
This sequence traces a user query through the full skill routing pipeline: from intent classification and embedding-based registry lookup, through a policy gate that enforces risk and permission checks, to skill execution with a step trace returned alongside the response. Notice how the Policy Gate acts as a hard filter before any skill runs โ queries that fail eligibility never reach the executor, regardless of embedding match score. The reader should focus on how each layer narrows candidates progressively rather than making a single monolithic routing decision.
๐ Skill Registry Lookup Flow
flowchart TD
Query[User Query]
Embed[Embed Query (intent vector)]
Cos[Cosine Similarity vs skill descriptions]
TopK[Top-K Candidate Skills]
Filter[Policy Gate (risk, permissions, domain)]
Score[Score: fit latency risk]
Select[Select Top Skill]
Invoke[Invoke Skill Runtime]
Query --> Embed --> Cos --> TopK --> Filter --> Score --> Select --> Invoke
This post is the operational follow-up to LLM Skills vs Tools: The Missing Layer in Agent Design.
๐ Designing a Registry That Humans and Routers Can Trust
A useful registry entry is both machine-readable and operator-readable.
Minimum fields per skill:
| Field | Example | Why it matters |
skill_id | incident_triage_v3 | Stable reference in traces |
description | "Investigate alerts and create tickets" | Helps intent matching |
input_schema | JSON schema | Prevents malformed runs |
output_schema | JSON schema | Stabilizes downstream integrations |
risk_level | low, medium, high | Enables policy gating |
allowed_data_domains | logs, tickets | Limits data exposure |
owner_team | sre-platform | Accountability |
slo | p95<4s | Runtime expectations |
version | 3.2.1 | Safe rollouts and rollbacks |
Dense systems should also include:
- deprecation status,
- fallback skill id,
- required approvals (for sensitive actions),
- evaluation baseline hash.
A registry is a product artifact. Treat it like API surface area, not internal trivia.
โ๏ธ Routing Pipeline: From User Intent to Skill Selection
A production router should be explicit about stages.
flowchart TD
A[User request] --> B[Intent and entity extraction]
B --> C[Candidate skill retrieval from registry]
C --> D[Policy gate: data/risk/compliance]
D --> E[Score candidates: fit, cost, risk, freshness]
E --> F{Confidence above threshold?}
F -- Yes --> G[Select top skill]
F -- No --> H[Fallback: safe default or human review]
G --> I[Execute skill with trace]
H --> I
I --> J[Return response + route metadata]
This pipeline prevents a common failure mode: the model picks a "kind of related" skill because a keyword looked similar.
| Stage | Typical failure if skipped | Fix |
| Candidate retrieval | Wrong skill family selected | Embedding + keyword hybrid retrieval |
| Policy gate | Unsafe skill selected | Hard allow/deny rules before scoring |
| Confidence threshold | Overconfident wrong execution | Fallback path when confidence is low |
| Trace capture | No root cause during outages | Persist route id, candidate scores, policy decisions |
Router quality is usually more important than incremental prompt tweaks once you scale beyond a few skills.
๐ง Deep Dive: Scoring, Constraints, and Runtime Guarantees
Internals: hybrid routing usually beats single-strategy routing
Most robust systems combine three routing signals:
- Rule-based filters for non-negotiable constraints (risk, permissions, domain).
- Semantic match for intent-to-skill relevance.
- Operational priors from latency, error rate, and freshness.
| Router signal | Strength | Weakness |
| Rules | Deterministic safety | Can be rigid |
| Semantic score | Flexible intent fit | Can over-match vague text |
| Operational priors | Production-aware decisions | Needs telemetry quality |
A pure LLM router is fast to prototype but hard to govern. A pure rules engine is predictable but brittle. The hybrid path tends to be the practical middle ground.
Mathematical model: route score with explicit penalties
A common scoring objective:
$$ RouteScore(s \mid q) = w_f \cdot Fit(s, q) - w_l \cdot Latency(s) - w_r \cdot Risk(s) + w_o \cdot Reliability(s) $$
Where:
Fit: intent coverage confidence,Latency: normalized expected runtime,Risk: policy and safety risk score,Reliability: historical success and schema-valid output rate.
Add hard constraints before scoring:
$$ Allowed(s, q) = Permission(s, q) \land DataPolicy(s, q) \land RegionPolicy(s, q) $$
Then choose:
$$ s^* = \arg\max_{s \in S, Allowed(s,q)} RouteScore(s \mid q) $$
This separates policy from optimization, which keeps audits and incident reviews much cleaner.
Performance analysis: what to measure in routing systems
| Metric | Why it matters | Target style |
| Route accuracy | Correct skill chosen | Task-dependent baseline |
| Fallback rate | Router uncertainty / poor coverage | Low and stable |
| Schema-valid output rate | Downstream integration health | Very high |
| p95 route+execution latency | User experience and SLA risk | Within product SLO |
| Safety violation rate | Compliance and trust | Near zero |
A strong sign your registry is healthy: new skills can be added without increasing fallback and safety incidents disproportionately.
๐ฌ Internals
A skill registry maps intent signatures to executable handlers via a routing layer โ typically a classifier or embedding similarity lookup over skill descriptions. At query time, the router embeds the user intent, retrieves the top-k skill candidates by cosine similarity, and optionally re-ranks with a small cross-encoder. Skill versioning (semver tags on handlers) allows A/B testing and gradual rollout without changing the routing API.
โก Performance Analysis
Embedding-based routing over 100 skills adds 5โ15ms latency using precomputed skill embeddings cached in memory. A BERT-base cross-encoder re-ranker adds another 10โ30ms but reduces misrouting rate from ~8% to ~2% on ambiguous queries. End-to-end agent request latency with a registry lookup is typically 50โ100ms before the LLM call โ negligible compared to the 1โ3s LLM response time.
๐ Evaluation Loop: Offline Replay, Shadow Routing, and Live Gates
Evaluation is not one number. It is a loop.
sequenceDiagram
participant D as Dataset Store
participant R as Router
participant E as Evaluator
participant P as Prod Traffic
D->>R: historical requests replay
R-->>E: selected skill + confidence + trace
E->>E: compute quality/safety/latency metrics
E-->>R: threshold updates and alerts
P->>R: live requests (shadow mode)
R-->>E: shadow route decisions
E-->>R: promote or rollback recommendation
Recommended evaluation layers:
| Layer | Input | Output |
| Offline replay | Curated request set | Route accuracy, regression diffs |
| Shadow mode | Live traffic copy | Real-world drift signals |
| Online canary | Small user slice | Business-safe rollout confidence |
For intermediate maturity, start with one robust offline suite and one shadow dashboard before touching canary automation.
๐ Real-World Applications: Rollout Patterns That Work in Real Teams
Pattern 1: New skill onboarding checklist
- Add skill metadata and policy fields to registry.
- Add at least 20 representative replay prompts.
- Verify schema-valid output rate and safety checks.
- Enable shadow routing before any user-facing traffic.
Pattern 2: Risk-tiered routing
| Intent class | Route policy |
| Informational Q&A | Standard skill routing |
| Data mutation | High-confidence threshold + stricter policy gate |
| Regulated output | Human approval or signed workflow |
Pattern 3: Progressive promotion
devregistry namespace,stagingwith replay and shadow tests,prod-canaryfor 1-5% traffic,- full promotion if metrics pass.
This avoids the all-or-nothing rollout trap that causes noisy incidents.
โ๏ธ Trade-offs & Failure Modes: Failure Modes and Mitigations in Skill Routing Systems
| Failure mode | Typical symptom | Mitigation |
| Registry drift | Skill docs and behavior diverge | Contract tests + version pinning |
| Overlapping skills | Router flips between near-identical skills | Capability taxonomy + ownership boundaries |
| Silent policy gaps | Unexpected sensitive actions | Deny-by-default policy design |
| Score overfitting | Good replay metrics, bad live behavior | Shadow routing with live telemetry |
| Evaluation blind spots | Regressions after release | Include adversarial and long-tail test sets |
Also watch for this anti-pattern: using one global confidence threshold for every intent type. High-risk intents need stricter thresholds.
๐งญ Decision Guide: What to Build First at Your Current Maturity
| Team situation | Build first | Build second |
| 3-5 skills, early product stage | Basic registry with owners and schemas | Deterministic policy gate |
| 10-20 skills, multiple teams | Hybrid router with scoring + traces | Offline replay regression suite |
| Regulated domain or high-risk actions | Strict policy engine + approvals | Canary automation with rollback |
| Frequent model or prompt updates | Evaluation harness with drift alerts | Route score calibration tooling |
| Decision question | Recommendation |
| Should routing live in prompts only? | No, keep prompts as one signal, not sole control plane |
| Should every skill have full autonomy? | No, route through centralized policy + registry metadata |
| Should evaluation be periodic only? | No, combine continuous shadow metrics with scheduled replays |
| Should fallback be generic? | No, define intent-aware fallbacks per risk tier |
๐งช Practical Examples: Registry and Router Skeleton
Example 1: Registry document shape (JSON)
These examples provide a concrete skill registry document shape and a hybrid route selection function that together form the skeleton of a production routing system. The JSON registry entry was chosen because it is the minimum viable artifact that makes routing decisions both machine-readable and operator-auditable โ humans and automated routers share the same source of truth. When reading the Python routing function, focus on how eligibility filtering happens before score computation: this separation is what keeps policy enforcement from leaking into the scoring math and makes each layer independently testable.
{
"skill_id": "incident_triage_v3",
"version": "3.2.1",
"description": "Analyze outage alerts, summarize impact, and create incident tickets.",
"risk_level": "medium",
"owner_team": "sre-platform",
"allowed_data_domains": ["logs", "incidents"],
"input_schema": {
"type": "object",
"required": ["service", "time_range", "alert_id"]
},
"output_schema": {
"type": "object",
"required": ["summary", "severity", "ticket_id"]
},
"slo": {
"p95_latency_ms": 4000,
"schema_valid_rate": 0.99
},
"fallback_skill_id": "incident_triage_safe_v1"
}
This is enough to power both routing decisions and operator dashboards.
Example 2: Hybrid route selection sketch (Python)
from typing import Dict, List
def allowed(skill: Dict, request: Dict) -> bool:
if request.get("risk_tier") == "high" and skill.get("risk_level") == "high":
return False
required_domain = request.get("domain")
return required_domain in skill.get("allowed_data_domains", [])
def route_score(skill: Dict, fit: float, latency_ms: float, reliability: float) -> float:
risk_penalty = {"low": 0.05, "medium": 0.15, "high": 0.40}[skill.get("risk_level", "medium")]
return 0.55 * fit - 0.20 * (latency_ms / 5000.0) - 0.15 * risk_penalty + 0.10 * reliability
def choose_skill(request: Dict, candidates: List[Dict]) -> Dict:
scored = []
for skill in candidates:
if not allowed(skill, request):
continue
# Placeholder signals. In production these come from embedding match and telemetry.
fit = skill.get("fit", 0.0)
latency_ms = skill.get("p95_latency_ms", 3000)
reliability = skill.get("schema_valid_rate", 0.95)
scored.append((route_score(skill, fit, latency_ms, reliability), skill))
if not scored:
return {"action": "fallback", "reason": "no_allowed_candidate"}
scored.sort(key=lambda pair: pair[0], reverse=True)
best_score, best_skill = scored[0]
if best_score < 0.35:
return {"action": "fallback", "reason": "low_confidence", "score": best_score}
return {"action": "execute", "skill_id": best_skill["skill_id"], "score": best_score}
Even this simple structure gives you deterministic policy handling and explainable route selection.
๐ ๏ธ LangChain and LangGraph: Building the Routing and Execution Stack
LangChain is a Python/TypeScript framework providing composable building blocks for LLM applications โ tools, chains, memory, and output parsers. LangGraph extends LangChain with stateful, cyclic graphs for multi-step agent workflows where routing decisions must react to intermediate outputs rather than executing a fixed sequence of steps.
# pip install langchain langchain-openai langgraph
from langchain_core.tools import tool
from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal, Optional
# --- Registry-backed tools (one action each) ---
@tool
def fetch_service_metrics(service: str, window_minutes: int = 15) -> dict:
"""Retrieve error rate and p95 latency for a service over the given window."""
# Replace with real observability API call
return {"service": service, "error_rate": 0.082, "p95_latency_ms": 1340}
@tool
def create_incident_ticket(summary: str, severity: str) -> str:
"""Create a production incident ticket and return the ticket ID."""
return f"INC-{abs(hash(summary)) % 9999:04d}"
# --- LangGraph stateful skill: routes through nodes based on intermediate state ---
class IncidentState(TypedDict):
service: str
metrics: Optional[dict]
incident_id: Optional[str]
status: Literal["pending", "escalated", "done"]
def fetch_node(state: IncidentState) -> IncidentState:
metrics = fetch_service_metrics.invoke(
{"service": state["service"], "window_minutes": 15}
)
return {**state, "metrics": metrics}
def ticket_node(state: IncidentState) -> IncidentState:
summary = (f"High error rate on {state['service']}: "
f"{state['metrics']['error_rate']:.1%}")
ticket_id = create_incident_ticket.invoke(
{"summary": summary, "severity": "high"}
)
return {**state, "incident_id": ticket_id, "status": "escalated"}
def route(state: IncidentState) -> str:
"""Policy gate: only escalate when error rate exceeds threshold."""
if state.get("metrics", {}).get("error_rate", 0) >= 0.05:
return "ticket"
return END
graph = StateGraph(IncidentState)
graph.add_node("fetch", fetch_node)
graph.add_node("ticket", ticket_node)
graph.set_entry_point("fetch")
graph.add_conditional_edges("fetch", route)
graph.add_edge("ticket", END)
skill = graph.compile()
result = skill.invoke({"service": "payments-svc", "metrics": None,
"incident_id": None, "status": "pending"})
print(result)
LangGraph's StateGraph maps directly onto the registry-router-evaluator control plane described earlier: each node is a step, each edge is a conditional routing decision, and the route function encodes the policy gate โ separate from execution logic, independently testable, and auditable in the state trace.
For a full deep-dive on LangChain tool schemas, LangGraph multi-agent orchestration, and checkpoint-based resumability, a dedicated follow-up post is planned.
๐ Lessons Learned from Scaling Agent Skill Systems
- Registry design quality determines routing quality more than people expect.
- If metadata ownership is unclear, incident resolution slows down dramatically.
- Route traces are as important as model logs for debugging production failures.
- Always separate policy eligibility from score ranking.
- Evaluation must include long-tail and adversarial queries, not only happy-path prompts.
- Fallback quality matters as much as primary route quality for user trust.
๐ TLDR: Summary & Key Takeaways
- Production agents need a control plane: registry, router, and evaluator.
- A strong registry captures contracts, ownership, risk, and runtime expectations.
- Hybrid routing (rules + semantic fit + telemetry priors) is usually the best practical approach.
- Policy constraints should be hard gates before scoring.
- Evaluation should be continuous: replay, shadow, and canary.
- Reliable fallbacks turn router uncertainty into safe user outcomes.
One-line takeaway: Great agent behavior is rarely accidental; it is routed, constrained, and continuously evaluated.
๐ Related Posts
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer โ 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2ร A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
