Deploying LangGraph Agents: LangServe, Docker, LangGraph Platform, and Production Observability
Deploy LangGraph agents to production: LangServe, Docker, PostgresSaver, LangGraph Platform, and LangSmith observability.
Abstract Algorithms
TLDR: Swap InMemorySaver β PostgresSaver, add LangServe + Docker, trace with LangSmith.
π The Demo-to-Production Gap: Why Notebook Agents Fail at Scale
Your LangGraph agent works perfectly in the demo. You deploy it to a single FastAPI instance. Two users hit it at the same time. They see each other's conversation history. The InMemorySaver you used in development doesn't know about concurrency β and it never will, because it was never designed for it.
This is the most common silent failure mode in LangGraph deployments, but it is only one of five production requirements that notebook-style agents don't satisfy:
| Requirement | Dev/Notebook | Production need |
| State persistence | InMemorySaver β per-process, lost on restart | External checkpointer (PostgresSaver, Redis) |
| Concurrency safety | Single-user loop | Row-level-locked checkpoints per thread_id |
| Authentication | None β any caller can invoke | API key middleware, JWT, or OAuth |
| Observability | print() statements | Distributed traces in LangSmith or OpenTelemetry |
| Horizontal scaling | 1 uvicorn worker | N stateless workers + shared external state store |
The rest of this post covers exactly how to close each of these gaps β using LangServe for the HTTP layer, Docker and docker-compose for packaging, PostgresSaver for multi-worker state, and LangSmith for end-to-end observability. It ends with a worked deployment of the ReAct agent from earlier in this series.
π Deployment Options Compared: LangServe vs LangGraph Platform vs Self-Hosted
Before writing a single line of deployment code, choose the right layer. The three main options have fundamentally different trade-off profiles.
| Aspect | LangServe | LangGraph Platform / Cloud | Self-Hosted (K8s) |
| Setup complexity | FastAPI wrapper, ~20 lines | langgraph.json + langgraph deploy | Full DevOps pipeline |
| Auth | Add middleware yourself | Built-in API keys + JWT | Your choice |
| Persistence | PostgresSaver (you manage) | Managed checkpointer | You manage |
| Background runs | β Request/response only | β Yes | β Yes |
| Cron-triggered agents | β No | β Yes | Custom scheduler |
| Multi-tenant isolation | Manual (thread_id partitioning) | Per-user thread isolation built-in | Manual |
| Vendor lock-in | None β pure FastAPI | LangChain Inc. deployment format | None |
| Cost model | Your infra + LLM | Platform fee + LLM | Infra + DevOps labor |
| Observability | Add LangSmith env vars | LangSmith integrated | LangSmith or OTel |
When to use which: LangServe for rapid prototyping and internal tools. LangGraph Platform when you need background tasks, cron scheduling, and minimal ops overhead. Self-hosted Kubernetes when you have data residency requirements, regulated workloads, or traffic volumes that make the platform fee uneconomic.
βοΈ LangServe: Wrapping a LangGraph Agent as a FastAPI App
LangServe wraps any LangChain Runnable β including a compiled LangGraph β as a FastAPI application. Its core API is a single function: add_routes.
# app.py
import os
from fastapi import FastAPI, Request, HTTPException
from langserve import add_routes
from agent import build_agent # your compiled LangGraph
app = FastAPI(title="LangGraph Agent API", version="1.0")
# --- API key middleware ---
@app.middleware("http")
async def verify_api_key(request: Request, call_next):
if request.url.path.startswith("/agent"):
key = request.headers.get("X-API-Key")
if key != os.environ["AGENT_API_KEY"]:
raise HTTPException(status_code=401, detail="Invalid API key")
return await call_next(request)
# --- Mount the graph ---
agent = build_agent()
add_routes(
app,
agent,
path="/agent",
enable_feedback_endpoint=True,
enable_public_trace_link_endpoint=True,
)
@app.get("/health")
async def health():
return {"status": "ok"}
add_routes auto-generates six endpoints from the graph's input/output schema:
| Endpoint | Method | Purpose |
/agent/invoke | POST | Single synchronous invocation |
/agent/stream | POST | Server-sent event stream of intermediate state |
/agent/batch | POST | Multiple inputs processed in parallel |
/agent/input_schema | GET | JSON Schema for the request body |
/agent/output_schema | GET | JSON Schema for the response body |
/agent/playground | GET | Browser-based interactive playground |
RemoteRunnable client β callers can use a typed Python client instead of raw HTTP:
from langserve import RemoteRunnable
agent = RemoteRunnable("http://localhost:8000/agent")
# Single turn
result = agent.invoke(
{"messages": [{"role": "user", "content": "Summarize Q4 results"}]},
config={"configurable": {"thread_id": "user-42-session-7"}},
)
# Streaming
for chunk in agent.stream(
{"messages": [{"role": "user", "content": "Now compare to Q3"}]},
config={"configurable": {"thread_id": "user-42-session-7"}},
):
print(chunk)
Limitation: LangServe is in maintenance mode as of 2025. LangChain Inc. directs new projects toward LangGraph Platform for managed deployments. LangServe remains the right choice for self-hosted scenarios β but you own all the operational scaffolding.
π§ Deep Dive: Concurrency, Persistence, and Production-Grade Checkpointing
The Internals
When a POST /agent/invoke request arrives at a LangServe worker, here is the exact execution path:
Deserialize β LangServe uses the graph's
InputTypeschema to parse the JSON body.Config extraction β the
config.configurable.thread_idis extracted and passed to the checkpointer.Checkpoint read β
AsyncPostgresSaveropens a connection from itsasyncpgpool and runs aSELECTfor the latest checkpoint row matchingthread_id. This read acquires a row-level shared lock.Graph execution β nodes run in topological order. After each node that modifies state, the checkpointer writes a new
UPSERTinto thecheckpointstable, keyed by(thread_id, checkpoint_id). Large state blobs overflow intocheckpoint_blobs.Checkpoint write safety β PostgreSQL's row-level locking prevents two workers from writing conflicting checkpoints for the same
thread_id. Two workers processing different thread IDs have zero contention.Serialize and respond β the final
MessagesStateis serialized to JSON and returned.
Why InMemorySaver silently breaks with multiple workers:
Worker 1 memory: { "user-42": [msg1, msg2] }
Worker 2 memory: { } β empty, knows nothing about msg1 or msg2
User's turn 3 routes to Worker 2 β fresh conversation, no history
There is no error, no exception β just wrong behavior. The production fix is to replace InMemorySaver before any multi-worker deploy:
# Development (single process only)
from langgraph.checkpoint.memory import MemorySaver
graph = builder.compile(checkpointer=MemorySaver())
# Production (multi-worker safe)
import asyncpg
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
async def build_production_graph():
pool = await asyncpg.create_pool(dsn=os.environ["DATABASE_URL"], min_size=2, max_size=10)
checkpointer = AsyncPostgresSaver(pool)
await checkpointer.setup() # idempotent: creates tables if absent
return builder.compile(checkpointer=checkpointer)
Performance Analysis
The LLM API call dominates latency by two to three orders of magnitude. Every other component β checkpoint I/O, LangSmith trace upload β is negligible by comparison at moderate scale.
| Metric | 1 uvicorn worker | 4 uvicorn workers (under load) |
| Throughput (simple ReAct, 2-turn) | ~8 RPS | ~28β32 RPS |
| PostgresSaver write latency (p50) | ~8 ms | ~12 ms |
| PostgresSaver write latency (p99) | ~35 ms | ~80 ms |
| LangSmith trace upload overhead | ~3 ms (async, non-blocking) | ~3 ms |
| GPT-4o API call latency (p50) | ~1.2 s | ~1.2 s |
| Total round-trip p50 (1 LLM call) | ~1.4 s | ~1.4 s |
Where real bottlenecks appear:
Hot thread_ids β if many workers retry the same session simultaneously (e.g., a polling client), write lock contention on a single thread_id row surfaces at high concurrency. Fix: enforce one-writer-per-thread via application-level routing or advisory locks.
Checkpoint table growth β a single 20-turn session produces 20+ rows. At 100k daily sessions, the table reaches millions of rows within weeks. Add
created_atindexes and an archival TTL policy.asyncpg pool exhaustion β each LangServe worker shares one pool. With 4 workers Γ
max_size=10, you have 40 PostgreSQL connections. Size the DB accordingly; use PgBouncer in transaction mode for >100 connections.
Mathematical Model
Capacity planning β workers needed:
$$W = \left\lceil \frac{RPS \times \bar{L}}{u} \right\rceil$$
Where:
$RPS$ = target requests per second
$\bar{L}$ = average graph invocation latency in seconds (dominated by LLM calls)
$u$ = target worker utilization (0.7 is a safe default; above 0.85 causes queue buildup)
Worked example: Target 10 RPS, average latency 2.5 s (one GPT-4o call + tool calls), 70% utilization:
$$W = \left\lceil \frac{10 \times 2.5}{0.7} \right\rceil = \lceil 35.7 \rceil = 36 \text{ workers}$$
That is 9 container replicas Γ 4 uvicorn workers each.
Monthly LLM cost estimate:
$$C_{monthly} = U \times S \times T_{turns} \times T_{tokens} \times \frac{P}{1000} \times 30$$
Where:
$U$ = daily active users
$S$ = sessions per user per day
$T_{turns}$ = average turns per session
$T_{tokens}$ = average tokens per turn (input + output combined)
$P$ = model price per 1 000 tokens (e.g., GPT-4o: $0.005/1k output tokens)
Worked example: 1 000 DAU, 3 sessions/day, 5 turns/session, 800 tokens/turn:
Cmonthly=1000Γ3Γ5Γ800Γ0.0051000Γ30=$1,800/month
Doubling turns/session (to 10) doubles the cost to $3 600/month β turn count is the primary cost lever, not user count.
ποΈ Beyond Request-Response: Auth Patterns, Background Runs, and LangGraph Platform
Authentication and Per-User Thread Isolation
LangServe provides no authentication out of the box β everything added in the middleware snippet earlier is your responsibility. The three common production patterns are:
Pattern 1 β Static API key (internal tools and service-to-service)
# In FastAPI middleware β shown earlier in βοΈ section
key = request.headers.get("X-API-Key")
if key != os.environ["AGENT_API_KEY"]:
raise HTTPException(status_code=401)
Simple, effective for a single service identity. Does not support per-user tracking or revocation.
Pattern 2 β JWT with per-user thread isolation
import jwt
from fastapi import Depends, Request
SECRET = os.environ["JWT_SECRET"]
async def get_current_user(request: Request) -> dict:
token = request.headers.get("Authorization", "").removeprefix("Bearer ")
try:
payload = jwt.decode(token, SECRET, algorithms=["HS256"])
except jwt.PyJWTError:
raise HTTPException(status_code=401, detail="Invalid token")
return payload # contains {"sub": "user-42", "tenant": "acme"}
@app.post("/agent/invoke")
async def invoke(body: dict, user: dict = Depends(get_current_user)):
thread_id = f"{user['tenant']}-{user['sub']}-{body['session_id']}"
return await graph.ainvoke(
body["input"],
config={"configurable": {"thread_id": thread_id}},
)
The thread_id construction here is the security boundary. A user who crafts an arbitrary thread_id in the request body cannot access another user's thread β because the server overwrites it from the verified JWT payload.
Pattern 3 β LangGraph Platform built-in auth
LangGraph Platform handles token validation and thread isolation natively. You pass a user_id in the run config and the Platform enforces that each user can only access their own threads. No middleware required.
Background Runs and Cron-Triggered Agents with LangGraph Platform
Some agent workloads do not fit a request/response model:
Nightly report generators β run for 30β60 seconds per user, triggered at a fixed time
Monitoring agents β poll an external API every 5 minutes and write findings to a thread
Reactive agents β triggered by a webhook (new Stripe event, new GitHub PR) rather than a user request
LangGraph Platform supports these with background runs and cron schedules:
// langgraph.json β LangGraph Platform deployment config
{
"dependencies": ["."],
"graphs": {
"react_agent": "./agent.py:graph",
"report_agent": "./report_agent.py:report_graph"
},
"env": ".env",
"http": {
"port": 8000
}
}
# Local dev server (hot reload, no Docker required)
langgraph dev
# Deploy to LangGraph Cloud
langgraph deploy --project my-agent-project
Background run (via SDK):
from langgraph_sdk import get_client
client = get_client(url="https://my-agent.langchain.app")
# Create a background run β returns immediately with a run_id
run = await client.runs.create(
thread_id=f"report-{user_id}-{today}",
assistant_id="report_agent",
input={"user_id": user_id, "date": today},
)
# Poll for completion
result = await client.runs.join(thread_id=run.thread_id, run_id=run.id)
Cron runs are configured via the Platform dashboard or API β specify a cron expression, the assistant, and the default input. The Platform creates a new background run on each trigger.
Edge Cases in Production Graph Design
| Edge case | Symptom | Guard pattern |
| Tool returns empty result set | Agent loops indefinitely, asking the same question | Add a max_iterations counter to state; route to END on limit |
| LLM returns malformed JSON for structured output | Pydantic ValidationError on the node output | Wrap node in a try/except; write error to state and route to a fallback node |
| Checkpoint write fails mid-run (DB unavailable) | Next invocation replays from last committed checkpoint, may duplicate tool calls | Make tool calls idempotent; use run_id as idempotency key in external APIs |
| Thread_id collision across tenants | User A sees User B's history | Namespace thread_ids: f"{tenant_id}::{user_id}::{session_id}" β colons are valid in thread_id strings |
π Production Architecture: From Load Balancer to LangSmith
The following diagram shows the full request path for a multi-worker LangServe deployment with PostgreSQL persistence and LangSmith observability. The dashed arrows are async/non-blocking paths that do not add to user-facing latency.
flowchart TD
A["Client / Browser\n(RemoteRunnable or HTTP)"] -->|"HTTPS POST /agent/invoke"| B["Load Balancer\nnginx / AWS ALB"]
B --> C1["FastAPI Worker 1\nLangServe + agent.py"]
B --> C2["FastAPI Worker 2\nLangServe + agent.py"]
B --> C3["FastAPI Worker N\nLangServe + agent.py"]
C1 & C2 & C3 -->|"asyncpg pool\nthread_id checkpoint UPSERT"| D[("PostgreSQL 16\nAsyncPostgresSaver\ncheckpoints table")]
C1 & C2 & C3 -->|"HTTPS β LLM prompts"| E["OpenAI / Anthropic\nLLM API"]
C1 & C2 & C3 -.->|"async trace upload\n(non-blocking)"| F["LangSmith\nTracing & Evaluation"]
D -.->|"checkpoint read on session resume"| C1
style D fill:#336791,color:#ffffff
style F fill:#1a73e8,color:#ffffff
style E fill:#10a37f,color:#ffffff
style B fill:#f0a500,color:#000000
Figure: Each stateless FastAPI worker reads and writes checkpoints through a shared PostgreSQL pool. LangSmith receives traces asynchronously β its availability does not affect request-response latency.
Key architectural properties:
Workers are stateless β any worker can handle any request for any thread_id, because state lives in PostgreSQL, not in process memory.
The load balancer can use round-robin or least-connections routing; sticky sessions are not required.
The LLM API is an external dependency β implement retry with exponential backoff and a circuit breaker on 429/503 responses.
LangSmith is fire-and-forget from the worker's perspective. If LangSmith is unavailable, traces are queued locally and uploaded on recovery; no request fails.
π Real-World Applications: How Teams Run LangGraph in Production
Case Study 1 β Customer Support Agent at a SaaS Company
A B2B SaaS team deployed a LangGraph support agent handling 50 000 monthly tickets. The agent used a ReAct pattern: classify intent β search knowledge base β draft response β optionally escalate.
Deployment details:
3 ECS tasks Γ 4 uvicorn workers = 12 workers total
PostgresSaver on RDS PostgreSQL
db.t3.mediumβ average 8 ms checkpoint write latencythread_id = f"tenant-{tenant_id}-ticket-{ticket_id}"β one thread per ticket, strict tenant isolationLangSmith used for trace replay when a ticket was escalated but the customer reported the agent gave contradictory answers
Scaling lesson: The team initially used 2 RDS instances (one per region) and ran into cross-region checkpoint replication lag when the load balancer briefly routed a user's follow-up to the other region. Fix: pin sessions to a region via the load balancer's geo-routing rules, not the database layer.
Case Study 2 β Daily Market Summary Report Agent
A fintech team needed a LangGraph agent to generate a personalized market summary for 1 200 subscribers every morning at 06:00 UTC. Each report took 25β40 seconds (10+ tool calls: price fetcher, news scraper, portfolio analyzer, chart renderer).
Why LangServe alone was insufficient: LangServe operates on a request/response model. A 40-second HTTP request is fragile β load balancers timeout, clients disconnect, and there is no retry semantics. Background task runs were required.
Solution: LangGraph Platform with a cron-triggered background run:
// langgraph.json
{
"dependencies": ["."],
"graphs": {
"market_summary": "./agent.py:market_summary_graph"
},
"env": ".env"
}
The cron job created one background run per subscriber with thread_id = f"report-{user_id}-{date}". Results were written to the thread's final state and polled by the notification service via the Platform API.
Scaling lesson: Background run queues have depth limits on the managed platform. For more than 500 simultaneous background runs, teams either paginate cron triggers across time windows (06:00β06:30 in batches of 100) or move to self-hosted with a dedicated task queue (Celery, ARQ, or Temporal).
βοΈ Trade-offs and Failure Modes: LangServe Limits, Platform Lock-in, and Trace Storage Costs
LangServe Limitations
LangServe is a thin FastAPI adapter β it gives you the right endpoints fast, but every production concern is your responsibility:
No built-in auth β the middleware example earlier is the minimal pattern; production systems need JWT validation, per-tenant rate limiting, and RBAC.
No background runs β all invocations are synchronous HTTP. For long-running agents (>10s), clients must poll or use
/streamwith SSE.No cron scheduling β you need an external scheduler (cron job, Celery beat, cloud scheduler) to trigger periodic agents.
Maintenance mode β LangChain Inc. has publicly indicated LangServe receives only security patches; feature development moved to LangGraph Platform.
Platform Lock-in
LangGraph Platform's deployment contract (langgraph.json, the Platform SDK, the background run API) is proprietary. Your graph code is portable β it is pure Python with LangGraph core as the only dependency. Your deployment wrapper is not. If you later need to self-host, you rewrite the deployment scaffolding but keep the graph logic unchanged. Design your graph modules with zero imports of langgraph_sdk or platform-specific APIs.
PostgresSaver as a Bottleneck at Scale
PostgresSaver is well-suited for hundreds of concurrent users. At thousands of concurrent sessions:
| Scale | Symptom | Mitigation |
| 100β500 concurrent sessions | Checkpoint write latency p99 climbs to 100β200 ms | Add asyncpg pool + PgBouncer |
| 500β2000 concurrent sessions | Table size grows to 10M+ rows; index scans slow | Partition by created_at; archive old threads |
| 2000+ concurrent sessions | PostgreSQL becomes throughput bottleneck | Redis checkpointer for hot threads; PostgreSQL for cold/archived threads |
Trace Storage Costs
LangSmith charges per trace ingested and per trace stored. At high volume:
A single GPT-4o ReAct run with 5 tool calls = 6β8 LangSmith spans
1 000 RPS Γ 8 spans = 8 000 spans/second = ~700 M spans/day
At LangSmith pricing, this easily reaches $1 000β$3 000/month
Mitigation strategies:
Trace sampling β
LANGCHAIN_TRACING_SAMPLE_RATE=0.10traces 10% of requests. Always trace on error regardless of sample rate.TTL policy β set trace retention to 14 days for high-volume projects; 90 days for evaluation baseline projects.
Disable verbose logging β set
LANGCHAIN_HIDE_INPUTS=trueandLANGCHAIN_HIDE_OUTPUTS=truefor paths where input/output capture adds no debugging value.
Checkpoint Table Footprint
A subtle failure mode: the checkpoints table grows without bound unless you set a retention policy. One session with 20 turns writes 20+ rows. At 100 000 daily sessions:
$$\text{Rows/day} = 100{,}000 \times 20 = 2{,}000{,}000 \text{ rows/day}$$
Within 30 days: 60 million rows. Without proper indexing on (thread_id, created_at) and periodic archival, checkpoint reads degrade to full-table scans.
π§ Decision Guide: LangServe vs LangGraph Platform vs Self-Hosted Kubernetes
| Situation | Recommendation |
| Internal tool, small team, fast prototype | LangServe + MemorySaver locally β swap to PostgresSaver before any team-shared deploy |
| Production web app, self-managed infra, simple request/response | LangServe + AsyncPostgresSaver + API key middleware + LangSmith tracing |
| Need background runs, cron agents, minimal ops overhead | LangGraph Platform (managed cloud); expect $200β$2000/month platform fee depending on usage |
| Regulated industry, strict data residency, or air-gapped environment | Self-hosted K8s + AsyncPostgresSaver + self-managed LangSmith (or OpenTelemetry export to Jaeger/Grafana) |
| Multi-tenant SaaS with strict user isolation | thread_id = f"{tenant_id}-{user_id}-{session_id}"; PostgreSQL row-level security per tenant schema |
| Very high throughput (>200 RPS) | Self-hosted K8s + HPA + PgBouncer + Redis checkpointer for hot threads; PostgreSQL as cold archive |
| Periodic batch agents (nightly reports, weekly summaries) | LangGraph Platform cron runs for simplicity; or self-hosted ARQ/Temporal queue if you need retry semantics and audit logs |
π§ͺ Practical Example: Full Deployment with Docker Compose, PostgreSQL, and LangSmith
This section deploys the ReAct agent from earlier in the series end-to-end: AsyncPostgresSaver for state, LangServe for HTTP, docker-compose for local production parity, and LangSmith tracing enabled. The full-stack deployment scenario was chosen because it exercises every production requirement from the post's opening table β shared durable state, a real HTTP interface, container isolation, and observability β in a single runnable stack. As you read through the steps, watch how the PostgreSQL connection string flows from docker-compose.yml into asyncpg.create_pool() into AsyncPostgresSaver: that three-layer chain is the only change between a dev InMemorySaver and a production-ready agent.
Step 1 β Agent with PostgresSaver
# agent.py
import os
import asyncpg
from langgraph.graph import StateGraph, MessagesState, END
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def call_model(state: MessagesState):
response = llm.invoke(state["messages"])
return {"messages": [response]}
builder = StateGraph(MessagesState)
builder.add_node("model", call_model)
builder.set_entry_point("model")
builder.add_edge("model", END)
_graph = None # module-level singleton
async def get_graph():
global _graph
if _graph is None:
pool = await asyncpg.create_pool(
dsn=os.environ["DATABASE_URL"],
min_size=2,
max_size=10,
)
checkpointer = AsyncPostgresSaver(pool)
await checkpointer.setup() # idempotent table creation
_graph = builder.compile(checkpointer=checkpointer)
return _graph
Step 2 β LangServe App with API Key Auth
# app.py
import os
from contextlib import asynccontextmanager
from fastapi import FastAPI, Request, HTTPException
from langserve import add_routes
from agent import get_graph
graph_ref = {}
@asynccontextmanager
async def lifespan(app: FastAPI):
graph_ref["graph"] = await get_graph()
yield
app = FastAPI(title="LangGraph ReAct Agent", version="1.0", lifespan=lifespan)
@app.middleware("http")
async def verify_api_key(request: Request, call_next):
if request.url.path.startswith("/agent"):
key = request.headers.get("X-API-Key")
if key != os.environ.get("AGENT_API_KEY", ""):
raise HTTPException(status_code=401, detail="Invalid API key")
return await call_next(request)
@app.on_event("startup")
async def mount_routes():
add_routes(app, graph_ref["graph"], path="/agent")
@app.get("/health")
async def health():
return {"status": "ok"}
Step 3 β Multi-Stage Dockerfile
# ---- Build stage: install deps into /install ----
FROM python:3.12-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
# ---- Runtime stage: minimal image ----
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY . .
# Non-root user for security
RUN adduser --disabled-password --gecos "" appuser
USER appuser
EXPOSE 8000
HEALTHCHECK --interval=10s --timeout=3s CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
Step 4 β docker-compose with PostgreSQL and LangSmith
# docker-compose.yml
version: "3.9"
services:
db:
image: postgres:16-alpine
environment:
POSTGRES_DB: langgraph
POSTGRES_USER: langgraph
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U langgraph"]
interval: 5s
timeout: 3s
retries: 5
agent:
build: .
ports:
- "8000:8000"
environment:
DATABASE_URL: "postgresql://langgraph:${POSTGRES_PASSWORD}@db:5432/langgraph"
OPENAI_API_KEY: ${OPENAI_API_KEY}
AGENT_API_KEY: ${AGENT_API_KEY}
# LangSmith tracing
LANGCHAIN_TRACING_V2: "true"
LANGCHAIN_API_KEY: ${LANGSMITH_API_KEY}
LANGCHAIN_PROJECT: "react-agent-prod"
depends_on:
db:
condition: service_healthy
restart: unless-stopped
volumes:
pgdata:
Start the stack and run a smoke test:
docker compose up --build -d
curl -X POST http://localhost:8000/agent/invoke \
-H "Content-Type: application/json" \
-H "X-API-Key: $AGENT_API_KEY" \
-d '{
"input": {"messages": [{"role": "user", "content": "What is 12 * 8?"}]},
"config": {"configurable": {"thread_id": "smoke-test-001"}}
}'
Step 5 β Load Test with k6
Validate concurrency before launch. Each virtual user gets its own thread_id so they exercise separate checkpoint paths independently.
// load-test.js
import http from "k6/http";
import { check, sleep } from "k6";
export const options = { vus: 20, duration: "60s" };
export default function () {
const payload = JSON.stringify({
input: { messages: [{ role: "user", content: "Summarize the benefits of async IO." }] },
config: { configurable: { thread_id: `loadtest-user-${__VU}` } },
});
const res = http.post("http://localhost:8000/agent/invoke", payload, {
headers: {
"Content-Type": "application/json",
"X-API-Key": __ENV.AGENT_API_KEY,
},
timeout: "30s",
});
check(res, {
"status 200": (r) => r.status === 200,
"response time < 10s": (r) => r.timings.duration < 10000,
});
sleep(2);
}
Run with k6 run -e AGENT_API_KEY=your-key load-test.js and watch the PostgreSQL connection pool and LangSmith trace dashboard in parallel.
π οΈ LangSmith: Tracing, Evaluation, and Debugging Production Agent Runs
LangSmith is the observability layer for LangChain and LangGraph applications. When LANGCHAIN_TRACING_V2=true, every LLM call, every tool invocation, and every graph node traversal is automatically captured β no code changes required.
Enabling Tracing
# .env (or container environment)
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=ls__... # from app.langsmith.com
LANGCHAIN_PROJECT=react-agent-prod # project name in LangSmith dashboard
What is captured per run:
| Span type | Captured fields |
| LLM call | Model name, prompt, response, latency, input tokens, output tokens, cost |
| Tool call | Tool name, input arguments, output, latency, errors |
| Graph node | Node name, input state, output state, execution time |
| Chain | Parent/child span hierarchy, total latency, total cost |
Reading a Trace to Debug a Failed Run
Open the LangSmith dashboard β select your project.
Filter runs by Status: Error and Date: last 24h.
Click the failed run β the trace tree expands all spans in execution order.
Red-highlighted spans threw exceptions β click to see the full stack trace and the exact input that triggered the failure.
Use "Run as experiment" to replay the exact same input with a patched version of your graph without touching production.
Example trace tree for a failed ReAct run:
β
graph.invoke (1.8s)
β
node: classify_intent (0.3s)
β
node: search_tool (0.6s)
β node: format_response (0.2s)
β³ KeyError: 'product_name' not found in tool output
β³ Input state: {"search_results": [], "messages": [...]}
The trace immediately tells you which node failed, what state it received, and why β in seconds, compared to hours of log mining.
Building an Evaluation Dataset for Regression Testing
from langsmith import Client
from langsmith.evaluation import aevaluate
from langserve import RemoteRunnable
client = Client()
agent = RemoteRunnable("http://localhost:8000/agent")
# Score whether the agent included a resolution in its response
def resolution_evaluator(run, example):
messages = run.outputs.get("messages", [])
last_message = messages[-1]["content"] if messages else ""
has_resolution = any(
kw in last_message.lower()
for kw in ["resolved", "completed", "refunded", "escalated"]
)
return {"key": "has_resolution", "score": int(has_resolution)}
# Run evaluation against the golden dataset
results = await aevaluate(
agent.ainvoke,
data="support-agent-golden-v1", # dataset created from production traces
evaluators=[resolution_evaluator],
experiment_prefix="deploy-v2",
max_concurrency=5,
)
print(f"Resolution rate: {results.results[0].score:.2%}")
Wire this evaluation into your CI/CD pipeline. If resolution_rate drops more than 5% from the baseline, block the deployment automatically.
π Lessons Learned
InMemorySaveris a deployment hazard, not just a development convenience. It works in every test, fails silently in every multi-worker deploy. MakeAsyncPostgresSavera hard requirement in your production deploy checklist β enforced at code review, not left to team memory.thread_idis the session boundary and the security boundary. Set it in the caller, not the server. Use a deterministic, scoped ID (f"user-{user_id}-session-{session_id}") so users can resume sessions across workers and across restarts, while being unable to read each other's threads.LangSmith trace overhead is negligible; trace storage costs are not. Profile your span volume before going live. Enable
LANGCHAIN_TRACING_SAMPLE_RATE=0.10for high-throughput paths and always trace on error regardless of sample rate.LangServe is a rapid onramp, not a full production platform. You own auth, rate limiting, background tasks, and monitoring. If you find yourself reimplementing all four, evaluate LangGraph Platform before finishing the fifth.
Your graph code is portable; your deployment wrapper is not. Keep graph modules free of LangGraph Platform SDK imports. This makes it a one-day migration to move from managed to self-hosted, not a one-month rewrite.
Load-test with realistic
thread_iddiversity and multi-turn conversations. A 5-tool agent that looks fine with 1 concurrent user saturates 4 workers at just 2 concurrent users if each tool call blocks. Usek6orlocustwith actual turn sequences before declaring production-ready.Checkpoint table growth is a sleeper issue. One 20-turn session generates 20+ rows. Set a TTL archival job from day one β preferably before your first real traffic, not after your first slow-query alert at 3 AM.
π TLDR: Summary and Key Takeaways
TLDR: Swap InMemorySaver β PostgresSaver, add LangServe + Docker, trace with LangSmith.
InMemorySaversilently corrupts multi-user state across workers β replace withAsyncPostgresSaverbefore any multi-worker deployment, no exceptions.LangServe gives you
/invoke,/stream,/batchin ~20 lines of FastAPI setup, but you own auth, rate limiting, and background task support.LangGraph Platform handles auth, persistence, background runs, and cron out of the box β at the cost of deployment format lock-in; your graph logic remains portable Python.
AsyncPostgresSaveruses asyncpg connection pooling and PostgreSQL row-level locking for safe concurrent writes; checkpoint reads add ~8β15 ms latency, dwarfed by LLM call latency.Capacity formula: $W = \lceil (RPS \times \bar{L}) / u \rceil$ β LLM latency dominates; you need far more workers than checkpoint I/O would suggest.
LangSmith's
LANGCHAIN_TRACING_V2=trueis zero-code observability β use trace replay to debug production failures andaevaluate()to gate deployments on quality regressions.thread_idis both the session key and the isolation boundary β make it deterministic, caller-controlled, and scoped to the user + session pair.
One-liner to remember: Deploying LangGraph is a checklist β swap the saver, add auth, containerize, trace everything, and plan your checkpoint table for the scale you expect in 90 days, not the scale you have today.
π Practice Quiz
You deploy a LangGraph agent with
InMemorySaverbehind a load balancer with 3 uvicorn workers. A user sends turn 1 (routed to Worker 1) and then turn 2 (routed to Worker 2). What happens?A) The agent correctly resumes conversation history from Worker 1's memory
B) The agent starts a fresh conversation with no prior context and no error
C) The agent throws a
KeyErrorbecause thethread_idis not foundD) The load balancer automatically synchronizes state between workers Correct Answer: B
Your LangGraph app has 500 daily active users, each running 4 sessions per day, 6 turns per session, 600 tokens per turn (input + output combined), using GPT-4o at $0.005 per 1 000 tokens. What is the estimated monthly LLM cost?
A) $108
B) $1 080
C) $10 800
D) $108 000 Correct Answer: B β (500 Γ 4 Γ 6 Γ 600 Γ 0.000005 Γ 30 = $1 080)
Your PostgresSaver write latency spikes from 10 ms (p50) to 450 ms (p50) after three weeks in production with no code changes. Which root cause is most likely?
A) LangSmith trace uploads are blocking the asyncpg event loop
B) The
checkpointstable has grown to 50+ million rows with no index onthread_idC) The asyncpg connection pool is too large, causing PostgreSQL OOM
D) The OpenAI API rate limit is throttling checkpoint read requests Correct Answer: B
Open-ended challenge: A team wants to deliver daily personalized market-summary reports for 1 200 subscribers, with each report taking 30β40 seconds to generate (multiple tool calls). They propose using LangServe as the delivery mechanism: trigger a
POST /agent/invokeper subscriber at 06:00 UTC from a cron job. Explain why this architecture is fragile at scale and describe a more robust alternative, including which LangGraph deployment option you would choose and how you would handle failures mid-report. (No single correct answer β justify your design choices.)
π Related Posts
AI Architecture Patterns: Routers, Planner-Worker Loops, Memory Layers, and Evaluation Guardrails
LLM Skill Registries, Routing Policies, and Evaluation for Production Agents
Skills vs LangChain, LangGraph, MCP, and Tools: A Practical Architecture Guide

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Software Engineering Principles: Your Complete Learning Roadmap
TLDR: This roadmap organizes the Software Engineering Principles series into a problem-first learning path β starting with the code smell before the principle. New to SOLID? Start with Single Responsibility. Facing messy legacy code? Jump to the smel...
Machine Learning Fundamentals: Your Complete Learning Roadmap
TLDR: πΊοΈ Most ML courses dive into math formulas before explaining what problems they solve. This roadmap guides you through 9 essential posts across 3 phases: understanding ML fundamentals β mastering core algorithms β deploying production models. ...
Low-Level Design Guide: Your Complete Learning Roadmap
TLDR TLDR: LLD interviews ask you to design classes and interfaces β not databases and caches.This roadmap sequences 8 problems across two phases: Phase 1 (6 beginner posts) builds your core OOP vocabulary through increasingly complex domains; Phase...

LLM Engineering: Your Complete Learning Roadmap
TLDR: The LLM space moves so fast that engineers end up reading random blog posts and never build a mental model of how everything connects. This roadmap organizes 35+ LLM Engineering posts into 7 tra
