All Posts

Deploying LangGraph Agents: LangServe, Docker, LangGraph Platform, and Production Observability

Deploy LangGraph agents to production: LangServe, Docker, PostgresSaver, LangGraph Platform, and LangSmith observability.

Abstract AlgorithmsAbstract Algorithms
Β·Β·27 min read
Cover Image for Deploying LangGraph Agents: LangServe, Docker, LangGraph Platform, and Production Observability
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Swap InMemorySaver β†’ PostgresSaver, add LangServe + Docker, trace with LangSmith.


πŸ“– The Demo-to-Production Gap: Why Notebook Agents Fail at Scale

Your LangGraph agent works perfectly in the demo. You deploy it to a single FastAPI instance. Two users hit it at the same time. They see each other's conversation history. The InMemorySaver you used in development doesn't know about concurrency β€” and it never will, because it was never designed for it.

This is the most common silent failure mode in LangGraph deployments, but it is only one of five production requirements that notebook-style agents don't satisfy:

RequirementDev/NotebookProduction need
State persistenceInMemorySaver β€” per-process, lost on restartExternal checkpointer (PostgresSaver, Redis)
Concurrency safetySingle-user loopRow-level-locked checkpoints per thread_id
AuthenticationNone β€” any caller can invokeAPI key middleware, JWT, or OAuth
Observabilityprint() statementsDistributed traces in LangSmith or OpenTelemetry
Horizontal scaling1 uvicorn workerN stateless workers + shared external state store

The rest of this post covers exactly how to close each of these gaps β€” using LangServe for the HTTP layer, Docker and docker-compose for packaging, PostgresSaver for multi-worker state, and LangSmith for end-to-end observability. It ends with a worked deployment of the ReAct agent from earlier in this series.


πŸ” Deployment Options Compared: LangServe vs LangGraph Platform vs Self-Hosted

Before writing a single line of deployment code, choose the right layer. The three main options have fundamentally different trade-off profiles.

AspectLangServeLangGraph Platform / CloudSelf-Hosted (K8s)
Setup complexityFastAPI wrapper, ~20 lineslanggraph.json + langgraph deployFull DevOps pipeline
AuthAdd middleware yourselfBuilt-in API keys + JWTYour choice
PersistencePostgresSaver (you manage)Managed checkpointerYou manage
Background runs❌ Request/response onlyβœ… Yesβœ… Yes
Cron-triggered agents❌ Noβœ… YesCustom scheduler
Multi-tenant isolationManual (thread_id partitioning)Per-user thread isolation built-inManual
Vendor lock-inNone β€” pure FastAPILangChain Inc. deployment formatNone
Cost modelYour infra + LLMPlatform fee + LLMInfra + DevOps labor
ObservabilityAdd LangSmith env varsLangSmith integratedLangSmith or OTel

When to use which: LangServe for rapid prototyping and internal tools. LangGraph Platform when you need background tasks, cron scheduling, and minimal ops overhead. Self-hosted Kubernetes when you have data residency requirements, regulated workloads, or traffic volumes that make the platform fee uneconomic.


βš™οΈ LangServe: Wrapping a LangGraph Agent as a FastAPI App

LangServe wraps any LangChain Runnable β€” including a compiled LangGraph β€” as a FastAPI application. Its core API is a single function: add_routes.

# app.py
import os
from fastapi import FastAPI, Request, HTTPException
from langserve import add_routes
from agent import build_agent  # your compiled LangGraph

app = FastAPI(title="LangGraph Agent API", version="1.0")

# --- API key middleware ---
@app.middleware("http")
async def verify_api_key(request: Request, call_next):
    if request.url.path.startswith("/agent"):
        key = request.headers.get("X-API-Key")
        if key != os.environ["AGENT_API_KEY"]:
            raise HTTPException(status_code=401, detail="Invalid API key")
    return await call_next(request)

# --- Mount the graph ---
agent = build_agent()

add_routes(
    app,
    agent,
    path="/agent",
    enable_feedback_endpoint=True,
    enable_public_trace_link_endpoint=True,
)

@app.get("/health")
async def health():
    return {"status": "ok"}

add_routes auto-generates six endpoints from the graph's input/output schema:

EndpointMethodPurpose
/agent/invokePOSTSingle synchronous invocation
/agent/streamPOSTServer-sent event stream of intermediate state
/agent/batchPOSTMultiple inputs processed in parallel
/agent/input_schemaGETJSON Schema for the request body
/agent/output_schemaGETJSON Schema for the response body
/agent/playgroundGETBrowser-based interactive playground

RemoteRunnable client β€” callers can use a typed Python client instead of raw HTTP:

from langserve import RemoteRunnable

agent = RemoteRunnable("http://localhost:8000/agent")

# Single turn
result = agent.invoke(
    {"messages": [{"role": "user", "content": "Summarize Q4 results"}]},
    config={"configurable": {"thread_id": "user-42-session-7"}},
)

# Streaming
for chunk in agent.stream(
    {"messages": [{"role": "user", "content": "Now compare to Q3"}]},
    config={"configurable": {"thread_id": "user-42-session-7"}},
):
    print(chunk)

Limitation: LangServe is in maintenance mode as of 2025. LangChain Inc. directs new projects toward LangGraph Platform for managed deployments. LangServe remains the right choice for self-hosted scenarios β€” but you own all the operational scaffolding.


🧠 Deep Dive: Concurrency, Persistence, and Production-Grade Checkpointing

The Internals

When a POST /agent/invoke request arrives at a LangServe worker, here is the exact execution path:

  1. Deserialize β€” LangServe uses the graph's InputType schema to parse the JSON body.

  2. Config extraction β€” the config.configurable.thread_id is extracted and passed to the checkpointer.

  3. Checkpoint read β€” AsyncPostgresSaver opens a connection from its asyncpg pool and runs a SELECT for the latest checkpoint row matching thread_id. This read acquires a row-level shared lock.

  4. Graph execution β€” nodes run in topological order. After each node that modifies state, the checkpointer writes a new UPSERT into the checkpoints table, keyed by (thread_id, checkpoint_id). Large state blobs overflow into checkpoint_blobs.

  5. Checkpoint write safety β€” PostgreSQL's row-level locking prevents two workers from writing conflicting checkpoints for the same thread_id. Two workers processing different thread IDs have zero contention.

  6. Serialize and respond β€” the final MessagesState is serialized to JSON and returned.

Why InMemorySaver silently breaks with multiple workers:

Worker 1 memory: { "user-42": [msg1, msg2] }
Worker 2 memory: { }           ← empty, knows nothing about msg1 or msg2

User's turn 3 routes to Worker 2 β†’ fresh conversation, no history

There is no error, no exception β€” just wrong behavior. The production fix is to replace InMemorySaver before any multi-worker deploy:

# Development (single process only)
from langgraph.checkpoint.memory import MemorySaver
graph = builder.compile(checkpointer=MemorySaver())

# Production (multi-worker safe)
import asyncpg
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver

async def build_production_graph():
    pool = await asyncpg.create_pool(dsn=os.environ["DATABASE_URL"], min_size=2, max_size=10)
    checkpointer = AsyncPostgresSaver(pool)
    await checkpointer.setup()   # idempotent: creates tables if absent
    return builder.compile(checkpointer=checkpointer)

Performance Analysis

The LLM API call dominates latency by two to three orders of magnitude. Every other component β€” checkpoint I/O, LangSmith trace upload β€” is negligible by comparison at moderate scale.

Metric1 uvicorn worker4 uvicorn workers (under load)
Throughput (simple ReAct, 2-turn)~8 RPS~28–32 RPS
PostgresSaver write latency (p50)~8 ms~12 ms
PostgresSaver write latency (p99)~35 ms~80 ms
LangSmith trace upload overhead~3 ms (async, non-blocking)~3 ms
GPT-4o API call latency (p50)~1.2 s~1.2 s
Total round-trip p50 (1 LLM call)~1.4 s~1.4 s

Where real bottlenecks appear:

  • Hot thread_ids β€” if many workers retry the same session simultaneously (e.g., a polling client), write lock contention on a single thread_id row surfaces at high concurrency. Fix: enforce one-writer-per-thread via application-level routing or advisory locks.

  • Checkpoint table growth β€” a single 20-turn session produces 20+ rows. At 100k daily sessions, the table reaches millions of rows within weeks. Add created_at indexes and an archival TTL policy.

  • asyncpg pool exhaustion β€” each LangServe worker shares one pool. With 4 workers Γ— max_size=10, you have 40 PostgreSQL connections. Size the DB accordingly; use PgBouncer in transaction mode for >100 connections.

Mathematical Model

Capacity planning β€” workers needed:

$$W = \left\lceil \frac{RPS \times \bar{L}}{u} \right\rceil$$

Where:

  • $RPS$ = target requests per second

  • $\bar{L}$ = average graph invocation latency in seconds (dominated by LLM calls)

  • $u$ = target worker utilization (0.7 is a safe default; above 0.85 causes queue buildup)

Worked example: Target 10 RPS, average latency 2.5 s (one GPT-4o call + tool calls), 70% utilization:

$$W = \left\lceil \frac{10 \times 2.5}{0.7} \right\rceil = \lceil 35.7 \rceil = 36 \text{ workers}$$

That is 9 container replicas Γ— 4 uvicorn workers each.

Monthly LLM cost estimate:

$$C_{monthly} = U \times S \times T_{turns} \times T_{tokens} \times \frac{P}{1000} \times 30$$

Where:

  • $U$ = daily active users

  • $S$ = sessions per user per day

  • $T_{turns}$ = average turns per session

  • $T_{tokens}$ = average tokens per turn (input + output combined)

  • $P$ = model price per 1 000 tokens (e.g., GPT-4o: $0.005/1k output tokens)

Worked example: 1 000 DAU, 3 sessions/day, 5 turns/session, 800 tokens/turn:

Cmonthly=1000Γ—3Γ—5Γ—800Γ—0.0051000Γ—30=$1,800/month

Doubling turns/session (to 10) doubles the cost to $3 600/month β€” turn count is the primary cost lever, not user count.


πŸ—οΈ Beyond Request-Response: Auth Patterns, Background Runs, and LangGraph Platform

Authentication and Per-User Thread Isolation

LangServe provides no authentication out of the box β€” everything added in the middleware snippet earlier is your responsibility. The three common production patterns are:

Pattern 1 β€” Static API key (internal tools and service-to-service)

# In FastAPI middleware β€” shown earlier in βš™οΈ section
key = request.headers.get("X-API-Key")
if key != os.environ["AGENT_API_KEY"]:
    raise HTTPException(status_code=401)

Simple, effective for a single service identity. Does not support per-user tracking or revocation.

Pattern 2 β€” JWT with per-user thread isolation

import jwt
from fastapi import Depends, Request

SECRET = os.environ["JWT_SECRET"]

async def get_current_user(request: Request) -> dict:
    token = request.headers.get("Authorization", "").removeprefix("Bearer ")
    try:
        payload = jwt.decode(token, SECRET, algorithms=["HS256"])
    except jwt.PyJWTError:
        raise HTTPException(status_code=401, detail="Invalid token")
    return payload  # contains {"sub": "user-42", "tenant": "acme"}

@app.post("/agent/invoke")
async def invoke(body: dict, user: dict = Depends(get_current_user)):
    thread_id = f"{user['tenant']}-{user['sub']}-{body['session_id']}"
    return await graph.ainvoke(
        body["input"],
        config={"configurable": {"thread_id": thread_id}},
    )

The thread_id construction here is the security boundary. A user who crafts an arbitrary thread_id in the request body cannot access another user's thread β€” because the server overwrites it from the verified JWT payload.

Pattern 3 β€” LangGraph Platform built-in auth

LangGraph Platform handles token validation and thread isolation natively. You pass a user_id in the run config and the Platform enforces that each user can only access their own threads. No middleware required.

Background Runs and Cron-Triggered Agents with LangGraph Platform

Some agent workloads do not fit a request/response model:

  • Nightly report generators β€” run for 30–60 seconds per user, triggered at a fixed time

  • Monitoring agents β€” poll an external API every 5 minutes and write findings to a thread

  • Reactive agents β€” triggered by a webhook (new Stripe event, new GitHub PR) rather than a user request

LangGraph Platform supports these with background runs and cron schedules:

// langgraph.json β€” LangGraph Platform deployment config
{
  "dependencies": ["."],
  "graphs": {
    "react_agent": "./agent.py:graph",
    "report_agent": "./report_agent.py:report_graph"
  },
  "env": ".env",
  "http": {
    "port": 8000
  }
}
# Local dev server (hot reload, no Docker required)
langgraph dev

# Deploy to LangGraph Cloud
langgraph deploy --project my-agent-project

Background run (via SDK):

from langgraph_sdk import get_client

client = get_client(url="https://my-agent.langchain.app")

# Create a background run β€” returns immediately with a run_id
run = await client.runs.create(
    thread_id=f"report-{user_id}-{today}",
    assistant_id="report_agent",
    input={"user_id": user_id, "date": today},
)

# Poll for completion
result = await client.runs.join(thread_id=run.thread_id, run_id=run.id)

Cron runs are configured via the Platform dashboard or API β€” specify a cron expression, the assistant, and the default input. The Platform creates a new background run on each trigger.

Edge Cases in Production Graph Design

Edge caseSymptomGuard pattern
Tool returns empty result setAgent loops indefinitely, asking the same questionAdd a max_iterations counter to state; route to END on limit
LLM returns malformed JSON for structured outputPydantic ValidationError on the node outputWrap node in a try/except; write error to state and route to a fallback node
Checkpoint write fails mid-run (DB unavailable)Next invocation replays from last committed checkpoint, may duplicate tool callsMake tool calls idempotent; use run_id as idempotency key in external APIs
Thread_id collision across tenantsUser A sees User B's historyNamespace thread_ids: f"{tenant_id}::{user_id}::{session_id}" β€” colons are valid in thread_id strings

πŸ“Š Production Architecture: From Load Balancer to LangSmith

The following diagram shows the full request path for a multi-worker LangServe deployment with PostgreSQL persistence and LangSmith observability. The dashed arrows are async/non-blocking paths that do not add to user-facing latency.

flowchart TD
    A["Client / Browser\n(RemoteRunnable or HTTP)"] -->|"HTTPS POST /agent/invoke"| B["Load Balancer\nnginx / AWS ALB"]
    B --> C1["FastAPI Worker 1\nLangServe + agent.py"]
    B --> C2["FastAPI Worker 2\nLangServe + agent.py"]
    B --> C3["FastAPI Worker N\nLangServe + agent.py"]

    C1 & C2 & C3 -->|"asyncpg pool\nthread_id checkpoint UPSERT"| D[("PostgreSQL 16\nAsyncPostgresSaver\ncheckpoints table")]
    C1 & C2 & C3 -->|"HTTPS β€” LLM prompts"| E["OpenAI / Anthropic\nLLM API"]
    C1 & C2 & C3 -.->|"async trace upload\n(non-blocking)"| F["LangSmith\nTracing & Evaluation"]
    D -.->|"checkpoint read on session resume"| C1

    style D fill:#336791,color:#ffffff
    style F fill:#1a73e8,color:#ffffff
    style E fill:#10a37f,color:#ffffff
    style B fill:#f0a500,color:#000000

Figure: Each stateless FastAPI worker reads and writes checkpoints through a shared PostgreSQL pool. LangSmith receives traces asynchronously β€” its availability does not affect request-response latency.

Key architectural properties:

  • Workers are stateless β€” any worker can handle any request for any thread_id, because state lives in PostgreSQL, not in process memory.

  • The load balancer can use round-robin or least-connections routing; sticky sessions are not required.

  • The LLM API is an external dependency β€” implement retry with exponential backoff and a circuit breaker on 429/503 responses.

  • LangSmith is fire-and-forget from the worker's perspective. If LangSmith is unavailable, traces are queued locally and uploaded on recovery; no request fails.


🌍 Real-World Applications: How Teams Run LangGraph in Production

Case Study 1 β€” Customer Support Agent at a SaaS Company

A B2B SaaS team deployed a LangGraph support agent handling 50 000 monthly tickets. The agent used a ReAct pattern: classify intent β†’ search knowledge base β†’ draft response β†’ optionally escalate.

Deployment details:

  • 3 ECS tasks Γ— 4 uvicorn workers = 12 workers total

  • PostgresSaver on RDS PostgreSQL db.t3.medium β€” average 8 ms checkpoint write latency

  • thread_id = f"tenant-{tenant_id}-ticket-{ticket_id}" β€” one thread per ticket, strict tenant isolation

  • LangSmith used for trace replay when a ticket was escalated but the customer reported the agent gave contradictory answers

Scaling lesson: The team initially used 2 RDS instances (one per region) and ran into cross-region checkpoint replication lag when the load balancer briefly routed a user's follow-up to the other region. Fix: pin sessions to a region via the load balancer's geo-routing rules, not the database layer.

Case Study 2 β€” Daily Market Summary Report Agent

A fintech team needed a LangGraph agent to generate a personalized market summary for 1 200 subscribers every morning at 06:00 UTC. Each report took 25–40 seconds (10+ tool calls: price fetcher, news scraper, portfolio analyzer, chart renderer).

Why LangServe alone was insufficient: LangServe operates on a request/response model. A 40-second HTTP request is fragile β€” load balancers timeout, clients disconnect, and there is no retry semantics. Background task runs were required.

Solution: LangGraph Platform with a cron-triggered background run:

// langgraph.json
{
  "dependencies": ["."],
  "graphs": {
    "market_summary": "./agent.py:market_summary_graph"
  },
  "env": ".env"
}

The cron job created one background run per subscriber with thread_id = f"report-{user_id}-{date}". Results were written to the thread's final state and polled by the notification service via the Platform API.

Scaling lesson: Background run queues have depth limits on the managed platform. For more than 500 simultaneous background runs, teams either paginate cron triggers across time windows (06:00–06:30 in batches of 100) or move to self-hosted with a dedicated task queue (Celery, ARQ, or Temporal).


βš–οΈ Trade-offs and Failure Modes: LangServe Limits, Platform Lock-in, and Trace Storage Costs

LangServe Limitations

LangServe is a thin FastAPI adapter β€” it gives you the right endpoints fast, but every production concern is your responsibility:

  • No built-in auth β€” the middleware example earlier is the minimal pattern; production systems need JWT validation, per-tenant rate limiting, and RBAC.

  • No background runs β€” all invocations are synchronous HTTP. For long-running agents (>10s), clients must poll or use /stream with SSE.

  • No cron scheduling β€” you need an external scheduler (cron job, Celery beat, cloud scheduler) to trigger periodic agents.

  • Maintenance mode β€” LangChain Inc. has publicly indicated LangServe receives only security patches; feature development moved to LangGraph Platform.

Platform Lock-in

LangGraph Platform's deployment contract (langgraph.json, the Platform SDK, the background run API) is proprietary. Your graph code is portable β€” it is pure Python with LangGraph core as the only dependency. Your deployment wrapper is not. If you later need to self-host, you rewrite the deployment scaffolding but keep the graph logic unchanged. Design your graph modules with zero imports of langgraph_sdk or platform-specific APIs.

PostgresSaver as a Bottleneck at Scale

PostgresSaver is well-suited for hundreds of concurrent users. At thousands of concurrent sessions:

ScaleSymptomMitigation
100–500 concurrent sessionsCheckpoint write latency p99 climbs to 100–200 msAdd asyncpg pool + PgBouncer
500–2000 concurrent sessionsTable size grows to 10M+ rows; index scans slowPartition by created_at; archive old threads
2000+ concurrent sessionsPostgreSQL becomes throughput bottleneckRedis checkpointer for hot threads; PostgreSQL for cold/archived threads

Trace Storage Costs

LangSmith charges per trace ingested and per trace stored. At high volume:

  • A single GPT-4o ReAct run with 5 tool calls = 6–8 LangSmith spans

  • 1 000 RPS Γ— 8 spans = 8 000 spans/second = ~700 M spans/day

  • At LangSmith pricing, this easily reaches $1 000–$3 000/month

Mitigation strategies:

  1. Trace sampling β€” LANGCHAIN_TRACING_SAMPLE_RATE=0.10 traces 10% of requests. Always trace on error regardless of sample rate.

  2. TTL policy β€” set trace retention to 14 days for high-volume projects; 90 days for evaluation baseline projects.

  3. Disable verbose logging β€” set LANGCHAIN_HIDE_INPUTS=true and LANGCHAIN_HIDE_OUTPUTS=true for paths where input/output capture adds no debugging value.

Checkpoint Table Footprint

A subtle failure mode: the checkpoints table grows without bound unless you set a retention policy. One session with 20 turns writes 20+ rows. At 100 000 daily sessions:

$$\text{Rows/day} = 100{,}000 \times 20 = 2{,}000{,}000 \text{ rows/day}$$

Within 30 days: 60 million rows. Without proper indexing on (thread_id, created_at) and periodic archival, checkpoint reads degrade to full-table scans.


🧭 Decision Guide: LangServe vs LangGraph Platform vs Self-Hosted Kubernetes

SituationRecommendation
Internal tool, small team, fast prototypeLangServe + MemorySaver locally β†’ swap to PostgresSaver before any team-shared deploy
Production web app, self-managed infra, simple request/responseLangServe + AsyncPostgresSaver + API key middleware + LangSmith tracing
Need background runs, cron agents, minimal ops overheadLangGraph Platform (managed cloud); expect $200–$2000/month platform fee depending on usage
Regulated industry, strict data residency, or air-gapped environmentSelf-hosted K8s + AsyncPostgresSaver + self-managed LangSmith (or OpenTelemetry export to Jaeger/Grafana)
Multi-tenant SaaS with strict user isolationthread_id = f"{tenant_id}-{user_id}-{session_id}"; PostgreSQL row-level security per tenant schema
Very high throughput (>200 RPS)Self-hosted K8s + HPA + PgBouncer + Redis checkpointer for hot threads; PostgreSQL as cold archive
Periodic batch agents (nightly reports, weekly summaries)LangGraph Platform cron runs for simplicity; or self-hosted ARQ/Temporal queue if you need retry semantics and audit logs

πŸ§ͺ Practical Example: Full Deployment with Docker Compose, PostgreSQL, and LangSmith

This section deploys the ReAct agent from earlier in the series end-to-end: AsyncPostgresSaver for state, LangServe for HTTP, docker-compose for local production parity, and LangSmith tracing enabled. The full-stack deployment scenario was chosen because it exercises every production requirement from the post's opening table β€” shared durable state, a real HTTP interface, container isolation, and observability β€” in a single runnable stack. As you read through the steps, watch how the PostgreSQL connection string flows from docker-compose.yml into asyncpg.create_pool() into AsyncPostgresSaver: that three-layer chain is the only change between a dev InMemorySaver and a production-ready agent.

Step 1 β€” Agent with PostgresSaver

# agent.py
import os
import asyncpg
from langgraph.graph import StateGraph, MessagesState, END
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def call_model(state: MessagesState):
    response = llm.invoke(state["messages"])
    return {"messages": [response]}

builder = StateGraph(MessagesState)
builder.add_node("model", call_model)
builder.set_entry_point("model")
builder.add_edge("model", END)

_graph = None  # module-level singleton

async def get_graph():
    global _graph
    if _graph is None:
        pool = await asyncpg.create_pool(
            dsn=os.environ["DATABASE_URL"],
            min_size=2,
            max_size=10,
        )
        checkpointer = AsyncPostgresSaver(pool)
        await checkpointer.setup()   # idempotent table creation
        _graph = builder.compile(checkpointer=checkpointer)
    return _graph

Step 2 β€” LangServe App with API Key Auth

# app.py
import os
from contextlib import asynccontextmanager
from fastapi import FastAPI, Request, HTTPException
from langserve import add_routes
from agent import get_graph

graph_ref = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    graph_ref["graph"] = await get_graph()
    yield

app = FastAPI(title="LangGraph ReAct Agent", version="1.0", lifespan=lifespan)

@app.middleware("http")
async def verify_api_key(request: Request, call_next):
    if request.url.path.startswith("/agent"):
        key = request.headers.get("X-API-Key")
        if key != os.environ.get("AGENT_API_KEY", ""):
            raise HTTPException(status_code=401, detail="Invalid API key")
    return await call_next(request)

@app.on_event("startup")
async def mount_routes():
    add_routes(app, graph_ref["graph"], path="/agent")

@app.get("/health")
async def health():
    return {"status": "ok"}

Step 3 β€” Multi-Stage Dockerfile

# ---- Build stage: install deps into /install ----
FROM python:3.12-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# ---- Runtime stage: minimal image ----
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY . .

# Non-root user for security
RUN adduser --disabled-password --gecos "" appuser
USER appuser

EXPOSE 8000
HEALTHCHECK --interval=10s --timeout=3s CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Step 4 β€” docker-compose with PostgreSQL and LangSmith

# docker-compose.yml
version: "3.9"

services:
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: langgraph
      POSTGRES_USER: langgraph
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U langgraph"]
      interval: 5s
      timeout: 3s
      retries: 5

  agent:
    build: .
    ports:
      - "8000:8000"
    environment:
      DATABASE_URL: "postgresql://langgraph:${POSTGRES_PASSWORD}@db:5432/langgraph"
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      AGENT_API_KEY: ${AGENT_API_KEY}
      # LangSmith tracing
      LANGCHAIN_TRACING_V2: "true"
      LANGCHAIN_API_KEY: ${LANGSMITH_API_KEY}
      LANGCHAIN_PROJECT: "react-agent-prod"
    depends_on:
      db:
        condition: service_healthy
    restart: unless-stopped

volumes:
  pgdata:

Start the stack and run a smoke test:

docker compose up --build -d

curl -X POST http://localhost:8000/agent/invoke \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $AGENT_API_KEY" \
  -d '{
    "input": {"messages": [{"role": "user", "content": "What is 12 * 8?"}]},
    "config": {"configurable": {"thread_id": "smoke-test-001"}}
  }'

Step 5 β€” Load Test with k6

Validate concurrency before launch. Each virtual user gets its own thread_id so they exercise separate checkpoint paths independently.

// load-test.js
import http from "k6/http";
import { check, sleep } from "k6";

export const options = { vus: 20, duration: "60s" };

export default function () {
  const payload = JSON.stringify({
    input: { messages: [{ role: "user", content: "Summarize the benefits of async IO." }] },
    config: { configurable: { thread_id: `loadtest-user-${__VU}` } },
  });

  const res = http.post("http://localhost:8000/agent/invoke", payload, {
    headers: {
      "Content-Type": "application/json",
      "X-API-Key": __ENV.AGENT_API_KEY,
    },
    timeout: "30s",
  });

  check(res, {
    "status 200": (r) => r.status === 200,
    "response time < 10s": (r) => r.timings.duration < 10000,
  });

  sleep(2);
}

Run with k6 run -e AGENT_API_KEY=your-key load-test.js and watch the PostgreSQL connection pool and LangSmith trace dashboard in parallel.


πŸ› οΈ LangSmith: Tracing, Evaluation, and Debugging Production Agent Runs

LangSmith is the observability layer for LangChain and LangGraph applications. When LANGCHAIN_TRACING_V2=true, every LLM call, every tool invocation, and every graph node traversal is automatically captured β€” no code changes required.

Enabling Tracing

# .env (or container environment)
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=ls__...            # from app.langsmith.com
LANGCHAIN_PROJECT=react-agent-prod  # project name in LangSmith dashboard

What is captured per run:

Span typeCaptured fields
LLM callModel name, prompt, response, latency, input tokens, output tokens, cost
Tool callTool name, input arguments, output, latency, errors
Graph nodeNode name, input state, output state, execution time
ChainParent/child span hierarchy, total latency, total cost

Reading a Trace to Debug a Failed Run

  1. Open the LangSmith dashboard β†’ select your project.

  2. Filter runs by Status: Error and Date: last 24h.

  3. Click the failed run β†’ the trace tree expands all spans in execution order.

  4. Red-highlighted spans threw exceptions β€” click to see the full stack trace and the exact input that triggered the failure.

  5. Use "Run as experiment" to replay the exact same input with a patched version of your graph without touching production.

Example trace tree for a failed ReAct run:

  βœ… graph.invoke (1.8s)
    βœ… node: classify_intent (0.3s)
    βœ… node: search_tool (0.6s)
    ❌ node: format_response (0.2s)
         ↳ KeyError: 'product_name' not found in tool output
         ↳ Input state: {"search_results": [], "messages": [...]}

The trace immediately tells you which node failed, what state it received, and why β€” in seconds, compared to hours of log mining.

Building an Evaluation Dataset for Regression Testing

from langsmith import Client
from langsmith.evaluation import aevaluate
from langserve import RemoteRunnable

client = Client()
agent = RemoteRunnable("http://localhost:8000/agent")

# Score whether the agent included a resolution in its response
def resolution_evaluator(run, example):
    messages = run.outputs.get("messages", [])
    last_message = messages[-1]["content"] if messages else ""
    has_resolution = any(
        kw in last_message.lower()
        for kw in ["resolved", "completed", "refunded", "escalated"]
    )
    return {"key": "has_resolution", "score": int(has_resolution)}

# Run evaluation against the golden dataset
results = await aevaluate(
    agent.ainvoke,
    data="support-agent-golden-v1",     # dataset created from production traces
    evaluators=[resolution_evaluator],
    experiment_prefix="deploy-v2",
    max_concurrency=5,
)
print(f"Resolution rate: {results.results[0].score:.2%}")

Wire this evaluation into your CI/CD pipeline. If resolution_rate drops more than 5% from the baseline, block the deployment automatically.


πŸ“š Lessons Learned

  1. InMemorySaver is a deployment hazard, not just a development convenience. It works in every test, fails silently in every multi-worker deploy. Make AsyncPostgresSaver a hard requirement in your production deploy checklist β€” enforced at code review, not left to team memory.

  2. thread_id is the session boundary and the security boundary. Set it in the caller, not the server. Use a deterministic, scoped ID (f"user-{user_id}-session-{session_id}") so users can resume sessions across workers and across restarts, while being unable to read each other's threads.

  3. LangSmith trace overhead is negligible; trace storage costs are not. Profile your span volume before going live. Enable LANGCHAIN_TRACING_SAMPLE_RATE=0.10 for high-throughput paths and always trace on error regardless of sample rate.

  4. LangServe is a rapid onramp, not a full production platform. You own auth, rate limiting, background tasks, and monitoring. If you find yourself reimplementing all four, evaluate LangGraph Platform before finishing the fifth.

  5. Your graph code is portable; your deployment wrapper is not. Keep graph modules free of LangGraph Platform SDK imports. This makes it a one-day migration to move from managed to self-hosted, not a one-month rewrite.

  6. Load-test with realistic thread_id diversity and multi-turn conversations. A 5-tool agent that looks fine with 1 concurrent user saturates 4 workers at just 2 concurrent users if each tool call blocks. Use k6 or locust with actual turn sequences before declaring production-ready.

  7. Checkpoint table growth is a sleeper issue. One 20-turn session generates 20+ rows. Set a TTL archival job from day one β€” preferably before your first real traffic, not after your first slow-query alert at 3 AM.


πŸ“Œ TLDR: Summary and Key Takeaways

TLDR: Swap InMemorySaver β†’ PostgresSaver, add LangServe + Docker, trace with LangSmith.

  • InMemorySaver silently corrupts multi-user state across workers β€” replace with AsyncPostgresSaver before any multi-worker deployment, no exceptions.

  • LangServe gives you /invoke, /stream, /batch in ~20 lines of FastAPI setup, but you own auth, rate limiting, and background task support.

  • LangGraph Platform handles auth, persistence, background runs, and cron out of the box β€” at the cost of deployment format lock-in; your graph logic remains portable Python.

  • AsyncPostgresSaver uses asyncpg connection pooling and PostgreSQL row-level locking for safe concurrent writes; checkpoint reads add ~8–15 ms latency, dwarfed by LLM call latency.

  • Capacity formula: $W = \lceil (RPS \times \bar{L}) / u \rceil$ β€” LLM latency dominates; you need far more workers than checkpoint I/O would suggest.

  • LangSmith's LANGCHAIN_TRACING_V2=true is zero-code observability β€” use trace replay to debug production failures and aevaluate() to gate deployments on quality regressions.

  • thread_id is both the session key and the isolation boundary β€” make it deterministic, caller-controlled, and scoped to the user + session pair.

One-liner to remember: Deploying LangGraph is a checklist β€” swap the saver, add auth, containerize, trace everything, and plan your checkpoint table for the scale you expect in 90 days, not the scale you have today.


πŸ“ Practice Quiz

  1. You deploy a LangGraph agent with InMemorySaver behind a load balancer with 3 uvicorn workers. A user sends turn 1 (routed to Worker 1) and then turn 2 (routed to Worker 2). What happens?

    • A) The agent correctly resumes conversation history from Worker 1's memory

    • B) The agent starts a fresh conversation with no prior context and no error

    • C) The agent throws a KeyError because the thread_id is not found

    • D) The load balancer automatically synchronizes state between workers Correct Answer: B

  2. Your LangGraph app has 500 daily active users, each running 4 sessions per day, 6 turns per session, 600 tokens per turn (input + output combined), using GPT-4o at $0.005 per 1 000 tokens. What is the estimated monthly LLM cost?

    • A) $108

    • B) $1 080

    • C) $10 800

    • D) $108 000 Correct Answer: B β€” (500 Γ— 4 Γ— 6 Γ— 600 Γ— 0.000005 Γ— 30 = $1 080)

  3. Your PostgresSaver write latency spikes from 10 ms (p50) to 450 ms (p50) after three weeks in production with no code changes. Which root cause is most likely?

    • A) LangSmith trace uploads are blocking the asyncpg event loop

    • B) The checkpoints table has grown to 50+ million rows with no index on thread_id

    • C) The asyncpg connection pool is too large, causing PostgreSQL OOM

    • D) The OpenAI API rate limit is throttling checkpoint read requests Correct Answer: B

  4. Open-ended challenge: A team wants to deliver daily personalized market-summary reports for 1 200 subscribers, with each report taking 30–40 seconds to generate (multiple tool calls). They propose using LangServe as the delivery mechanism: trigger a POST /agent/invoke per subscriber at 06:00 UTC from a cron job. Explain why this architecture is fragile at scale and describe a more robust alternative, including which LangGraph deployment option you would choose and how you would handle failures mid-report. (No single correct answer β€” justify your design choices.)



Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms