Deploying LangGraph Agents: LangServe, Docker, LangGraph Platform, and Production Observability

Deploy LangGraph agents to production: LangServe, Docker, PostgresSaver, LangGraph Platform, and LangSmith observability.

Agentic AI: LangChain and LangGraph

Abstract Algorithms

·Mar 28, 2026·25 min read

Cover Image for Deploying LangGraph Agents: LangServe, Docker, LangGraph Platform, and Production Observability

📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 25 min

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Swap InMemorySaver → PostgresSaver, add LangServe + Docker, trace with LangSmith.

📖 The Demo-to-Production Gap: Why Notebook Agents Fail at Scale

Your LangGraph agent works perfectly in the demo. You deploy it to a single FastAPI instance. Two users hit it at the same time. They see each other's conversation history. The InMemorySaver you used in development doesn't know about concurrency — and it never will, because it was never designed for it.

This is the most common silent failure mode in LangGraph deployments, but it is only one of five production requirements that notebook-style agents don't satisfy:

Requirement	Dev/Notebook	Production need
State persistence	`InMemorySaver` — per-process, lost on restart	External checkpointer (PostgresSaver, Redis)
Concurrency safety	Single-user loop	Row-level-locked checkpoints per `thread_id`
Authentication	None — any caller can invoke	API key middleware, JWT, or OAuth
Observability	`print()` statements	Distributed traces in LangSmith or OpenTelemetry
Horizontal scaling	1 uvicorn worker	N stateless workers + shared external state store

The rest of this post covers exactly how to close each of these gaps — using LangServe for the HTTP layer, Docker and docker-compose for packaging, PostgresSaver for multi-worker state, and LangSmith for end-to-end observability. It ends with a worked deployment of the ReAct agent from earlier in this series.

🔍 Deployment Options Compared: LangServe vs LangGraph Platform vs Self-Hosted

Before writing a single line of deployment code, choose the right layer. The three main options have fundamentally different trade-off profiles.

Aspect	LangServe	LangGraph Platform / Cloud	Self-Hosted (K8s)
Setup complexity	FastAPI wrapper, ~20 lines	`langgraph.json` + `langgraph deploy`	Full DevOps pipeline
Auth	Add middleware yourself	Built-in API keys + JWT	Your choice
Persistence	PostgresSaver (you manage)	Managed checkpointer	You manage
Background runs	❌ Request/response only	✅ Yes	✅ Yes
Cron-triggered agents	❌ No	✅ Yes	Custom scheduler
Multi-tenant isolation	Manual (`thread_id` partitioning)	Per-user thread isolation built-in	Manual
Vendor lock-in	None — pure FastAPI	LangChain Inc. deployment format	None
Cost model	Your infra + LLM	Platform fee + LLM	Infra + DevOps labor
Observability	Add LangSmith env vars	LangSmith integrated	LangSmith or OTel

When to use which: LangServe for rapid prototyping and internal tools. LangGraph Platform when you need background tasks, cron scheduling, and minimal ops overhead. Self-hosted Kubernetes when you have data residency requirements, regulated workloads, or traffic volumes that make the platform fee uneconomic.

⚙️ LangServe: Wrapping a LangGraph Agent as a FastAPI App

LangServe wraps any LangChain Runnable — including a compiled LangGraph — as a FastAPI application. Its core API is a single function: add_routes.

# app.py
import os
from fastapi import FastAPI, Request, HTTPException
from langserve import add_routes
from agent import build_agent  # your compiled LangGraph

app = FastAPI(title="LangGraph Agent API", version="1.0")

# --- API key middleware ---
@app.middleware("http")
async def verify_api_key(request: Request, call_next):
    if request.url.path.startswith("/agent"):
        key = request.headers.get("X-API-Key")
        if key != os.environ["AGENT_API_KEY"]:
            raise HTTPException(status_code=401, detail="Invalid API key")
    return await call_next(request)

# --- Mount the graph ---
agent = build_agent()

add_routes(
    app,
    agent,
    path="/agent",
    enable_feedback_endpoint=True,
    enable_public_trace_link_endpoint=True,
)

@app.get("/health")
async def health():
    return {"status": "ok"}

add_routes auto-generates six endpoints from the graph's input/output schema:

Endpoint	Method	Purpose
`/agent/invoke`	POST	Single synchronous invocation
`/agent/stream`	POST	Server-sent event stream of intermediate state
`/agent/batch`	POST	Multiple inputs processed in parallel
`/agent/input_schema`	GET	JSON Schema for the request body
`/agent/output_schema`	GET	JSON Schema for the response body
`/agent/playground`	GET	Browser-based interactive playground

RemoteRunnable client — callers can use a typed Python client instead of raw HTTP:

from langserve import RemoteRunnable

agent = RemoteRunnable("http://localhost:8000/agent")

# Single turn
result = agent.invoke(
    {"messages": [{"role": "user", "content": "Summarize Q4 results"}]},
    config={"configurable": {"thread_id": "user-42-session-7"}},
)

# Streaming
for chunk in agent.stream(
    {"messages": [{"role": "user", "content": "Now compare to Q3"}]},
    config={"configurable": {"thread_id": "user-42-session-7"}},
):
    print(chunk)

Limitation: LangServe is in maintenance mode as of 2025. LangChain Inc. directs new projects toward LangGraph Platform for managed deployments. LangServe remains the right choice for self-hosted scenarios — but you own all the operational scaffolding.

🧠 Deep Dive: Concurrency, Persistence, and Production-Grade Checkpointing

The Internals

When a POST /agent/invoke request arrives at a LangServe worker, here is the exact execution path:

Deserialize — LangServe uses the graph's InputType schema to parse the JSON body.
Config extraction — the config.configurable.thread_id is extracted and passed to the checkpointer.
Checkpoint read — AsyncPostgresSaver opens a connection from its asyncpg pool and runs a SELECT for the latest checkpoint row matching thread_id. This read acquires a row-level shared lock.
Graph execution — nodes run in topological order. After each node that modifies state, the checkpointer writes a new UPSERT into the checkpoints table, keyed by (thread_id, checkpoint_id). Large state blobs overflow into checkpoint_blobs.
Checkpoint write safety — PostgreSQL's row-level locking prevents two workers from writing conflicting checkpoints for the same thread_id. Two workers processing different thread IDs have zero contention.
Serialize and respond — the final MessagesState is serialized to JSON and returned.

Why InMemorySaver silently breaks with multiple workers:

Worker 1 memory: { "user-42": [msg1, msg2] }
Worker 2 memory: { }           ← empty, knows nothing about msg1 or msg2

User's turn 3 routes to Worker 2 → fresh conversation, no history

There is no error, no exception — just wrong behavior. The production fix is to replace InMemorySaver before any multi-worker deploy:

# Development (single process only)
from langgraph.checkpoint.memory import MemorySaver
graph = builder.compile(checkpointer=MemorySaver())

# Production (multi-worker safe)
import asyncpg
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver

async def build_production_graph():
    pool = await asyncpg.create_pool(dsn=os.environ["DATABASE_URL"], min_size=2, max_size=10)
    checkpointer = AsyncPostgresSaver(pool)
    await checkpointer.setup()   # idempotent: creates tables if absent
    return builder.compile(checkpointer=checkpointer)

Performance Analysis

The LLM API call dominates latency by two to three orders of magnitude. Every other component — checkpoint I/O, LangSmith trace upload — is negligible by comparison at moderate scale.

Metric	1 uvicorn worker	4 uvicorn workers (under load)
Throughput (simple ReAct, 2-turn)	~8 RPS	~28–32 RPS
PostgresSaver write latency (p50)	~8 ms	~12 ms
PostgresSaver write latency (p99)	~35 ms	~80 ms
LangSmith trace upload overhead	~3 ms (async, non-blocking)	~3 ms
GPT-4o API call latency (p50)	~1.2 s	~1.2 s
Total round-trip p50 (1 LLM call)	~1.4 s	~1.4 s

Where real bottlenecks appear:

Hot thread_ids — if many workers retry the same session simultaneously (e.g., a polling client), write lock contention on a single thread_id row surfaces at high concurrency. Fix: enforce one-writer-per-thread via application-level routing or advisory locks.
Checkpoint table growth — a single 20-turn session produces 20+ rows. At 100k daily sessions, the table reaches millions of rows within weeks. Add created_at indexes and an archival TTL policy.
asyncpg pool exhaustion — each LangServe worker shares one pool. With 4 workers × max_size=10, you have 40 PostgreSQL connections. Size the DB accordingly; use PgBouncer in transaction mode for >100 connections.

Mathematical Model

Capacity planning — workers needed:

$$W = \left\lceil \frac{RPS \times \bar{L}}{u} \right\rceil$$

Where:

$RPS$ = target requests per second
$\bar{L}$ = average graph invocation latency in seconds (dominated by LLM calls)
$u$ = target worker utilization (0.7 is a safe default; above 0.85 causes queue buildup)

Worked example: Target 10 RPS, average latency 2.5 s (one GPT-4o call + tool calls), 70% utilization:

$$W = \left\lceil \frac{10 \times 2.5}{0.7} \right\rceil = \lceil 35.7 \rceil = 36 \text{ workers}$$

That is 9 container replicas × 4 uvicorn workers each.

Monthly LLM cost estimate:

$$C_{monthly} = U \times S \times T_{turns} \times T_{tokens} \times \frac{P}{1000} \times 30$$

Where:

$U$ = daily active users
$S$ = sessions per user per day
$T_{turns}$ = average turns per session
$T_{tokens}$ = average tokens per turn (input + output combined)
$P$ = model price per 1 000 tokens (e.g., GPT-4o: $0.005/1k output tokens)

Worked example: 1 000 DAU, 3 sessions/day, 5 turns/session, 800 tokens/turn:

$$C_{monthly} = 1000 \times 3 \times 5 \times 800 \times \frac{0.005}{1000} \times 30 = \$1{,}800/\text{month}$$

Doubling turns/session (to 10) doubles the cost to $3 600/month — turn count is the primary cost lever, not user count.

🏗️ Beyond Request-Response: Auth Patterns, Background Runs, and LangGraph Platform

Authentication and Per-User Thread Isolation

LangServe provides no authentication out of the box — everything added in the middleware snippet earlier is your responsibility. The three common production patterns are:

Pattern 1 — Static API key (internal tools and service-to-service)

# In FastAPI middleware — shown earlier in ⚙️ section
key = request.headers.get("X-API-Key")
if key != os.environ["AGENT_API_KEY"]:
    raise HTTPException(status_code=401)

Simple, effective for a single service identity. Does not support per-user tracking or revocation.

Pattern 2 — JWT with per-user thread isolation

import jwt
from fastapi import Depends, Request

SECRET = os.environ["JWT_SECRET"]

async def get_current_user(request: Request) -> dict:
    token = request.headers.get("Authorization", "").removeprefix("Bearer ")
    try:
        payload = jwt.decode(token, SECRET, algorithms=["HS256"])
    except jwt.PyJWTError:
        raise HTTPException(status_code=401, detail="Invalid token")
    return payload  # contains {"sub": "user-42", "tenant": "acme"}

@app.post("/agent/invoke")
async def invoke(body: dict, user: dict = Depends(get_current_user)):
    thread_id = f"{user['tenant']}-{user['sub']}-{body['session_id']}"
    return await graph.ainvoke(
        body["input"],
        config={"configurable": {"thread_id": thread_id}},
    )

The thread_id construction here is the security boundary. A user who crafts an arbitrary thread_id in the request body cannot access another user's thread — because the server overwrites it from the verified JWT payload.

Pattern 3 — LangGraph Platform built-in auth

LangGraph Platform handles token validation and thread isolation natively. You pass a user_id in the run config and the Platform enforces that each user can only access their own threads. No middleware required.

Background Runs and Cron-Triggered Agents with LangGraph Platform

Some agent workloads do not fit a request/response model:

Nightly report generators — run for 30–60 seconds per user, triggered at a fixed time
Monitoring agents — poll an external API every 5 minutes and write findings to a thread
Reactive agents — triggered by a webhook (new Stripe event, new GitHub PR) rather than a user request

LangGraph Platform supports these with background runs and cron schedules:

// langgraph.json — LangGraph Platform deployment config
{
  "dependencies": ["."],
  "graphs": {
    "react_agent": "./agent.py:graph",
    "report_agent": "./report_agent.py:report_graph"
  },
  "env": ".env",
  "http": {
    "port": 8000
  }
}

# Local dev server (hot reload, no Docker required)
langgraph dev

# Deploy to LangGraph Cloud
langgraph deploy --project my-agent-project

Background run (via SDK):

from langgraph_sdk import get_client

client = get_client(url="https://my-agent.langchain.app")

# Create a background run — returns immediately with a run_id
run = await client.runs.create(
    thread_id=f"report-{user_id}-{today}",
    assistant_id="report_agent",
    input={"user_id": user_id, "date": today},
)

# Poll for completion
result = await client.runs.join(thread_id=run.thread_id, run_id=run.id)

Cron runs are configured via the Platform dashboard or API — specify a cron expression, the assistant, and the default input. The Platform creates a new background run on each trigger.

Edge Cases in Production Graph Design

Edge case	Symptom	Guard pattern
Tool returns empty result set	Agent loops indefinitely, asking the same question	Add a `max_iterations` counter to state; route to END on limit
LLM returns malformed JSON for structured output	Pydantic `ValidationError` on the node output	Wrap node in a try/except; write error to state and route to a fallback node
Checkpoint write fails mid-run (DB unavailable)	Next invocation replays from last committed checkpoint, may duplicate tool calls	Make tool calls idempotent; use `run_id` as idempotency key in external APIs
Thread_id collision across tenants	User A sees User B's history	Namespace thread_ids: `f"{tenant_id}::{user_id}::{session_id}"` — colons are valid in `thread_id` strings

📊 Production Architecture: From Load Balancer to LangSmith

The following diagram shows the full request path for a multi-worker LangServe deployment with PostgreSQL persistence and LangSmith observability. The dashed arrows are async/non-blocking paths that do not add to user-facing latency.

flowchart TD
    A["Client / Browser (RemoteRunnable or HTTP)"] -->|"HTTPS POST /agent/invoke"| B[Load Balancer nginx / AWS ALB]
    B --> C1[FastAPI Worker 1 LangServe + agent.py]
    B --> C2[FastAPI Worker 2 LangServe + agent.py]
    B --> C3[FastAPI Worker N LangServe + agent.py]

    C1 & C2 & C3 -->|"asyncpg pool thread_id checkpoint UPSERT"| D[(PostgreSQL 16 AsyncPostgresSaver checkpoints table)]
    C1 & C2 & C3 -->|"HTTPS  LLM prompts"| E[OpenAI / Anthropic LLM API]
    C1 & C2 & C3 -.->|"async trace upload (non-blocking)"| F[LangSmith Tracing & Evaluation]
    D -.->|"checkpoint read on session resume"| C1

    style D fill:#336791,color:#ffffff
    style F fill:#1a73e8,color:#ffffff
    style E fill:#10a37f,color:#ffffff
    style B fill:#f0a500,color:#000000

Figure: Each stateless FastAPI worker reads and writes checkpoints through a shared PostgreSQL pool. LangSmith receives traces asynchronously — its availability does not affect request-response latency.

Key architectural properties:

Workers are stateless — any worker can handle any request for any thread_id, because state lives in PostgreSQL, not in process memory.
The load balancer can use round-robin or least-connections routing; sticky sessions are not required.
The LLM API is an external dependency — implement retry with exponential backoff and a circuit breaker on 429/503 responses.
LangSmith is fire-and-forget from the worker's perspective. If LangSmith is unavailable, traces are queued locally and uploaded on recovery; no request fails.

🌍 Real-World Applications: How Teams Run LangGraph in Production

Case Study 1 — Customer Support Agent at a SaaS Company

A B2B SaaS team deployed a LangGraph support agent handling 50 000 monthly tickets. The agent used a ReAct pattern: classify intent → search knowledge base → draft response → optionally escalate.

Deployment details:

3 ECS tasks × 4 uvicorn workers = 12 workers total
PostgresSaver on RDS PostgreSQL db.t3.medium — average 8 ms checkpoint write latency
thread_id = f"tenant-{tenant_id}-ticket-{ticket_id}" — one thread per ticket, strict tenant isolation
LangSmith used for trace replay when a ticket was escalated but the customer reported the agent gave contradictory answers

Scaling lesson: The team initially used 2 RDS instances (one per region) and ran into cross-region checkpoint replication lag when the load balancer briefly routed a user's follow-up to the other region. Fix: pin sessions to a region via the load balancer's geo-routing rules, not the database layer.

Case Study 2 — Daily Market Summary Report Agent

A fintech team needed a LangGraph agent to generate a personalized market summary for 1 200 subscribers every morning at 06:00 UTC. Each report took 25–40 seconds (10+ tool calls: price fetcher, news scraper, portfolio analyzer, chart renderer).

Why LangServe alone was insufficient: LangServe operates on a request/response model. A 40-second HTTP request is fragile — load balancers timeout, clients disconnect, and there is no retry semantics. Background task runs were required.

Solution: LangGraph Platform with a cron-triggered background run:

// langgraph.json
{
  "dependencies": ["."],
  "graphs": {
    "market_summary": "./agent.py:market_summary_graph"
  },
  "env": ".env"
}

The cron job created one background run per subscriber with thread_id = f"report-{user_id}-{date}". Results were written to the thread's final state and polled by the notification service via the Platform API.

Scaling lesson: Background run queues have depth limits on the managed platform. For more than 500 simultaneous background runs, teams either paginate cron triggers across time windows (06:00–06:30 in batches of 100) or move to self-hosted with a dedicated task queue (Celery, ARQ, or Temporal).

⚖️ Trade-offs and Failure Modes: LangServe Limits, Platform Lock-in, and Trace Storage Costs

LangServe Limitations

LangServe is a thin FastAPI adapter — it gives you the right endpoints fast, but every production concern is your responsibility:

No built-in auth — the middleware example earlier is the minimal pattern; production systems need JWT validation, per-tenant rate limiting, and RBAC.
No background runs — all invocations are synchronous HTTP. For long-running agents (>10s), clients must poll or use /stream with SSE.
No cron scheduling — you need an external scheduler (cron job, Celery beat, cloud scheduler) to trigger periodic agents.
Maintenance mode — LangChain Inc. has publicly indicated LangServe receives only security patches; feature development moved to LangGraph Platform.

Platform Lock-in

LangGraph Platform's deployment contract (langgraph.json, the Platform SDK, the background run API) is proprietary. Your graph code is portable — it is pure Python with LangGraph core as the only dependency. Your deployment wrapper is not. If you later need to self-host, you rewrite the deployment scaffolding but keep the graph logic unchanged. Design your graph modules with zero imports of langgraph_sdk or platform-specific APIs.

PostgresSaver as a Bottleneck at Scale

PostgresSaver is well-suited for hundreds of concurrent users. At thousands of concurrent sessions:

Scale	Symptom	Mitigation
100–500 concurrent sessions	Checkpoint write latency p99 climbs to 100–200 ms	Add `asyncpg` pool + PgBouncer
500–2000 concurrent sessions	Table size grows to 10M+ rows; index scans slow	Partition by `created_at`; archive old threads
2000+ concurrent sessions	PostgreSQL becomes throughput bottleneck	Redis checkpointer for hot threads; PostgreSQL for cold/archived threads

Trace Storage Costs

LangSmith charges per trace ingested and per trace stored. At high volume:

A single GPT-4o ReAct run with 5 tool calls = 6–8 LangSmith spans
1 000 RPS × 8 spans = 8 000 spans/second = ~700 M spans/day
At LangSmith pricing, this easily reaches $1 000–$3 000/month

Mitigation strategies:

Trace sampling — LANGCHAIN_TRACING_SAMPLE_RATE=0.10 traces 10% of requests. Always trace on error regardless of sample rate.
TTL policy — set trace retention to 14 days for high-volume projects; 90 days for evaluation baseline projects.
Disable verbose logging — set LANGCHAIN_HIDE_INPUTS=true and LANGCHAIN_HIDE_OUTPUTS=true for paths where input/output capture adds no debugging value.

Checkpoint Table Footprint

A subtle failure mode: the checkpoints table grows without bound unless you set a retention policy. One session with 20 turns writes 20+ rows. At 100 000 daily sessions:

$$\text{Rows/day} = 100{,}000 \times 20 = 2{,}000{,}000 \text{ rows/day}$$

Within 30 days: 60 million rows. Without proper indexing on (thread_id, created_at) and periodic archival, checkpoint reads degrade to full-table scans.

🧭 Decision Guide: LangServe vs LangGraph Platform vs Self-Hosted Kubernetes

Situation	Recommendation
Internal tool, small team, fast prototype	LangServe + `MemorySaver` locally → swap to `PostgresSaver` before any team-shared deploy
Production web app, self-managed infra, simple request/response	LangServe + `AsyncPostgresSaver` + API key middleware + LangSmith tracing
Need background runs, cron agents, minimal ops overhead	LangGraph Platform (managed cloud); expect $200–$2000/month platform fee depending on usage
Regulated industry, strict data residency, or air-gapped environment	Self-hosted K8s + `AsyncPostgresSaver` + self-managed LangSmith (or OpenTelemetry export to Jaeger/Grafana)
Multi-tenant SaaS with strict user isolation	`thread_id = f"{tenant_id}-{user_id}-{session_id}"`; PostgreSQL row-level security per tenant schema
Very high throughput (>200 RPS)	Self-hosted K8s + HPA + PgBouncer + Redis checkpointer for hot threads; PostgreSQL as cold archive
Periodic batch agents (nightly reports, weekly summaries)	LangGraph Platform cron runs for simplicity; or self-hosted ARQ/Temporal queue if you need retry semantics and audit logs

🧪 Practical Example: Full Deployment with Docker Compose, PostgreSQL, and LangSmith

This section deploys the ReAct agent from earlier in the series end-to-end: AsyncPostgresSaver for state, LangServe for HTTP, docker-compose for local production parity, and LangSmith tracing enabled. The full-stack deployment scenario was chosen because it exercises every production requirement from the post's opening table — shared durable state, a real HTTP interface, container isolation, and observability — in a single runnable stack. As you read through the steps, watch how the PostgreSQL connection string flows from docker-compose.yml into asyncpg.create_pool() into AsyncPostgresSaver: that three-layer chain is the only change between a dev InMemorySaver and a production-ready agent.

Step 1 — Agent with PostgresSaver

# agent.py
import os
import asyncpg
from langgraph.graph import StateGraph, MessagesState, END
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def call_model(state: MessagesState):
    response = llm.invoke(state["messages"])
    return {"messages": [response]}

builder = StateGraph(MessagesState)
builder.add_node("model", call_model)
builder.set_entry_point("model")
builder.add_edge("model", END)

_graph = None  # module-level singleton

async def get_graph():
    global _graph
    if _graph is None:
        pool = await asyncpg.create_pool(
            dsn=os.environ["DATABASE_URL"],
            min_size=2,
            max_size=10,
        )
        checkpointer = AsyncPostgresSaver(pool)
        await checkpointer.setup()   # idempotent table creation
        _graph = builder.compile(checkpointer=checkpointer)
    return _graph

Step 2 — LangServe App with API Key Auth

# app.py
import os
from contextlib import asynccontextmanager
from fastapi import FastAPI, Request, HTTPException
from langserve import add_routes
from agent import get_graph

graph_ref = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    graph_ref["graph"] = await get_graph()
    yield

app = FastAPI(title="LangGraph ReAct Agent", version="1.0", lifespan=lifespan)

@app.middleware("http")
async def verify_api_key(request: Request, call_next):
    if request.url.path.startswith("/agent"):
        key = request.headers.get("X-API-Key")
        if key != os.environ.get("AGENT_API_KEY", ""):
            raise HTTPException(status_code=401, detail="Invalid API key")
    return await call_next(request)

@app.on_event("startup")
async def mount_routes():
    add_routes(app, graph_ref["graph"], path="/agent")

@app.get("/health")
async def health():
    return {"status": "ok"}

Step 3 — Multi-Stage Dockerfile

# ---- Build stage: install deps into /install ----
FROM python:3.12-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# ---- Runtime stage: minimal image ----
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY . .

# Non-root user for security
RUN adduser --disabled-password --gecos "" appuser
USER appuser

EXPOSE 8000
HEALTHCHECK --interval=10s --timeout=3s CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Step 4 — docker-compose with PostgreSQL and LangSmith

# docker-compose.yml
version: "3.9"

services:
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: langgraph
      POSTGRES_USER: langgraph
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U langgraph"]
      interval: 5s
      timeout: 3s
      retries: 5

  agent:
    build: .
    ports:
      - "8000:8000"
    environment:
      DATABASE_URL: "postgresql://langgraph:${POSTGRES_PASSWORD}@db:5432/langgraph"
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      AGENT_API_KEY: ${AGENT_API_KEY}
      # LangSmith tracing
      LANGCHAIN_TRACING_V2: "true"
      LANGCHAIN_API_KEY: ${LANGSMITH_API_KEY}
      LANGCHAIN_PROJECT: "react-agent-prod"
    depends_on:
      db:
        condition: service_healthy
    restart: unless-stopped

volumes:
  pgdata:

Start the stack and run a smoke test:

docker compose up --build -d

curl -X POST http://localhost:8000/agent/invoke \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $AGENT_API_KEY" \
  -d '{
    "input": {"messages": [{"role": "user", "content": "What is 12 * 8?"}]},
    "config": {"configurable": {"thread_id": "smoke-test-001"}}
  }'

Step 5 — Load Test with k6

Validate concurrency before launch. Each virtual user gets its own thread_id so they exercise separate checkpoint paths independently.

// load-test.js
import http from "k6/http";
import { check, sleep } from "k6";

export const options = { vus: 20, duration: "60s" };

export default function () {
  const payload = JSON.stringify({
    input: { messages: [{ role: "user", content: "Summarize the benefits of async IO." }] },
    config: { configurable: { thread_id: `loadtest-user-${__VU}` } },
  });

  const res = http.post("http://localhost:8000/agent/invoke", payload, {
    headers: {
      "Content-Type": "application/json",
      "X-API-Key": __ENV.AGENT_API_KEY,
    },
    timeout: "30s",
  });

  check(res, {
    "status 200": (r) => r.status === 200,
    "response time < 10s": (r) => r.timings.duration < 10000,
  });

  sleep(2);
}

Run with k6 run -e AGENT_API_KEY=your-key load-test.js and watch the PostgreSQL connection pool and LangSmith trace dashboard in parallel.

🛠️ LangSmith: Tracing, Evaluation, and Debugging Production Agent Runs

LangSmith is the observability layer for LangChain and LangGraph applications. When LANGCHAIN_TRACING_V2=true, every LLM call, every tool invocation, and every graph node traversal is automatically captured — no code changes required.

Enabling Tracing

# .env (or container environment)
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=ls__...            # from app.langsmith.com
LANGCHAIN_PROJECT=react-agent-prod  # project name in LangSmith dashboard

What is captured per run:

Span type	Captured fields
LLM call	Model name, prompt, response, latency, input tokens, output tokens, cost
Tool call	Tool name, input arguments, output, latency, errors
Graph node	Node name, input state, output state, execution time
Chain	Parent/child span hierarchy, total latency, total cost

Reading a Trace to Debug a Failed Run

Open the LangSmith dashboard → select your project.
Filter runs by Status: Error and Date: last 24h.
Click the failed run → the trace tree expands all spans in execution order.
Red-highlighted spans threw exceptions — click to see the full stack trace and the exact input that triggered the failure.
Use "Run as experiment" to replay the exact same input with a patched version of your graph without touching production.

Example trace tree for a failed ReAct run:

  ✅ graph.invoke (1.8s)
    ✅ node: classify_intent (0.3s)
    ✅ node: search_tool (0.6s)
    ❌ node: format_response (0.2s)
         ↳ KeyError: 'product_name' not found in tool output
         ↳ Input state: {"search_results": [], "messages": [...]}

The trace immediately tells you which node failed, what state it received, and why — in seconds, compared to hours of log mining.

Building an Evaluation Dataset for Regression Testing

from langsmith import Client
from langsmith.evaluation import aevaluate
from langserve import RemoteRunnable

client = Client()
agent = RemoteRunnable("http://localhost:8000/agent")

# Score whether the agent included a resolution in its response
def resolution_evaluator(run, example):
    messages = run.outputs.get("messages", [])
    last_message = messages[-1]["content"] if messages else ""
    has_resolution = any(
        kw in last_message.lower()
        for kw in ["resolved", "completed", "refunded", "escalated"]
    )
    return {"key": "has_resolution", "score": int(has_resolution)}

# Run evaluation against the golden dataset
results = await aevaluate(
    agent.ainvoke,
    data="support-agent-golden-v1",     # dataset created from production traces
    evaluators=[resolution_evaluator],
    experiment_prefix="deploy-v2",
    max_concurrency=5,
)
print(f"Resolution rate: {results.results[0].score:.2%}")

Wire this evaluation into your CI/CD pipeline. If resolution_rate drops more than 5% from the baseline, block the deployment automatically.

📚 Lessons Learned

InMemorySaver is a deployment hazard, not just a development convenience. It works in every test, fails silently in every multi-worker deploy. Make AsyncPostgresSaver a hard requirement in your production deploy checklist — enforced at code review, not left to team memory.
thread_id is the session boundary and the security boundary. Set it in the caller, not the server. Use a deterministic, scoped ID (f"user-{user_id}-session-{session_id}") so users can resume sessions across workers and across restarts, while being unable to read each other's threads.
LangSmith trace overhead is negligible; trace storage costs are not. Profile your span volume before going live. Enable LANGCHAIN_TRACING_SAMPLE_RATE=0.10 for high-throughput paths and always trace on error regardless of sample rate.
LangServe is a rapid onramp, not a full production platform. You own auth, rate limiting, background tasks, and monitoring. If you find yourself reimplementing all four, evaluate LangGraph Platform before finishing the fifth.
Your graph code is portable; your deployment wrapper is not. Keep graph modules free of LangGraph Platform SDK imports. This makes it a one-day migration to move from managed to self-hosted, not a one-month rewrite.
Load-test with realistic thread_id diversity and multi-turn conversations. A 5-tool agent that looks fine with 1 concurrent user saturates 4 workers at just 2 concurrent users if each tool call blocks. Use k6 or locust with actual turn sequences before declaring production-ready.
Checkpoint table growth is a sleeper issue. One 20-turn session generates 20+ rows. Set a TTL archival job from day one — preferably before your first real traffic, not after your first slow-query alert at 3 AM.

📌 TLDR: Summary and Key Takeaways

TLDR: Swap InMemorySaver → PostgresSaver, add LangServe + Docker, trace with LangSmith.

InMemorySaver silently corrupts multi-user state across workers — replace with AsyncPostgresSaver before any multi-worker deployment, no exceptions.
LangServe gives you /invoke, /stream, /batch in ~20 lines of FastAPI setup, but you own auth, rate limiting, and background task support.
LangGraph Platform handles auth, persistence, background runs, and cron out of the box — at the cost of deployment format lock-in; your graph logic remains portable Python.
AsyncPostgresSaver uses asyncpg connection pooling and PostgreSQL row-level locking for safe concurrent writes; checkpoint reads add ~8–15 ms latency, dwarfed by LLM call latency.
Capacity formula: $W = \lceil (RPS \times \bar{L}) / u \rceil$ — LLM latency dominates; you need far more workers than checkpoint I/O would suggest.
LangSmith's LANGCHAIN_TRACING_V2=true is zero-code observability — use trace replay to debug production failures and aevaluate() to gate deployments on quality regressions.
thread_id is both the session key and the isolation boundary — make it deterministic, caller-controlled, and scoped to the user + session pair.

One-liner to remember: Deploying LangGraph is a checklist — swap the saver, add auth, containerize, trace everything, and plan your checkpoint table for the scale you expect in 90 days, not the scale you have today.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Deploying LangGraph Agents: LangServe, Docker, LangGraph Platform, and Production Observability

Intermediate

📖 The Demo-to-Production Gap: Why Notebook Agents Fail at Scale

🔍 Deployment Options Compared: LangServe vs LangGraph Platform vs Self-Hosted

⚙️ LangServe: Wrapping a LangGraph Agent as a FastAPI App

🧠 Deep Dive: Concurrency, Persistence, and Production-Grade Checkpointing

The Internals

Performance Analysis

Mathematical Model

🏗️ Beyond Request-Response: Auth Patterns, Background Runs, and LangGraph Platform

Authentication and Per-User Thread Isolation

Background Runs and Cron-Triggered Agents with LangGraph Platform

Edge Cases in Production Graph Design

📊 Production Architecture: From Load Balancer to LangSmith

🌍 Real-World Applications: How Teams Run LangGraph in Production

Case Study 1 — Customer Support Agent at a SaaS Company

Case Study 2 — Daily Market Summary Report Agent

⚖️ Trade-offs and Failure Modes: LangServe Limits, Platform Lock-in, and Trace Storage Costs

LangServe Limitations

Platform Lock-in

PostgresSaver as a Bottleneck at Scale

Trace Storage Costs

Checkpoint Table Footprint

🧭 Decision Guide: LangServe vs LangGraph Platform vs Self-Hosted Kubernetes

🧪 Practical Example: Full Deployment with Docker Compose, PostgreSQL, and LangSmith

Step 1 — Agent with PostgresSaver

Step 2 — LangServe App with API Key Auth

Step 3 — Multi-Stage Dockerfile

Step 4 — docker-compose with PostgreSQL and LangSmith

Step 5 — Load Test with k6

🛠️ LangSmith: Tracing, Evaluation, and Debugging Production Agent Runs

Enabling Tracing

Reading a Trace to Debug a Failed Run

Building an Evaluation Dataset for Regression Testing

📚 Lessons Learned

📌 TLDR: Summary and Key Takeaways

🔗 Related Posts

Test Your Knowledge

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive