All Posts

Headless Agents: How to Deploy Your Skills as an MCP Server

Build a Python MCP server once and call it from Cursor, Claude Desktop, and GitHub Copilot with no code changes.

Abstract AlgorithmsAbstract Algorithms
Β·Β·17 min read
Cover Image for Headless Agents: How to Deploy Your Skills as an MCP Server
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Deploy once, call everywhere: MCP turns Python skills into headless servers any AI client can call.


πŸ“– The Trapped Skill Problem: When Your Best LLM Tool Works Everywhere But Here

You spent an afternoon building a beautiful skill inside GitHub Copilot CLI: given a repository URL, it summarises the codebase, identifies the top changed files, and drafts a pull-request description. It works every time you run it. You feel good.

Then your teammate on Cursor asks if they can use it. Another colleague on Claude Desktop wants access too. You look at your implementation β€” a tightly coupled async function registered directly inside Copilot's extension API β€” and realise there is no clean way to share it. You would have to rewrite it for Cursor's tool format, then rewrite it again for Claude's function-calling schema, and maintain three versions forever.

This is the trapped skill problem: a useful LLM capability locked inside one tool's runtime.

The Model Context Protocol (MCP) is the solution. MCP is an open standard β€” originally developed by Anthropic and now implemented by Cursor, Claude Desktop, GitHub Copilot, and VS Code agent mode β€” that defines a single wire format for exposing tools, resources, and prompts from a server process. Write your skill as an MCP server once, and any MCP-aware client can discover and invoke it. No rewrites. No per-client adapters.

This post walks you through exactly how to do that: understanding MCP's three-layer model, building a server with the Python SDK, choosing the right transport for local vs. remote deployment, and running a real "repo summarizer" skill simultaneously from two different clients.


πŸ” MCP Fundamentals: Protocol, Transport Types, and the Server Lifecycle

MCP has three moving parts: a client (the AI assistant β€” Cursor, Claude Desktop, Copilot), a server (your Python process exposing tools), and a transport (the channel that connects them).

The protocol is intentionally thin. At its core, MCP defines:

  • Tool registration β€” the server advertises a list of callable tools with JSON Schema-typed parameters.
  • Resource registration β€” the server can expose read-only data sources (files, database rows, API responses) the client can fetch.
  • Prompt templates β€” reusable prompt fragments the client can request.

MCP transports come in two flavours:

TransportMechanismBest for
stdioClient spawns the server as a child process; messages flow over stdin/stdoutLocal, single-client, same machine
HTTP + SSEServer runs as an HTTP daemon; client POSTs requests and receives Server-Sent Events backRemote, multi-client, Docker/cloud

The server lifecycle is predictable: on startup, the server sends a capabilities object declaring which tools it supports. The client stores this manifest and routes user-intent to the right tool function. On each tool call, the server receives a JSON-RPC 2.0 request, executes the handler, and streams or returns the result.


βš™οΈ Building Your First MCP Server with the Python SDK

Install the SDK:

pip install mcp

A minimal server with one tool looks like this:

# server.py
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp import types

app = Server("repo-summarizer")

@app.list_tools()
async def list_tools() -> list[types.Tool]:
    return [
        types.Tool(
            name="summarize_repo",
            description="Summarise a GitHub repository and draft a PR description.",
            inputSchema={
                "type": "object",
                "properties": {
                    "repo_url": {
                        "type": "string",
                        "description": "Full HTTPS URL of the repository."
                    },
                    "max_files": {
                        "type": "integer",
                        "description": "Maximum changed files to include.",
                        "default": 10
                    }
                },
                "required": ["repo_url"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[types.TextContent]:
    if name == "summarize_repo":
        repo_url = arguments["repo_url"]
        max_files = arguments.get("max_files", 10)
        # --- your skill logic goes here ---
        summary = await _summarize(repo_url, max_files)
        return [types.TextContent(type="text", text=summary)]
    raise ValueError(f"Unknown tool: {name}")

async def main():
    async with stdio_server() as streams:
        await app.run(*streams, app.create_initialization_options())

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

Three things to notice here. First, @app.list_tools() is the capability announcement β€” it tells the client exactly what parameters to expect. Second, @app.call_tool() is the dispatcher β€” every tool call routes through this single handler. Third, the transport is wired in main() β€” swap stdio_server for the HTTP+SSE transport and the tool logic is untouched.

Registering the Server with a Client

For stdio transport (local use with Cursor or Claude Desktop), add an entry to the client's mcp.json or settings file:

{
  "mcpServers": {
    "repo-summarizer": {
      "command": "python",
      "args": ["/path/to/server.py"]
    }
  }
}

The client will spawn the process on demand and terminate it when the session ends.


🧠 Deep Dive: How MCP Actually Routes Messages

The Internals: JSON-RPC 2.0, Capability Negotiation, and Message Framing

MCP's wire format is JSON-RPC 2.0 β€” the same protocol powering Language Server Protocol (LSP). Every message has three fields: jsonrpc: "2.0", a method string, and either params (for requests) or result/error (for responses).

The initialization handshake is a two-step exchange:

  1. Client sends initialize with its own capabilities (protocol version, supported features).
  2. Server replies with InitializeResult containing its capabilities (tools, resources, prompts it can serve).

This capability negotiation means clients never need to hard-code what a server can do. They discover it at runtime. If a server is updated to expose a new tool, any connected client sees it on the next session without configuration changes.

Message framing on stdio uses newline-delimited JSON (NDJSON): each message is a single JSON object terminated by \n. The SDK handles framing automatically, but understanding it matters when debugging: you can attach a simple pipe logger between client and server to inspect raw traffic.

Message framing on HTTP+SSE is slightly different. The client POSTs a JSON-RPC request to /message. The server writes data: <json>\n\n chunks to the SSE stream. The connection stays open for the life of the session, which means long-running tool calls stream progress updates back incrementally rather than blocking until completion.

Performance Analysis: stdio vs HTTP Latency and Concurrency Limits

The performance characteristics of the two transports are meaningfully different:

MetricstdioHTTP + SSE
Cold-start latency~100–300 ms (process spawn)~5–20 ms (HTTP connect to running process)
Per-call overheadNegligible (IPC)~1–5 ms (TCP + HTTP headers)
Max concurrent clients1 (the spawning process)Limited by your server's asyncio event loop (hundreds)
Long-running tool supportβœ… (streaming supported)βœ… (SSE chunking)
Auth surfaceNone (OS process isolation)Requires token/mTLS (exposed over network)

The main bottleneck in any MCP server is the tool handler itself, not the transport. An LLM API call inside summarize_repo that takes 2 seconds dominates a sub-millisecond transport overhead by a factor of 1000. Optimize the tool logic β€” use async HTTP clients, cache repeated LLM calls β€” before worrying about transport tuning.

For HTTP+SSE servers under real load, the Python asyncio event loop is single-threaded. CPU-bound work inside a handler will block other concurrent requests. The fix is either asyncio.to_thread() for synchronous blocking calls, or splitting compute-heavy work into a background task queue (Celery, ARQ) that the handler awaits.


πŸ“Š From Client Request to Tool Result: The MCP Call Flow

The diagram below shows the complete round-trip from a user's intent in Cursor to the tool result returned by your Python server. Note the two possible transport paths β€” stdio for local use and HTTP+SSE for headless remote deployment β€” and how they converge at the same tool handler.

flowchart TD
    A["User types intent in Cursor / Claude Desktop"] --> B["Client LLM resolves tool name + params"]
    B --> C{Transport?}
    C -->|stdio local| D["Client spawns server as child process"]
    C -->|HTTP+SSE remote| E["Client POSTs to /message endpoint"]
    D --> F["JSON-RPC request over stdin"]
    E --> F
    F --> G["MCP Server: capability check + dispatch"]
    G --> H["Tool handler: call_tool()"]
    H --> I["Skill logic: LLM call / API / file I/O"]
    I --> J["TextContent result"]
    J --> K["JSON-RPC response over stdout / SSE stream"]
    K --> L["Client renders result to user"]

Reading the diagram: the left branch (stdio) means the client manages the server's entire process lifecycle β€” it starts and stops your Python script. The right branch (HTTP+SSE) means your server runs independently as a daemon; the client simply calls it over HTTP. Both paths deliver results through the same call_tool() handler, which is why switching transports requires zero changes to your business logic.


🌍 Real-World Applications: How Cursor, Claude Desktop, and GitHub Copilot Use MCP Today

MCP has moved from experimental to default in most major AI coding tools. Here is how each client uses it in practice.

Cursor uses MCP to give its inline chat access to custom project tools. A common pattern: a team deploys an MCP server that wraps their internal code-search index (not exposed to GitHub Copilot's cloud) and registers it in each developer's Cursor config. The LLM can now answer "where is the payment retry logic?" against private code without sending the full codebase to an external API.

Claude Desktop ships with an mcp.json registry in its application config directory. When you add an MCP server entry, Claude immediately sees the new tools. Anthropic's own reference implementations include servers for filesystem access, PostgreSQL, Brave Search, and Google Maps β€” all wired via the same protocol.

GitHub Copilot (VS Code agent mode, 2025+) added MCP server discovery as a first-class feature. The extensions.json or workspace-level mcp.json file registers servers that become available to Copilot's /agent commands. A CI/CD team at a mid-size fintech uses this to expose a "lint findings summarizer" server: Copilot's agent calls it as a tool during code review, and the result appears inline as a Copilot suggestion β€” without any GitHub App installation or cloud infrastructure beyond a single Railway-hosted container.


βš–οΈ Trade-offs and Failure Modes: stdio vs SSE, Auth Complexity, and Cold Starts

Every architectural choice in MCP deployment involves real trade-offs. Knowing the failure modes before you hit them saves debugging time.

stdio β€” simplicity at the cost of scale. stdio is the easiest transport to get running and the safest from a security perspective (no network surface). The failure mode is isolation: each client spawns its own copy of the server process. If five developers on a team each open Claude Desktop simultaneously, you get five Python processes, five sets of cold-start LLM calls, five cached states. If your tool has warm-up cost (model loading, database connection pooling), stdio amplifies it linearly.

HTTP+SSE β€” power at the cost of operational complexity. The SSE connection is persistent, which means network interruptions (firewalls closing idle connections, load balancer timeouts) will silently drop the stream. Clients must implement reconnect logic. The MCP SDK handles basic reconnection, but misconfigured reverse proxies with short idle timeouts are a common production surprise.

Authentication. stdio has no auth β€” OS process isolation is the security boundary. HTTP+SSE needs explicit auth. The three practical options are: bearer tokens (simple, rotate via secrets manager), mutual TLS (strong, complex to set up), or OAuth 2.0 with a short-lived access token exchanged at session start. A common mistake is exposing an HTTP MCP server on 0.0.0.0 without any auth during local testing and then forgetting to add it before deploying to a shared cloud environment.

Versioning and schema drift. Because tools are discovered at runtime, a server upgrade that removes or renames a tool will silently break any client that cached the old capability manifest. Use semantic versioning in your server name (repo-summarizer-v2) and maintain backwards-compatible aliases during transition windows.

Cold starts on serverless. AWS Lambda and Google Cloud Run both support HTTP-based MCP servers, but the first request after a cold start will include the initialization handshake latency on top of the function cold-start time. For latency-sensitive tools, prefer a container-based always-on deployment (Railway, Fly.io) or set a minimum instance count.


🧭 Decision Guide: Choosing Your Transport and Deployment Pattern

SituationRecommendation
Use stdio whenYour tool is for local personal use, runs on the same machine as the client, and multi-client access is not required
Use HTTP+SSE whenMultiple developers need simultaneous access, the server lives on a remote host, or you need persistent state between calls
Go headless (container/serverless) whenThe skill must be available 24/7 without an engineer running it manually, or it serves production workflows (CI, review automation)
Stay stdio + local whenSecurity requirements prohibit external network exposure, the tool uses local filesystem or internal-only APIs
Consider serverless (Lambda/Cloud Run) whenCall volume is low and sporadic, you want zero-infrastructure idle cost, and cold-start latency of ~500 ms is acceptable
Avoid serverless whenThe tool has long warm-up time (model loading), requires a persistent TCP connection to a database, or latency SLOs are under 200 ms

πŸ§ͺ Practical Example: The Repo Summarizer β€” One Python File, Three Clients

Here is the complete path from a single Python file to a skill callable from Cursor, Claude Desktop, and GitHub Copilot simultaneously.

The Server File

# repo_summarizer_server.py
import asyncio
import os
from mcp.server import Server
from mcp.server.sse import SseServerTransport
from mcp import types
import httpx

app = Server("repo-summarizer")

@app.list_tools()
async def list_tools() -> list[types.Tool]:
    return [
        types.Tool(
            name="summarize_repo",
            description="Fetch recent commits and open PRs for a GitHub repo, then return a structured summary.",
            inputSchema={
                "type": "object",
                "properties": {
                    "repo": {"type": "string", "description": "owner/repo (e.g. anthropics/mcp)"},
                    "days": {"type": "integer", "default": 7}
                },
                "required": ["repo"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[types.TextContent]:
    if name != "summarize_repo":
        raise ValueError(f"Unknown tool: {name}")

    repo = arguments["repo"]
    days = arguments.get("days", 7)
    token = os.environ.get("GITHUB_TOKEN", "")
    headers = {"Authorization": f"Bearer {token}"} if token else {}

    async with httpx.AsyncClient(headers=headers) as client:
        commits_resp = await client.get(
            f"https://api.github.com/repos/{repo}/commits",
            params={"per_page": 20}
        )
        prs_resp = await client.get(
            f"https://api.github.com/repos/{repo}/pulls",
            params={"state": "open", "per_page": 10}
        )

    commits = [c["commit"]["message"].split("\n")[0] for c in commits_resp.json()[:10]]
    prs = [f"#{p['number']}: {p['title']}" for p in prs_resp.json()[:5]]

    summary = (
        f"## {repo} β€” last {days} days\n\n"
        f"**Recent commits ({len(commits)}):**\n"
        + "\n".join(f"- {c}" for c in commits)
        + f"\n\n**Open PRs ({len(prs)}):**\n"
        + "\n".join(f"- {p}" for p in prs)
    )
    return [types.TextContent(type="text", text=summary)]

Dockerfile for Headless Deployment

FROM python:3.12-slim
WORKDIR /app
RUN pip install mcp httpx uvicorn
COPY repo_summarizer_server.py .
ENV PORT=8080
CMD ["python", "-m", "mcp.server.sse", "--host", "0.0.0.0", "--port", "8080", "repo_summarizer_server:app"]

Deploy to Railway with railway up. The resulting URL (https://repo-summarizer.railway.app) is the endpoint you register in each client.

Registering in Three Clients

Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "repo-summarizer": {
      "url": "https://repo-summarizer.railway.app/sse",
      "headers": { "Authorization": "Bearer YOUR_TOKEN" }
    }
  }
}

Cursor (.cursor/mcp.json in the project root):

{
  "mcpServers": {
    "repo-summarizer": {
      "url": "https://repo-summarizer.railway.app/sse"
    }
  }
}

VS Code / GitHub Copilot (.vscode/mcp.json):

{
  "servers": {
    "repo-summarizer": {
      "type": "sse",
      "url": "https://repo-summarizer.railway.app/sse"
    }
  }
}

All three now call the same Docker container. One deploy, three clients, zero duplicate skill code.


πŸ› οΈ FastMCP: How It Simplifies MCP Server Development

The raw mcp SDK is explicit and flexible, but its decorator pattern can feel ceremonial for small servers. FastMCP (jlowin/fastmcp) provides a @mcp.tool() decorator that mirrors FastAPI's ergonomics β€” Python type hints become the JSON Schema automatically, and you skip the manual list_tools / call_tool split.

The same repo summarizer in FastMCP:

# fast_server.py
from fastmcp import FastMCP
import httpx, os

mcp = FastMCP("repo-summarizer")

@mcp.tool()
async def summarize_repo(repo: str, days: int = 7) -> str:
    """Summarise recent commits and open PRs for a GitHub repository."""
    token = os.environ.get("GITHUB_TOKEN", "")
    headers = {"Authorization": f"Bearer {token}"} if token else {}
    async with httpx.AsyncClient(headers=headers) as client:
        commits = (await client.get(
            f"https://api.github.com/repos/{repo}/commits",
            params={"per_page": 20}, headers=headers
        )).json()
        prs = (await client.get(
            f"https://api.github.com/repos/{repo}/pulls",
            params={"state": "open", "per_page": 10}, headers=headers
        )).json()
    lines = [f"## {repo}", "", "**Commits:**"]
    lines += [f"- {c['commit']['message'].split(chr(10))[0]}" for c in commits[:10]]
    lines += ["", "**Open PRs:**"]
    lines += [f"- #{p['number']}: {p['title']}" for p in prs[:5]]
    return "\n".join(lines)

if __name__ == "__main__":
    mcp.run()

FastMCP extracts the function signature, docstring, and type hints to build the full JSON Schema automatically. The mcp.run() call defaults to stdio but accepts a transport="sse" argument for HTTP deployment. For a full deep-dive on FastMCP's advanced features (context injection, middleware, resource endpoints), see the FastMCP documentation.

Also worth knowing: Anthropic maintains modelcontextprotocol/servers, a reference implementations repo with production-ready servers for Postgres, filesystem, GitHub, Slack, and more β€” useful both as callable tools and as code templates for building your own.


πŸ“š Lessons Learned: What Actually Breaks When You Deploy MCP in Production

Don't treat stdio as "just for testing." For a single-developer workflow, stdio is the correct long-term choice. It is simpler, more secure, and has lower operational overhead than an SSE server. Reach for HTTP+SSE only when multi-client access or remote deployment is actually needed.

The JSON Schema is your API contract. Clients cache the capability manifest. If you change a parameter name between deploys, existing client sessions will send the old parameter name and your handler will receive None. Always validate inputs defensively: arguments.get("repo_url") or arguments.get("repo") as a migration shim costs nothing and prevents silent failures.

Instrument your tool handlers like microservices. Add structured logging, duration metrics, and error tracking (Sentry, OpenTelemetry) from day one. When a tool is called from three different clients, correlating which client triggered a failure becomes much harder without request IDs.

Test with two clients simultaneously before declaring success. It is easy to make a server work from one client by accident (global state, un-thread-safe caches). Running Cursor and Claude Desktop against the same HTTP+SSE server in parallel during local testing catches concurrency bugs before they reach production.

Cold-start surprises are transport-agnostic. Even with HTTP+SSE, if your handler initialises a heavy ML model on first call, the first user will wait. Move expensive initialisation to server startup β€” outside the handler β€” so it runs once at boot, not once per request.


πŸ“Œ TLDR: Summary and Key Takeaways

  • MCP is a universal adapter: any skill exposed as an MCP server is automatically callable from any MCP-aware client β€” Cursor, Claude Desktop, GitHub Copilot, VS Code agent mode β€” without per-client rewrites.
  • Two transports, two use cases: stdio for local/personal use (simpler, more secure); HTTP+SSE for remote/multi-client deployment (more powerful, more operational surface).
  • The Python SDK wires tools in three parts: list_tools() for schema advertisement, call_tool() for dispatch, and a transport context for the wire layer.
  • FastMCP removes boilerplate: type hints become JSON Schema automatically; @mcp.tool() is the only decorator you need for simple servers.
  • Deploy headlessly on Railway or Fly.io with a five-line Dockerfile; register the same SSE URL in all three client config files.
  • Production pitfalls: schema drift breaks clients silently, SSE connections drop under idle load-balancer timeouts, and uninstrumented tool handlers are impossible to debug at scale.
  • One-liner to remember: Write the skill once, expose it over MCP, and let every AI assistant in your team call it β€” that is what "deploy once, call everywhere" means in practice.

πŸ“ Practice Quiz

  1. Which MCP transport is most appropriate for a shared team server that multiple developers access simultaneously from different machines?

    • A) stdio, because it has lower per-call overhead
    • B) HTTP + SSE, because it supports multiple concurrent clients over the network
    • C) WebSocket, because MCP requires a bidirectional connection
    • D) gRPC, because JSON-RPC is too slow for production

    Correct Answer: B

  2. A developer updates their MCP server to rename the repo_url parameter to repo in the tool schema, but forgets to update the corresponding client config file. What is the most likely result?

    • A) The client will automatically detect the schema change and update its config
    • B) The client will throw a JSON parse error on every call
    • C) The server will receive None for repo_url because the client still sends the old parameter name
    • D) The MCP handshake will fail and the tool will disappear from the client

    Correct Answer: C

  3. In the FastMCP framework, how does the server know the JSON Schema for a registered tool's input parameters?

    • A) The developer writes a separate schema YAML file and registers it at startup
    • B) FastMCP reads the Python function's type hints and docstring to generate the schema automatically
    • C) The client sends its expected schema to the server during initialization and the server adapts
    • D) FastMCP requires a Pydantic model class for every tool β€” plain function signatures are not supported

    Correct Answer: B

  4. Open-ended challenge: You are deploying a repo summarizer MCP server that calls the GitHub API and an LLM for each request. After going live, you notice that the first call after each deploy takes 8 seconds while subsequent calls take 1.5 seconds. Describe at least two strategies to reduce that first-call latency without switching to a different transport, and explain any trade-offs each strategy introduces.


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms