All Posts

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

A practical decision framework with cost analysis, latency benchmarks, and Python code for both paths — so you pick right the first time.

Abstract AlgorithmsAbstract Algorithms
··34 min read

AI-assisted content.

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build vs buy decision is a spreadsheet problem, not an engineering identity problem.

📖 The $47,000 Monthly Bill That Killed the Roadmap

Month one, a seed-stage startup routes every user query to GPT-4 Turbo. The product is a document analysis tool — users paste contracts, the model summarises obligations, highlights risk clauses, and answers follow-up questions. The team ships fast, investors are happy, and the demo is genuinely impressive. Month two: $18,000 on OpenAI. Month three: the invoice is $47,000. The co-founders freeze new feature work and spend two weeks hunting optimisations — shorter prompts, reduced context windows, aggressive caching. They get the bill down to $31,000. Still existential. Meanwhile, a competitor with shallower pockets has deployed Llama 3 70B (4-bit quantised) on two A100 nodes. Their all-in monthly infrastructure cost, including storage, networking, and a small reserved-instance discount, is $4,800. They serve the same contract-analysis use case at 40% lower latency because there is no round-trip to a third-party API. The startup cannot pivot fast enough, burns through its runway buffer in month five, and is forced into a down-round to cover operations.

Now flip the scenario. A three-person team building an internal HR chatbot reads this cautionary tale and decides to self-host from day one. They provision two A40 GPUs, spend four weeks setting up vLLM, writing inference pipelines, writing evals, and debugging CUDA memory errors. The chatbot goes live. It gets 200 queries per day — a load that would cost roughly $90 per month on GPT-4o. The GPU spend is $6,000 in cloud time. The model hallucinates on HR policy questions because the team cannot get fine-tuning to converge on their 400-row example dataset. They end up using the API anyway, and the whole self-hosting effort is sunk cost.

This is the build versus buy problem for LLMs. The wrong call in either direction is expensive — sometimes catastrophically so. Neither self-hosting nor the API is automatically better. The decision depends on eight variables that most teams do not measure before they commit. This post gives you the framework to measure them, the cost formulas to run the spreadsheet, and a production-ready Python router so you can hedge your bets even after you decide.

🔍 Why This Decision Is Harder Than Most Engineering Trade-offs

Teams that have navigated cloud database vendor lock-in or build-vs-buy for analytics pipelines think they understand this decision. LLMs add five dimensions that those playbooks do not cover.

Cost has three layers, not one. When engineers compare API pricing to GPU hourly rates, they typically compare layer one: token price versus compute price. But layer two — engineering time — dwarfs it for most teams. Fine-tuning requires data curation, hyperparameter search, evaluation harnesses, and red-teaming. Ongoing model operations require on-call rotations, version management, and performance monitoring. A senior ML engineer costs $200,000+ per year fully loaded. Three months of that engineer's time on self-hosting infrastructure is $50,000 that never appears in the GPU invoice. Layer three is model iteration cost: every time a new base model is released (GPT-4o → o1 → o3, all in 18 months), self-hosted teams must re-evaluate, re-fine-tune, and re-deploy. API teams get the upgrade for free.

Latency is two distinct numbers. Time-to-first-token (TTFT) — the delay before the user sees any output — determines perceived responsiveness. Throughput — tokens per second sustained — determines how many users you can serve concurrently. A self-hosted vLLM deployment can achieve 30–60ms TTFT on a well-provisioned A100, versus 200–800ms for a busy GPT-4o API call. But throughput depends on batching strategy, not model location. Teams optimising for TTFT and throughput need very different solutions, and confusing the two is a common source of bad self-hosting decisions.

Data privacy is not binary. Sending data to the OpenAI API does not automatically violate HIPAA or GDPR. OpenAI offers Business Associate Agreements for HIPAA-covered entities, data processing addendums for GDPR, and zero-data-retention options. The real question is whether your legal team, your customers' legal teams, and your compliance auditors will accept those agreements. In healthcare, financial services, and government, the answer is often no — not because the provider is untrustworthy, but because regulations require demonstrable control over where data is processed, not just contractual assurances. This is a compliance question before it is an engineering question.

Fine-tuning has a ceiling below model scale. Teams often assume they can self-host a 7B model, fine-tune it on their domain data, and match GPT-4o quality. In practice, many emergent capabilities — complex multi-step reasoning, robust instruction following across diverse tasks, reliable tool use — only appear at 70B parameters or above. Fine-tuning a 7B model on domain-specific data can improve narrow task performance significantly, but it will not give you GPT-4-level general reasoning. If your use case requires that level of reasoning, you either self-host a 70B-class model (with corresponding hardware cost) or you use the API.

Model velocity is asymmetric. API providers ship new model families every few months. Self-hosted teams must decide when (and whether) to upgrade base models, retrain adapters, re-run evaluations, and update serving configurations. Teams that self-host because they want stability often get it — but stability means missing capability improvements that API-first competitors adopt in a single config change.

⚙️ The 7-Factor Framework for Choosing Your LLM Deployment Model

Rather than making this decision by intuition, score your use case against seven factors. Each factor scores 0, 1, or 2 points. Sum the scores for a decision tier.

Factor0 — Strong API signal1 — Neutral / hybrid2 — Strong self-host signal
Daily token volumeUnder 5M tokens/day5M–50M tokens/dayOver 50M tokens/day
Latency SLAOver 500ms acceptable200–500ms acceptableUnder 200ms TTFT required
Data privacyNo regulated PII/PHI; standard DPA acceptableInternal data, informal policyHIPAA/GDPR/SOC 2 with strict residency; data must not leave premises
Customisation needPrompt engineering achieves target qualityLightweight LoRA adapter sufficientFull fine-tuning on proprietary corpus required; behaviours not achievable by prompting
Team ML capabilityNo MLOps function; < 2 ML engineersSmall ML team, some infra experienceDedicated ML platform team; proven model serving track record
Budget predictabilityVariable spend is acceptable; usage-based billing preferredMixed — some fixed infrastructure tolerableFixed infrastructure budget; variable API spend creates forecasting problems
Model freshnessNeed latest model capabilities (o3, GPT-5, Claude 4)Model version stability acceptable for 6 monthsStable model version required for 12+ months; upgrade cadence owned internally

Scoring rubric:

  • 0–4 points → API-first. Use managed APIs. Invest savings into prompt engineering, evaluation harnesses, and semantic caching. Revisit when volume grows.
  • 5–9 points → Hybrid. Route high-volume, simpler tasks to a self-hosted model; keep complex reasoning and low-volume tasks on API. Implement a routing layer.
  • 10–14 points → Self-host first. The unit economics, compliance requirements, or customisation needs justify full self-hosting. API remains the safety net for capability gaps.

Score your use case honestly before reading the rest of this post. Most early-stage teams score 0–4. Most growth-stage companies with specific verticals score 4–7. Only high-volume, compliance-constrained, or deeply specialised deployments reliably score 8+.

🧠 Running the Numbers: A Cost Model That Actually Holds Up

Before any architecture decision, open a spreadsheet. The numbers are not intuitive — and many teams that made the wrong call did so because they only looked at the first layer of the cost model. This section breaks the economics into three sub-questions: what does the pricing layer actually include, and where do throughput benchmarks really shift the break-even point?

Internals: What the Pricing Layer Hides From You

When you look at API token prices in isolation, you are seeing one dimension of a three-layer cost structure. Layer one is the raw token price: dollars per million input and output tokens. This is the only number most teams compare. Layer two is the engineering cost to set up, operate, and maintain the path you choose — semantic caching, eval harnesses, fallback logic, on-call coverage. For self-hosted deployments, this layer includes GPU provisioning, vLLM configuration, model version management, and incident response. A fully loaded senior ML engineer at $200,000/year adds $50,000 in labour cost per quarter of infrastructure work — a number that dwarfs GPU invoices at most team sizes. Layer three is the opportunity cost of model velocity: API providers ship new model families roughly every six months. Self-hosted teams must re-evaluate, re-fine-tune, and re-deploy for every model upgrade they want to adopt; API teams inherit improvements transparently.

The practical implication: before you compare $3.00/M tokens (Claude) against $0.0005/M tokens (amortised GPU), add the engineering labour multiplier. Most teams undercount it by 3–5×.

API pricing (approximate 2026 rates):

ModelInput ($/M tokens)Output ($/M tokens)Best for
GPT-4o~$5.00~$15.00Complex reasoning, multimodal
GPT-4o-mini~$0.15~$0.60High-volume simple tasks
Claude 3.5 Sonnet~$3.00~$15.00Long context, instruction following
Gemini 1.5 Pro~$3.50~$10.50Long context, document analysis
Gemini 1.5 Flash~$0.075~$0.30Ultra-high-volume, latency-tolerant

The "hidden gem" insight here is GPT-4o-mini and Gemini Flash. At $0.15–0.30 per million output tokens, they obliterate the economics of self-hosting for all but the very highest volumes. Before you provision a GPU, check whether a smaller API model meets your quality bar — many teams discover it does for 60–70% of their query types.

Self-hosting cost model:

ComponentCloud on-demandReserved 1-year
A100 80GB (single GPU)~$3.50/hr~$2.10/hr
A40 48GB (single GPU)~$1.10/hr~$0.65/hr
Llama 3 70B (4-bit quant) hardware2× A40 or 1× A100
Throughput (Llama 3 70B, batch 32)~1,500 tokens/sec
Overhead (storage, networking, monitoring)+20% on GPU cost

Performance Analysis: Throughput, Latency, and Where the Break-Even Actually Lands

The throughput figure in the table above — 1,500 tokens/sec at batch size 32 — is the critical number for break-even analysis, and it is frequently misused. Single-request throughput (measuring one prompt at a time) is much lower: roughly 40–80 tokens/sec for a 70B model on a single A100. Naive cost estimates use single-request throughput and dramatically overestimate GPU requirements. vLLM's continuous batching fills the GPU with requests from multiple concurrent users, achieving much higher aggregate throughput. At batch size 32–64 in a production serving environment, the effective tokens/sec approaches 1,200–1,800 — which is the number you should use for break-even calculations.

Latency also deserves a precise definition. Time-to-first-token (TTFT) — how quickly the user sees the first word of a response — is determined primarily by prompt processing time and model size. A self-hosted Llama 3 70B on a well-provisioned A100 achieves 30–60ms TTFT. The Claude or GPT-4o API typically ranges from 200–800ms TTFT depending on load. For streaming UIs where the user sees tokens as they arrive, the gap is perceptible. Throughput tokens/sec — how fast the full response arrives after first token — is a function of GPU compute and batching. For use cases where latency is the primary driver, self-hosting with vLLM is genuinely superior to managed APIs. For use cases where quality is the primary driver, the API advantage in model capability often outweighs the latency difference.

The Python cost calculator. Run this with your actual token volumes before making any architecture decision:

def monthly_cost_api(
    daily_input_tokens: int,
    daily_output_tokens: int,
    input_price_per_million: float,
    output_price_per_million: float,
) -> float:
    """Compute monthly API cost in USD."""
    monthly_input = daily_input_tokens * 30
    monthly_output = daily_output_tokens * 30
    return (
        monthly_input / 1_000_000 * input_price_per_million
        + monthly_output / 1_000_000 * output_price_per_million
    )

def monthly_cost_self_host(
    gpu_hourly_rate: float = 2.10,   # A40 reserved 1-year rate
    num_gpus: int = 2,
    utilization: float = 0.70,       # 70% avg utilization
) -> float:
    """Compute monthly self-hosting cost (GPU + 20% overhead for storage/networking)."""
    gpu_monthly = gpu_hourly_rate * num_gpus * 24 * 30
    overhead = gpu_monthly * 0.20
    return gpu_monthly + overhead

def break_even_daily_tokens(
    self_host_monthly: float,
    input_price_per_million: float,
    output_price_per_million: float,
    output_ratio: float = 0.20,  # output tokens as fraction of total
) -> int:
    """
    Estimate the daily token volume at which self-hosting becomes cheaper.
    Assumes output_ratio fraction of tokens are output tokens.
    """
    blended_price = (
        (1 - output_ratio) * input_price_per_million
        + output_ratio * output_price_per_million
    ) / 1_000_000  # cost per token
    monthly_tokens_needed = self_host_monthly / blended_price
    return int(monthly_tokens_needed / 30)

# --- Example: 10M input + 2M output tokens/day vs Claude 3.5 Sonnet ---
api_cost = monthly_cost_api(
    daily_input_tokens=10_000_000,
    daily_output_tokens=2_000_000,
    input_price_per_million=3.0,
    output_price_per_million=15.0,
)
self_host_cost = monthly_cost_self_host(gpu_hourly_rate=2.10, num_gpus=2)
be = break_even_daily_tokens(self_host_cost, 3.0, 15.0)

print(f"API (Claude 3.5 Sonnet):     ${api_cost:>10,.0f}/month")
print(f"Self-hosted Llama 3 70B:     ${self_host_cost:>10,.0f}/month")
print(f"Break-even daily volume:     {be / 1_000_000:.1f}M tokens/day")
print()

# --- Scenario: GPT-4o-mini (the hidden gem tier) ---
mini_cost = monthly_cost_api(
    daily_input_tokens=10_000_000,
    daily_output_tokens=2_000_000,
    input_price_per_million=0.15,
    output_price_per_million=0.60,
)
be_mini = break_even_daily_tokens(self_host_cost, 0.15, 0.60)

print(f"API (GPT-4o-mini):           ${mini_cost:>10,.0f}/month")
print(f"Self-hosted Llama 3 70B:     ${self_host_cost:>10,.0f}/month")
print(f"Break-even vs mini:          {be_mini / 1_000_000:.0f}M tokens/day")

Running this for 10M input + 2M output tokens per day produces:

API (Claude 3.5 Sonnet):     $    10,800/month
Self-hosted Llama 3 70B:     $     3,629/month
Break-even daily volume:     8.5M tokens/day

API (GPT-4o-mini):           $       504/month
Self-hosted Llama 3 70B:     $     3,629/month
Break-even vs mini:          242M tokens/day

The implication is stark. Against Claude Sonnet, self-hosting breaks even at 8.5M tokens per day — achievable for a mid-size product. Against GPT-4o-mini, you need 242M tokens per day before self-hosting is cheaper. If your use case can tolerate GPT-4o-mini quality, the API is almost certainly the right answer until you are at Netflix-scale.

The self-host calculation also does not include the ML engineering labour cost. Add one engineer at $200K/year fully loaded and the break-even shifts dramatically. Include two engineers (model training + infrastructure) and self-hosting rarely wins until you are well above 100M tokens per day with a significant output ratio.

📊 Three Deployment Tiers and When Each One Fits

The decision is not binary. Most production systems at scale use a tiered model. The following flowchart maps the key decision points to the three tiers. Read it top-to-bottom: each diamond is a yes/no question; the answers route you toward API-only, hybrid, or self-host-first.

`mermaid graph TD A[Start: Evaluate Your Use Case] --> B{Daily volume above 50M tokens?} B -->|No| C{PII or PHI in prompts?} B -->|Yes| D{Dedicated MLOps team?} C -->|No| E{Latency SLA under 200ms?} C -->|Yes| F{Data can leave premises?} E -->|No| G[Tier 1 - API Only] E -->|Yes| H[Tier 2 - Hybrid with vLLM for latency] F -->|No| I[Tier 3 - Self-Host or Private Endpoint] F -->|Yes| G D -->|No| J[Tier 2 - Hybrid] D -->|Yes| K{Fine-tuning on proprietary corpus needed?} K -->|No| J K -->|Yes| L[Tier 3 - Self-Host First]


The flowchart shows that volume is the first filter but not the only one. A team with 5M tokens per day and strict PHI data residency requirements bypasses the volume question entirely and lands in Tier 3. A team with 200M tokens per day but no MLOps function will find Tier 2 or even Tier 1 (via aggressive mini-model routing) more practical than a full Tier 3 self-hosted deployment. Read the entire path, not just the first branch.

### Tier 1 — API-First (010M tokens/day)

At this volume, the API is almost always cheaper when you account for engineering labour. The architecture is simple: a request router dispatches to a primary model (GPT-4o or Claude) and a fallback (GPT-4o-mini), with a semantic cache layer (Redis + embedding similarity) in front to absorb repeated queries.

`mermaid
graph TD
    App[Application Layer] --> Cache[Semantic Cache - Redis plus Embeddings]
    Cache -->|Cache hit| Response[Return Cached Response]
    Cache -->|Cache miss| Router[Model Router]
    Router --> Primary[Primary Model - GPT-4o or Claude 3.5 Sonnet]
    Router --> Fallback[Fallback Model - GPT-4o-mini or Gemini Flash]
    Primary --> Store[Store in Cache]
    Fallback --> Store
    Store --> Response

This diagram shows the Tier 1 request path. The semantic cache is placed before the router — even before model selection — because a cache hit is infinitely cheaper than any model call. Semantic caching with cosine similarity on query embeddings typically reduces API calls by 30–60% for use cases with natural query repetition (support chatbots, FAQ systems, search assistants).

The key optimisations in Tier 1: route queries semantically (similar questions share cache entries), use the mini model for short/simple queries, and set hard monthly spend alerts at 80% of budget before you hit overage.

Tier 2 — Hybrid Routing (10M–200M tokens/day)

The hybrid tier adds a self-hosted serving layer for high-volume, lower-complexity tasks. Simple completions, classification, short summarisation, and slot-filling go to the self-hosted model. Complex reasoning, multi-step planning, code generation, and long-context tasks go to the API. A complexity classifier (heuristic or trained) sits in the router and makes the dispatch decision per request.

The self-hosted component is typically Llama 3 8B or Mistral 7B — small enough to fit on a single A10 GPU with excellent throughput, cheap enough that the break-even against API pricing is around 5–10M tokens per day. The 70B model is rarely necessary in Tier 2 because the complex tasks that need it go to the API anyway.

Tier 3 — Self-Host First (200M+ tokens/day)

At this scale, infrastructure economics decisively favour self-hosting for most query types. The serving layer is vLLM with Llama 3 70B or Mistral Large, deployed on Kubernetes with horizontal pod autoscaling keyed on inference request queue depth. The API becomes the safety net: tasks that require the absolute latest model capabilities, or tasks where the self-hosted model's quality falls meaningfully below API quality on your eval suite, fall back to the API.

The Tier 3 operational burden is substantial: model version management, eval harnesses that run on every model update, GPU capacity planning, on-call rotations for inference outages, and integration with your existing observability stack. Budget at least 1.5 FTE dedicated to model serving infrastructure before committing to this tier.

🌍 Real Deployments, Real Numbers: Who Chose What and Why

Harvey AI — self-hosted, compliance-driven. Harvey builds legal AI for law firms and corporate legal departments. Their regulatory environment is non-negotiable: client privileged communications cannot traverse a third-party API. Harvey self-hosts fine-tuned models on their own infrastructure and uses their own legal corpus to improve model performance on contracts, case law, and regulatory filings. The compliance requirement alone — not the economics — made self-hosting mandatory. Engineering cost was secondary.

Cursor — API-first, quality-driven. The AI-powered IDE from Anysphere routes its core code completion and chat to Claude and GPT-4o via APIs. At Cursor's scale and use case, model quality matters more than infrastructure cost: developers notice immediately when code suggestions degrade. Claude's long context window and strong code understanding are core product features that Cursor cannot easily replicate with an open-weight model. They optimise cost through aggressive caching and smart context management, not by switching to self-hosted inference.

Perplexity AI — hybrid, performance-optimised. Perplexity uses a fine-tuned Mistral variant for search grounding — the retrieval and citation task that runs on every query. For complex, multi-step reasoning answers on hard queries, they route to Claude or GPT-4o through their API. The hybrid architecture allows them to run extremely high query volumes cost-effectively while maintaining top-tier answer quality on the queries where it matters. Their search grounding model runs as a proprietary fine-tune on their own infrastructure; the reasoning model is rented.

Notion AI — API-first, iteration-speed-driven. Notion launched their AI features on OpenAI's API because shipping speed mattered more than optimisation at that stage. They iterated on prompts, UX patterns, and feature scope for six months before thinking seriously about infrastructure. They subsequently added a semantic caching layer to manage cost as usage grew. Their lesson: API-first let them discover what features users actually wanted before locking into an infrastructure decision.

Enterprise migration at 500M+ tokens/day. Multiple enterprise teams running document processing pipelines at 500M+ tokens per day have migrated from API-first to Mistral self-hosted after monthly token bills exceeded $100,000. The trigger is almost always financial — the monthly API invoice becomes a budget line item large enough to fund two additional ML engineers plus the GPU infrastructure to replace it. The migration is painful (2–4 months of work), but the economics after migration are compelling: a Tier 3 self-hosted deployment at that volume typically costs 70–85% less than API pricing for the same workload.

⚖️ Hidden Risks and Failure Modes in Both Directions

Knowing which tier is right does not guarantee a good outcome. Each path has failure modes that are invisible until they hit production.

DecisionHidden RiskMitigation
API-only at scale without a gatewayVendor lock-in; price changes or model deprecations require emergency refactorsAbstract all LLM calls behind an LLM gateway (LiteLLM, OpenRouter). One config change to switch providers.
Self-host without automated evalsQuality regression goes undetected after model updates or infra changesRun evaluation harness on every model deploy. Automated regression tests on a golden set of 200+ prompt/response pairs.
Fine-tuning without a baseline evalCannot measure whether fine-tuning helped or hurtRun baseline evals on the untuned model before training. Compare on identical eval set post-training.
Hybrid routing without a classifierComplex tasks hit the cheap model; quality degrades silentlyTask complexity classifier with logged confidence scores. Alerting on output quality metrics by routing path.
Self-host on a single GPU nodeSingle point of failure; no autoscaling; queue saturation under loadKubernetes + vLLM replicas; horizontal pod autoscaler keyed on llm_queue_depth; circuit breaker to API fallback.
Switching providers mid-product without testingSubtle output format differences break downstream parsersShadow traffic testing: run new provider in parallel, log disagreements, before cutting over.
Skipping semantic cachePaying for repeated identical API callsImplement Redis + embedding similarity cache before any other cost optimisation. 30–60% call reduction is typical.

🧭 The One-Page Decision Checklist

Use this table in your architecture review. Each question maps to a recommendation. If your answers put you in two different tiers, the stronger constraint wins (data privacy and volume tend to dominate).

QuestionYes →No →
Do you process > 50M tokens/day on this use case?Evaluate self-hosting (Tier 2 or 3)Start with API (Tier 1)
Do you have PII, PHI, or regulated data in prompts?Self-host or use private API endpoint (Tier 3)API is compliant with standard DPA
Can you staff a dedicated MLOps engineer?Self-hosting is operationally viableAPI-only; self-hosting ops burden is too high
Do you need model behaviours not achievable by prompting alone?Fine-tuning required → self-host (Tier 3)Prompt engineering + API is sufficient
Is your use case latency-sensitive (< 200ms TTFT)?Self-host with vLLM + speculative decoding (Tier 2/3)API latency acceptable (Tier 1)
Is your token volume predictable month-to-month?Reserved GPU instances; self-host economics are favourableAPI with usage alerts; variable billing acceptable
Do you need the latest model capabilities as they ship?Managed API — new models arrive there first (Tier 1)Self-host with controlled upgrade cycle viable
Is your monthly API bill already above $10K?Run the cost model; evaluate Tier 2 hybridAPI economics still favourable
Is your eval infrastructure mature and automated?Self-hosting without regression detection is riskyBuild evals before self-hosting

The single most important insight from this checklist: build your evaluation infrastructure before you build your serving infrastructure. Teams that cannot measure model quality reliably cannot safely operate self-hosted models. Every self-hosted deployment decision should be preceded by an eval investment, not followed by one.

🧪 Building a Production LLM Router in Python

The following router implements the core Tier 2 pattern: classify task complexity from prompt features, dispatch simple tasks to a self-hosted vLLM endpoint (served with an OpenAI-compatible API surface), route complex tasks to the API, track cost per request, and fall back to the API automatically if the self-hosted endpoint misses its latency SLA.

The complexity classifier here is a heuristic — fast and transparent, but limited. In a production Tier 2 deployment, replace it with a trained classifier (a fine-tuned BERT or a small LLM that predicts routing category) for higher accuracy. The rest of the router structure remains the same.

import os
import time
from dataclasses import dataclass, field
from openai import OpenAI

@dataclass
class RoutingDecision:
    model: str
    endpoint: str
    estimated_cost_usd: float
    reason: str
    complexity: str = "unknown"

@dataclass
class RouterStats:
    total_requests: int = 0
    self_hosted_requests: int = 0
    api_requests: int = 0
    sla_fallbacks: int = 0
    total_estimated_cost_usd: float = 0.0

    def summary(self) -> str:
        pct_self = (
            100 * self.self_hosted_requests / self.total_requests
            if self.total_requests
            else 0
        )
        return (
            f"Requests: {self.total_requests} total | "
            f"{self.self_hosted_requests} self-hosted ({pct_self:.0f}%) | "
            f"{self.api_requests} API | "
            f"{self.sla_fallbacks} SLA fallbacks | "
            f"Est. cost: ${self.total_estimated_cost_usd:.4f}"
        )

class LLMRouter:
    """
    Routes LLM requests between a self-hosted vLLM endpoint and a managed API
    based on heuristic task complexity and a latency SLA guard.

    Replace _estimate_complexity() with a trained classifier in production.
    Replace the Anthropic call with LiteLLM proxy for multi-provider support.
    """

    SELF_HOST_ENDPOINT = os.getenv("VLLM_ENDPOINT", "http://localhost:8000/v1")
    SELF_HOST_MODEL = "meta-llama/Meta-Llama-3-70B-Instruct"

    # For the API path, use LiteLLM proxy (see the LiteLLM section below)
    # or the provider's native SDK. This example uses an OpenAI-compat layer.
    API_MODEL = "claude-3-5-sonnet-20241022"
    API_BASE_URL = os.getenv("LLM_GATEWAY_URL", "https://api.anthropic.com/v1")
    LATENCY_SLA_MS = 3_000  # 3-second SLA for self-hosted path

    # Approximate amortised costs (USD per 1K tokens)
    SELF_HOST_COST_PER_1K = 0.0005   # GPU cost amortised at 70% utilisation
    API_COST_INPUT_PER_1K = 0.003    # Claude Sonnet input
    API_COST_OUTPUT_PER_1K = 0.015   # Claude Sonnet output

    def __init__(self) -> None:
        self.self_host_client = OpenAI(
            base_url=self.SELF_HOST_ENDPOINT,
            api_key="not-needed",  # vLLM does not require an API key
        )
        self.api_client = OpenAI(
            api_key=os.environ["LLM_API_KEY"],
            base_url=self.API_BASE_URL,
        )
        self.stats = RouterStats()

    def _estimate_complexity(self, prompt: str) -> str:
        """
        Heuristic complexity classifier.
        Categories: simple | medium | complex
        Replace with a trained classifier for production accuracy.
        """
        words = prompt.split()
        word_count = len(words)

        has_code_request = any(
            kw in prompt.lower()
            for kw in ["implement", "write code", "debug", "refactor", "write a function"]
        )
        has_reasoning_request = any(
            kw in prompt.lower()
            for kw in ["explain why", "analyse", "compare", "design", "evaluate", "trade-off"]
        )
        has_multi_step = any(
            kw in prompt.lower()
            for kw in ["step by step", "plan", "first", "then", "finally", "outline"]
        )

        if word_count < 40 and not has_code_request and not has_reasoning_request:
            return "simple"
        if has_code_request or has_reasoning_request or has_multi_step or word_count > 250:
            return "complex"
        return "medium"

    def _build_routing_decision(self, prompt: str) -> RoutingDecision:
        complexity = self._estimate_complexity(prompt)
        # Rough token estimate: 1 word ≈ 1.3 tokens
        estimated_tokens = len(prompt.split()) * 1.3

        if complexity == "simple":
            return RoutingDecision(
                model=self.SELF_HOST_MODEL,
                endpoint=self.SELF_HOST_ENDPOINT,
                estimated_cost_usd=estimated_tokens / 1_000 * self.SELF_HOST_COST_PER_1K,
                reason=f"simple task ({len(prompt.split())} words) → self-hosted Llama 3",
                complexity=complexity,
            )
        # medium and complex → API
        cost = (
            estimated_tokens / 1_000 * self.API_COST_INPUT_PER_1K
            + (estimated_tokens * 0.5) / 1_000 * self.API_COST_OUTPUT_PER_1K
        )
        return RoutingDecision(
            model=self.API_MODEL,
            endpoint="api",
            estimated_cost_usd=cost,
            reason=f"{complexity} task → managed API ({self.API_MODEL})",
            complexity=complexity,
        )

    def _call_self_hosted(self, prompt: str, decision: RoutingDecision) -> str | None:
        """
        Call the self-hosted vLLM endpoint.
        Returns the response text, or None if the call fails or exceeds the SLA.
        """
        start = time.monotonic()
        try:
            resp = self.self_host_client.chat.completions.create(
                model=self.SELF_HOST_MODEL,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=512,
                timeout=self.LATENCY_SLA_MS / 1_000,
            )
            elapsed_ms = (time.monotonic() - start) * 1_000
            if elapsed_ms > self.LATENCY_SLA_MS:
                return None  # SLA breach — fall back to API
            return resp.choices[0].message.content
        except Exception:
            return None  # Any error → fall back to API

    def _call_api(self, prompt: str) -> str:
        resp = self.api_client.chat.completions.create(
            model=self.API_MODEL,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=512,
        )
        return resp.choices[0].message.content

    def complete(self, prompt: str) -> tuple[str, RoutingDecision]:
        """
        Route a prompt to the appropriate model and return (response, decision).
        Updates internal stats for cost and routing telemetry.
        """
        decision = self._build_routing_decision(prompt)
        self.stats.total_requests += 1

        if decision.endpoint == self.SELF_HOST_ENDPOINT:
            response = self._call_self_hosted(prompt, decision)
            if response is not None:
                self.stats.self_hosted_requests += 1
                self.stats.total_estimated_cost_usd += decision.estimated_cost_usd
                return response, decision
            # SLA breach or error: fall back to API
            decision.reason += " [SLA breach — fell back to API]"
            self.stats.sla_fallbacks += 1

        # API path
        response = self._call_api(prompt)
        self.stats.api_requests += 1
        self.stats.total_estimated_cost_usd += decision.estimated_cost_usd
        return response, decision

# --- Demo: route a mixed workload ---
if __name__ == "__main__":
    router = LLMRouter()

    test_prompts = [
        "Summarise this PR title: Fix null pointer in UserService",
        "What is the capital of France?",
        "Implement a thread-safe LRU cache in Python with O(1) get and put operations.",
        "Compare the trade-offs between eventual consistency and strong consistency in distributed databases.",
        "Translate: Hello, how are you?",
    ]

    for prompt in test_prompts:
        # In production, await router.complete() if using async
        response, dec = router.complete(prompt)
        print(f"[{dec.complexity.upper()}] {dec.reason}")
        print(f"  Est. cost: ${dec.estimated_cost_usd:.5f}")
        print(f"  Response:  {response[:100]}...")
        print()

    print("--- Session Summary ---")
    print(router.stats.summary())

The router's key design decisions are worth calling out. The SLA guard (LATENCY_SLA_MS) is a first-class citizen — if the self-hosted endpoint is slow (cold start, high load, GPU memory pressure), the router automatically falls back to the API rather than degrading the user experience. The RouterStats object gives you the telemetry data you need to tune the routing thresholds over time: if SLA fallbacks are consistently above 5%, your self-hosted capacity is under-provisioned. If more than 80% of requests are going to the API, your complexity classifier is over-classifying tasks as complex and you are overpaying.

🛠️ LiteLLM: The Gateway Layer That Makes the Decision Reversible

One of the most expensive mistakes in LLM deployment is hard-coding API client calls throughout your codebase. When you need to switch from GPT-4o to Claude (for cost reasons) or from Claude to a self-hosted Llama endpoint (for compliance reasons), a hard-coded integration means touching every call site. LiteLLM solves this by providing a single OpenAI-compatible interface over 100+ LLM providers and self-hosted endpoints.

LiteLLM handles:

  • A unified API surface (swap GPT-4o for Claude with a single config line change — no code changes)
  • Automatic request/response logging and per-request cost tracking
  • Fallback chains: if the primary provider is unavailable, automatically route to the fallback
  • Rate limiting, budget guardrails, and spend alerts per user or team
  • Caching integration (Redis) for semantic deduplication

Minimal LiteLLM proxy configuration (litellm_config.yaml):

model_list:
  - model_name: primary                  # logical name your app calls
    litellm_params:
      model: claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: primary                  # same logical name → automatic fallback list
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: fast                     # low-cost path for simple tasks
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

  - model_name: self-hosted              # vLLM endpoint, OpenAI-compat
    litellm_params:
      model: openai/meta-llama/Meta-Llama-3-70B-Instruct
      api_base: http://vllm-service:8000/v1
      api_key: not-needed

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL  # Postgres for cost tracking

litellm_settings:
  fallbacks:
    - primary:
        - fast
  cache: true
  cache_params:
    type: redis
    host: redis-service
    port: 6379

Calling the LiteLLM proxy from Python — identical code regardless of backend:

import os
from openai import OpenAI

# Point the standard OpenAI client at your LiteLLM proxy.
# Swap models by changing "primary" to "self-hosted" or "fast" — zero other changes.
client = OpenAI(
    api_key=os.environ["LITELLM_MASTER_KEY"],
    base_url=os.environ.get("LITELLM_PROXY_URL", "http://localhost:4000"),
)

def call_llm(prompt: str, model: str = "primary", max_tokens: int = 512) -> str:
    """
    All LLM calls in your application route through this single function.
    Changing providers, adding fallbacks, or switching to self-hosted
    requires only a litellm_config.yaml change — no application code changes.
    """
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
    )
    return resp.choices[0].message.content

# Simple task → cheap model
answer = call_llm("What is the boiling point of water?", model="fast")

# Complex task → primary (Claude or GPT-4o with automatic fallback)
analysis = call_llm(
    "Analyse the trade-offs between CQRS and traditional CRUD for a high-write financial ledger.",
    model="primary",
)

# Force self-hosted for a privacy-sensitive prompt
sensitive = call_llm(
    "Summarise the following patient note: ...",
    model="self-hosted",
)

The key insight: when your entire codebase calls call_llm(prompt, model="primary"), you can migrate from GPT-4o to Claude to a self-hosted Llama endpoint by editing litellm_config.yaml and redeploying the proxy — your application does not change at all. This is the architectural hedge that makes the build-vs-buy decision reversible. For a full deep-dive on LiteLLM in production agent routing, including budget guardrails and multi-tenant cost attribution, see the LLM skill registry and routing post.

📚 Lessons Learned

Six non-obvious lessons from teams that have navigated this decision in both directions.

1. Start API-first, self-host at a specific measurable threshold — never speculatively. The threshold should be defined in writing before you build: "When our monthly API cost exceeds $15,000 for three consecutive months, we evaluate Tier 2 hybrid routing." Speculative self-hosting (building the infrastructure before you hit the threshold) almost always results in sunk cost and under-utilised GPUs.

2. Semantic caching reduces API costs 30–60% and should come before any infrastructure change. A Redis cache with embedding-based similarity lookup is a 2-week engineering project. It consistently reduces API call volume by 30–60% for use cases with natural query repetition. Do this before you consider a GPU. Many teams discover after implementing caching that they never hit their self-hosting threshold.

3. If your fine-tuning dataset has fewer than 1,000 examples, you need better prompts, not fine-tuning. Fine-tuning on small datasets rarely achieves the hoped-for improvement and frequently hurts performance on out-of-distribution inputs. The minimum viable fine-tuning dataset for a domain-adaptation task is typically 5,000–10,000 diverse, high-quality examples. Below that threshold, prompt engineering with a few-shot example library almost always beats fine-tuning.

4. vLLM's continuous batching changes the economics significantly at scale. Naive estimates compare single-request throughput. vLLM's continuous batching at batch size 64+ achieves throughput-to-cost ratios that beat API pricing sooner than naive estimates suggest — sometimes at 30–40M tokens per day rather than 50–80M. If you are running GPU cost estimates, use vLLM benchmark numbers at realistic batch sizes, not theoretical single-request throughput.

5. GPT-4o-mini at $0.60/M output tokens often outperforms fine-tuned 7B models on real tasks. This is the uncomfortable truth that derails many self-hosting plans. A carefully prompted GPT-4o-mini outperforms a fine-tuned Llama 3 7B on the majority of production tasks we have seen. The 7B model only wins when fine-tuned on a very large, clean, domain-specific dataset (> 50,000 examples) for a narrow, well-defined task. Validate with evals before committing.

6. Spend alerts save more money than architecture changes, and you should set them on Day 1. Set a hard alert at 80% of your monthly LLM budget the day you integrate your first API call. Most overspend incidents are detectable 10–14 days before month end if you have alerting. Most teams that get surprised by a $47,000 invoice did not have alerts configured. The infrastructure decision comes after the alert fires; the alert should be there from the beginning.

📌 TLDR

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build vs buy decision is a spreadsheet problem, not an engineering identity problem.

Key takeaways from this post:

  • 7-factor scoring framework: Volume, latency SLA, data privacy, customisation need, team capability, budget predictability, and model freshness each score 0–2. Sum determines your tier (API / Hybrid / Self-host).
  • Break-even formula: Against Claude Sonnet pricing, self-hosting a Llama 3 70B deployment breaks even at ~8.5M tokens/day. Against GPT-4o-mini, break-even is ~242M tokens/day. GPT-4o-mini is the "hidden gem" that extends API economics much further than most teams assume.
  • Three deployment tiers: API-only (Tier 1, 0–10M tokens/day) → Hybrid with complexity-based routing (Tier 2, 10–200M tokens/day) → Self-host first (Tier 3, > 200M tokens/day).
  • LiteLLM as the reversibility layer: Abstract all LLM calls behind an LLM gateway from day one. The build-vs-buy decision becomes a config file change, not a code rewrite.
  • Eval infrastructure before serving infrastructure: You cannot safely operate self-hosted models without automated regression detection. Build evals first.

📝 Practice Quiz

  1. At what approximate daily token volume does self-hosting a Llama 3 70B deployment (2× A40 reserved GPUs) typically break even against Claude 3.5 Sonnet API pricing?
Answer Correct Answer: B — approximately 8–10M tokens per day. Using the cost model in this post: 2× A40 reserved at $0.65/hr each = $1.30/hr = $936/month (plus 20% overhead = ~$1,123/month). Against Claude Sonnet at $3/M input + $15/M output (assuming 80/20 input/output split, blended ~$4.8/M tokens), break-even is around 1,123,000 / (4.8/1,000,000) / 30 ≈ 7.8M tokens/day. The answer varies with utilisation and exact pricing, but the 8–10M range is correct. Against GPT-4o-mini, the break-even is ~242M tokens/day — making A (5M) and D (50M) wrong for the Sonnet comparison, and all options wrong for mini. Options: - A) ~5M tokens/day - B) ~8–10M tokens/day ✅ - C) ~50M tokens/day - D) ~200M tokens/day

  1. Which factor most strongly favours self-hosting over a managed API for a healthcare AI startup?
Answer Correct Answer: C — HIPAA compliance with strict data residency requirements. Token volume (A) influences the economics but does not mandate self-hosting; the API can serve high volumes. Lower latency (B) is achievable with self-hosting but also with edge deployments or regional API endpoints. HIPAA with strict residency (C) often makes managed API endpoints unacceptable regardless of cost — the compliance requirement is binary, not economic. Wanting to fine-tune (D) on a small dataset is a poor reason to self-host; fine-tuning on < 1,000 examples rarely justifies the infrastructure cost. Options: - A) The startup processes 15M tokens per day - B) They want lower response latency - C) HIPAA compliance requires all PHI to remain on-premises ✅ - D) They want to fine-tune on 500 domain-specific examples

  1. What does vLLM's continuous batching primarily improve, compared to naive single-request inference?
Answer Correct Answer: B — GPU throughput by processing multiple in-flight requests in parallel. Continuous batching (also called iteration-level scheduling) allows vLLM to add new requests to the batch mid-generation, rather than waiting for all requests in a batch to complete before starting new ones. This dramatically improves GPU utilisation and tokens-per-second throughput. It does not reduce time-to-first-token for individual requests (A) — TTFT is primarily a function of model size and hardware. It does not improve model quality (C) or reduce memory usage (D); quantisation handles memory reduction. Options: - A) Time-to-first-token for individual requests - B) GPU throughput by batching multiple in-flight requests ✅ - C) Model output quality through better decoding - D) Peak GPU memory usage through weight sharing

  1. A startup processes 5M tokens per day with no PII data, no regulatory constraints, and a 2-person engineering team (no ML specialist). What is the best recommendation?
Answer Correct Answer: A — API-first with semantic caching. At 5M tokens per day with no compliance constraints and no MLOps capability, the 7-factor score is low (likely 1–3). Self-hosting (B) requires at minimum one dedicated ML engineer and produces no economic benefit at this volume compared to GPT-4o-mini. A hybrid setup (C) adds operational complexity without a clear benefit at this scale. Semantic caching (D as a standalone choice) is valuable but should augment API usage, not replace it. The correct path is Tier 1: API with semantic caching to reduce call volume, spend alerts, and a plan to re-evaluate if volume 10× in the next 6 months. Options: - A) API-first (Tier 1) with semantic caching layer ✅ - B) Self-host Llama 3 70B to avoid future cost growth - C) Hybrid routing with self-hosted 7B model for simple tasks - D) Implement semantic caching only, no model serving changes

  1. (Open-ended) Your team uses the Claude API at $25,000/month for a document summarisation pipeline. You are considering a hybrid routing approach: route short, simple documents to a self-hosted Llama 3 8B model, keep long and complex documents on the Claude API. Describe the steps you would take to evaluate whether this approach would reduce cost without degrading quality. Include the metrics you would track and the threshold that would trigger a rollback.
Answer A strong answer should include: Step 1 — Establish a baseline eval. Before changing anything, build an evaluation dataset of 300–500 representative (prompt, expected_output) pairs drawn from production traffic. Include examples across the full complexity distribution — short documents, medium documents, long documents, edge cases. Run Claude on all of them and record scores on your quality metrics. Step 2 — Define quality metrics. For summarisation, suitable metrics include: ROUGE-L (lexical overlap), BERTScore (semantic similarity), and a human-preference eval (20 examples rated by a domain expert). Set a minimum acceptable score for each metric based on the baseline. For example: "BERTScore must not drop below baseline − 0.03." Step 3 — Build the classifier and shadow-run. Implement the complexity classifier (heuristic or trained). Deploy the self-hosted model. Run the classifier in shadow mode: all production traffic still goes to Claude, but the classifier logs which requests it would have routed to Llama 3. Measure agreement between Claude and Llama 3 outputs on the shadowed simple-category requests using your eval metrics. Step 4 — Define the routing threshold and rollback trigger. Set a specific threshold: for example, "Route to self-hosted if complexity_score < 0.3 AND document_word_count < 500." Track per-routing-path quality scores in production dashboards. Define a rollback trigger: "If BERTScore on self-hosted-routed requests drops more than 0.05 below Claude baseline OR if user-reported error rate on self-hosted path exceeds 2%, revert all routing to Claude within 30 minutes." Step 5 — Gradual cutover with canary traffic. Route 5% of production simple-category traffic to the self-hosted model. Monitor quality metrics and user signals (session abandonment, re-query rate, explicit feedback) for 72 hours. If metrics hold, increase to 25%, 50%, 100% with 72-hour monitoring windows between each step. Metrics to track: BERTScore per routing path, ROUGE-L per routing path, user re-query rate (proxy for perceived quality), self-hosted SLA compliance rate (p95 latency), cost per request by routing path, API fallback rate from SLA breaches. Rollback trigger: Any of (a) quality metric breach, (b) user-facing error rate > 2%, (c) SLA fallback rate > 10% (indicates capacity problem), or (d) AB test shows statistically significant increase in re-query rate on self-hosted path. Rollback means setting routing back to 100% API immediately; investigate before re-attempting cutover.
Share
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms