31 min readLlm Ai Agents Machine Learning

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

A practical decision framework with cost analysis, latency benchmarks, and Python code for both paths — so you pick right the first time.

Abstract Algorithms/Apr 19, 2026/LLM Engineering

On this page

📖 The $47,000 Monthly Bill That Killed the Roadmap 🔍 Why This Decision Is Harder Than Most Engineering Trade-offs ⚙️ The 7-Factor Framework for Choosing Your LLM Deployment Model 🧠 Running the Numbers: A Cost Model That Actually Holds Up Internals: What the Pricing Layer Hides From You Performance Analysis: Throughput, Latency, and Where the Break-Even Actually Lands 📊 Three Deployment Tiers and When Each One Fits Tier 1 — API-First (0–10M tokens/day)Tier 2 — Hybrid Routing (10M–200M tokens/day)

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Executive TLDR

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement.
Self hosting full model serving is only cost effective at 50M tokens/day with a dedicated MLOps team.
The build vs buy decision is a spreadsheet problem, not an engineering identity problem.
📖 The $47,000 Monthly Bill That Killed the Roadmap Month one, a seed stage startup routes every user query to GPT 4 Turbo.

Core mental model

Read this as a system of state, constraints, and failure boundaries.

A practical decision framework with cost analysis, latency benchmarks, and Python code for both paths — so you pick right the first time.

Explain simpler Compare tradeoffs

Key systems visualization

The article’s conceptual path

📖 The $47,000 Monthly Bill That Killed the Roadmap

🔍 Why This Decision Is Harder Than Most Engineering Trade-offs

⚙️ The 7-Factor Framework for Choosing Your LLM Deployment Model

🧠 Running the Numbers: A Cost Model That Actually Holds Up

📊 Three Deployment Tiers and When Each One Fits

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build vs buy decision is a spreadsheet problem, not an engineering identity problem.

📖 The $47,000 Monthly Bill That Killed the Roadmap

Month one, a seed-stage startup routes every user query to GPT-4 Turbo. The product is a document analysis tool — users paste contracts, the model summarises obligations, highlights risk clauses, and answers follow-up questions. The team ships fast, investors are happy, and the demo is genuinely impressive. Month two: $18,000 on OpenAI. Month three: the invoice is $47,000. The co-founders freeze new feature work and spend two weeks hunting optimisations — shorter prompts, reduced context windows, aggressive caching. They get the bill down to $31,000. Still existential. Meanwhile, a competitor with shallower pockets has deployed Llama 3 70B (4-bit quantised) on two A100 nodes. Their all-in monthly infrastructure cost, including storage, networking, and a small reserved-instance discount, is $4,800. They serve the same contract-analysis use case at 40% lower latency because there is no round-trip to a third-party API. The startup cannot pivot fast enough, burns through its runway buffer in month five, and is forced into a down-round to cover operations.

Now flip the scenario. A three-person team building an internal HR chatbot reads this cautionary tale and decides to self-host from day one. They provision two A40 GPUs, spend four weeks setting up vLLM, writing inference pipelines, writing evals, and debugging CUDA memory errors. The chatbot goes live. It gets 200 queries per day — a load that would cost roughly $90 per month on GPT-4o. The GPU spend is $6,000 in cloud time. The model hallucinates on HR policy questions because the team cannot get fine-tuning to converge on their 400-row example dataset. They end up using the API anyway, and the whole self-hosting effort is sunk cost.

This is the build versus buy problem for LLMs. The wrong call in either direction is expensive — sometimes catastrophically so. Neither self-hosting nor the API is automatically better. The decision depends on eight variables that most teams do not measure before they commit. This post gives you the framework to measure them, the cost formulas to run the spreadsheet, and a production-ready Python router so you can hedge your bets even after you decide.

🔍 Why This Decision Is Harder Than Most Engineering Trade-offs

Teams that have navigated cloud database vendor lock-in or build-vs-buy for analytics pipelines think they understand this decision. LLMs add five dimensions that those playbooks do not cover.

Cost has three layers, not one. When engineers compare API pricing to GPU hourly rates, they typically compare layer one: token price versus compute price. But layer two — engineering time — dwarfs it for most teams. Fine-tuning requires data curation, hyperparameter search, evaluation harnesses, and red-teaming. Ongoing model operations require on-call rotations, version management, and performance monitoring. A senior ML engineer costs $200,000+ per year fully loaded. Three months of that engineer's time on self-hosting infrastructure is $50,000 that never appears in the GPU invoice. Layer three is model iteration cost: every time a new base model is released (GPT-4o → o1 → o3, all in 18 months), self-hosted teams must re-evaluate, re-fine-tune, and re-deploy. API teams get the upgrade for free.

Latency is two distinct numbers. Time-to-first-token (TTFT) — the delay before the user sees any output — determines perceived responsiveness. Throughput — tokens per second sustained — determines how many users you can serve concurrently. A self-hosted vLLM deployment can achieve 30–60ms TTFT on a well-provisioned A100, versus 200–800ms for a busy GPT-4o API call. But throughput depends on batching strategy, not model location. Teams optimising for TTFT and throughput need very different solutions, and confusing the two is a common source of bad self-hosting decisions.

Data privacy is not binary. Sending data to the OpenAI API does not automatically violate HIPAA or GDPR. OpenAI offers Business Associate Agreements for HIPAA-covered entities, data processing addendums for GDPR, and zero-data-retention options. The real question is whether your legal team, your customers' legal teams, and your compliance auditors will accept those agreements. In healthcare, financial services, and government, the answer is often no — not because the provider is untrustworthy, but because regulations require demonstrable control over where data is processed, not just contractual assurances. This is a compliance question before it is an engineering question.

Fine-tuning has a ceiling below model scale. Teams often assume they can self-host a 7B model, fine-tune it on their domain data, and match GPT-4o quality. In practice, many emergent capabilities — complex multi-step reasoning, robust instruction following across diverse tasks, reliable tool use — only appear at 70B parameters or above. Fine-tuning a 7B model on domain-specific data can improve narrow task performance significantly, but it will not give you GPT-4-level general reasoning. If your use case requires that level of reasoning, you either self-host a 70B-class model (with corresponding hardware cost) or you use the API.

Model velocity is asymmetric. API providers ship new model families every few months. Self-hosted teams must decide when (and whether) to upgrade base models, retrain adapters, re-run evaluations, and update serving configurations. Teams that self-host because they want stability often get it — but stability means missing capability improvements that API-first competitors adopt in a single config change.

⚙️ The 7-Factor Framework for Choosing Your LLM Deployment Model

Rather than making this decision by intuition, score your use case against seven factors. Each factor scores 0, 1, or 2 points. Sum the scores for a decision tier.

Factor	0 — Strong API signal	1 — Neutral / hybrid	2 — Strong self-host signal
Daily token volume	Under 5M tokens/day	5M–50M tokens/day	Over 50M tokens/day
Latency SLA	Over 500ms acceptable	200–500ms acceptable	Under 200ms TTFT required
Data privacy	No regulated PII/PHI; standard DPA acceptable	Internal data, informal policy	HIPAA/GDPR/SOC 2 with strict residency; data must not leave premises
Customisation need	Prompt engineering achieves target quality	Lightweight LoRA adapter sufficient	Full fine-tuning on proprietary corpus required; behaviours not achievable by prompting
Team ML capability	No MLOps function; < 2 ML engineers	Small ML team, some infra experience	Dedicated ML platform team; proven model serving track record
Budget predictability	Variable spend is acceptable; usage-based billing preferred	Mixed — some fixed infrastructure tolerable	Fixed infrastructure budget; variable API spend creates forecasting problems
Model freshness	Need latest model capabilities (o3, GPT-5, Claude 4)	Model version stability acceptable for 6 months	Stable model version required for 12+ months; upgrade cadence owned internally

Scoring rubric:

0–4 points → API-first. Use managed APIs. Invest savings into prompt engineering, evaluation harnesses, and semantic caching. Revisit when volume grows.
5–9 points → Hybrid. Route high-volume, simpler tasks to a self-hosted model; keep complex reasoning and low-volume tasks on API. Implement a routing layer.
10–14 points → Self-host first. The unit economics, compliance requirements, or customisation needs justify full self-hosting. API remains the safety net for capability gaps.

Score your use case honestly before reading the rest of this post. Most early-stage teams score 0–4. Most growth-stage companies with specific verticals score 4–7. Only high-volume, compliance-constrained, or deeply specialised deployments reliably score 8+.

🧠 Running the Numbers: A Cost Model That Actually Holds Up

Before any architecture decision, open a spreadsheet. The numbers are not intuitive — and many teams that made the wrong call did so because they only looked at the first layer of the cost model. This section breaks the economics into three sub-questions: what does the pricing layer actually include, and where do throughput benchmarks really shift the break-even point?

Internals: What the Pricing Layer Hides From You

When you look at API token prices in isolation, you are seeing one dimension of a three-layer cost structure. Layer one is the raw token price: dollars per million input and output tokens. This is the only number most teams compare. Layer two is the engineering cost to set up, operate, and maintain the path you choose — semantic caching, eval harnesses, fallback logic, on-call coverage. For self-hosted deployments, this layer includes GPU provisioning, vLLM configuration, model version management, and incident response. A fully loaded senior ML engineer at $200,000/year adds $50,000 in labour cost per quarter of infrastructure work — a number that dwarfs GPU invoices at most team sizes. Layer three is the opportunity cost of model velocity: API providers ship new model families roughly every six months. Self-hosted teams must re-evaluate, re-fine-tune, and re-deploy for every model upgrade they want to adopt; API teams inherit improvements transparently.

The practical implication: before you compare $3.00/M tokens (Claude) against $0.0005/M tokens (amortised GPU), add the engineering labour multiplier. Most teams undercount it by 3–5×.

API pricing (approximate 2026 rates):

Model	Input ($/M tokens)	Output ($/M tokens)	Best for
GPT-4o	~$5.00	~$15.00	Complex reasoning, multimodal
GPT-4o-mini	~$0.15	~$0.60	High-volume simple tasks
Claude 3.5 Sonnet	~$3.00	~$15.00	Long context, instruction following
Gemini 1.5 Pro	~$3.50	~$10.50	Long context, document analysis
Gemini 1.5 Flash	~$0.075	~$0.30	Ultra-high-volume, latency-tolerant

The "hidden gem" insight here is GPT-4o-mini and Gemini Flash. At $0.15–0.30 per million output tokens, they obliterate the economics of self-hosting for all but the very highest volumes. Before you provision a GPU, check whether a smaller API model meets your quality bar — many teams discover it does for 60–70% of their query types.

Self-hosting cost model:

Component	Cloud on-demand	Reserved 1-year
A100 80GB (single GPU)	~$3.50/hr	~$2.10/hr
A40 48GB (single GPU)	~$1.10/hr	~$0.65/hr
Llama 3 70B (4-bit quant) hardware	2× A40 or 1× A100	—
Throughput (Llama 3 70B, batch 32)	~1,500 tokens/sec	—
Overhead (storage, networking, monitoring)	+20% on GPU cost	—

Performance Analysis: Throughput, Latency, and Where the Break-Even Actually Lands

The throughput figure in the table above — 1,500 tokens/sec at batch size 32 — is the critical number for break-even analysis, and it is frequently misused. Single-request throughput (measuring one prompt at a time) is much lower: roughly 40–80 tokens/sec for a 70B model on a single A100. Naive cost estimates use single-request throughput and dramatically overestimate GPU requirements. vLLM's continuous batching fills the GPU with requests from multiple concurrent users, achieving much higher aggregate throughput. At batch size 32–64 in a production serving environment, the effective tokens/sec approaches 1,200–1,800 — which is the number you should use for break-even calculations.

Latency also deserves a precise definition. Time-to-first-token (TTFT) — how quickly the user sees the first word of a response — is determined primarily by prompt processing time and model size. A self-hosted Llama 3 70B on a well-provisioned A100 achieves 30–60ms TTFT. The Claude or GPT-4o API typically ranges from 200–800ms TTFT depending on load. For streaming UIs where the user sees tokens as they arrive, the gap is perceptible. Throughput tokens/sec — how fast the full response arrives after first token — is a function of GPU compute and batching. For use cases where latency is the primary driver, self-hosting with vLLM is genuinely superior to managed APIs. For use cases where quality is the primary driver, the API advantage in model capability often outweighs the latency difference.

The Python cost calculator. Run this with your actual token volumes before making any architecture decision:

def monthly_cost_api(
    daily_input_tokens: int,
    daily_output_tokens: int,
    input_price_per_million: float,
    output_price_per_million: float,
) -> float:
    """Compute monthly API cost in USD."""
    monthly_input = daily_input_tokens * 30
    monthly_output = daily_output_tokens * 30
    return (
        monthly_input / 1_000_000 * input_price_per_million
        + monthly_output / 1_000_000 * output_price_per_million
    )

def monthly_cost_self_host(
    gpu_hourly_rate: float = 2.10,   # A40 reserved 1-year rate
    num_gpus: int = 2,
    utilization: float = 0.70,       # 70% avg utilization
) -> float:
    """Compute monthly self-hosting cost (GPU + 20% overhead for storage/networking)."""
    gpu_monthly = gpu_hourly_rate * num_gpus * 24 * 30
    overhead = gpu_monthly * 0.20
    return gpu_monthly + overhead

def break_even_daily_tokens(
    self_host_monthly: float,
    input_price_per_million: float,
    output_price_per_million: float,
    output_ratio: float = 0.20,  # output tokens as fraction of total
) -> int:
    """
    Estimate the daily token volume at which self-hosting becomes cheaper.
    Assumes output_ratio fraction of tokens are output tokens.
    """
    blended_price = (
        (1 - output_ratio) * input_price_per_million
        + output_ratio * output_price_per_million
    ) / 1_000_000  # cost per token
    monthly_tokens_needed = self_host_monthly / blended_price
    return int(monthly_tokens_needed / 30)

# --- Example: 10M input + 2M output tokens/day vs Claude 3.5 Sonnet ---
api_cost = monthly_cost_api(
    daily_input_tokens=10_000_000,
    daily_output_tokens=2_000_000,
    input_price_per_million=3.0,
    output_price_per_million=15.0,
)
self_host_cost = monthly_cost_self_host(gpu_hourly_rate=2.10, num_gpus=2)
be = break_even_daily_tokens(self_host_cost, 3.0, 15.0)

print(f"API (Claude 3.5 Sonnet):     ${api_cost:>10,.0f}/month")
print(f"Self-hosted Llama 3 70B:     ${self_host_cost:>10,.0f}/month")
print(f"Break-even daily volume:     {be / 1_000_000:.1f}M tokens/day")
print()

# --- Scenario: GPT-4o-mini (the hidden gem tier) ---
mini_cost = monthly_cost_api(
    daily_input_tokens=10_000_000,
    daily_output_tokens=2_000_000,
    input_price_per_million=0.15,
    output_price_per_million=0.60,
)
be_mini = break_even_daily_tokens(self_host_cost, 0.15, 0.60)

print(f"API (GPT-4o-mini):           ${mini_cost:>10,.0f}/month")
print(f"Self-hosted Llama 3 70B:     ${self_host_cost:>10,.0f}/month")
print(f"Break-even vs mini:          {be_mini / 1_000_000:.0f}M tokens/day")

Running this for 10M input + 2M output tokens per day produces:

API (Claude 3.5 Sonnet):     $    10,800/month
Self-hosted Llama 3 70B:     $     3,629/month
Break-even daily volume:     8.5M tokens/day

API (GPT-4o-mini):           $       504/month
Self-hosted Llama 3 70B:     $     3,629/month
Break-even vs mini:          242M tokens/day

The implication is stark. Against Claude Sonnet, self-hosting breaks even at 8.5M tokens per day — achievable for a mid-size product. Against GPT-4o-mini, you need 242M tokens per day before self-hosting is cheaper. If your use case can tolerate GPT-4o-mini quality, the API is almost certainly the right answer until you are at Netflix-scale.

The self-host calculation also does not include the ML engineering labour cost. Add one engineer at $200K/year fully loaded and the break-even shifts dramatically. Include two engineers (model training + infrastructure) and self-hosting rarely wins until you are well above 100M tokens per day with a significant output ratio.

📊 Visual Reference

flowchart TD
    Pretrained["Pretrained
Model
(Frozen)"]
    LoRA["LoRA Adapter
(Trainable)"]
    Finetune["Fine-tune on
Custom Data"]
    Result["Fine-tuned
Model"]

    Pretrained --> LoRA
    LoRA --> Finetune
    Finetune --> Result

📊 Three Deployment Tiers and When Each One Fits

The decision is not binary. Most production systems at scale use a tiered model. The following flowchart maps the key decision points to the three tiers. Read it top-to-bottom: each diamond is a yes/no question; the answers route you toward API-only, hybrid, or self-host-first.

graph TD
    A[Start: Evaluate Your Use Case] --> B{Daily volume above 50M tokens?}
    B -->|No| C{PII or PHI in prompts?}
    B -->|Yes| D{Dedicated MLOps team?}
    C -->|No| E{Latency SLA under 200ms?}
    C -->|Yes| F{Data can leave premises?}
    E -->|No| G[Tier 1 - API Only]
    E -->|Yes| H[Tier 2 - Hybrid with vLLM for latency]
    F -->|No| I[Tier 3 - Self-Host or Private Endpoint]
    F -->|Yes| G
    D -->|No| J[Tier 2 - Hybrid]
    D -->|Yes| K{Fine-tuning on proprietary corpus needed?}
    K -->|No| J
    K -->|Yes| L[Tier 3 - Self-Host First]

The flowchart shows that volume is the first filter but not the only one. A team with 5M tokens per day and strict PHI data residency requirements bypasses the volume question entirely and lands in Tier 3. A team with 200M tokens per day but no MLOps function will find Tier 2 or even Tier 1 (via aggressive mini-model routing) more practical than a full Tier 3 self-hosted deployment. Read the entire path, not just the first branch.

Tier 1 — API-First (0–10M tokens/day)

At this volume, the API is almost always cheaper when you account for engineering labour. The architecture is simple: a request router dispatches to a primary model (GPT-4o or Claude) and a fallback (GPT-4o-mini), with a semantic cache layer (Redis + embedding similarity) in front to absorb repeated queries.

graph TD
    App[Application Layer] --> Cache[Semantic Cache - Redis plus Embeddings]
    Cache -->|Cache hit| Response[Return Cached Response]
    Cache -->|Cache miss| Router[Model Router]
    Router --> Primary[Primary Model - GPT-4o or Claude 3.5 Sonnet]
    Router --> Fallback[Fallback Model - GPT-4o-mini or Gemini Flash]
    Primary --> Store[Store in Cache]
    Fallback --> Store
    Store --> Response

This diagram shows the Tier 1 request path. The semantic cache is placed before the router — even before model selection — because a cache hit is infinitely cheaper than any model call. Semantic caching with cosine similarity on query embeddings typically reduces API calls by 30–60% for use cases with natural query repetition (support chatbots, FAQ systems, search assistants).

The key optimisations in Tier 1: route queries semantically (similar questions share cache entries), use the mini model for short/simple queries, and set hard monthly spend alerts at 80% of budget before you hit overage.

Tier 2 — Hybrid Routing (10M–200M tokens/day)

The hybrid tier adds a self-hosted serving layer for high-volume, lower-complexity tasks. Simple completions, classification, short summarisation, and slot-filling go to the self-hosted model. Complex reasoning, multi-step planning, code generation, and long-context tasks go to the API. A complexity classifier (heuristic or trained) sits in the router and makes the dispatch decision per request.

The self-hosted component is typically Llama 3 8B or Mistral 7B — small enough to fit on a single A10 GPU with excellent throughput, cheap enough that the break-even against API pricing is around 5–10M tokens per day. The 70B model is rarely necessary in Tier 2 because the complex tasks that need it go to the API anyway.

Tier 3 — Self-Host First (200M+ tokens/day)

At this scale, infrastructure economics decisively favour self-hosting for most query types. The serving layer is vLLM with Llama 3 70B or Mistral Large, deployed on Kubernetes with horizontal pod autoscaling keyed on inference request queue depth. The API becomes the safety net: tasks that require the absolute latest model capabilities, or tasks where the self-hosted model's quality falls meaningfully below API quality on your eval suite, fall back to the API.

The Tier 3 operational burden is substantial: model version management, eval harnesses that run on every model update, GPU capacity planning, on-call rotations for inference outages, and integration with your existing observability stack. Budget at least 1.5 FTE dedicated to model serving infrastructure before committing to this tier.

🌍 Real Deployments, Real Numbers: Who Chose What and Why

Harvey AI — self-hosted, compliance-driven. Harvey builds legal AI for law firms and corporate legal departments. Their regulatory environment is non-negotiable: client privileged communications cannot traverse a third-party API. Harvey self-hosts fine-tuned models on their own infrastructure and uses their own legal corpus to improve model performance on contracts, case law, and regulatory filings. The compliance requirement alone — not the economics — made self-hosting mandatory. Engineering cost was secondary.

Cursor — API-first, quality-driven. The AI-powered IDE from Anysphere routes its core code completion and chat to Claude and GPT-4o via APIs. At Cursor's scale and use case, model quality matters more than infrastructure cost: developers notice immediately when code suggestions degrade. Claude's long context window and strong code understanding are core product features that Cursor cannot easily replicate with an open-weight model. They optimise cost through aggressive caching and smart context management, not by switching to self-hosted inference.

Perplexity AI — hybrid, performance-optimised. Perplexity uses a fine-tuned Mistral variant for search grounding — the retrieval and citation task that runs on every query. For complex, multi-step reasoning answers on hard queries, they route to Claude or GPT-4o through their API. The hybrid architecture allows them to run extremely high query volumes cost-effectively while maintaining top-tier answer quality on the queries where it matters. Their search grounding model runs as a proprietary fine-tune on their own infrastructure; the reasoning model is rented.

Notion AI — API-first, iteration-speed-driven. Notion launched their AI features on OpenAI's API because shipping speed mattered more than optimisation at that stage. They iterated on prompts, UX patterns, and feature scope for six months before thinking seriously about infrastructure. They subsequently added a semantic caching layer to manage cost as usage grew. Their lesson: API-first let them discover what features users actually wanted before locking into an infrastructure decision.

Enterprise migration at 500M+ tokens/day. Multiple enterprise teams running document processing pipelines at 500M+ tokens per day have migrated from API-first to Mistral self-hosted after monthly token bills exceeded $100,000. The trigger is almost always financial — the monthly API invoice becomes a budget line item large enough to fund two additional ML engineers plus the GPU infrastructure to replace it. The migration is painful (2–4 months of work), but the economics after migration are compelling: a Tier 3 self-hosted deployment at that volume typically costs 70–85% less than API pricing for the same workload.

⚖️ Hidden Risks and Failure Modes in Both Directions

Knowing which tier is right does not guarantee a good outcome. Each path has failure modes that are invisible until they hit production.

Decision	Hidden Risk	Mitigation
API-only at scale without a gateway	Vendor lock-in; price changes or model deprecations require emergency refactors	Abstract all LLM calls behind an LLM gateway (LiteLLM, OpenRouter). One config change to switch providers.
Self-host without automated evals	Quality regression goes undetected after model updates or infra changes	Run evaluation harness on every model deploy. Automated regression tests on a golden set of 200+ prompt/response pairs.
Fine-tuning without a baseline eval	Cannot measure whether fine-tuning helped or hurt	Run baseline evals on the untuned model before training. Compare on identical eval set post-training.
Hybrid routing without a classifier	Complex tasks hit the cheap model; quality degrades silently	Task complexity classifier with logged confidence scores. Alerting on output quality metrics by routing path.
Self-host on a single GPU node	Single point of failure; no autoscaling; queue saturation under load	Kubernetes + vLLM replicas; horizontal pod autoscaler keyed on `llm_queue_depth`; circuit breaker to API fallback.
Switching providers mid-product without testing	Subtle output format differences break downstream parsers	Shadow traffic testing: run new provider in parallel, log disagreements, before cutting over.
Skipping semantic cache	Paying for repeated identical API calls	Implement Redis + embedding similarity cache before any other cost optimisation. 30–60% call reduction is typical.

🧭 The One-Page Decision Checklist

Use this table in your architecture review. Each question maps to a recommendation. If your answers put you in two different tiers, the stronger constraint wins (data privacy and volume tend to dominate).

Question	Yes →	No →
Do you process > 50M tokens/day on this use case?	Evaluate self-hosting (Tier 2 or 3)	Start with API (Tier 1)
Do you have PII, PHI, or regulated data in prompts?	Self-host or use private API endpoint (Tier 3)	API is compliant with standard DPA
Can you staff a dedicated MLOps engineer?	Self-hosting is operationally viable	API-only; self-hosting ops burden is too high
Do you need model behaviours not achievable by prompting alone?	Fine-tuning required → self-host (Tier 3)	Prompt engineering + API is sufficient
Is your use case latency-sensitive (< 200ms TTFT)?	Self-host with vLLM + speculative decoding (Tier 2/3)	API latency acceptable (Tier 1)
Is your token volume predictable month-to-month?	Reserved GPU instances; self-host economics are favourable	API with usage alerts; variable billing acceptable
Do you need the latest model capabilities as they ship?	Managed API — new models arrive there first (Tier 1)	Self-host with controlled upgrade cycle viable
Is your monthly API bill already above $10K?	Run the cost model; evaluate Tier 2 hybrid	API economics still favourable
Is your eval infrastructure mature and automated?	Self-hosting without regression detection is risky	Build evals before self-hosting

The single most important insight from this checklist: build your evaluation infrastructure before you build your serving infrastructure. Teams that cannot measure model quality reliably cannot safely operate self-hosted models. Every self-hosted deployment decision should be preceded by an eval investment, not followed by one.

🧪 Building a Production LLM Router in Python

The following router implements the core Tier 2 pattern: classify task complexity from prompt features, dispatch simple tasks to a self-hosted vLLM endpoint (served with an OpenAI-compatible API surface), route complex tasks to the API, track cost per request, and fall back to the API automatically if the self-hosted endpoint misses its latency SLA.

The complexity classifier here is a heuristic — fast and transparent, but limited. In a production Tier 2 deployment, replace it with a trained classifier (a fine-tuned BERT or a small LLM that predicts routing category) for higher accuracy. The rest of the router structure remains the same.

import os
import time
from dataclasses import dataclass, field
from openai import OpenAI

@dataclass
class RoutingDecision:
    model: str
    endpoint: str
    estimated_cost_usd: float
    reason: str
    complexity: str = "unknown"

@dataclass
class RouterStats:
    total_requests: int = 0
    self_hosted_requests: int = 0
    api_requests: int = 0
    sla_fallbacks: int = 0
    total_estimated_cost_usd: float = 0.0

    def summary(self) -> str:
        pct_self = (
            100 * self.self_hosted_requests / self.total_requests
            if self.total_requests
            else 0
        )
        return (
            f"Requests: {self.total_requests} total | "
            f"{self.self_hosted_requests} self-hosted ({pct_self:.0f}%) | "
            f"{self.api_requests} API | "
            f"{self.sla_fallbacks} SLA fallbacks | "
            f"Est. cost: ${self.total_estimated_cost_usd:.4f}"
        )

class LLMRouter:
    """
    Routes LLM requests between a self-hosted vLLM endpoint and a managed API
    based on heuristic task complexity and a latency SLA guard.

    Replace _estimate_complexity() with a trained classifier in production.
    Replace the Anthropic call with LiteLLM proxy for multi-provider support.
    """

    SELF_HOST_ENDPOINT = os.getenv("VLLM_ENDPOINT", "http://localhost:8000/v1")
    SELF_HOST_MODEL = "meta-llama/Meta-Llama-3-70B-Instruct"

    # For the API path, use LiteLLM proxy (see the LiteLLM section below)
    # or the provider's native SDK. This example uses an OpenAI-compat layer.
    API_MODEL = "claude-3-5-sonnet-20241022"
    API_BASE_URL = os.getenv("LLM_GATEWAY_URL", "https://api.anthropic.com/v1")
    LATENCY_SLA_MS = 3_000  # 3-second SLA for self-hosted path

    # Approximate amortised costs (USD per 1K tokens)
    SELF_HOST_COST_PER_1K = 0.0005   # GPU cost amortised at 70% utilisation
    API_COST_INPUT_PER_1K = 0.003    # Claude Sonnet input
    API_COST_OUTPUT_PER_1K = 0.015   # Claude Sonnet output

    def __init__(self) -> None:
        self.self_host_client = OpenAI(
            base_url=self.SELF_HOST_ENDPOINT,
            api_key="not-needed",  # vLLM does not require an API key
        )
        self.api_client = OpenAI(
            api_key=os.environ["LLM_API_KEY"],
            base_url=self.API_BASE_URL,
        )
        self.stats = RouterStats()

    def _estimate_complexity(self, prompt: str) -> str:
        """
        Heuristic complexity classifier.
        Categories: simple | medium | complex
        Replace with a trained classifier for production accuracy.
        """
        words = prompt.split()
        word_count = len(words)

        has_code_request = any(
            kw in prompt.lower()
            for kw in ["implement", "write code", "debug", "refactor", "write a function"]
        )
        has_reasoning_request = any(
            kw in prompt.lower()
            for kw in ["explain why", "analyse", "compare", "design", "evaluate", "trade-off"]
        )
        has_multi_step = any(
            kw in prompt.lower()
            for kw in ["step by step", "plan", "first", "then", "finally", "outline"]
        )

        if word_count < 40 and not has_code_request and not has_reasoning_request:
            return "simple"
        if has_code_request or has_reasoning_request or has_multi_step or word_count > 250:
            return "complex"
        return "medium"

    def _build_routing_decision(self, prompt: str) -> RoutingDecision:
        complexity = self._estimate_complexity(prompt)
        # Rough token estimate: 1 word ≈ 1.3 tokens
        estimated_tokens = len(prompt.split()) * 1.3

        if complexity == "simple":
            return RoutingDecision(
                model=self.SELF_HOST_MODEL,
                endpoint=self.SELF_HOST_ENDPOINT,
                estimated_cost_usd=estimated_tokens / 1_000 * self.SELF_HOST_COST_PER_1K,
                reason=f"simple task ({len(prompt.split())} words) → self-hosted Llama 3",
                complexity=complexity,
            )
        # medium and complex → API
        cost = (
            estimated_tokens / 1_000 * self.API_COST_INPUT_PER_1K
            + (estimated_tokens * 0.5) / 1_000 * self.API_COST_OUTPUT_PER_1K
        )
        return RoutingDecision(
            model=self.API_MODEL,
            endpoint="api",
            estimated_cost_usd=cost,
            reason=f"{complexity} task → managed API ({self.API_MODEL})",
            complexity=complexity,
        )

    def _call_self_hosted(self, prompt: str, decision: RoutingDecision) -> str | None:
        """
        Call the self-hosted vLLM endpoint.
        Returns the response text, or None if the call fails or exceeds the SLA.
        """
        start = time.monotonic()
        try:
            resp = self.self_host_client.chat.completions.create(
                model=self.SELF_HOST_MODEL,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=512,
                timeout=self.LATENCY_SLA_MS / 1_000,
            )
            elapsed_ms = (time.monotonic() - start) * 1_000
            if elapsed_ms > self.LATENCY_SLA_MS:
                return None  # SLA breach — fall back to API
            return resp.choices[0].message.content
        except Exception:
            return None  # Any error → fall back to API

    def _call_api(self, prompt: str) -> str:
        resp = self.api_client.chat.completions.create(
            model=self.API_MODEL,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=512,
        )
        return resp.choices[0].message.content

    def complete(self, prompt: str) -> tuple[str, RoutingDecision]:
        """
        Route a prompt to the appropriate model and return (response, decision).
        Updates internal stats for cost and routing telemetry.
        """
        decision = self._build_routing_decision(prompt)
        self.stats.total_requests += 1

        if decision.endpoint == self.SELF_HOST_ENDPOINT:
            response = self._call_self_hosted(prompt, decision)
            if response is not None:
                self.stats.self_hosted_requests += 1
                self.stats.total_estimated_cost_usd += decision.estimated_cost_usd
                return response, decision
            # SLA breach or error: fall back to API
            decision.reason += " [SLA breach — fell back to API]"
            self.stats.sla_fallbacks += 1

        # API path
        response = self._call_api(prompt)
        self.stats.api_requests += 1
        self.stats.total_estimated_cost_usd += decision.estimated_cost_usd
        return response, decision

# --- Demo: route a mixed workload ---
if __name__ == "__main__":
    router = LLMRouter()

    test_prompts = [
        "Summarise this PR title: Fix null pointer in UserService",
        "What is the capital of France?",
        "Implement a thread-safe LRU cache in Python with O(1) get and put operations.",
        "Compare the trade-offs between eventual consistency and strong consistency in distributed databases.",
        "Translate: Hello, how are you?",
    ]

    for prompt in test_prompts:
        # In production, await router.complete() if using async
        response, dec = router.complete(prompt)
        print(f"[{dec.complexity.upper()}] {dec.reason}")
        print(f"  Est. cost: ${dec.estimated_cost_usd:.5f}")
        print(f"  Response:  {response[:100]}...")
        print()

    print("--- Session Summary ---")
    print(router.stats.summary())

The router's key design decisions are worth calling out. The SLA guard (LATENCY_SLA_MS) is a first-class citizen — if the self-hosted endpoint is slow (cold start, high load, GPU memory pressure), the router automatically falls back to the API rather than degrading the user experience. The RouterStats object gives you the telemetry data you need to tune the routing thresholds over time: if SLA fallbacks are consistently above 5%, your self-hosted capacity is under-provisioned. If more than 80% of requests are going to the API, your complexity classifier is over-classifying tasks as complex and you are overpaying.

📊 Visual Reference

flowchart TD
    Task["Task Type"]
    Changing{Requires frequent
knowledge updates?}
    RAG["Use RAG
(Dynamic retrieval)"]
    Finetune["Use Fine-tuning
(Static weights)"]

    Task --> Changing
    Changing -->|Yes| RAG
    Changing -->|No| Finetune

🛠️ LiteLLM: The Gateway Layer That Makes the Decision Reversible

One of the most expensive mistakes in LLM deployment is hard-coding API client calls throughout your codebase. When you need to switch from GPT-4o to Claude (for cost reasons) or from Claude to a self-hosted Llama endpoint (for compliance reasons), a hard-coded integration means touching every call site. LiteLLM solves this by providing a single OpenAI-compatible interface over 100+ LLM providers and self-hosted endpoints.

LiteLLM handles:

A unified API surface (swap GPT-4o for Claude with a single config line change — no code changes)
Automatic request/response logging and per-request cost tracking
Fallback chains: if the primary provider is unavailable, automatically route to the fallback
Rate limiting, budget guardrails, and spend alerts per user or team
Caching integration (Redis) for semantic deduplication

Minimal LiteLLM proxy configuration (litellm_config.yaml):

model_list:
  - model_name: primary                  # logical name your app calls
    litellm_params:
      model: claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: primary                  # same logical name → automatic fallback list
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: fast                     # low-cost path for simple tasks
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

  - model_name: self-hosted              # vLLM endpoint, OpenAI-compat
    litellm_params:
      model: openai/meta-llama/Meta-Llama-3-70B-Instruct
      api_base: http://vllm-service:8000/v1
      api_key: not-needed

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL  # Postgres for cost tracking

litellm_settings:
  fallbacks:
    - primary:
        - fast
  cache: true
  cache_params:
    type: redis
    host: redis-service
    port: 6379

Calling the LiteLLM proxy from Python — identical code regardless of backend:

import os
from openai import OpenAI

# Point the standard OpenAI client at your LiteLLM proxy.
# Swap models by changing "primary" to "self-hosted" or "fast" — zero other changes.
client = OpenAI(
    api_key=os.environ["LITELLM_MASTER_KEY"],
    base_url=os.environ.get("LITELLM_PROXY_URL", "http://localhost:4000"),
)

def call_llm(prompt: str, model: str = "primary", max_tokens: int = 512) -> str:
    """
    All LLM calls in your application route through this single function.
    Changing providers, adding fallbacks, or switching to self-hosted
    requires only a litellm_config.yaml change — no application code changes.
    """
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
    )
    return resp.choices[0].message.content

# Simple task → cheap model
answer = call_llm("What is the boiling point of water?", model="fast")

# Complex task → primary (Claude or GPT-4o with automatic fallback)
analysis = call_llm(
    "Analyse the trade-offs between CQRS and traditional CRUD for a high-write financial ledger.",
    model="primary",
)

# Force self-hosted for a privacy-sensitive prompt
sensitive = call_llm(
    "Summarise the following patient note: ...",
    model="self-hosted",
)

The key insight: when your entire codebase calls call_llm(prompt, model="primary"), you can migrate from GPT-4o to Claude to a self-hosted Llama endpoint by editing litellm_config.yaml and redeploying the proxy — your application does not change at all. This is the architectural hedge that makes the build-vs-buy decision reversible. For a full deep-dive on LiteLLM in production agent routing, including budget guardrails and multi-tenant cost attribution, see the LLM skill registry and routing post.

📚 Lessons Learned

Six non-obvious lessons from teams that have navigated this decision in both directions.

1. Start API-first, self-host at a specific measurable threshold — never speculatively. The threshold should be defined in writing before you build: "When our monthly API cost exceeds $15,000 for three consecutive months, we evaluate Tier 2 hybrid routing." Speculative self-hosting (building the infrastructure before you hit the threshold) almost always results in sunk cost and under-utilised GPUs.

2. Semantic caching reduces API costs 30–60% and should come before any infrastructure change. A Redis cache with embedding-based similarity lookup is a 2-week engineering project. It consistently reduces API call volume by 30–60% for use cases with natural query repetition. Do this before you consider a GPU. Many teams discover after implementing caching that they never hit their self-hosting threshold.

3. If your fine-tuning dataset has fewer than 1,000 examples, you need better prompts, not fine-tuning. Fine-tuning on small datasets rarely achieves the hoped-for improvement and frequently hurts performance on out-of-distribution inputs. The minimum viable fine-tuning dataset for a domain-adaptation task is typically 5,000–10,000 diverse, high-quality examples. Below that threshold, prompt engineering with a few-shot example library almost always beats fine-tuning.

4. vLLM's continuous batching changes the economics significantly at scale. Naive estimates compare single-request throughput. vLLM's continuous batching at batch size 64+ achieves throughput-to-cost ratios that beat API pricing sooner than naive estimates suggest — sometimes at 30–40M tokens per day rather than 50–80M. If you are running GPU cost estimates, use vLLM benchmark numbers at realistic batch sizes, not theoretical single-request throughput.

5. GPT-4o-mini at $0.60/M output tokens often outperforms fine-tuned 7B models on real tasks. This is the uncomfortable truth that derails many self-hosting plans. A carefully prompted GPT-4o-mini outperforms a fine-tuned Llama 3 7B on the majority of production tasks we have seen. The 7B model only wins when fine-tuned on a very large, clean, domain-specific dataset (> 50,000 examples) for a narrow, well-defined task. Validate with evals before committing.

6. Spend alerts save more money than architecture changes, and you should set them on Day 1. Set a hard alert at 80% of your monthly LLM budget the day you integrate your first API call. Most overspend incidents are detectable 10–14 days before month end if you have alerting. Most teams that get surprised by a $47,000 invoice did not have alerts configured. The infrastructure decision comes after the alert fires; the alert should be there from the beginning.

📌 TLDR

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build vs buy decision is a spreadsheet problem, not an engineering identity problem.

Key takeaways from this post:

7-factor scoring framework: Volume, latency SLA, data privacy, customisation need, team capability, budget predictability, and model freshness each score 0–2. Sum determines your tier (API / Hybrid / Self-host).
Break-even formula: Against Claude Sonnet pricing, self-hosting a Llama 3 70B deployment breaks even at ~8.5M tokens/day. Against GPT-4o-mini, break-even is ~242M tokens/day. GPT-4o-mini is the "hidden gem" that extends API economics much further than most teams assume.
Three deployment tiers: API-only (Tier 1, 0–10M tokens/day) → Hybrid with complexity-based routing (Tier 2, 10–200M tokens/day) → Self-host first (Tier 3, > 200M tokens/day).
LiteLLM as the reversibility layer: Abstract all LLM calls behind an LLM gateway from day one. The build-vs-buy decision becomes a config file change, not a code rewrite.
Eval infrastructure before serving infrastructure: You cannot safely operate self-hosted models without automated regression detection. Build evals first.

Quiet AI help

Explain simpler Compare approaches What next?

Article metadata

Written by

Abstract Algorithms

@abstractalgorithms

Related deep dives

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

31 min read

ANN Index Types Explained: When to Choose Flat, HNSW, IVF, or IVF-PQ

14 min read

Softmax Function Explained: From Raw Scores to Probabilities

23 min read

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

31 min · Llm · best next step

Open Collection