All Posts

LLM Software Development Pitfalls: What to Avoid and When to Simplify

A practical field guide to prompt spaghetti, missing evals, runaway costs, and the moment an LLM feature becomes too much.

Abstract AlgorithmsAbstract Algorithms
ยทยท20 min read
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: Most bad LLM products do not fail because the model is weak. They fail because teams wrap a maybe-useful model in too much architecture: prompt spaghetti, no eval harness, weak tool schemas, huge context windows, agent chains nobody can explain, and zero cost ceilings. Use an LLM only when the problem is genuinely ambiguous, keep the architecture linear for as long as possible, and remove or simplify the model when deterministic software already solves the job with lower cost and higher trust.

๐Ÿ“– Four Ordinary LLM Projects That Drift Into Complexity

The pattern usually starts with a perfectly reasonable idea. A support team wants a copilot that drafts replies. An internal platform team wants a search assistant over docs and runbooks. Ops wants a ticket triage bot. Engineering wants a code review helper that points out risky changes before humans look at them.

None of those ideas are bad. The trouble starts when teams assume that because an LLM can say something useful, it should also become the main control plane for the workflow. A simple support copilot turns into autonomous replying. A search assistant turns into "memory" plus five prompt templates plus a retrieval chain that dumps half the wiki into context. A ticket triage bot becomes a planner, a router, three specialist agents, and a database of long-lived conversation state nobody can justify.

The practical question is not, "Can the model do this?" It is, "What part of this job is genuinely ambiguous, and what part is just software?"

ProjectDeterministic baseline that often worksWhere the LLM adds real valueCommon overkill move
Support copilotRetrieval + canned policy snippetsTone adaptation and draft synthesisLetting the bot send final replies without policy checks
Internal search assistantSearch, ranking, filters, linksSummarizing top hits into one answerStuffing 30 documents into context and calling it RAG
Ticket triage botRules, keywords, metadata, routing tablesSummarizing messy issue descriptionsUsing an agent swarm to classify three ticket types
Code review helperLinters, static analysis, testsExplaining risk in plain EnglishTreating the LLM as a merge gate with no eval history

This post is an opinionated field guide for teams that want useful LLM-powered software, not a demo that looks smart for two days and becomes a reliability project for six months.

๐Ÿ” The First Question: Do You Need Probabilistic Software at All?

LLMs are strongest when the task is fuzzy: summarizing long text, rewriting for tone, extracting meaning from messy language, or generating a first draft from incomplete context. They are weakest when the job is bounded, rule-heavy, or directly tied to a source of truth your software already owns.

That means you should actively try to route around the model.

Use deterministic software first when:

  • the valid output space is small and known in advance
  • the answer must be exact, auditable, or regulation-sensitive
  • the task is really lookup plus formatting
  • a rules engine or classifier can already hit your acceptance threshold
  • failure costs are higher than the value of flexible language generation

An easy test is this: if you can explain the task as "if these conditions hold, do one of seven things," you probably do not need an LLM in the critical path.

SignalKeep it deterministicLLM is justified
Output formatFixed labels, IDs, actionsOpen-ended explanation or synthesis
Source of truthSingle database or policy fileMultiple messy documents or human language inputs
Tolerance for varianceNear zeroSome variation is acceptable
AuditabilityEvery decision must be replayableDrafting or assistive output is acceptable
User expectationPrecise answerHelpful first pass

The biggest beginner mistake in LLM product work is not bad prompting. It is using a stochastic component to avoid writing clear product logic.

โš™๏ธ How Prompt Spaghetti and Architecture Inflation Happen

Most production LLM failures are software design failures wearing AI clothes. The model becomes the excuse for avoiding boundaries, schemas, tests, and fallback paths that normal engineering would demand.

Prompt spaghetti is just hidden business logic

If your application behavior depends on seven prompt variants, two invisible system messages, and a few "temporary" string concatenations spread across the codebase, you do not have prompt engineering. You have undocumented business logic.

This becomes dangerous because nobody can answer simple questions anymore:

  • Which instruction actually changed the output?
  • Which retrieved chunk caused the bad answer?
  • Which version of the prompt was in production during the incident?

Version prompts like code, keep them few, and prefer explicit state over magic wording.

No eval harness means you are shipping vibes

A team says, "We tested twenty examples and it felt good." That is not evaluation. That is a mood.

Without a lightweight eval harness, every model swap, prompt tweak, and retrieval change is guesswork. You cannot tell whether quality improved, whether the system got more expensive, or whether a "small" prompt edit silently broke edge cases in finance, policy, or multilingual inputs.

At minimum, keep a golden dataset, score expected behavior, and run it on every change that touches prompts, retrieval, model choice, or tool definitions.

Memory is usually oversold and often harmful

Persistent conversation memory sounds user-friendly, but it is frequently the fastest route to irrelevant context, privacy headaches, and bizarre failures.

Most product flows do not need rich long-term memory. They need:

  • the current user request
  • a few recent turns if follow-up context matters
  • durable business state from real systems of record

When teams stuff every prior message into every prompt, the model starts anchoring on stale information. Cost goes up, latency goes up, and correctness often goes down.

Weak tool schemas create fake autonomy

If a tool call returns loose JSON, optional fields, or ambiguous action names, your agent is guessing at the interface. That is not a reasoning problem; that is a contract problem.

Bad tool schemas cause three painful failure modes:

  1. the model chooses the wrong tool because names overlap
  2. the model calls the right tool with invalid arguments
  3. the model gets valid data back but cannot reliably parse what matters

Strong names, strong types, strong validation. The more expensive the downstream action, the stricter the schema must be.

Too many chains and agents multiply uncertainty

A lot of "advanced" LLM architecture is just latency and debugging overhead disguised as sophistication.

If one model call cannot do the job, that does not automatically mean you need a planner, a router, four specialists, a critic, and a summarizer. Every extra step adds:

  • another prompt surface
  • another failure mode
  • another cost multiplier
  • another place where monitoring can go dark

The default architecture for most LLM features should be boring: one decision point, one retrieval step if needed, one model call, one validator, one fallback.

The missing operational controls are not optional

Teams also underinvest in the non-model parts that actually make the system survivable:

PitfallWhy it feels acceptable at firstWhat breaks laterSafer default
No cost ceilingEarly traffic is tinyA prompt bug or loop burns budget fastPer-request and per-user spend caps
Unbounded context stuffing"More context means better answers"Token waste, latency spikes, lower signalRanked retrieval plus hard token budgets
No human fallbackFull automation looks impressiveHigh-risk cases fail silentlyEscalate uncertain or sensitive cases
No abuse handlingInternal users seem safePrompt injection, toxic inputs, data leakageInput filters, policy checks, safe completion paths
No monitoringDemo looks fineProduction drift is invisibleTrace prompts, costs, latency, schema failures

The practical rule: if you would never ship a payment system, search system, or workflow engine without guardrails, do not ship the LLM wrapper without them either.

๐Ÿง  Why Over-Engineered LLM Systems Rot Faster Than Normal Integrations

The reason LLM architecture gets messy so quickly is that uncertainty can enter at several layers, and teams often treat every bad outcome as a prompting issue even when the real cause is orchestration.

The Internals of an Over-Engineered LLM Request

A single user message can fan out through multiple uncertain steps:

  1. a router decides whether the request is search, chat, classification, or tool use
  2. retrieval selects a few chunks from a larger corpus
  3. prompt assembly decides how much prior conversation and policy text to include
  4. the model generates output or a tool call
  5. the tool layer validates arguments and executes side effects
  6. a parser turns the output back into application state

Each step can be individually "pretty good" and the whole request can still fail. A slightly bad router sends the task to the wrong path. A slightly noisy retriever adds irrelevant context. A weak tool schema makes the generated action unsafe. By the time a user sees a wrong answer, the error is several layers removed from the visible symptom.

That is why LLM systems need narrower interfaces than normal prose suggests. The model should operate inside a small, well-defined envelope, not across the entire product surface area.

Performance Analysis: Latency, Token, and Retry Multipliers

The cost of over-design is multiplicative, not additive.

In plain language:

  • every extra retrieval chunk raises prompt tokens
  • every extra tool or agent step adds latency
  • every retry increases both cost and queue pressure
  • every long memory window makes bad answers more expensive, not just slower

You can think of the request budget like this:

[ \text{request cost} \approx \sum (\text{prompt tokens} + \text{completion tokens}) \times \text{model price} + \text{retry overhead} ]

[ \text{request latency} \approx \text{routing} + \text{retrieval} + \text{generation} + \text{tool fanout} + \text{retries} ]

The equations are simple on purpose. The important point is not formal math; it is that every architectural flourish has a measurable operational multiplier.

Design choiceQuality upsideCost impactReliability impact
Add more retrieved chunksSometimes improves coverageHigh token growthMore noise in context
Add persistent memoryHelps a few long workflowsMedium to highStale context and privacy risk
Add a second model callCan improve structure or critiqueDoubles spend on that pathMore timeout and parsing failures
Add multi-agent planningUseful for truly open-ended workflowsVery highHarder tracing and debugging

๐Ÿ“Š A Keep-Simplify-Remove Flow for LLM Features

Before you optimize prompts, decide whether the model should stay in the path at all. The diagram below is intentionally simple: it treats the LLM as one component in a product system, not the center of the universe. Read it from top to bottom and notice how often the safe answer is "use normal software first."

graph TD
    A[Incoming task] --> B{Deterministic path good enough?}
    B -- Yes --> C[Use rules search or workflow code]
    B -- No --> D{Is user value in synthesis or ambiguity?}
    D -- No --> C
    D -- Yes --> E[Single LLM call with tight prompt]
    E --> F{Need external action or data?}
    F -- No --> G[Validate response and return]
    F -- Yes --> H[Use strict tool schema]
    H --> I[Validate tool result]
    I --> G
    G --> J{Below quality cost or safety threshold?}
    J -- Yes --> K[Add monitoring and human fallback]
    J -- No --> L[Simplify or remove LLM]

This flow is also a debugging aid. If your design jumps straight from "incoming task" to "agent planner," you skipped the most important engineering question. The cleanest LLM systems are usually the ones that earn each extra step with evidence, not enthusiasm.

๐ŸŒ What These Pitfalls Look Like in Real Teams

The anti-patterns above become easier to see when you look at common product shapes instead of abstract architecture diagrams.

Case Study: Support Copilot

Input: a customer asks for a refund after a delayed shipment.
Better process: retrieve the current refund policy, draft a response, and require a human or a policy validator before sending.
Bad process: let the LLM answer from memory, mix old conversation history with partial policy context, and send automatically.

The support copilot should help a human move faster, not invent policy under pressure.

Case Study: Internal Search Assistant

Input: an engineer asks, "How do I rotate Kafka credentials in staging?"
Better process: search docs, rank results, summarize the top two, and include links.
Bad process: dump fifteen chunks plus months of chat memory into context, then blame the model for a wrong answer.

The user mainly needs retrieval discipline, not simulated memory.

Case Study: Ticket Triage Bot

Input: a bug report arrives with messy text, logs, and screenshots.
Better process: use rules for obvious routing, then let the LLM summarize ambiguous tickets and suggest severity.
Bad process: ask three agents to debate ownership, urgency, and probable root cause before filing the issue.

Triage is a great example of selective LLM use. Summarization helps. Autonomous workflow orchestration often does not.

Case Study: Code Review Helper

Input: a pull request changes auth middleware and a feature flag.
Better process: combine static analysis, test results, diff metadata, and an LLM explanation layer for reviewers.
Bad process: treat the LLM as an authoritative reviewer with no benchmark against known defects.

Code review helpers work best as explainers and risk annotators, not as sole approvers.

โš–๏ธ The Trade-Offs That Quietly Turn an LLM Feature Into a Liability

Every LLM feature carries hidden trade-offs, but some are especially easy to ignore during prototype week.

DecisionImmediate gainHidden costFailure mode to watch
Let the model see more contextHigher recall on some casesToken bloat and weaker focusRelevant answer buried in irrelevant text
Add conversation memoryMore continuityPrivacy, stale facts, unpredictable anchoringUser gets answer based on old state
Use more agentsBetter decomposition on paperMore latency and debugging surfacesNobody knows which step failed
Skip evalsFaster iterationNo regression signalQuality drifts until users complain
Skip abuse handlingFaster launchSafety and injection riskTool misuse or leakage of sensitive data
Skip monitoringCleaner MVPInvisible cost and quality driftIncidents become anecdotal and slow to triage

Two trade-offs matter more than teams admit:

  1. Flexibility vs accountability. LLMs are attractive because they absorb messy human input. But the more flexibly you use them, the more you need explicit validation and escalation paths.
  2. Magic vs maintainability. A surprisingly good demo often leads teams to add layers instead of constraints. Production software usually needs the opposite move.

๐Ÿงญ When Is It Too Much? A Decision Guide for Simplifying or Removing the LLM

Here is the blunt answer: it is too much when the LLM is no longer the smallest useful source of intelligence in the system.

Use this scorecard before adding more prompts, memory, or agents:

QuestionIf the answer is mostly "no"
Does the task truly require ambiguity handling or synthesis?Remove the LLM and use rules, search, or templates
Can you define success with an eval set, not vibes?Simplify until you can measure it
Can you cap cost and latency per request?Do not expand the architecture yet
Do you have strict schemas for every tool and output?Fix contracts before adding autonomy
Is there a safe human fallback for high-risk cases?Keep the feature assistive, not autonomous
Does memory measurably improve quality?Remove or sharply limit memory
Can you explain the full path of one request on one page?You already have too much orchestration

A practical threshold:

  • 0-2 yes answers: remove the LLM from the critical path
  • 3-5 yes answers: keep the LLM, but simplify to one model call with validation
  • 6-7 yes answers: keep it, but only with monitoring, budgets, and eval gates

You should seriously consider removing or simplifying the model when:

  • a deterministic baseline solves most requests
  • users mainly need retrieval, filtering, or workflow automation
  • incidents come from prompt routing, not from lack of model capability
  • the team cannot explain or reproduce failures quickly
  • long-term memory adds cost but not measurable user benefit
  • tool execution is the real value and the LLM is just a fragile wrapper around it

The hardest but healthiest engineering move is sometimes this: replace an "AI feature" with normal software and call it a win.

๐Ÿงช Five Small Python Guards That Make an LLM Feature Behave More Like Software

These examples are intentionally small and runnable. They are not framework demos; they are engineering habits you can drop into a prototype today. Read them as guardrails for pruning complexity, not as reasons to add more layers.

Example 1: Route Around the Model When the Task Is Bounded

from dataclasses import dataclass

@dataclass
class Request:
    task: str
    text: str
    risk: str = "low"

DETERMINISTIC_TASKS = {"ticket_triage", "faq_lookup", "status_check"}
KEYWORDS = {"refund", "password reset", "invoice"}

def should_bypass_llm(request: Request) -> bool:
    short_text = len(request.text.split()) <= 30
    keyword_match = any(k in request.text.lower() for k in KEYWORDS)
    return request.task in DETERMINISTIC_TASKS and (short_text or keyword_match)

def route_request(request: Request) -> str:
    if should_bypass_llm(request):
        return "rules_or_search"
    if request.risk == "high":
        return "human_review"
    return "llm"

sample = Request(task="ticket_triage", text="Invoice PDF missing for April order")
print(route_request(sample))

If this kind of router handles a large share of traffic, that is good news. It means you can reserve the LLM for the ambiguous tail instead of paying model cost for work a workflow engine can do.

Example 2: Enforce a Token Budget Before You Assemble Context

from dataclasses import dataclass

@dataclass
class TokenBudget:
    max_prompt_tokens: int = 1200
    reserve_for_completion: int = 300

def estimate_tokens(text: str) -> int:
    return max(1, len(text) // 4)

def build_context(chunks: list[str], budget: TokenBudget) -> list[str]:
    chosen = []
    used = 0
    limit = budget.max_prompt_tokens - budget.reserve_for_completion

    for chunk in chunks:
        chunk_tokens = estimate_tokens(chunk)
        if used + chunk_tokens > limit:
            break
        chosen.append(chunk)
        used += chunk_tokens

    return chosen

chunks = [
    "Refund policy: items can be returned within 30 days.",
    "Shipping FAQ: tracking updates may lag by 24 hours.",
    "Old holiday policy from 2023 that should not be in every prompt.",
]

print(build_context(chunks, TokenBudget()))

Hard budgets prevent the most common RAG failure mode: treating the context window like free real estate.

Example 3: Keep a Lightweight Eval Harness Instead of Trusting Vibe Checks

from statistics import mean

goldens = [
    {"name": "refund_policy", "predicted": "billing", "expected": "billing", "latency_ms": 220},
    {"name": "password_reset", "predicted": "support", "expected": "support", "latency_ms": 180},
    {"name": "cancel_order", "predicted": "shipping", "expected": "billing", "latency_ms": 260},
]

def score(case: dict) -> dict:
    correct = int(case["predicted"] == case["expected"])
    latency_ok = int(case["latency_ms"] <= 250)
    return {"name": case["name"], "correct": correct, "latency_ok": latency_ok}

results = [score(case) for case in goldens]
accuracy = mean(r["correct"] for r in results)
latency_pass_rate = mean(r["latency_ok"] for r in results)

print({"accuracy": accuracy, "latency_pass_rate": latency_pass_rate, "results": results})

This is deliberately simple, but it is already far better than, "We tried it and it seemed fine." Add domain-specific assertions over time. If you want the production-grade version of this discipline, see LLM Evaluation Frameworks: How to Measure Model Quality.

Example 4: Use Retries and a Circuit Breaker, Not Infinite Hope

import time

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 3, reset_after_seconds: int = 5):
        self.failure_threshold = failure_threshold
        self.reset_after_seconds = reset_after_seconds
        self.failures = 0
        self.opened_at = None

    def allow(self) -> bool:
        if self.opened_at is None:
            return True
        return (time.time() - self.opened_at) >= self.reset_after_seconds

    def record_success(self) -> None:
        self.failures = 0
        self.opened_at = None

    def record_failure(self) -> None:
        self.failures += 1
        if self.failures >= self.failure_threshold:
            self.opened_at = time.time()

def flaky_llm_call() -> str:
    raise TimeoutError("model timeout")

def call_with_guardrails(fn, retries: int = 2) -> str:
    breaker = CircuitBreaker()
    for _ in range(retries + 1):
        if not breaker.allow():
            return "fallback_path"
        try:
            result = fn()
            breaker.record_success()
            return result
        except TimeoutError:
            breaker.record_failure()
            time.sleep(0.1)
    return "fallback_path"

print(call_with_guardrails(flaky_llm_call))

An LLM dependency should fail like any other dependency: bounded retries, explicit fallback, and no heroics.

๐Ÿ› ๏ธ Pydantic: How Strict Schemas Stop Tool and Output Drift in Practice

Pydantic is an open-source Python validation framework that turns loose JSON into explicit contracts. In LLM systems, that matters because structured output is where "looks correct" often becomes "silently wrong." If the model says a ticket severity is "urgent-ish" when your workflow expects low | medium | high, you want a hard validation failure, not a soft shrug.

This snippet keeps the model on a short leash. It validates category, urgency, and confidence before the rest of the application trusts the output.

from typing import Literal

from pydantic import BaseModel, Field, ValidationError

class TicketDecision(BaseModel):
    queue: Literal["billing", "support", "security"]
    urgency: Literal["low", "medium", "high"]
    confidence: float = Field(ge=0.0, le=1.0)

raw_payload = {
    "queue": "billing",
    "urgency": "high",
    "confidence": 0.92,
}

try:
    decision = TicketDecision.model_validate(raw_payload)
    print(decision.model_dump())
except ValidationError as exc:
    print("route_to_human_review")
    print(exc)

Pairing a single model call with strict validation is often better than adding another agent whose only job is to clean up the first agent's mess. For a full deep-dive on Pydantic, see a planned follow-up.

๐Ÿ“š Lessons Learned From Shipping LLM Features That Stay Small on Purpose

  • The best LLM architecture is usually the smallest one that survives contact with production.
  • If a deterministic path handles the request well enough, that is not a compromise. That is good engineering.
  • Memory should be earned with measured lift, not added because chat apps have memory.
  • Tool schemas are product contracts, not prompt accessories.
  • If you cannot evaluate, budget, trace, and fail safely, you are not ready to add more agents.

๐Ÿ“Œ TLDR Summary and Key Takeaways

  • Avoid using an LLM for bounded tasks that rules, search, or workflow code already solve well.
  • Treat prompt sprawl, context stuffing, and multi-agent inflation as architecture smells.
  • Never rely on vibe checks; keep a lightweight eval harness from day one.
  • Put hard ceilings on tokens, retries, spend, and memory scope.
  • It is "too much" when the LLM is no longer the smallest useful intelligence layer in the system.

๐Ÿ“ Practice Quiz

  1. A ticket triage system already routes 92% of tickets correctly using metadata and rules. When should you still add an LLM?

    • Correct Answer: Add it only for the ambiguous remainder where summarization or messy language interpretation materially improves outcomes.
  2. Why is prompt spaghetti a software design problem, not just a prompting problem?

    • Correct Answer: Because hidden prompt variants become undocumented business logic that is hard to version, test, debug, and audit.
  3. What is the main engineering reason to cap context size instead of always sending more documents?

    • Correct Answer: Unbounded context increases cost and latency while often reducing answer quality by burying relevant information in noise.
  4. Open-ended challenge: your support assistant uses retrieval, memory, two agents, and three tools, but user value mainly comes from finding the right policy article. What would you simplify first, and why?

    • Correct Answer: No single correct answer, but a strong response should remove or sharply limit memory, collapse the agents into one validated call, and preserve deterministic retrieval plus human fallback.
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms