All Posts

'The Developer''s Guide: When to Use Code, ML, LLMs, or Agents'

Stop trying to solve everything with ChatGPT. We provide a decision framework for modern develope...

Abstract AlgorithmsAbstract Algorithms
Β·Β·14 min read

AI-assisted content.

TLDR: AI is a tool, not a religion. Use Code for deterministic logic (banking, math). Use Traditional ML for structured predictions (fraud, recommendations). Use LLMs for unstructured text (summarization, chat). Use Agents only when a task genuinely requires multi-step planning and external tool calls.


πŸ“– One Codebase, Four Paradigms: Know Before You Reach for the LLM

The most expensive mistake in modern software is using an LLM for a problem deterministic code solves in 5 lines.

Before adding an AI component, ask two questions:

  1. Is the output deterministic? If yes, write code.
  2. Does the input have known structure? If yes, use ML.

If both answers are no and the input is natural language, then LLMs are the right tool. Agents are warranted only when the task requires multiple steps with external tool calls to complete.


πŸ” Deep Dive: The Four Paradigms at a Glance

A quick reference before diving deeper. Each paradigm has a distinct role β€” picking the wrong one costs time, money, and unnecessary complexity.

ParadigmOne-Line DefinitionKey Use Case
CodeExplicit rules the developer writesTax calculation, email validation, data parsing
Traditional MLA model trained to find patterns in labeled dataFraud scoring, churn prediction, recommendations
LLMA large language model that reasons over free-form textSummarization, intent classification, code generation
AgentAn LLM that orchestrates multi-step actions with external toolsBooking flows, CI debug bots, autonomous research

Think of it as a filter: start at Code and only move down the list when the previous paradigm genuinely cannot handle the problem.


πŸ“Š The Decision Flow: From Problem to Paradigm

Choosing the right approach starts by analyzing your problem's input structure, output requirements, and tolerance for non-determinism.

flowchart TD
    A[New Problem] --> B{Needs language understanding?}
    B -->|No| C{Has labeled training data?}
    B -->|Yes| D{Multi-step reasoning or tool use?}
    C -->|No| E[Pure Code / Heuristics]
    C -->|Yes| F[Traditional ML]
    D -->|Yes| G[Agents]
    D -->|No| H[LLM API Call]

🧭 Decision Guide: Choosing the Right Tool

The flowchart below turns the two intro questions into a repeatable process you can run mentally for any new backlog item.

flowchart TD
    Feature([New backlog item]) --> D1{Can you write an explicit rule?}
    D1 -- Yes --> UseCode[Use Code if/else  regex  math]
    D1 -- No --> D2{Structured input with labeled data?}
    D2 -- Yes --> UseML[Use Traditional ML classifier  regressor]
    D2 -- No --> D3{Single generation enough?}
    D3 -- Yes --> UseLLM[Use an LLM OpenAI  Anthropic  Gemini]
    D3 -- No --> UseAgent[Use an Agent ReAct  tool calls  loops]

The ladder rule: Code β†’ ML β†’ LLM β†’ Agent is also a cost and complexity ladder. Climb only as high as the task demands β€” every rung you skip saves latency, dollars, and debugging hours.


πŸ”’ Pure Code: When Determinism Is Non-Negotiable

Any operation where you can write an explicit rule belongs here.

Use caseCode approach
Calculate tax = subtotal * 0.081 line of arithmetic
Validate email formatRegex
Parse a known JSON schemajson.loads()
Sort a list by timestampsorted(items, key=lambda x: x.ts)
Route a payment to the right processorIf-else / pattern matching

When code beats AI: Banking transactions, data migrations, format validation, mathematical computations, protocol parsing. The rule: if a junior developer could write a test that covers every case, write code.


βš™οΈ Traditional ML: Patterns in Structured Tabular Data

Use ML when the rule is too complex to write by hand, but the input is structured (rows and columns with known features).

flowchart LR
    Features[Structured Features age, amount, location] --> Model[ML Model XGBoost / Random Forest]
    Model --> Prediction[Score or Label fraud probability]
Use caseFeaturesModel
Fraud detectionAmount, merchant, velocityGradient boosting (XGBoost)
Churn predictionLogin frequency, support ticketsLogistic regression
Product recommendationsPurchase history, ratingsCollaborative filtering / Matrix factorization
House price estimationsq ft, location, yearLinear regression
Spam filter (classic)Word frequencies (TF-IDF)Naive Bayes / SVM

ML requires: labeled training data, feature engineering, model evaluation, and a retraining pipeline. If you don't have those, use rules instead.


🧠 Deep Dive: LLMs: When the Input Is Unstructured Text

LLMs excel at tasks where the input is free-form text and the output is also text (or a structured schema derived from text).

TaskWhy LLMWhy not code/ML
Summarize a 20-page PDFUnderstands context and importanceRules can't, ML needs fine-tuning
Classify support ticket intentHandles natural language variationRules miss edge cases, ML needs labeled data
Generate code from a descriptionTrained on vast code corpusImpossible with deterministic rules
Extract entities from unstructured textFlexible to schema variationClassic NER models need annotation per domain
Answer questions about a document (RAG)Combines retrieval + reasoningRules don't reason; classic ML doesn't generalize here

Cost reminder: Every LLM call costs money and adds latency. Never use an LLM for tasks that code or a simple ML model can solve.


πŸ”¬ Internals

Traditional software follows deterministic control flow: the same input always produces the same output via explicit conditional branches. ML models encode statistical patterns into weight matrices β€” inference is a series of matrix multiplications that approximate a learned function. LLMs combine both: they are probabilistic sequence models that can be steered toward determinism via temperature=0 and constrained decoding.

⚑ Performance Analysis

A GPT-4-class API call averages 1–3 seconds for a 500-token response at standard load β€” 100–1000Γ— slower than a local function call. Fine-tuned smaller models (7B) can match GPT-4 on narrow tasks at 10–20Γ— lower cost and <200ms latency on GPU. The break-even point for building a custom ML pipeline versus using an LLM API is typically ~10M requests/month.

πŸ€– Agents: For Multi-Step Goals That Require External Tools

Use agents when completing the task requires:

  1. Multiple actions (not just one generation)
  2. Calling external APIs or tools (not just text transformation)
  3. Adapting plans based on intermediate results
TaskAgent needed?Why
"Summarize this document"NoSingle LLM call
"Book the cheapest flight to Paris next Tuesday"YesNeeds search API, calendar check, payment API
"Send a weekly report email"NoCode + cron job
"Debug this CI failure and open a PR with the fix"YesNeeds GitHub API, test runner, code editor
"What's 2 + 2?"NoCode

Red flag: If you're describing your agent as "it just generates text and returns it," you needed a plain LLM call, not an agent.


βš–οΈ Trade-offs & Failure Modes: When Each Paradigm Breaks Down

flowchart TD
    Start([New requirement]) --> Q1{Is the output deterministic?}
    Q1 -- Yes --> Code[Write Code if/else, math, regex]
    Q1 -- No --> Q2{Is input structured data?}
    Q2 -- Yes --> ML[Traditional ML XGBoost, sklearn]
    Q2 -- No --> Q3{Is a single generation enough?}
    Q3 -- Yes --> LLM[LLM Call OpenAI, Anthropic, Gemini]
    Q3 -- No --> Agent[AI Agent ReAct + tools]

This flowchart maps any new requirement through a four-question decision tree that routes it to the right paradigm. Starting from whether the output is deterministic, it forks through questions about data structure and generation complexity, landing at code, traditional ML, a plain LLM call, or an agent. Use it as a quick sanity check before reaching for AI: if the path resolves to "Write Code," no machine learning or language model is justified.

ParadigmLatencyCostPredictabilityBest for
CodeMicrosecondsFree100% deterministicRules, math, format
MLMillisecondsLow inference costHigh with good dataStructured predictions
LLM500ms–3s$0.001–$0.06/1K tokensVariable (hallucination risk)Unstructured text
AgentSeconds–minutesMultiplied by iterationsLow without guardrailsMulti-step tool tasks

πŸ“Š LLM vs Code vs ML Decision

flowchart TD
    TK[Task Type] --> UN{Unstructured text?}
    UN -- Yes --> LM[Use LLM]
    UN -- No --> LB{Labeled data?}
    LB -- Yes --> ML[Train ML model]
    LB -- No --> RU{Rule-based ok?}
    RU -- Yes --> CD[Write code rules]
    RU -- No --> FT[Fine-tune LLM]

This second decision flowchart narrows the choice specifically between LLM, ML, and code once you know the task involves language or prediction. If the input is unstructured text, use an LLM; if you have labeled structured data, train an ML model; if rules will cover every case, write code; only fine-tune an LLM when off-the-shelf models fall short on domain-specific vocabulary or reasoning. The four leaves represent the minimum-viable path, and the rule is to start from the leftmost applicable branch β€” never skip to fine-tuning before trying a base model.


🌍 Real-World Applications: Examples Across Industries

The four paradigms rarely appear alone in a production system β€” real products layer them across different sub-tasks. Here is how the split looks in three common domains.

E-Commerce

  • Code β€” checkout math: total = (price Γ— qty) - discount + tax. Deterministic, unit-testable, zero AI needed.
  • Traditional ML β€” product recommendations: a collaborative-filtering model ranks items based on purchase history and ratings.
  • LLM β€” product description drafting: turns a JSON of attributes (name, category, specs) into polished marketing copy.
  • Agent β€” order support bot: a user asks "Where is my order?" The agent queries the orders API, calls the shipping API, and drafts a personalised reply.

Fintech

  • Code β€” interest calculation and fee rounding (must be auditable to the cent).
  • Traditional ML β€” credit-risk scoring from structured applicant features (income, debt ratio, history).
  • LLM β€” contract clause extraction from uploaded PDF documents.
  • Agent β€” automated reconciliation: pulls daily transaction exports, flags discrepancies, and opens a Jira ticket for human review.

Healthcare and Devtools

DomainCodeTraditional MLLLMAgent
HealthcareDosage formulaReadmission risk modelClinical note summarizationPrior-auth agent that fills forms and emails payers
DevtoolsTest runner logicBug-priority predictionDocstring generationCI debug bot: reads logs, writes a fix, opens a PR

No domain is owned by a single paradigm. The right tool depends on the specific sub-task, not the industry.


πŸ§ͺ Practical Decision Checklist

Before adding any AI component, run through these five questions. It takes under two minutes and prevents months of over-engineering.

1. Could a developer write a rule that covers every case? β†’ If yes, write code. No ML, no LLM.

2. Is the input structured (rows, columns, known schema) and do you have labeled training data? β†’ If yes, explore gradient boosting or a simple classifier before reaching for an LLM.

3. Does the task require understanding natural language input or producing fluent text output? β†’ If yes, an LLM call is likely appropriate. Check latency budget and token cost first.

4. Does completing the task require calling external APIs, reading live data, or executing multiple dependent steps? β†’ If yes, consider an agent. If the task is still a single text transformation β€” use a plain LLM call.

5. Does the output need to be 100% reproducible and auditable? β†’ If yes, stay in Code or ML territory. LLMs and agents are non-deterministic by default.

Red flags for AI over-engineering:

  • "We just need to validate the zip code format." β†’ Write a regex.
  • "It needs to understand intent for a dropdown with 4 options." β†’ Write the dropdown with 4 options.
  • "The agent just calls one API and returns the result." β†’ That is an LLM call with a function call, not an agent.
  • Latency budget is under 100 ms. β†’ Neither LLMs nor agents can reliably meet this; use code or a cached ML model.

πŸ“Š AI Tool Selection Map

flowchart LR
    UC[Use Case] --> GEN[General QA or chat]
    UC --> DOM[Domain-specific task]
    UC --> STR[Structured prediction]
    GEN --> LLM[LLM API call]
    DOM --> FTLLM[Fine-tuned LLM]
    STR --> SML[Traditional ML model]

This diagram maps three high-level use-case categories β€” general Q&A or chat, domain-specific tasks, and structured prediction β€” to the appropriate tool. General Q&A flows directly to a standard LLM API call; domain-specific tasks where the base model lacks specialized vocabulary or reasoning justify a fine-tuned LLM; structured prediction (classification, regression, ranking from tabular data) should use a traditional ML model. The takeaway is that fine-tuning is never the default starting point β€” it is only justified when a domain gap is confirmed with a base-model evaluation.


πŸ› οΈ HuggingFace Transformers & scikit-learn: LLM Inference vs Traditional ML Pipeline Side by Side

HuggingFace Transformers is the go-to Python library for loading and running open-source LLMs (GPT-2, LLaMA, Mistral). Its pipeline() abstraction wraps tokenization, model inference, and decoding into a single callable β€” making it easy to benchmark an LLM approach alongside a traditional ML baseline before committing either to production.

The snippet below applies the decision framework from this post to a real task β€” intent classification β€” solved two ways: a scikit-learn TF-IDF + Logistic Regression classifier (Traditional ML, sub-millisecond, trained on your labeled data) and a zero-shot HuggingFace pipeline (LLM, no training data required, but slower and heavier).

# pip install scikit-learn transformers torch

# ── Traditional ML: scikit-learn text classifier ─────────────────────────────
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

train_texts  = ["cancel my order", "where is my package", "I want a refund"]
train_labels = ["cancel", "track", "refund"]

ml_clf = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("lr",    LogisticRegression()),
])
ml_clf.fit(train_texts, train_labels)

print(ml_clf.predict(["I'd like to return this item"]))  # β†’ ['refund']

# ── LLM: zero-shot classification via HuggingFace Transformers ────────────────
from transformers import pipeline

llm_clf = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli",   # runs locally β€” no API key required
)

result = llm_clf(
    "I'd like to return this item",
    candidate_labels=["cancel", "track", "refund"],
)
print(result["labels"][0])              # β†’ 'refund'
ApproachTraining data neededInference latencyBest when
scikit-learn (Traditional ML)Yes β€” labeled examples< 1 msStable label set, data already available
HuggingFace zero-shot (LLM)No200–800 ms (CPU)Low-data phase or evolving label schema

The decision framework maps directly here: start with the ML classifier once you have 50+ labeled examples. Use zero-shot only while your label schema is still evolving β€” it is the LLM rung on the ladder, justified when no labeled data exists.

For a full deep-dive on HuggingFace Transformers and scikit-learn pipelines, a dedicated follow-up post is planned.


πŸ“š Key Lessons from the Field

These are the hard-won lessons from teams that have shipped all four paradigms in production.

1. The biggest mistake is using LLMs for everything. Every project that over-indexes on AI ends up rewriting core logic in plain code six months later. LLM calls replacing if/else blocks are the most expensive technical debt in modern software.

2. Agents cost far more than a single LLM call. A ReAct loop that makes five tool calls before returning an answer can cost 10–50Γ— what a direct LLM call costs in tokens and latency. Add hard iteration caps and cost circuit-breakers before going to production.

3. Hybrid approaches often outperform pure-AI solutions. A fraud system that uses XGBoost to score 95% of transactions (fast, cheap, auditable) and escalates borderline cases to an LLM for explanation frequently beats a pure-LLM pipeline on cost, latency, and explainability.

4. Testing strategies differ across paradigms.

  • Code: Unit tests with full coverage.
  • ML: Hold-out test set, confusion matrix, data-drift monitoring.
  • LLM: Prompt regression tests (golden prompt–output pairs), LLM-as-judge evaluations.
  • Agent: End-to-end integration tests with tool mocking, timeout validation, and cost-per-run tracking.

5. Start simple and graduate up on evidence. Deploy the rule-based solution first. Measure where it fails. Introduce ML or an LLM only for the specific failure mode you can prove with data. This approach is faster to ship, easier to debug, and cheaper to run.


πŸ“Œ TLDR: Summary & Key Takeaways

  • Traditional ML for structured data with learnable patterns (fraud, churn, recs).
  • LLMs for unstructured text tasks: summarization, classification, generation.
  • Agents only when the task is multi-step and requires external tool calls.
  • Cost and latency scale: Code < ML < LLM < Agent. Use the cheapest tool that solves the problem.

Share
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms