Home/Blog/Ai/'The Developer''s Guide: When to Use Code, ML, LLMs, or Agents'

AiAdvanced•15 min read•Mar 9, 2026

'The Developer''s Guide: When to Use Code, ML, LLMs, or Agents'

Stop trying to solve everything with ChatGPT. We provide a decision framework for modern develope...

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: AI is a tool, not a religion. Use Code for deterministic logic (banking, math). Use Traditional ML for structured predictions (fraud, recommendations). Use LLMs for unstructured text (summarization, chat). Use Agents only when a task genuinely requires multi-step planning and external tool calls.

📖 One Codebase, Four Paradigms: Know Before You Reach for the LLM

The most expensive mistake in modern software is using an LLM for a problem deterministic code solves in 5 lines.

Before adding an AI component, ask two questions:

Is the output deterministic? If yes, write code.
Does the input have known structure? If yes, use ML.

If both answers are no and the input is natural language, then LLMs are the right tool. Agents are warranted only when the task requires multiple steps with external tool calls to complete.

🔍 Deep Dive: The Four Paradigms at a Glance

A quick reference before diving deeper. Each paradigm has a distinct role — picking the wrong one costs time, money, and unnecessary complexity.

Paradigm	One-Line Definition	Key Use Case
Code	Explicit rules the developer writes	Tax calculation, email validation, data parsing
Traditional ML	A model trained to find patterns in labeled data	Fraud scoring, churn prediction, recommendations
LLM	A large language model that reasons over free-form text	Summarization, intent classification, code generation
Agent	An LLM that orchestrates multi-step actions with external tools	Booking flows, CI debug bots, autonomous research

Think of it as a filter: start at Code and only move down the list when the previous paradigm genuinely cannot handle the problem.

📊 The Decision Flow: From Problem to Paradigm

Choosing the right approach starts by analyzing your problem's input structure, output requirements, and tolerance for non-determinism.

flowchart TD
    A[New Problem] --> B{Needs language understanding?}
    B -->|No| C{Has labeled training data?}
    B -->|Yes| D{Multi-step reasoning or tool use?}
    C -->|No| E[Pure Code / Heuristics]
    C -->|Yes| F[Traditional ML]
    D -->|Yes| G[Agents]
    D -->|No| H[LLM API Call]

🧭 Decision Guide: Choosing the Right Tool

The flowchart below turns the two intro questions into a repeatable process you can run mentally for any new backlog item.

flowchart TD
    Feature([New backlog item]) --> D1{Can you write an explicit rule?}
    D1 -- Yes --> UseCode[Use Code if/else  regex  math]
    D1 -- No --> D2{Structured input with labeled data?}
    D2 -- Yes --> UseML[Use Traditional ML classifier  regressor]
    D2 -- No --> D3{Single generation enough?}
    D3 -- Yes --> UseLLM[Use an LLM OpenAI  Anthropic  Gemini]
    D3 -- No --> UseAgent[Use an Agent ReAct  tool calls  loops]

The ladder rule: Code → ML → LLM → Agent is also a cost and complexity ladder. Climb only as high as the task demands — every rung you skip saves latency, dollars, and debugging hours.

🔢 Pure Code: When Determinism Is Non-Negotiable

Any operation where you can write an explicit rule belongs here.

Use case	Code approach
Calculate `tax = subtotal * 0.08`	1 line of arithmetic
Validate email format	Regex
Parse a known JSON schema	`json.loads()`
Sort a list by timestamp	`sorted(items, key=lambda x: x.ts)`
Route a payment to the right processor	If-else / pattern matching

When code beats AI: Banking transactions, data migrations, format validation, mathematical computations, protocol parsing. The rule: if a junior developer could write a test that covers every case, write code.

⚙️ Traditional ML: Patterns in Structured Tabular Data

Use ML when the rule is too complex to write by hand, but the input is structured (rows and columns with known features).

flowchart LR
    Features[Structured Features age, amount, location] --> Model[ML Model XGBoost / Random Forest]
    Model --> Prediction[Score or Label fraud probability]

Use case	Features	Model
Fraud detection	Amount, merchant, velocity	Gradient boosting (XGBoost)
Churn prediction	Login frequency, support tickets	Logistic regression
Product recommendations	Purchase history, ratings	Collaborative filtering / Matrix factorization
House price estimation	sq ft, location, year	Linear regression
Spam filter (classic)	Word frequencies (TF-IDF)	Naive Bayes / SVM

ML requires: labeled training data, feature engineering, model evaluation, and a retraining pipeline. If you don't have those, use rules instead.

🧠 Deep Dive: LLMs: When the Input Is Unstructured Text

LLMs excel at tasks where the input is free-form text and the output is also text (or a structured schema derived from text).

Task	Why LLM	Why not code/ML
Summarize a 20-page PDF	Understands context and importance	Rules can't, ML needs fine-tuning
Classify support ticket intent	Handles natural language variation	Rules miss edge cases, ML needs labeled data
Generate code from a description	Trained on vast code corpus	Impossible with deterministic rules
Extract entities from unstructured text	Flexible to schema variation	Classic NER models need annotation per domain
Answer questions about a document (RAG)	Combines retrieval + reasoning	Rules don't reason; classic ML doesn't generalize here

Cost reminder: Every LLM call costs money and adds latency. Never use an LLM for tasks that code or a simple ML model can solve.

🔬 Internals

Traditional software follows deterministic control flow: the same input always produces the same output via explicit conditional branches. ML models encode statistical patterns into weight matrices — inference is a series of matrix multiplications that approximate a learned function. LLMs combine both: they are probabilistic sequence models that can be steered toward determinism via temperature=0 and constrained decoding.

⚡ Performance Analysis

A GPT-4-class API call averages 1–3 seconds for a 500-token response at standard load — 100–1000× slower than a local function call. Fine-tuned smaller models (7B) can match GPT-4 on narrow tasks at 10–20× lower cost and <200ms latency on GPU. The break-even point for building a custom ML pipeline versus using an LLM API is typically ~10M requests/month.

🤖 Agents: For Multi-Step Goals That Require External Tools

Use agents when completing the task requires:

Multiple actions (not just one generation)
Calling external APIs or tools (not just text transformation)
Adapting plans based on intermediate results

Task	Agent needed?	Why
"Summarize this document"	No	Single LLM call
"Book the cheapest flight to Paris next Tuesday"	Yes	Needs search API, calendar check, payment API
"Send a weekly report email"	No	Code + cron job
"Debug this CI failure and open a PR with the fix"	Yes	Needs GitHub API, test runner, code editor
"What's 2 + 2?"	No	Code

Red flag: If you're describing your agent as "it just generates text and returns it," you needed a plain LLM call, not an agent.

⚖️ Trade-offs & Failure Modes: When Each Paradigm Breaks Down

flowchart TD
    Start([New requirement]) --> Q1{Is the output deterministic?}
    Q1 -- Yes --> Code[Write Code if/else, math, regex]
    Q1 -- No --> Q2{Is input structured data?}
    Q2 -- Yes --> ML[Traditional ML XGBoost, sklearn]
    Q2 -- No --> Q3{Is a single generation enough?}
    Q3 -- Yes --> LLM[LLM Call OpenAI, Anthropic, Gemini]
    Q3 -- No --> Agent[AI Agent ReAct + tools]

This flowchart maps any new requirement through a four-question decision tree that routes it to the right paradigm. Starting from whether the output is deterministic, it forks through questions about data structure and generation complexity, landing at code, traditional ML, a plain LLM call, or an agent. Use it as a quick sanity check before reaching for AI: if the path resolves to "Write Code," no machine learning or language model is justified.

Paradigm	Latency	Cost	Predictability	Best for
Code	Microseconds	Free	100% deterministic	Rules, math, format
ML	Milliseconds	Low inference cost	High with good data	Structured predictions
LLM	500ms–3s	$0.001–$0.06/1K tokens	Variable (hallucination risk)	Unstructured text
Agent	Seconds–minutes	Multiplied by iterations	Low without guardrails	Multi-step tool tasks

📊 LLM vs Code vs ML Decision

flowchart TD
    TK[Task Type] --> UN{Unstructured text?}
    UN -- Yes --> LM[Use LLM]
    UN -- No --> LB{Labeled data?}
    LB -- Yes --> ML[Train ML model]
    LB -- No --> RU{Rule-based ok?}
    RU -- Yes --> CD[Write code rules]
    RU -- No --> FT[Fine-tune LLM]

This second decision flowchart narrows the choice specifically between LLM, ML, and code once you know the task involves language or prediction. If the input is unstructured text, use an LLM; if you have labeled structured data, train an ML model; if rules will cover every case, write code; only fine-tune an LLM when off-the-shelf models fall short on domain-specific vocabulary or reasoning. The four leaves represent the minimum-viable path, and the rule is to start from the leftmost applicable branch — never skip to fine-tuning before trying a base model.

🌍 Real-World Applications: Examples Across Industries

The four paradigms rarely appear alone in a production system — real products layer them across different sub-tasks. Here is how the split looks in three common domains.

E-Commerce

Code — checkout math: total = (price × qty) - discount + tax. Deterministic, unit-testable, zero AI needed.
Traditional ML — product recommendations: a collaborative-filtering model ranks items based on purchase history and ratings.
LLM — product description drafting: turns a JSON of attributes (name, category, specs) into polished marketing copy.
Agent — order support bot: a user asks "Where is my order?" The agent queries the orders API, calls the shipping API, and drafts a personalised reply.

Fintech

Code — interest calculation and fee rounding (must be auditable to the cent).
Traditional ML — credit-risk scoring from structured applicant features (income, debt ratio, history).
LLM — contract clause extraction from uploaded PDF documents.
Agent — automated reconciliation: pulls daily transaction exports, flags discrepancies, and opens a Jira ticket for human review.

Healthcare and Devtools

Domain	Code	Traditional ML	LLM	Agent
Healthcare	Dosage formula	Readmission risk model	Clinical note summarization	Prior-auth agent that fills forms and emails payers
Devtools	Test runner logic	Bug-priority prediction	Docstring generation	CI debug bot: reads logs, writes a fix, opens a PR

No domain is owned by a single paradigm. The right tool depends on the specific sub-task, not the industry.

🧪 Practical Decision Checklist

Before adding any AI component, run through these five questions. It takes under two minutes and prevents months of over-engineering.

1. Could a developer write a rule that covers every case? → If yes, write code. No ML, no LLM.

2. Is the input structured (rows, columns, known schema) and do you have labeled training data? → If yes, explore gradient boosting or a simple classifier before reaching for an LLM.

3. Does the task require understanding natural language input or producing fluent text output? → If yes, an LLM call is likely appropriate. Check latency budget and token cost first.

4. Does completing the task require calling external APIs, reading live data, or executing multiple dependent steps? → If yes, consider an agent. If the task is still a single text transformation — use a plain LLM call.

5. Does the output need to be 100% reproducible and auditable? → If yes, stay in Code or ML territory. LLMs and agents are non-deterministic by default.

Red flags for AI over-engineering:

"We just need to validate the zip code format." → Write a regex.
"It needs to understand intent for a dropdown with 4 options." → Write the dropdown with 4 options.
"The agent just calls one API and returns the result." → That is an LLM call with a function call, not an agent.
Latency budget is under 100 ms. → Neither LLMs nor agents can reliably meet this; use code or a cached ML model.

📊 AI Tool Selection Map

flowchart LR
    UC[Use Case] --> GEN[General QA or chat]
    UC --> DOM[Domain-specific task]
    UC --> STR[Structured prediction]
    GEN --> LLM[LLM API call]
    DOM --> FTLLM[Fine-tuned LLM]
    STR --> SML[Traditional ML model]

This diagram maps three high-level use-case categories — general Q&A or chat, domain-specific tasks, and structured prediction — to the appropriate tool. General Q&A flows directly to a standard LLM API call; domain-specific tasks where the base model lacks specialized vocabulary or reasoning justify a fine-tuned LLM; structured prediction (classification, regression, ranking from tabular data) should use a traditional ML model. The takeaway is that fine-tuning is never the default starting point — it is only justified when a domain gap is confirmed with a base-model evaluation.

🛠️ HuggingFace Transformers & scikit-learn: LLM Inference vs Traditional ML Pipeline Side by Side

HuggingFace Transformers is the go-to Python library for loading and running open-source LLMs (GPT-2, LLaMA, Mistral). Its pipeline() abstraction wraps tokenization, model inference, and decoding into a single callable — making it easy to benchmark an LLM approach alongside a traditional ML baseline before committing either to production.

The snippet below applies the decision framework from this post to a real task — intent classification — solved two ways: a scikit-learn TF-IDF + Logistic Regression classifier (Traditional ML, sub-millisecond, trained on your labeled data) and a zero-shot HuggingFace pipeline (LLM, no training data required, but slower and heavier).

# pip install scikit-learn transformers torch

# ── Traditional ML: scikit-learn text classifier ─────────────────────────────
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

train_texts  = ["cancel my order", "where is my package", "I want a refund"]
train_labels = ["cancel", "track", "refund"]

ml_clf = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("lr",    LogisticRegression()),
])
ml_clf.fit(train_texts, train_labels)

print(ml_clf.predict(["I'd like to return this item"]))  # → ['refund']

# ── LLM: zero-shot classification via HuggingFace Transformers ────────────────
from transformers import pipeline

llm_clf = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli",   # runs locally — no API key required
)

result = llm_clf(
    "I'd like to return this item",
    candidate_labels=["cancel", "track", "refund"],
)
print(result["labels"][0])              # → 'refund'

Approach	Training data needed	Inference latency	Best when
scikit-learn (Traditional ML)	Yes — labeled examples	< 1 ms	Stable label set, data already available
HuggingFace zero-shot (LLM)	No	200–800 ms (CPU)	Low-data phase or evolving label schema

The decision framework maps directly here: start with the ML classifier once you have 50+ labeled examples. Use zero-shot only while your label schema is still evolving — it is the LLM rung on the ladder, justified when no labeled data exists.

For a full deep-dive on HuggingFace Transformers and scikit-learn pipelines, a dedicated follow-up post is planned.

📚 Key Lessons from the Field

These are the hard-won lessons from teams that have shipped all four paradigms in production.

1. The biggest mistake is using LLMs for everything. Every project that over-indexes on AI ends up rewriting core logic in plain code six months later. LLM calls replacing if/else blocks are the most expensive technical debt in modern software.

2. Agents cost far more than a single LLM call. A ReAct loop that makes five tool calls before returning an answer can cost 10–50× what a direct LLM call costs in tokens and latency. Add hard iteration caps and cost circuit-breakers before going to production.

3. Hybrid approaches often outperform pure-AI solutions. A fraud system that uses XGBoost to score 95% of transactions (fast, cheap, auditable) and escalates borderline cases to an LLM for explanation frequently beats a pure-LLM pipeline on cost, latency, and explainability.

4. Testing strategies differ across paradigms.

Code: Unit tests with full coverage.
ML: Hold-out test set, confusion matrix, data-drift monitoring.
LLM: Prompt regression tests (golden prompt–output pairs), LLM-as-judge evaluations.
Agent: End-to-end integration tests with tool mocking, timeout validation, and cost-per-run tracking.

5. Start simple and graduate up on evidence. Deploy the rule-based solution first. Measure where it fails. Introduce ML or an LLM only for the specific failure mode you can prove with data. This approach is faster to ship, easier to debug, and cheaper to run.

📌 TLDR: Summary & Key Takeaways

Traditional ML for structured data with learnable patterns (fraud, churn, recs).
LLMs for unstructured text tasks: summarization, classification, generation.
Agents only when the task is multi-step and requires external tool calls.
Cost and latency scale: Code < ML < LLM < Agent. Use the cheapest tool that solves the problem.

Article tools

Explain simpler Compare approaches What next?

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Article metadata