'The Developer''s Guide: When to Use Code, ML, LLMs, or Agents'
Stop trying to solve everything with ChatGPT. We provide a decision framework for modern develope...
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: AI is a tool, not a religion. Use Code for deterministic logic (banking, math). Use Traditional ML for structured predictions (fraud, recommendations). Use LLMs for unstructured text (summarization, chat). Use Agents only when a task genuinely requires multi-step planning and external tool calls.
π One Codebase, Four Paradigms: Know Before You Reach for the LLM
The most expensive mistake in modern software is using an LLM for a problem deterministic code solves in 5 lines.
Before adding an AI component, ask two questions:
- Is the output deterministic? If yes, write code.
- Does the input have known structure? If yes, use ML.
If both answers are no and the input is natural language, then LLMs are the right tool. Agents are warranted only when the task requires multiple steps with external tool calls to complete.
π Deep Dive: The Four Paradigms at a Glance
A quick reference before diving deeper. Each paradigm has a distinct role β picking the wrong one costs time, money, and unnecessary complexity.
| Paradigm | One-Line Definition | Key Use Case |
| Code | Explicit rules the developer writes | Tax calculation, email validation, data parsing |
| Traditional ML | A model trained to find patterns in labeled data | Fraud scoring, churn prediction, recommendations |
| LLM | A large language model that reasons over free-form text | Summarization, intent classification, code generation |
| Agent | An LLM that orchestrates multi-step actions with external tools | Booking flows, CI debug bots, autonomous research |
Think of it as a filter: start at Code and only move down the list when the previous paradigm genuinely cannot handle the problem.
π The Decision Flow: From Problem to Paradigm
Choosing the right approach starts by analyzing your problem's input structure, output requirements, and tolerance for non-determinism.
flowchart TD
A[New Problem] --> B{Needs language understanding?}
B -->|No| C{Has labeled training data?}
B -->|Yes| D{Multi-step reasoning or tool use?}
C -->|No| E[Pure Code / Heuristics]
C -->|Yes| F[Traditional ML]
D -->|Yes| G[Agents]
D -->|No| H[LLM API Call]
π§ Decision Guide: Choosing the Right Tool
The flowchart below turns the two intro questions into a repeatable process you can run mentally for any new backlog item.
flowchart TD
Feature([New backlog item]) --> D1{Can you write an explicit rule?}
D1 -- Yes --> UseCode[Use Code if/else regex math]
D1 -- No --> D2{Structured input with labeled data?}
D2 -- Yes --> UseML[Use Traditional ML classifier regressor]
D2 -- No --> D3{Single generation enough?}
D3 -- Yes --> UseLLM[Use an LLM OpenAI Anthropic Gemini]
D3 -- No --> UseAgent[Use an Agent ReAct tool calls loops]
The ladder rule: Code β ML β LLM β Agent is also a cost and complexity ladder. Climb only as high as the task demands β every rung you skip saves latency, dollars, and debugging hours.
π’ Pure Code: When Determinism Is Non-Negotiable
Any operation where you can write an explicit rule belongs here.
| Use case | Code approach |
Calculate tax = subtotal * 0.08 | 1 line of arithmetic |
| Validate email format | Regex |
| Parse a known JSON schema | json.loads() |
| Sort a list by timestamp | sorted(items, key=lambda x: x.ts) |
| Route a payment to the right processor | If-else / pattern matching |
When code beats AI: Banking transactions, data migrations, format validation, mathematical computations, protocol parsing. The rule: if a junior developer could write a test that covers every case, write code.
βοΈ Traditional ML: Patterns in Structured Tabular Data
Use ML when the rule is too complex to write by hand, but the input is structured (rows and columns with known features).
flowchart LR
Features[Structured Features age, amount, location] --> Model[ML Model XGBoost / Random Forest]
Model --> Prediction[Score or Label fraud probability]
| Use case | Features | Model |
| Fraud detection | Amount, merchant, velocity | Gradient boosting (XGBoost) |
| Churn prediction | Login frequency, support tickets | Logistic regression |
| Product recommendations | Purchase history, ratings | Collaborative filtering / Matrix factorization |
| House price estimation | sq ft, location, year | Linear regression |
| Spam filter (classic) | Word frequencies (TF-IDF) | Naive Bayes / SVM |
ML requires: labeled training data, feature engineering, model evaluation, and a retraining pipeline. If you don't have those, use rules instead.
π§ Deep Dive: LLMs: When the Input Is Unstructured Text
LLMs excel at tasks where the input is free-form text and the output is also text (or a structured schema derived from text).
| Task | Why LLM | Why not code/ML |
| Summarize a 20-page PDF | Understands context and importance | Rules can't, ML needs fine-tuning |
| Classify support ticket intent | Handles natural language variation | Rules miss edge cases, ML needs labeled data |
| Generate code from a description | Trained on vast code corpus | Impossible with deterministic rules |
| Extract entities from unstructured text | Flexible to schema variation | Classic NER models need annotation per domain |
| Answer questions about a document (RAG) | Combines retrieval + reasoning | Rules don't reason; classic ML doesn't generalize here |
Cost reminder: Every LLM call costs money and adds latency. Never use an LLM for tasks that code or a simple ML model can solve.
π¬ Internals
Traditional software follows deterministic control flow: the same input always produces the same output via explicit conditional branches. ML models encode statistical patterns into weight matrices β inference is a series of matrix multiplications that approximate a learned function. LLMs combine both: they are probabilistic sequence models that can be steered toward determinism via temperature=0 and constrained decoding.
β‘ Performance Analysis
A GPT-4-class API call averages 1β3 seconds for a 500-token response at standard load β 100β1000Γ slower than a local function call. Fine-tuned smaller models (7B) can match GPT-4 on narrow tasks at 10β20Γ lower cost and <200ms latency on GPU. The break-even point for building a custom ML pipeline versus using an LLM API is typically ~10M requests/month.
π€ Agents: For Multi-Step Goals That Require External Tools
Use agents when completing the task requires:
- Multiple actions (not just one generation)
- Calling external APIs or tools (not just text transformation)
- Adapting plans based on intermediate results
| Task | Agent needed? | Why |
| "Summarize this document" | No | Single LLM call |
| "Book the cheapest flight to Paris next Tuesday" | Yes | Needs search API, calendar check, payment API |
| "Send a weekly report email" | No | Code + cron job |
| "Debug this CI failure and open a PR with the fix" | Yes | Needs GitHub API, test runner, code editor |
| "What's 2 + 2?" | No | Code |
Red flag: If you're describing your agent as "it just generates text and returns it," you needed a plain LLM call, not an agent.
βοΈ Trade-offs & Failure Modes: When Each Paradigm Breaks Down
flowchart TD
Start([New requirement]) --> Q1{Is the output deterministic?}
Q1 -- Yes --> Code[Write Code if/else, math, regex]
Q1 -- No --> Q2{Is input structured data?}
Q2 -- Yes --> ML[Traditional ML XGBoost, sklearn]
Q2 -- No --> Q3{Is a single generation enough?}
Q3 -- Yes --> LLM[LLM Call OpenAI, Anthropic, Gemini]
Q3 -- No --> Agent[AI Agent ReAct + tools]
This flowchart maps any new requirement through a four-question decision tree that routes it to the right paradigm. Starting from whether the output is deterministic, it forks through questions about data structure and generation complexity, landing at code, traditional ML, a plain LLM call, or an agent. Use it as a quick sanity check before reaching for AI: if the path resolves to "Write Code," no machine learning or language model is justified.
| Paradigm | Latency | Cost | Predictability | Best for |
| Code | Microseconds | Free | 100% deterministic | Rules, math, format |
| ML | Milliseconds | Low inference cost | High with good data | Structured predictions |
| LLM | 500msβ3s | $0.001β$0.06/1K tokens | Variable (hallucination risk) | Unstructured text |
| Agent | Secondsβminutes | Multiplied by iterations | Low without guardrails | Multi-step tool tasks |
π LLM vs Code vs ML Decision
flowchart TD
TK[Task Type] --> UN{Unstructured text?}
UN -- Yes --> LM[Use LLM]
UN -- No --> LB{Labeled data?}
LB -- Yes --> ML[Train ML model]
LB -- No --> RU{Rule-based ok?}
RU -- Yes --> CD[Write code rules]
RU -- No --> FT[Fine-tune LLM]
This second decision flowchart narrows the choice specifically between LLM, ML, and code once you know the task involves language or prediction. If the input is unstructured text, use an LLM; if you have labeled structured data, train an ML model; if rules will cover every case, write code; only fine-tune an LLM when off-the-shelf models fall short on domain-specific vocabulary or reasoning. The four leaves represent the minimum-viable path, and the rule is to start from the leftmost applicable branch β never skip to fine-tuning before trying a base model.
π Real-World Applications: Examples Across Industries
The four paradigms rarely appear alone in a production system β real products layer them across different sub-tasks. Here is how the split looks in three common domains.
E-Commerce
- Code β checkout math:
total = (price Γ qty) - discount + tax. Deterministic, unit-testable, zero AI needed. - Traditional ML β product recommendations: a collaborative-filtering model ranks items based on purchase history and ratings.
- LLM β product description drafting: turns a JSON of attributes (name, category, specs) into polished marketing copy.
- Agent β order support bot: a user asks "Where is my order?" The agent queries the orders API, calls the shipping API, and drafts a personalised reply.
Fintech
- Code β interest calculation and fee rounding (must be auditable to the cent).
- Traditional ML β credit-risk scoring from structured applicant features (income, debt ratio, history).
- LLM β contract clause extraction from uploaded PDF documents.
- Agent β automated reconciliation: pulls daily transaction exports, flags discrepancies, and opens a Jira ticket for human review.
Healthcare and Devtools
| Domain | Code | Traditional ML | LLM | Agent |
| Healthcare | Dosage formula | Readmission risk model | Clinical note summarization | Prior-auth agent that fills forms and emails payers |
| Devtools | Test runner logic | Bug-priority prediction | Docstring generation | CI debug bot: reads logs, writes a fix, opens a PR |
No domain is owned by a single paradigm. The right tool depends on the specific sub-task, not the industry.
π§ͺ Practical Decision Checklist
Before adding any AI component, run through these five questions. It takes under two minutes and prevents months of over-engineering.
1. Could a developer write a rule that covers every case? β If yes, write code. No ML, no LLM.
2. Is the input structured (rows, columns, known schema) and do you have labeled training data? β If yes, explore gradient boosting or a simple classifier before reaching for an LLM.
3. Does the task require understanding natural language input or producing fluent text output? β If yes, an LLM call is likely appropriate. Check latency budget and token cost first.
4. Does completing the task require calling external APIs, reading live data, or executing multiple dependent steps? β If yes, consider an agent. If the task is still a single text transformation β use a plain LLM call.
5. Does the output need to be 100% reproducible and auditable? β If yes, stay in Code or ML territory. LLMs and agents are non-deterministic by default.
Red flags for AI over-engineering:
- "We just need to validate the zip code format." β Write a regex.
- "It needs to understand intent for a dropdown with 4 options." β Write the dropdown with 4 options.
- "The agent just calls one API and returns the result." β That is an LLM call with a function call, not an agent.
- Latency budget is under 100 ms. β Neither LLMs nor agents can reliably meet this; use code or a cached ML model.
π AI Tool Selection Map
flowchart LR
UC[Use Case] --> GEN[General QA or chat]
UC --> DOM[Domain-specific task]
UC --> STR[Structured prediction]
GEN --> LLM[LLM API call]
DOM --> FTLLM[Fine-tuned LLM]
STR --> SML[Traditional ML model]
This diagram maps three high-level use-case categories β general Q&A or chat, domain-specific tasks, and structured prediction β to the appropriate tool. General Q&A flows directly to a standard LLM API call; domain-specific tasks where the base model lacks specialized vocabulary or reasoning justify a fine-tuned LLM; structured prediction (classification, regression, ranking from tabular data) should use a traditional ML model. The takeaway is that fine-tuning is never the default starting point β it is only justified when a domain gap is confirmed with a base-model evaluation.
π οΈ HuggingFace Transformers & scikit-learn: LLM Inference vs Traditional ML Pipeline Side by Side
HuggingFace Transformers is the go-to Python library for loading and running open-source LLMs (GPT-2, LLaMA, Mistral). Its pipeline() abstraction wraps tokenization, model inference, and decoding into a single callable β making it easy to benchmark an LLM approach alongside a traditional ML baseline before committing either to production.
The snippet below applies the decision framework from this post to a real task β intent classification β solved two ways: a scikit-learn TF-IDF + Logistic Regression classifier (Traditional ML, sub-millisecond, trained on your labeled data) and a zero-shot HuggingFace pipeline (LLM, no training data required, but slower and heavier).
# pip install scikit-learn transformers torch
# ββ Traditional ML: scikit-learn text classifier βββββββββββββββββββββββββββββ
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
train_texts = ["cancel my order", "where is my package", "I want a refund"]
train_labels = ["cancel", "track", "refund"]
ml_clf = Pipeline([
("tfidf", TfidfVectorizer()),
("lr", LogisticRegression()),
])
ml_clf.fit(train_texts, train_labels)
print(ml_clf.predict(["I'd like to return this item"])) # β ['refund']
# ββ LLM: zero-shot classification via HuggingFace Transformers ββββββββββββββββ
from transformers import pipeline
llm_clf = pipeline(
"zero-shot-classification",
model="facebook/bart-large-mnli", # runs locally β no API key required
)
result = llm_clf(
"I'd like to return this item",
candidate_labels=["cancel", "track", "refund"],
)
print(result["labels"][0]) # β 'refund'
| Approach | Training data needed | Inference latency | Best when |
| scikit-learn (Traditional ML) | Yes β labeled examples | < 1 ms | Stable label set, data already available |
| HuggingFace zero-shot (LLM) | No | 200β800 ms (CPU) | Low-data phase or evolving label schema |
The decision framework maps directly here: start with the ML classifier once you have 50+ labeled examples. Use zero-shot only while your label schema is still evolving β it is the LLM rung on the ladder, justified when no labeled data exists.
For a full deep-dive on HuggingFace Transformers and scikit-learn pipelines, a dedicated follow-up post is planned.
π Key Lessons from the Field
These are the hard-won lessons from teams that have shipped all four paradigms in production.
1. The biggest mistake is using LLMs for everything.
Every project that over-indexes on AI ends up rewriting core logic in plain code six months later. LLM calls replacing if/else blocks are the most expensive technical debt in modern software.
2. Agents cost far more than a single LLM call. A ReAct loop that makes five tool calls before returning an answer can cost 10β50Γ what a direct LLM call costs in tokens and latency. Add hard iteration caps and cost circuit-breakers before going to production.
3. Hybrid approaches often outperform pure-AI solutions. A fraud system that uses XGBoost to score 95% of transactions (fast, cheap, auditable) and escalates borderline cases to an LLM for explanation frequently beats a pure-LLM pipeline on cost, latency, and explainability.
4. Testing strategies differ across paradigms.
- Code: Unit tests with full coverage.
- ML: Hold-out test set, confusion matrix, data-drift monitoring.
- LLM: Prompt regression tests (golden promptβoutput pairs), LLM-as-judge evaluations.
- Agent: End-to-end integration tests with tool mocking, timeout validation, and cost-per-run tracking.
5. Start simple and graduate up on evidence. Deploy the rule-based solution first. Measure where it fails. Introduce ML or an LLM only for the specific failure mode you can prove with data. This approach is faster to ship, easier to debug, and cheaper to run.
π TLDR: Summary & Key Takeaways
- Traditional ML for structured data with learnable patterns (fraud, churn, recs).
- LLMs for unstructured text tasks: summarization, classification, generation.
- Agents only when the task is multi-step and requires external tool calls.
- Cost and latency scale: Code < ML < LLM < Agent. Use the cheapest tool that solves the problem.
π Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer β 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2Γ A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
