Advanced34 min readAiCoding AgentsLlm

How AI Coding Agents Work: Models, Context, Sessions, and Memory

Inside the architecture of Claude Code, GitHub Copilot, and Cursor: how agents use LLMs, manage context windows, handle sessions, and tier memory.

How It Works: Internals Explained

Abstract Algorithms

·Apr 10, 2026·34 min read

More actions⌄

Practice Interview Mock Discussion

Reading progress

34 min left

Metadata and pacing⌄

Total read

34 min

Sections

◴ On this page⌄

🔥 The 500-Line Refactor That Lost Its Mind 📖 LLM + Tool Use + Orchestration Loop = AI Coding Agent 🔍 Tokens, Turns, and Statelessness: The Baseline Every Developer Needs 🏗️ High-Level Architecture of an AI Coding Agent 🔄 The Agent Loop: How Claude Code Decides What to Do Next ⚙️ How Coding Agents Invoke LLMs: System Prompts, Tool Schemas, and Streaming The API call anatomy Function calling / tool use Why agents stream tokens 🧠 Context Window Management: The Core Constraint Driving Agent Design The Internals of Context Prioritization and Window Assembly What consumes the context window?RAG: Retrieval-Augmented Generation for codebase context Context assembly for a code edit request The "lost in the middle" problem Sliding window and truncation strategies Performance Analysis of Context Operations 📦 Session Management: Keeping State Alive Between Turns Session lifecycle Turn-by-turn serialization Multi-agent sessions: how Devin delegates 🗄️ Memory Architecture: The Four Tiers Every Agent Navigates 🔧 Tool Use and Code Execution: How Agents Take Real Actions Sandboxing: how Devin runs code safely Tool results and context bloat 📊 Performance Characteristics: Latency, Throughput, and the Bottlenecks ⚖️ Failure Modes: Why Agents Hallucinate, Loop, and Forget 🌍 How GitHub Copilot, Cursor, Devin, and Claude Code Apply These Internals in Production 🧪 Tracing a Complete Agent Session: From Feature Request to Committed Edit 🧭 Choosing Between Long Context, RAG, and Summarization: A Decision Guide 🛠️ Configuring Real Agents: Cursor Rules, Aider Flags, and OpenAI Tool Schemas Cursor: injecting constraints via .cursorrules Aider: model selection and context configuration OpenAI function calling: the tool schema contract 📚 Hard-Won Lessons from Production AI Coding Agent Deployments 📌 TLDR Summary and Key Takeaways: What Every Developer Should Know About AI Coding Agents 🔗 Related Posts

✣ Need another angle?⌄

Switch the article companion into a lower-complexity framing, then quiz yourself when you are ready.

Advanced34 min readAiCoding AgentsLlm

How AI Coding Agents Work: Models, Context, Sessions, and Memory

Inside the architecture of Claude Code, GitHub Copilot, and Cursor: how agents use LLMs, manage context windows, handle sessions, and tier memory.

Abstract Algorithms

Apr 10, 2026 · 34 min read

Interview

Helpful?

🔥 The 500-Line Refactor That Lost Its Mind

TLDR: An AI coding agent is an LLM stapled to a tool registry, wrapped in an orchestration loop that painstakingly rebuilds state on every single API call — because the model itself is completely stateless.

1. Overview

Inside the architecture of Claude Code, GitHub Copilot, and Cursor: how agents use LLMs, manage context windows, handle sessions, and tier memory.

⌁

Why it matters

Show high-level concept flow⌄

🔥 The 500-Line Refactor That Lost Its Mind

Starting point

→

📖 LLM + Tool Use + Orchestration Loop = AI Coding Agent

Next concept

→

🔍 Tokens, Turns, and Statelessness: The Baseline Every Developer Needs

Next concept

→

🏗️ High-Level Architecture of an AI Coding Agent

Next concept

→

🔄 The Agent Loop: How Claude Code Decides What to Do Next

Outcome

Committed

At a glance

DifficultyAdvanced ▥

Concepts37

Estimated time34 min

PrerequisitesAi, Coding Agents

System lens

See How AI Coding Agents Work: Models, Context, Sessions, and Memory as a living topology.

Inside the architecture of Claude Code, GitHub Copilot, and Cursor: how agents use LLMs, manage context windows, handle sessions, and tier memory.

🔥 The 500-Line Refactor That Lost Its Mind

Ingress and assumptions

📖 LLM + Tool Use + Orchestration Loop = AI Coding Agent

State transition

🔍 Tokens, Turns, and Statelessness: The Baseline Every Developer Needs

State transition

🏗️ High-Level Architecture of an AI Coding Agent

State transition

🔄 The Agent Loop: How Claude Code Decides What to Do Next

Outcome and guarantees

The article becomes easier when every section maps to a state change, a guarantee, or a failure boundary.

Narrative transition

Move from explanation to operating judgment.

Use these checkpoints as the conceptual pacing layer before continuing into the full article.

!Why this matters

#Key section to watch

Pay attention to "📖 LLM + Tool Use + Orchestration Loop = AI Coding Agent"; it usually contains the main mechanism or tradeoff.

?Interview angle

Be ready to explain 🔥 The 500-Line Refactor That Lost Its Mind and 📖 LLM + Tool Use + Orchestration Loop = AI Coding Agent with one concrete example and one tradeoff.

Tradeoff path 1

🔥 The 500-Line Refactor That Lost Its Mind: speed-first

Tradeoff path 2

📖 LLM + Tool Use + Orchestration Loop = AI Coding Agent: reliability-first

Understanding the context window, the ReAct loop, and the four tier memory stack explains every hallucination, forgotten variable, and mid refactor context dropout you've ever blamed on "the AI." 🔥 The 500 Line Refactor That Lost Its Mind You're 45 minutes into a refactor with Claude Code.

Failure rehearsal

Pressure-test the mental model.

🔥 The 500-Line Refactor That Lost Its Mind misunderstood

High model quality can still produce incorrect outputs without grounding and verification.

Mitigation: Revisit 🔥 The 500-Line Refactor That Lost Its Mind and validate the first principles.

Risk 68%

📖 LLM + Tool Use + Orchestration Loop = AI Coding Agent tradeoff missed

Low latency does not automatically mean high throughput under contention.

Mitigation: Compare against 📖 LLM + Tool Use + Orchestration Loop = AI Coding Agent and document the tradeoff.

Risk 58%

Back to the article

Continue into the authored sections with the topology in mind: each heading should now answer what changes, what can fail, and what guarantee the system is trying to preserve.

TLDR: An AI coding agent is an LLM stapled to a tool registry, wrapped in an orchestration loop that painstakingly rebuilds state on every single API call — because the model itself is completely stateless. Understanding the context window, the ReAct loop, and the four-tier memory stack explains every hallucination, forgotten variable, and mid-refactor context dropout you've ever blamed on "the AI."

🔥 The 500-Line Refactor That Lost Its Mind

You're 45 minutes into a refactor with Claude Code. You've established the ground rules: use the PaymentProcessor interface, never call LegacyBillingClient directly, all new services must implement Retryable. The agent acknowledged all of this, generated a clean StripePaymentProcessor, and was halfway through PayPalPaymentProcessor.

Then the context window filled up.

The next response confidently imports LegacyBillingClient — the exact class you flagged as forbidden. It ignores the Retryable constraint. It doesn't remember the interface contract at all. The agent hasn't gone dumb; it has literally never seen your earlier instructions. Those tokens were evicted when the window overflowed.

This is not a bug in the product. It is the fundamental architecture of every AI coding agent working exactly as designed. To work around it — and to use these tools well — you need to understand what is actually happening under the hood.

📖 LLM + Tool Use + Orchestration Loop = AI Coding Agent

A plain language model like GPT-4 or Claude 3 Sonnet is a stateless text transformer: tokens go in, tokens come out. It has no persistent memory, no ability to read files, and no awareness of your codebase beyond what you paste into the prompt. It cannot run your tests or check if the method it just invented actually exists.

An AI coding agent adds three things on top of that:

A tool registry — file read/write, terminal execution, semantic search, linter invocation. This is what turns a plain LLM into an agent that can use tools, not just generate text.
An orchestration loop — logic that runs repeatedly, routing LLM output to tools, injecting tool results back into context, and deciding when the goal is complete.
A context manager — code that carefully assembles what the model should "see" on each API call, because the window is finite and every token costs both money and latency.

Key players in the space:

Agent	Underlying Model(s)	Primary Integration
Claude Code	Claude 3.5 Sonnet / Claude 3 Opus	Terminal, CLI, direct codebase access
GitHub Copilot	GPT-4o / OpenAI o1-mini	VS Code, JetBrains, vim plugins
Cursor	GPT-4o, Claude 3.5 Sonnet (user choice)	Custom fork of VS Code
Aider	GPT-4, Claude, local Ollama models	Terminal, Git-native workflow
Devin / SWE-agent	GPT-4 Turbo / Claude	Sandboxed browser + terminal
Cody (Sourcegraph)	Claude 3, Mixtral	Enterprise codebase search + edit

The differences between these agents are mostly about how they manage the three layers above — not about the underlying model capabilities.

🔍 Tokens, Turns, and Statelessness: The Baseline Every Developer Needs

Before diving into agent architecture, three foundational concepts unlock everything that follows.

Tokens — the currency of LLMs. LLMs do not operate on words; they operate on tokens — subword chunks produced by a tokenizer. The word "refactoring" is one token. The code snippet getPaymentProcessor() might tokenize as 4–5 tokens. As a rough rule: 1,000 tokens ≈ 750 English words ≈ 50–80 lines of code. Every pricing tier, context limit, and rate limit in the LLM ecosystem is denominated in tokens.

The context window — the model's entire working memory. An LLM can only "see" what fits inside its current context window. GPT-4o's 128,000-token window sounds large until you fill it with a system prompt, conversation history, and several source files. Anything outside the window does not exist to the model — there is no "background awareness" of earlier turns.

Statelessness — the LLM has no memory across calls. This is the single most counterintuitive thing about coding agents. Each API call is completely independent. When you send a follow-up message, you are not "continuing a conversation" — you are sending a brand new request that happens to include the text of all previous turns. The agent's orchestration layer fabricates the illusion of memory by reconstructing context from its own storage on every call. When context fills up and older turns must be dropped, the session genuinely loses that information.

These three facts — token-based budgeting, finite context windows, and stateless API calls — are the root cause of every confusing agent behavior: forgotten constraints, stale edits, and mid-session context loss.

The flowchart below shows how statelessness shapes every agent turn: the orchestrator must reassemble the full context — history, files, tools — from persistent storage on each API call, because the LLM retains nothing between requests.

flowchart TD
    UserMsg["User sends message"] --> Orchestrator["Orchestrator receives message"]
    Orchestrator --> LoadCache["Load prior turns from session cache"]
    LoadCache --> RAGQuery["Query vector store for relevant source files"]
    RAGQuery --> Assemble["Assemble context window from prompt, history, and files"]
    Assemble --> WindowCheck{"Context fits in token window?"}
    WindowCheck -->|Yes| CallLLM["Send complete context to LLM API as new stateless request"]
    WindowCheck -->|No| Evict["Evict or summarize oldest conversation turns"]
    Evict --> CallLLM
    CallLLM --> Parse["Parse LLM response"]
    Parse -->|Tool call detected| RunTool["Execute tool and capture result"]
    RunTool --> Inject["Inject tool_result message into context"]
    Inject --> CallLLM
    Parse -->|Final answer| Respond["Return answer to user"]
    Respond --> PersistTurn["Persist turn to session cache for next request"]

🏗️ High-Level Architecture of an AI Coding Agent

Every AI coding agent, regardless of vendor, shares the same skeleton. The diagram below shows the primary data flows: user input flows through the orchestrator into the context builder, the assembled context gets sent to the LLM, the response is parsed for tool calls or final output, and tool results are fed back into the loop.

graph TD
    User[User Request] --> Orch[Agent Orchestrator]
    Orch --> CB[Context Builder]
    CB --> LLMAPI[LLM API - GPT-4o or Claude]
    LLMAPI --> RP[Response Parser]
    RP -->|Tool call detected| TE[Tool Executor]
    TE -->|Tool result| Orch
    RP -->|Final answer| Out[Output to User]

    subgraph MemoryTiers[Memory Tiers]
        ICM[In-Context Window - ephemeral]
        SC[Session Cache - Redis]
        VS[Vector Store - Chroma - Pinecone]
        PW[Parametric - Model Weights]
    end

    subgraph ToolRegistry[Tool Registry]
        FS[File Read and Write]
        Term[Terminal and Shell]
        Srch[Semantic Search]
        Lint[Linter and Test Runner]
    end

    CB --> MemoryTiers
    TE --> ToolRegistry

The orchestrator is the control loop — it decides whether to call a tool, call the LLM, or return a final answer. The context builder is responsible for fitting the right information into the model's finite attention window. The tool executor wraps side-effectful actions (writing files, running commands) in a layer the orchestrator can supervise. Memory tiers are discussed in detail in a later section.

🔄 The Agent Loop: How Claude Code Decides What to Do Next

The behavioral engine at the heart of every coding agent is the ReAct pattern (Reason + Act), introduced by Yao et al. in 2022. Rather than doing a single LLM call and returning the answer, the agent cycles through three phases until the goal is satisfied.

flowchart TD
    Start([Task Received]) --> Reason[Reason about current state]
    Reason --> Decide{Tool call needed?}
    Decide -->|Yes| Pick[Select tool and arguments]
    Pick --> Exec[Execute tool]
    Exec --> Observe[Observe tool result]
    Observe --> Inject[Inject result into context]
    Inject --> Reason
    Decide -->|No - answer ready| Generate[Generate final response]
    Generate --> Done([Return to User])

Here is what each phase looks like in practice when you ask Cursor to "add retry logic to the uploadFile method":

Reason: "I need to see the current implementation of uploadFile before I can add retry logic. I should read the file."
Action: Call read_file("src/storage/FileUploader.java").
Observation: The file content appears in the context — the current method, its imports, the class signature.
Reason: "The method uses HttpClient. I should check whether a RetryPolicy abstraction already exists in this codebase."
Action: Call search_codebase("RetryPolicy").
Observation: RetryPolicy.java exists in src/core/.
Reason: "Good. I can implement the retry logic using the existing RetryPolicy. I have enough context to generate the edit."
Generate: Emit the diff.

Each tool call is a full round-trip: the orchestrator sends a new LLM request containing the accumulated context, the model emits a structured tool-call JSON block, the orchestrator executes the tool, injects the result, and calls the LLM again. A simple "add retry logic" request might require 3–6 LLM API calls before producing a final answer.

⚙️ How Coding Agents Invoke LLMs: System Prompts, Tool Schemas, and Streaming

The API call anatomy

Every agent-to-LLM interaction follows the same structure, regardless of whether it targets OpenAI or Anthropic:

system:    [repository context, style guide, tool schemas, agent persona]
messages:  [conversation history — user turns and assistant turns]
tools:     [JSON schema of available tools]
user:      [current user message]

The system prompt does the heavy lifting. In Cursor, it contains:

The contents of .cursorrules (project-specific instructions)
The currently open file and a few related files
The schema definitions for every tool the agent can call (read_file, edit_file, run_terminal, search)
Language and style constraints

A typical system prompt in a coding agent runs 2,000–4,000 tokens before any conversation starts. This is the "fixed cost" that every request pays before a single line of user input is processed.

Function calling / tool use

Modern LLMs expose a tools parameter that lets the orchestrator declare available actions as JSON schemas. The model can then emit a structured tool_use block instead of free text, signaling that it wants to invoke a specific function with specific arguments.

The schema for a read_file tool looks like this (this is a config/protocol definition, not application code):

{
  "name": "read_file",
  "description": "Read the contents of a file at the given path",
  "input_schema": {
    "type": "object",
    "properties": {
      "path": {
        "type": "string",
        "description": "Relative path to the file from the repository root"
      }
    },
    "required": ["path"]
  }
}

When Claude 3 emits a tool_use block for this schema, the orchestrator parses it, calls the actual file system, and returns the contents as a tool_result message. The model never touches the file system directly — it only emits intent.

Why agents stream tokens

Coding agents stream LLM responses token-by-token for two practical reasons. First, large diffs can take 20–40 seconds to generate, and streaming lets the UI show progress rather than a blank screen. Second, and more importantly, the orchestrator needs to detect when the model starts emitting a tool call block so it can stop streaming, extract the structured JSON, and execute the tool — rather than waiting for the full response.

🧠 Context Window Management: The Core Constraint Driving Agent Design

This is the section that explains the refactor failure from the opening story. Everything that follows describes the single biggest engineering challenge in building a coding agent.

The Internals of Context Prioritization and Window Assembly

What is a context window? A context window is the maximum number of tokens (roughly 0.75 English words per token) that an LLM can attend to in a single API call. Anything outside the window is completely invisible to the model.

Current token budgets by model:

Model	Context Window	Practical Code Limit
GPT-4o	128,000 tokens	~90,000 tokens of actual code
Claude 3.5 Sonnet	200,000 tokens	~150,000 tokens
Claude 3 Opus	200,000 tokens	~150,000 tokens
Gemini 1.5 Pro	1,000,000 tokens	~700,000 tokens
GPT-4 (original)	8,192 tokens	~5,000 tokens

A token budget of 128K sounds enormous until you account for what must fit inside it.

What consumes the context window?

On a real coding task, the context is subdivided roughly like this:

Content	Typical Token Cost
System prompt (agent persona, tools, style guide)	2,000–4,000
`.cursorrules` / project config	500–1,500
Open files and related files	5,000–30,000
Conversation history (all prior turns)	10,000–60,000
Tool results injected into context	2,000–20,000
Current user message	50–500

For a long refactor session, the conversation history alone can consume 60,000 tokens. Add two or three large source files and you've used 100K of GPT-4o's 128K budget — and the session is only 30 minutes old.

RAG: Retrieval-Augmented Generation for codebase context

A naive approach would be to dump the entire codebase into the context. That fails for any repository larger than a few hundred lines. The solution is RAG — Retrieval-Augmented Generation.

Here is how Cody, Cursor, and Claude Code all approach codebase context:

Embedding: When you open a project, the agent indexes every source file by converting it into a dense vector embedding using a smaller embedding model (e.g., OpenAI text-embedding-3-small or a local model).
Storage: Embeddings are stored in a local vector database (Chroma, LanceDB, or an in-process FAISS index).
Query: When you submit a request, the agent embeds your query and runs a nearest-neighbor search over the codebase embeddings to retrieve the top-k most semantically similar code chunks.
Injection: Those chunks are inserted into the context alongside your message.

The result: instead of loading 500 files, the agent loads the 10–20 files most relevant to your specific question. This is how Copilot can answer questions about a 200,000-line monorepo without blowing the context budget on every request.

Context assembly for a code edit request

The following sequence diagram traces exactly how the context is assembled when you ask an agent to edit a specific method. Each step is a deliberate budget decision — the orchestrator is choosing what information the model needs versus what it can afford to include.

sequenceDiagram
    participant U as User
    participant Orch as Orchestrator
    participant CB as Context Builder
    participant VS as Vector Store
    participant LLM as LLM API

    U->>Orch: Fix the retry logic in uploadFile
    Orch->>CB: Assemble context for this request
    CB->>VS: Embed query - retrieve relevant file chunks
    VS-->>CB: Top-5 chunks: FileUploader, RetryPolicy, HttpClient
    CB->>CB: Load open files from editor state
    CB->>CB: Prepend system prompt and tool schemas
    CB->>CB: Append conversation history up to token budget
    Note over CB: Budget check - 128K total - 12K used so far
    CB->>LLM: Send assembled context with user message
    LLM-->>Orch: Thought plus tool call - read_file FileUploader
    Orch->>CB: Inject tool result into context
    CB->>LLM: Re-send with tool result appended
    LLM-->>Orch: Final code edit diff
    Orch->>U: Display diff for approval

The "lost in the middle" problem

Research from Liu et al. (2023) demonstrated that LLMs are significantly better at recalling information at the beginning and end of their context window than in the middle. Content placed in the middle of a 128K context receives systematically less attention during generation.

This has practical consequences: if an agent places your critical interface contract in the middle of a long context with many files, the model may effectively "ignore" it even though the tokens are technically present. Well-designed agents place the most important constraints (the user's current message, the file being edited, the key instructions) at the top or bottom of the assembled context for exactly this reason.

Sliding window and truncation strategies

When the context window fills up, agents have four strategies:

Hard truncation: Drop the oldest conversation turns. Simple but causes the "forgot the rules" problem.
Summarization: Call the LLM to summarize the conversation history into a compressed representation before truncating. More expensive but preserves semantic content.
Priority-based eviction: Evict tool results and intermediate steps first; preserve user instructions and the original task description. This is what Claude Code does.
Re-chunking with RAG: Re-run the semantic search to refresh which file chunks are in context. Cursor does this on every request to ensure context relevance.

Performance Analysis of Context Operations

Context management is not free — it adds measurable latency and cost to every agent turn. Understanding these numbers helps set realistic expectations.

Operation	Latency	Token Cost
Embedding a query (RAG)	50–200ms	~50 tokens
Vector search (top-k retrieval)	10–100ms	0 tokens
Context assembly (in-memory)	5–30ms	0 tokens
Summarization of 20K token history	3–8s extra	500–2,000 tokens
LLM API call (first token)	500ms–3s	N/A

The RAG retrieval path (embedding + vector search + inject) adds ~200–400ms of overhead per request but saves 10,000–50,000 tokens of context budget that would otherwise be consumed by loading all files. At $0.01 per 1K input tokens (GPT-4o pricing), a 40,000-token reduction per request saves $0.40 per call — significant at scale.

The summarization path trades increased latency (3–8 extra seconds) for session continuity. Teams using Aider on long refactoring sessions often enable --max-chat-history-tokens to trigger automatic summarization before the context window forces hard truncation, accepting the latency cost to preserve the session's logical thread.

📦 Session Management: Keeping State Alive Between Turns

The LLM has no memory. Zero. Each API call is independent — the model has no idea what happened in the previous turn unless you include that history in the current request. The session is entirely an agent-side construct.

Session lifecycle

START → ACTIVE → IDLE → EXPIRED
  |        |       |       |
  |        |       |       └── History cleared; must restart task
  |        |       └────────── No activity for N minutes; state serialized to cache
  |        └────────────────── User is actively sending messages; full state in memory
  └─────────────────────────── Empty context; system prompt + empty history

When you close VS Code with Copilot Chat open and reopen it, the session state (your conversation history) has been serialized to a local cache file. When you reopen VS Code, the agent deserializes that history and injects it back into the next LLM request. This is why Copilot "remembers" your last conversation — it is replaying the entire history, not because the model retained it.

Turn-by-turn serialization

Each agent maintains a conversation log that grows with every turn:

Turn 1:  User: "Use PaymentProcessor interface everywhere"
         Assistant: "Understood. I'll ensure all new payment services implement PaymentProcessor."
Turn 2:  User: "Now write StripePaymentProcessor"
         Assistant: [tool calls + code] → StripePaymentProcessor.java
Turn 3:  User: "Now write PayPalPaymentProcessor"
         ...

This entire log is sent in every LLM request. By turn 20, the log itself may be the largest single item in the context. Agents like Aider include a --max-chat-history-tokens flag that triggers summarization when the history exceeds a threshold.

Multi-agent sessions: how Devin delegates

More advanced agents like Devin and OpenAI's Deep Research go beyond a single agent loop. They spawn sub-agents — separate orchestrator instances with their own context windows. A supervisor agent breaks the task into subtasks, each assigned to a sub-agent. Sub-agents return results that the supervisor injects into its own context as tool results.

Shared context between sub-agents is mediated through the external short-term cache (Redis or similar), not through a shared context window. Sub-agent A's conversation history is never directly visible to sub-agent B; only the summary output is passed.

🗄️ Memory Architecture: The Four Tiers Every Agent Navigates

Coding agents operate across four fundamentally different types of memory, each with its own access speed, capacity, and persistence characteristics.

graph TD
    subgraph T1[Tier 1 - In-Context Memory]
        ICM[Active context window - fast - ephemeral - 128K to 200K tokens]
    end
    subgraph T2[Tier 2 - External Short-Term Cache]
        STC[Redis - session state - tool outputs - preferences - minutes to hours]
    end
    subgraph T3[Tier 3 - Long-Term Vector Store]
        LTV[Chroma or Pinecone - codebase embeddings - past sessions - days to permanent]
    end
    subgraph T4[Tier 4 - Parametric Memory]
        PW[Model weights - pre-training and fine-tuning - static until retrain]
    end

    T1 -->|evicted when window fills| T2
    T2 -->|persisted after session expires| T3
    T3 -->|retrieved via semantic search| T1
    T4 -->|always available to model| T1

Tier 1 — In-context memory (ephemeral): The current context window. Everything in the active LLM call. This is the fastest (zero retrieval latency) and most expensive (every token costs money). It is completely lost when the session ends or the window overflows. This is the tier that failed in the opening scenario.

Tier 2 — External short-term cache (Redis or equivalent): Session state that survives between individual LLM calls but is scoped to the current working session. Tool outputs (file contents retrieved in turn 3 can be cached so they are not re-retrieved in turn 4), user preferences ("always use tabs, not spaces"), and intermediate task state live here. GitHub Copilot serializes session history to a local JSON cache file in the user's .vscode directory. Aider writes session history to a .aider.chat.history.md file.

Tier 3 — Long-term vector store (Chroma, Pinecone, LanceDB): Persistent embeddings of the codebase, past conversation summaries, and learned user patterns. This tier survives across sessions indefinitely. When Cursor indexes your repository on first open, it is populating Tier 3. When you ask "where is the authentication logic?", Cursor queries Tier 3 with a semantic search, retrieves the most relevant file chunks, and injects them into Tier 1 (the current context window) for the model to reason about.

Tier 4 — Parametric memory (model weights): Everything the model learned during pre-training and fine-tuning. This is static knowledge baked into the model's 70 billion+ parameters. GPT-4's knowledge of Python syntax, common design patterns, the Java standard library — all of that lives in Tier 4. It requires no retrieval and is always available, but it is frozen at the model's training cutoff date and cannot be updated without retraining.

The key insight: the orchestrator's primary job is moving information between these tiers at the right time. RAG moves Tier 3 content into Tier 1. Session serialization moves Tier 1 content into Tier 2. Summarization compresses Tier 1 overflow into Tier 3 for future retrieval.

🔧 Tool Use and Code Execution: How Agents Take Real Actions

An agent without tools is a very expensive autocomplete. The tool layer is what separates "AI coding assistant" from "AI coding agent."

Standard tool registry in a modern coding agent:

Tool	What it does	Used by
`read_file(path)`	Returns file contents as a string	All agents
`write_file(path, content)`	Creates or overwrites a file	Cursor, Aider, Claude Code
`edit_file(path, old, new)`	Line-level diff application	Claude Code, Copilot
`run_command(cmd)`	Executes a shell command; returns stdout/stderr	Devin, Aider, Claude Code
`search_codebase(query)`	Semantic or keyword search over the repo	Cody, Cursor, Copilot
`run_tests(suite)`	Runs a test suite; returns pass/fail + output	Devin, SWE-agent
`browse(url)`	Fetches a web page	Devin, Claude Code

Sandboxing: how Devin runs code safely

When an agent calls run_command, it is executing arbitrary code on a machine. Devin and E2B (the sandboxing infrastructure used by several agents) address this by running each agent session inside a per-session container (Docker or Firecracker microVM). The container has:

A clean copy of the repository
No access to the host network or host filesystem beyond the project
A resource budget (CPU, memory, disk I/O)
An automatic teardown timer to prevent runaway processes

The tool result (stdout, stderr, exit code) is returned to the orchestrator, which injects it into context as a tool_result message. The model sees the output but never directly executes code — it only observes results.

Tool results and context bloat

Every tool result is injected into the context window. A run_tests call that returns 2,000 lines of Maven Surefire output can consume 4,000+ tokens in a single turn. Well-designed agents apply result compression: truncate verbose output to the most salient lines (first N lines, last N lines, lines containing "ERROR" or "FAILED"), and summarize the rest. Claude Code applies a 4K token cap per tool result by default, with an option to raise the limit.

📊 Performance Characteristics: Latency, Throughput, and the Bottlenecks

Understanding where time is spent explains why agents feel slow on complex tasks.

Latency breakdown for a single agent turn:

Step	Typical Duration
Context assembly (RAG query + file loads)	100–500ms
LLM API call — time to first token	500ms–3s
LLM generation (full response, streaming)	2s–30s
Tool execution (file read)	1–10ms
Tool execution (run_command, test suite)	5s–120s
Tool result injection + re-prompt	200–800ms

For a task requiring 5 tool calls before the final answer, total latency is often 30–90 seconds. The bottleneck is almost always the LLM generation time, not the orchestration overhead.

Throughput limits: Most OpenAI and Anthropic API tiers enforce rate limits of 500,000–1,000,000 tokens per minute (TPM). A single agent session generating 10,000 tokens per turn at 10 turns per minute approaches these limits quickly in team environments — which is why tools like Cursor have per-user API key configurations rather than sharing a pool.

⚖️ Failure Modes: Why Agents Hallucinate, Loop, and Forget

Understanding failure modes is the most practical takeaway from studying agent internals.

Context overflow — the silent truncation. When the conversation history grows beyond the context window, older turns are silently dropped. The model never signals this — it simply operates without information it once had. The "forgot the rules" scenario from the opening story is always a context overflow event. Mitigation: keep sessions short and focused; use .cursorrules to inject critical constraints into the system prompt so they are always present.

Hallucinated file paths and method names. The model generates code that references com.example.payment.v2.StripeAdapter — a class that does not exist. Why? The model extrapolated from patterns in its training data. RAG helps but does not fully prevent this: if the relevant file was not retrieved (low semantic similarity to the query), the model will fill the gap from Tier 4 (parametric memory), which may suggest patterns from other codebases. Mitigation: run a linter as a tool in the loop and inject compile errors back as observations, forcing the model to self-correct.

Infinite tool loops. The model calls search_codebase("RetryPolicy"), does not find a satisfying result, and calls it again with slightly different phrasing. Loop. An orchestrator without a turn limit will keep calling tools indefinitely. All production agents implement a max-iterations guard (Claude Code: 10 tool calls per session turn; Aider: 5 retries per edit block).

Stale context — editing a file that has already changed. The agent read FileUploader.java in turn 3 and cached its contents. In turn 8, after the user manually edited the file, the agent still operates on the turn-3 snapshot. The generated diff applies to a file version that no longer exists, producing merge conflicts. Mitigation: re-read files immediately before editing them (Claude Code does this automatically); use file hashes to detect staleness.

Cascading hallucination across tool calls. The model generates a tool call with a bad argument (e.g., edit_file("src/storage/FileUploadr.java", ...) — typo in the path). The tool returns a "file not found" error. The model tries to create the file from scratch. Now there are two files — the real one and a hallucinated duplicate. Each subsequent call to the wrong file builds on the hallucination, drifting further from reality.

🌍 How GitHub Copilot, Cursor, Devin, and Claude Code Apply These Internals in Production

The internals described above are not theoretical — they manifest as concrete product design decisions in every major coding agent.

GitHub Copilot is the most conservative implementor. It maintains a tight, editor-scoped context window: the currently open file, a few recently viewed files, and a short conversation history. Its inline suggestion mode doesn't use a full ReAct loop — it's a single-shot LLM call with a heavily optimized context assembly (open file + neighboring code + editor cursor position). Copilot Chat adds a multi-turn session layer with local JSON cache for history persistence, but stops short of full tool use beyond code search.

Cursor takes a RAG-first approach. On every project open, it indexes the full codebase into a local vector store (LanceDB). On every request, it runs a semantic search before assembling context, ensuring that the most relevant code — not just what is open — arrives in the model's window. Cursor supports model routing: it sends short inline completions to a fast model (GPT-4o mini or Haiku) and routes complex refactoring requests to the more capable model (Claude 3.5 Sonnet or GPT-4o). The .cursorrules file in the repository root acts as a persistent system prompt injection, surviving context window overflow.

Claude Code (Anthropic's terminal agent) uses the full 200K-token Claude 3.5 Sonnet context window aggressively, loading more files than most agents before triggering RAG. Its tool registry is terminal-native: Bash, Read, Write, Edit, Search. Rather than sandboxing, Claude Code runs directly in your local shell — giving it low latency at the cost of requiring the developer to supervise destructive operations. Its session management relies on the 200K window plus in-session file re-reads to detect stale context.

Devin / SWE-agent represents the most autonomous end of the spectrum. Rather than assisting a human developer in an editor, Devin receives a task (e.g., "fix GitHub issue #1234") and operates independently in a sandboxed cloud VM for hours. Its ReAct loop includes web browsing, test suite execution, and PR creation. Context management becomes critical at this scale: Devin uses hierarchical summarization (summarize sub-tasks before continuing) and writes intermediate plans to a scratchpad file that it reads back into context when needed — a primitive form of Tier 2 (external cache) coordination.

Agent	Context Strategy	Tool Depth	Session Persistence
Copilot	Editor-scoped, short history	Code search only	Local JSON file
Cursor	RAG-first, model routing	File + search	Local LanceDB
Claude Code	Large window, aggressive loading	Full terminal	In-session only
Aider	Git-native, explicit file selection	Git + terminal	`.aider.chat.history.md`
Devin	Sandboxed VM, hierarchical summary	Browser + terminal + tests	Cloud session store

🧪 Tracing a Complete Agent Session: From Feature Request to Committed Edit

The best way to make the internals concrete is to trace a realistic request end-to-end. This example walks through what happens when a developer asks Cursor: "Add pagination to the GET /users endpoint — use cursor-based pagination, not offset."

Turn 1 — Context assembly. The orchestrator fires: embed the query → retrieve top-5 file chunks from the vector store (UserController.java, UserRepository.java, PageRequest.java, UserDto.java, and a test file). System prompt + .cursorrules (500 tokens) + retrieved chunks (8,200 tokens) + user message (25 tokens) = ~8,725 tokens sent to GPT-4o. Well within budget.

Turn 1 — LLM response. The model emits a tool call: read_file("src/main/java/com/example/UserRepository.java"). It wants the full file, not just the retrieved chunk, to understand the existing query methods. The orchestrator executes the read (230ms), appends the result (1,850 tokens) to context, and re-prompts.

Turn 2 — LLM generates the edit. With the full repository file in context, the model emits an edit_file tool call with the complete diff: adding a findByCursorId JPA method, updating the controller endpoint signature, and adding a CursorPageResponse DTO. The orchestrator applies the diff via patch.

Turn 3 — Validation. The model emits run_command("./mvnw test -pl api-service -Dtest=UserControllerTest"). The test runner returns a failure: CursorPageResponse is missing a @JsonProperty("next_cursor") annotation. The model observes the error, reasons about the fix, and emits a second edit_file call for the DTO. Tests pass on rerun.

Total cost for this session: 4 LLM calls, 14,800 tokens consumed, ~45 seconds wall clock time. The key insight: the agent didn't just write code — it read the relevant files, ran tests, observed a failure, self-corrected, and produced working code. Each of these steps is a separate LLM call, not a single generation.

🧭 Choosing Between Long Context, RAG, and Summarization: A Decision Guide

Not every project needs every technique. The right context management strategy depends on repository size, session length, and task type.

Situation	Recommended Strategy	Why
Small focused repo (< 100K tokens total)	Load all files into context	No retrieval overhead; model has full picture
Large monorepo (> 500K tokens)	RAG-first: embed + retrieve top-k chunks	Prevents context overflow; focuses model on relevant code
Long refactor session (> 30 turns)	Summarization + `.cursorrules` for constraints	History will overflow; constraints must survive truncation
One-shot code generation task	Single-shot LLM call, no agent loop	Agent loop overhead (3–6 extra API calls) is wasteful
Multi-step autonomous task (fix a bug + run tests)	Full ReAct loop with tool use	Multi-step requires observation/correction cycles
Team environment with shared codebase	Persistent vector store (Tier 3) + short sessions	Keeps individual context windows small; leverages shared index

Anti-patterns to avoid:

Using an agent for everything. For "explain this function," a single LLM call is faster and cheaper than a 5-tool-call agent session. Reserve agent loops for tasks that genuinely require multiple tool interactions.
Trusting long sessions without .cursorrules. After 30+ turns, your early constraints are likely evicted. Any rule that must hold for the whole session belongs in the system prompt, not a conversation turn.
Ignoring the "lost in the middle" effect. If you have 10 critical constraints, inject them as a numbered list at the beginning of the system prompt — do not bury them in the middle of a large context block.

🛠️ Configuring Real Agents: Cursor Rules, Aider Flags, and OpenAI Tool Schemas

For Internals / How It Works posts, the OSS section shows config/integration-level examples only — no application boilerplate.

Cursor: injecting constraints via `.cursorrules`

Cursor reads a .cursorrules file from the repository root and prepends its content to every system prompt. This is the most direct way to influence what the agent "always knows" regardless of context window state.

# .cursorrules
You are working in a Spring Boot 3.x codebase.

Rules:
- All payment services must implement the PaymentProcessor interface
- Never use LegacyBillingClient — it is deprecated and must be removed
- Use constructor injection, not field injection
- All new services must be annotated with @Retryable(maxAttempts = 3)
- Target Java 21 record types for DTOs

Codebase layout:
- src/main/java/com/example/payment — payment domain
- src/main/java/com/example/core — shared infrastructure
- src/test — unit and integration tests (TestContainers)

Because .cursorrules is in the system prompt, it survives context window overflow. It is the single most effective mitigation against the "forgot the rules" problem.

Aider: model selection and context configuration

Aider is a terminal-first coding agent that integrates directly with Git. It supports explicit model selection and history management:

# Use Claude 3.5 Sonnet with a 150K token context budget
aider --model claude-3-5-sonnet-20241022 \
      --context-window 150000 \
      --max-chat-history-tokens 30000 \
      src/payment/FileUploader.java src/core/RetryPolicy.java

# Enable auto-commits after each accepted edit
aider --auto-commits --model gpt-4o

The --max-chat-history-tokens flag triggers automatic summarization when the conversation history exceeds the threshold, preventing silent truncation from corrupting session state.

OpenAI function calling: the tool schema contract

This is the JSON schema that an orchestrator registers with the OpenAI API to expose a run_tests tool. The model reads this schema as part of the system prompt and emits structured function_call JSON when it wants to invoke the tool. This is protocol-level config, not application code:

{
  "type": "function",
  "function": {
    "name": "run_tests",
    "description": "Execute the test suite for the specified module and return results",
    "parameters": {
      "type": "object",
      "properties": {
        "module": {
          "type": "string",
          "description": "Maven module name or path, e.g. 'payment-service'"
        },
        "test_filter": {
          "type": "string",
          "description": "Optional regex filter for test class names"
        }
      },
      "required": ["module"]
    }
  }
}

The orchestrator parses the model's function_call response, maps it to the actual mvn test -pl payment-service invocation, captures stdout/stderr, and returns it as a tool message in the next API call.

For a full deep-dive on building LangGraph-based agent loops with tool calling, see LangChain Tools and Agents: The Classic Agent Loop and LangGraph ReAct Agent Pattern.

📚 Hard-Won Lessons from Production AI Coding Agent Deployments

The context window is your #1 operational constraint. Every architectural decision in a coding agent — RAG, summarization, tool result truncation, priority-based eviction — exists to serve one goal: fitting the right information into a finite window. When an agent "goes dumb," check the context, not the model.
.cursorrules is your safety net. The system prompt is the only part of context that survives a window overflow intact. Any constraint you genuinely need the agent to respect — interface contracts, deprecated class avoidance, style rules — belongs in the system prompt, not in a one-time message at the start of the session.
Statelessness is a feature, not a bug. The model's statelessness makes it deterministic (modulo temperature) and parallelizable. The difficulty of maintaining session state is an orchestration problem, not an LLM problem. Treating it as an LLM problem leads to frustration; treating it as an engineering problem leads to solutions.
Tool loops are a prompt engineering problem. When an agent gets stuck in a tool loop, the fix is almost always a better system prompt: explicit stop conditions ("if search returns empty, report that to the user and stop"), step limits in the orchestrator, and cleaner tool descriptions that reduce ambiguity.
The "lost in the middle" effect is real and measurable. In controlled experiments, models recall content at context positions 0–10% and 90–100% at significantly higher accuracy than positions 40–60%. For long contexts, structure your system prompt so the highest-priority constraints appear at the very beginning, and place the user's actual question at the very end of the assembled context.
Multi-agent architectures multiply context management complexity. Each sub-agent in a Devin-style system has its own finite context window. Passing state between sub-agents via a shared Redis cache introduces serialization overhead and information loss. More agents does not mean more memory — it means more orchestration surface area to get wrong.
Streaming is not just UX. Streaming tokens back to the client allows the orchestrator to detect and interrupt tool calls mid-stream, implement real-time safety checks on generated code, and update the UI incrementally. Batch-mode LLM calls make all of this harder.

📌 TLDR Summary and Key Takeaways: What Every Developer Should Know About AI Coding Agents

An AI coding agent is an LLM plus tool registry plus orchestration loop plus context management. The model itself is the smallest part of the system.
The context window is finite and every token in it has been placed there by deliberate engineering decisions. When an agent forgets something, tokens were evicted.
The ReAct loop (Reason → Act → Observe) drives every non-trivial agent action. A single user request can trigger 5–10 LLM API calls before producing a final answer.
RAG (Retrieval-Augmented Generation) is how agents reference large codebases without exceeding the context window: embed → store → retrieve → inject.
Memory operates across four tiers: in-context (ephemeral, fastest), external cache (Redis, session-scoped), vector store (persistent, semantic), and parametric (model weights, static).
Sessions are entirely an agent-side abstraction. The LLM sees a fresh slate every call; the orchestrator reconstructs the session from its serialized history.
Tool calls are structured JSON emitted by the model. The model never directly executes code — it expresses intent; the orchestrator acts.
Critical constraints belong in .cursorrules or the system prompt, not in early conversation turns that will eventually be evicted.
The "lost in the middle" problem means context position matters — put the most important instructions at the beginning and end of the assembled context.

Expandable deep dives

🔥 The 500-Line Refactor That Lost Its Mind⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

📖 LLM + Tool Use + Orchestration Loop = AI Coding Agent⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

🔍 Tokens, Turns, and Statelessness: The Baseline Every Developer Needs⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

🏗️ High-Level Architecture of an AI Coding Agent⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

Key takeaways

✓TLDR: An AI coding agent is an LLM stapled to a tool registry, wrapped in an orchestration loop that painstakingly rebuilds state on every single API call — because the model itself is completely stateless.
✓Understanding the context window, the ReAct loop, and the four tier memory stack explains every hallucination, forgotten variable, and mid refactor context dropout you've ever blamed on "the AI." 🔥 The 500 Line Refactor That Lost Its Mind You're 45 minutes into a refactor with Claude Code.
✓You've established the ground rules: use the interface, never call directly, all new services must implement .
✓The agent acknowledged all of this, generated a clean , and was halfway through .

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Reader feedback

Was this article useful?

Rate it before you leave, then follow or subscribe for the next deep dive.

Continue learning

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

31 min · Llm · best next step

View roadmap