How AI Coding Agents Work: Models, Context, Sessions, and Memory
Inside the architecture of Claude Code, GitHub Copilot, and Cursor: how agents use LLMs, manage context windows, handle sessions, and tier memory.
Abstract Algorithms
TLDR: An AI coding agent is an LLM stapled to a tool registry, wrapped in an orchestration loop that painstakingly rebuilds state on every single API call โ because the model itself is completely stateless. Understanding the context window, the ReAct loop, and the four-tier memory stack explains every hallucination, forgotten variable, and mid-refactor context dropout you've ever blamed on "the AI."
๐ฅ The 500-Line Refactor That Lost Its Mind
You're 45 minutes into a refactor with Claude Code. You've established the ground rules: use the PaymentProcessor interface, never call LegacyBillingClient directly, all new services must implement Retryable. The agent acknowledged all of this, generated a clean StripePaymentProcessor, and was halfway through PayPalPaymentProcessor.
Then the context window filled up.
The next response confidently imports LegacyBillingClient โ the exact class you flagged as forbidden. It ignores the Retryable constraint. It doesn't remember the interface contract at all. The agent hasn't gone dumb; it has literally never seen your earlier instructions. Those tokens were evicted when the window overflowed.
This is not a bug in the product. It is the fundamental architecture of every AI coding agent working exactly as designed. To work around it โ and to use these tools well โ you need to understand what is actually happening under the hood.
๐ LLM + Tool Use + Orchestration Loop = AI Coding Agent
A plain language model like GPT-4 or Claude 3 Sonnet is a stateless text transformer: tokens go in, tokens come out. It has no persistent memory, no ability to read files, and no awareness of your codebase beyond what you paste into the prompt. It cannot run your tests or check if the method it just invented actually exists.
An AI coding agent adds three things on top of that:
- A tool registry โ file read/write, terminal execution, semantic search, linter invocation. The agent can take actions, not just generate text.
- An orchestration loop โ logic that runs repeatedly, routing LLM output to tools, injecting tool results back into context, and deciding when the goal is complete.
- A context manager โ code that carefully assembles what the model should "see" on each API call, because the window is finite and every token costs both money and latency.
Key players in the space:
| Agent | Underlying Model(s) | Primary Integration |
| Claude Code | Claude 3.5 Sonnet / Claude 3 Opus | Terminal, CLI, direct codebase access |
| GitHub Copilot | GPT-4o / OpenAI o1-mini | VS Code, JetBrains, vim plugins |
| Cursor | GPT-4o, Claude 3.5 Sonnet (user choice) | Custom fork of VS Code |
| Aider | GPT-4, Claude, local Ollama models | Terminal, Git-native workflow |
| Devin / SWE-agent | GPT-4 Turbo / Claude | Sandboxed browser + terminal |
| Cody (Sourcegraph) | Claude 3, Mixtral | Enterprise codebase search + edit |
The differences between these agents are mostly about how they manage the three layers above โ not about the underlying model capabilities.
๐ Tokens, Turns, and Statelessness: The Baseline Every Developer Needs
Before diving into agent architecture, three foundational concepts unlock everything that follows.
Tokens โ the currency of LLMs. LLMs do not operate on words; they operate on tokens โ subword chunks produced by a tokenizer. The word "refactoring" is one token. The code snippet getPaymentProcessor() might tokenize as 4โ5 tokens. As a rough rule: 1,000 tokens โ 750 English words โ 50โ80 lines of code. Every pricing tier, context limit, and rate limit in the LLM ecosystem is denominated in tokens.
The context window โ the model's entire working memory. An LLM can only "see" what fits inside its current context window. GPT-4o's 128,000-token window sounds large until you fill it with a system prompt, conversation history, and several source files. Anything outside the window does not exist to the model โ there is no "background awareness" of earlier turns.
Statelessness โ the LLM has no memory across calls. This is the single most counterintuitive thing about coding agents. Each API call is completely independent. When you send a follow-up message, you are not "continuing a conversation" โ you are sending a brand new request that happens to include the text of all previous turns. The agent's orchestration layer fabricates the illusion of memory by reconstructing context from its own storage on every call. When context fills up and older turns must be dropped, the session genuinely loses that information.
These three facts โ token-based budgeting, finite context windows, and stateless API calls โ are the root cause of every confusing agent behavior: forgotten constraints, stale edits, and mid-session context loss.
๐๏ธ High-Level Architecture of an AI Coding Agent
Every AI coding agent, regardless of vendor, shares the same skeleton. The diagram below shows the primary data flows: user input flows through the orchestrator into the context builder, the assembled context gets sent to the LLM, the response is parsed for tool calls or final output, and tool results are fed back into the loop.
graph TD
User[User Request] --> Orch[Agent Orchestrator]
Orch --> CB[Context Builder]
CB --> LLMAPI[LLM API - GPT-4o or Claude]
LLMAPI --> RP[Response Parser]
RP -->|Tool call detected| TE[Tool Executor]
TE -->|Tool result| Orch
RP -->|Final answer| Out[Output to User]
subgraph MemoryTiers[Memory Tiers]
ICM[In-Context Window - ephemeral]
SC[Session Cache - Redis]
VS[Vector Store - Chroma - Pinecone]
PW[Parametric - Model Weights]
end
subgraph ToolRegistry[Tool Registry]
FS[File Read and Write]
Term[Terminal and Shell]
Srch[Semantic Search]
Lint[Linter and Test Runner]
end
CB --> MemoryTiers
TE --> ToolRegistry
The orchestrator is the control loop โ it decides whether to call a tool, call the LLM, or return a final answer. The context builder is responsible for fitting the right information into the model's finite attention window. The tool executor wraps side-effectful actions (writing files, running commands) in a layer the orchestrator can supervise. Memory tiers are discussed in detail in a later section.
๐ The Agent Loop: How Claude Code Decides What to Do Next
The behavioral engine at the heart of every coding agent is the ReAct pattern (Reason + Act), introduced by Yao et al. in 2022. Rather than doing a single LLM call and returning the answer, the agent cycles through three phases until the goal is satisfied.
flowchart TD
Start([Task Received]) --> Reason[Reason about current state]
Reason --> Decide{Tool call needed?}
Decide -->|Yes| Pick[Select tool and arguments]
Pick --> Exec[Execute tool]
Exec --> Observe[Observe tool result]
Observe --> Inject[Inject result into context]
Inject --> Reason
Decide -->|No - answer ready| Generate[Generate final response]
Generate --> Done([Return to User])
Here is what each phase looks like in practice when you ask Cursor to "add retry logic to the uploadFile method":
- Reason: "I need to see the current implementation of
uploadFilebefore I can add retry logic. I should read the file." - Action: Call
read_file("src/storage/FileUploader.java"). - Observation: The file content appears in the context โ the current method, its imports, the class signature.
- Reason: "The method uses
HttpClient. I should check whether aRetryPolicyabstraction already exists in this codebase." - Action: Call
search_codebase("RetryPolicy"). - Observation:
RetryPolicy.javaexists insrc/core/. - Reason: "Good. I can implement the retry logic using the existing
RetryPolicy. I have enough context to generate the edit." - Generate: Emit the diff.
Each tool call is a full round-trip: the orchestrator sends a new LLM request containing the accumulated context, the model emits a structured tool-call JSON block, the orchestrator executes the tool, injects the result, and calls the LLM again. A simple "add retry logic" request might require 3โ6 LLM API calls before producing a final answer.
โ๏ธ How Coding Agents Invoke LLMs: System Prompts, Tool Schemas, and Streaming
The API call anatomy
Every agent-to-LLM interaction follows the same structure, regardless of whether it targets OpenAI or Anthropic:
system: [repository context, style guide, tool schemas, agent persona]
messages: [conversation history โ user turns and assistant turns]
tools: [JSON schema of available tools]
user: [current user message]
The system prompt does the heavy lifting. In Cursor, it contains:
- The contents of
.cursorrules(project-specific instructions) - The currently open file and a few related files
- The schema definitions for every tool the agent can call (read_file, edit_file, run_terminal, search)
- Language and style constraints
A typical system prompt in a coding agent runs 2,000โ4,000 tokens before any conversation starts. This is the "fixed cost" that every request pays before a single line of user input is processed.
Function calling / tool use
Modern LLMs expose a tools parameter that lets the orchestrator declare available actions as JSON schemas. The model can then emit a structured tool_use block instead of free text, signaling that it wants to invoke a specific function with specific arguments.
The schema for a read_file tool looks like this (this is a config/protocol definition, not application code):
{
"name": "read_file",
"description": "Read the contents of a file at the given path",
"input_schema": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Relative path to the file from the repository root"
}
},
"required": ["path"]
}
}
When Claude 3 emits a tool_use block for this schema, the orchestrator parses it, calls the actual file system, and returns the contents as a tool_result message. The model never touches the file system directly โ it only emits intent.
Why agents stream tokens
Coding agents stream LLM responses token-by-token for two practical reasons. First, large diffs can take 20โ40 seconds to generate, and streaming lets the UI show progress rather than a blank screen. Second, and more importantly, the orchestrator needs to detect when the model starts emitting a tool call block so it can stop streaming, extract the structured JSON, and execute the tool โ rather than waiting for the full response.
๐ง Context Window Management: The Core Constraint Driving Agent Design
This is the section that explains the refactor failure from the opening story. Everything that follows describes the single biggest engineering challenge in building a coding agent.
The Internals of Context Prioritization and Window Assembly
What is a context window? A context window is the maximum number of tokens (roughly 0.75 English words per token) that an LLM can attend to in a single API call. Anything outside the window is completely invisible to the model.
Current token budgets by model:
| Model | Context Window | Practical Code Limit |
| GPT-4o | 128,000 tokens | ~90,000 tokens of actual code |
| Claude 3.5 Sonnet | 200,000 tokens | ~150,000 tokens |
| Claude 3 Opus | 200,000 tokens | ~150,000 tokens |
| Gemini 1.5 Pro | 1,000,000 tokens | ~700,000 tokens |
| GPT-4 (original) | 8,192 tokens | ~5,000 tokens |
A token budget of 128K sounds enormous until you account for what must fit inside it.
What consumes the context window?
On a real coding task, the context is subdivided roughly like this:
| Content | Typical Token Cost |
| System prompt (agent persona, tools, style guide) | 2,000โ4,000 |
.cursorrules / project config | 500โ1,500 |
| Open files and related files | 5,000โ30,000 |
| Conversation history (all prior turns) | 10,000โ60,000 |
| Tool results injected into context | 2,000โ20,000 |
| Current user message | 50โ500 |
For a long refactor session, the conversation history alone can consume 60,000 tokens. Add two or three large source files and you've used 100K of GPT-4o's 128K budget โ and the session is only 30 minutes old.
RAG: Retrieval-Augmented Generation for codebase context
A naive approach would be to dump the entire codebase into the context. That fails for any repository larger than a few hundred lines. The solution is RAG โ Retrieval-Augmented Generation.
Here is how Cody, Cursor, and Claude Code all approach codebase context:
- Embedding: When you open a project, the agent indexes every source file by converting it into a dense vector embedding using a smaller embedding model (e.g., OpenAI
text-embedding-3-smallor a local model). - Storage: Embeddings are stored in a local vector database (Chroma, LanceDB, or an in-process FAISS index).
- Query: When you submit a request, the agent embeds your query and runs a nearest-neighbor search over the codebase embeddings to retrieve the top-k most semantically similar code chunks.
- Injection: Those chunks are inserted into the context alongside your message.
The result: instead of loading 500 files, the agent loads the 10โ20 files most relevant to your specific question. This is how Copilot can answer questions about a 200,000-line monorepo without blowing the context budget on every request.
Context assembly for a code edit request
The following sequence diagram traces exactly how the context is assembled when you ask an agent to edit a specific method. Each step is a deliberate budget decision โ the orchestrator is choosing what information the model needs versus what it can afford to include.
sequenceDiagram
participant U as User
participant Orch as Orchestrator
participant CB as Context Builder
participant VS as Vector Store
participant LLM as LLM API
U->>Orch: Fix the retry logic in uploadFile
Orch->>CB: Assemble context for this request
CB->>VS: Embed query - retrieve relevant file chunks
VS-->>CB: Top-5 chunks: FileUploader, RetryPolicy, HttpClient
CB->>CB: Load open files from editor state
CB->>CB: Prepend system prompt and tool schemas
CB->>CB: Append conversation history up to token budget
Note over CB: Budget check - 128K total - 12K used so far
CB->>LLM: Send assembled context with user message
LLM-->>Orch: Thought plus tool call - read_file FileUploader
Orch->>CB: Inject tool result into context
CB->>LLM: Re-send with tool result appended
LLM-->>Orch: Final code edit diff
Orch->>U: Display diff for approval
The "lost in the middle" problem
Research from Liu et al. (2023) demonstrated that LLMs are significantly better at recalling information at the beginning and end of their context window than in the middle. Content placed in the middle of a 128K context receives systematically less attention during generation.
This has practical consequences: if an agent places your critical interface contract in the middle of a long context with many files, the model may effectively "ignore" it even though the tokens are technically present. Well-designed agents place the most important constraints (the user's current message, the file being edited, the key instructions) at the top or bottom of the assembled context for exactly this reason.
Sliding window and truncation strategies
When the context window fills up, agents have four strategies:
- Hard truncation: Drop the oldest conversation turns. Simple but causes the "forgot the rules" problem.
- Summarization: Call the LLM to summarize the conversation history into a compressed representation before truncating. More expensive but preserves semantic content.
- Priority-based eviction: Evict tool results and intermediate steps first; preserve user instructions and the original task description. This is what Claude Code does.
- Re-chunking with RAG: Re-run the semantic search to refresh which file chunks are in context. Cursor does this on every request to ensure context relevance.
Performance Analysis of Context Operations
Context management is not free โ it adds measurable latency and cost to every agent turn. Understanding these numbers helps set realistic expectations.
| Operation | Latency | Token Cost |
| Embedding a query (RAG) | 50โ200ms | ~50 tokens |
| Vector search (top-k retrieval) | 10โ100ms | 0 tokens |
| Context assembly (in-memory) | 5โ30ms | 0 tokens |
| Summarization of 20K token history | 3โ8s extra | 500โ2,000 tokens |
| LLM API call (first token) | 500msโ3s | N/A |
The RAG retrieval path (embedding + vector search + inject) adds ~200โ400ms of overhead per request but saves 10,000โ50,000 tokens of context budget that would otherwise be consumed by loading all files. At $0.01 per 1K input tokens (GPT-4o pricing), a 40,000-token reduction per request saves $0.40 per call โ significant at scale.
The summarization path trades increased latency (3โ8 extra seconds) for session continuity. Teams using Aider on long refactoring sessions often enable --max-chat-history-tokens to trigger automatic summarization before the context window forces hard truncation, accepting the latency cost to preserve the session's logical thread.
๐ฆ Session Management: Keeping State Alive Between Turns
The LLM has no memory. Zero. Each API call is independent โ the model has no idea what happened in the previous turn unless you include that history in the current request. The session is entirely an agent-side construct.
Session lifecycle
START โ ACTIVE โ IDLE โ EXPIRED
| | | |
| | | โโโ History cleared; must restart task
| | โโโโโโโโโโโ No activity for N minutes; state serialized to cache
| โโโโโโโโโโโโโโโโโโโ User is actively sending messages; full state in memory
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ Empty context; system prompt + empty history
When you close VS Code with Copilot Chat open and reopen it, the session state (your conversation history) has been serialized to a local cache file. When you reopen VS Code, the agent deserializes that history and injects it back into the next LLM request. This is why Copilot "remembers" your last conversation โ it is replaying the entire history, not because the model retained it.
Turn-by-turn serialization
Each agent maintains a conversation log that grows with every turn:
Turn 1: User: "Use PaymentProcessor interface everywhere"
Assistant: "Understood. I'll ensure all new payment services implement PaymentProcessor."
Turn 2: User: "Now write StripePaymentProcessor"
Assistant: [tool calls + code] โ StripePaymentProcessor.java
Turn 3: User: "Now write PayPalPaymentProcessor"
...
This entire log is sent in every LLM request. By turn 20, the log itself may be the largest single item in the context. Agents like Aider include a --max-chat-history-tokens flag that triggers summarization when the history exceeds a threshold.
Multi-agent sessions: how Devin delegates
More advanced agents like Devin and OpenAI's Deep Research go beyond a single agent loop. They spawn sub-agents โ separate orchestrator instances with their own context windows. A supervisor agent breaks the task into subtasks, each assigned to a sub-agent. Sub-agents return results that the supervisor injects into its own context as tool results.
Shared context between sub-agents is mediated through the external short-term cache (Redis or similar), not through a shared context window. Sub-agent A's conversation history is never directly visible to sub-agent B; only the summary output is passed.
๐๏ธ Memory Architecture: The Four Tiers Every Agent Navigates
Coding agents operate across four fundamentally different types of memory, each with its own access speed, capacity, and persistence characteristics.
graph TD
subgraph T1[Tier 1 - In-Context Memory]
ICM[Active context window - fast - ephemeral - 128K to 200K tokens]
end
subgraph T2[Tier 2 - External Short-Term Cache]
STC[Redis - session state - tool outputs - preferences - minutes to hours]
end
subgraph T3[Tier 3 - Long-Term Vector Store]
LTV[Chroma or Pinecone - codebase embeddings - past sessions - days to permanent]
end
subgraph T4[Tier 4 - Parametric Memory]
PW[Model weights - pre-training and fine-tuning - static until retrain]
end
T1 -->|evicted when window fills| T2
T2 -->|persisted after session expires| T3
T3 -->|retrieved via semantic search| T1
T4 -->|always available to model| T1
Tier 1 โ In-context memory (ephemeral): The current context window. Everything in the active LLM call. This is the fastest (zero retrieval latency) and most expensive (every token costs money). It is completely lost when the session ends or the window overflows. This is the tier that failed in the opening scenario.
Tier 2 โ External short-term cache (Redis or equivalent): Session state that survives between individual LLM calls but is scoped to the current working session. Tool outputs (file contents retrieved in turn 3 can be cached so they are not re-retrieved in turn 4), user preferences ("always use tabs, not spaces"), and intermediate task state live here. GitHub Copilot serializes session history to a local JSON cache file in the user's .vscode directory. Aider writes session history to a .aider.chat.history.md file.
Tier 3 โ Long-term vector store (Chroma, Pinecone, LanceDB): Persistent embeddings of the codebase, past conversation summaries, and learned user patterns. This tier survives across sessions indefinitely. When Cursor indexes your repository on first open, it is populating Tier 3. When you ask "where is the authentication logic?", Cursor queries Tier 3 with a semantic search, retrieves the most relevant file chunks, and injects them into Tier 1 (the current context window) for the model to reason about.
Tier 4 โ Parametric memory (model weights): Everything the model learned during pre-training and fine-tuning. This is static knowledge baked into the model's 70 billion+ parameters. GPT-4's knowledge of Python syntax, common design patterns, the Java standard library โ all of that lives in Tier 4. It requires no retrieval and is always available, but it is frozen at the model's training cutoff date and cannot be updated without retraining.
The key insight: the orchestrator's primary job is moving information between these tiers at the right time. RAG moves Tier 3 content into Tier 1. Session serialization moves Tier 1 content into Tier 2. Summarization compresses Tier 1 overflow into Tier 3 for future retrieval.
๐ง Tool Use and Code Execution: How Agents Take Real Actions
An agent without tools is a very expensive autocomplete. The tool layer is what separates "AI coding assistant" from "AI coding agent."
Standard tool registry in a modern coding agent:
| Tool | What it does | Used by |
read_file(path) | Returns file contents as a string | All agents |
write_file(path, content) | Creates or overwrites a file | Cursor, Aider, Claude Code |
edit_file(path, old, new) | Line-level diff application | Claude Code, Copilot |
run_command(cmd) | Executes a shell command; returns stdout/stderr | Devin, Aider, Claude Code |
search_codebase(query) | Semantic or keyword search over the repo | Cody, Cursor, Copilot |
run_tests(suite) | Runs a test suite; returns pass/fail + output | Devin, SWE-agent |
browse(url) | Fetches a web page | Devin, Claude Code |
Sandboxing: how Devin runs code safely
When an agent calls run_command, it is executing arbitrary code on a machine. Devin and E2B (the sandboxing infrastructure used by several agents) address this by running each agent session inside a per-session container (Docker or Firecracker microVM). The container has:
- A clean copy of the repository
- No access to the host network or host filesystem beyond the project
- A resource budget (CPU, memory, disk I/O)
- An automatic teardown timer to prevent runaway processes
The tool result (stdout, stderr, exit code) is returned to the orchestrator, which injects it into context as a tool_result message. The model sees the output but never directly executes code โ it only observes results.
Tool results and context bloat
Every tool result is injected into the context window. A run_tests call that returns 2,000 lines of Maven Surefire output can consume 4,000+ tokens in a single turn. Well-designed agents apply result compression: truncate verbose output to the most salient lines (first N lines, last N lines, lines containing "ERROR" or "FAILED"), and summarize the rest. Claude Code applies a 4K token cap per tool result by default, with an option to raise the limit.
๐ Performance Characteristics: Latency, Throughput, and the Bottlenecks
Understanding where time is spent explains why agents feel slow on complex tasks.
Latency breakdown for a single agent turn:
| Step | Typical Duration |
| Context assembly (RAG query + file loads) | 100โ500ms |
| LLM API call โ time to first token | 500msโ3s |
| LLM generation (full response, streaming) | 2sโ30s |
| Tool execution (file read) | 1โ10ms |
| Tool execution (run_command, test suite) | 5sโ120s |
| Tool result injection + re-prompt | 200โ800ms |
For a task requiring 5 tool calls before the final answer, total latency is often 30โ90 seconds. The bottleneck is almost always the LLM generation time, not the orchestration overhead.
Throughput limits: Most OpenAI and Anthropic API tiers enforce rate limits of 500,000โ1,000,000 tokens per minute (TPM). A single agent session generating 10,000 tokens per turn at 10 turns per minute approaches these limits quickly in team environments โ which is why tools like Cursor have per-user API key configurations rather than sharing a pool.
โ๏ธ Failure Modes: Why Agents Hallucinate, Loop, and Forget
Understanding failure modes is the most practical takeaway from studying agent internals.
Context overflow โ the silent truncation. When the conversation history grows beyond the context window, older turns are silently dropped. The model never signals this โ it simply operates without information it once had. The "forgot the rules" scenario from the opening story is always a context overflow event. Mitigation: keep sessions short and focused; use .cursorrules to inject critical constraints into the system prompt so they are always present.
Hallucinated file paths and method names. The model generates code that references com.example.payment.v2.StripeAdapter โ a class that does not exist. Why? The model extrapolated from patterns in its training data. RAG helps but does not fully prevent this: if the relevant file was not retrieved (low semantic similarity to the query), the model will fill the gap from Tier 4 (parametric memory), which may suggest patterns from other codebases. Mitigation: run a linter as a tool in the loop and inject compile errors back as observations, forcing the model to self-correct.
Infinite tool loops. The model calls search_codebase("RetryPolicy"), does not find a satisfying result, and calls it again with slightly different phrasing. Loop. An orchestrator without a turn limit will keep calling tools indefinitely. All production agents implement a max-iterations guard (Claude Code: 10 tool calls per session turn; Aider: 5 retries per edit block).
Stale context โ editing a file that has already changed. The agent read FileUploader.java in turn 3 and cached its contents. In turn 8, after the user manually edited the file, the agent still operates on the turn-3 snapshot. The generated diff applies to a file version that no longer exists, producing merge conflicts. Mitigation: re-read files immediately before editing them (Claude Code does this automatically); use file hashes to detect staleness.
Cascading hallucination across tool calls. The model generates a tool call with a bad argument (e.g., edit_file("src/storage/FileUploadr.java", ...) โ typo in the path). The tool returns a "file not found" error. The model tries to create the file from scratch. Now there are two files โ the real one and a hallucinated duplicate. Each subsequent call to the wrong file builds on the hallucination, drifting further from reality.
๐ How GitHub Copilot, Cursor, Devin, and Claude Code Apply These Internals in Production
The internals described above are not theoretical โ they manifest as concrete product design decisions in every major coding agent.
GitHub Copilot is the most conservative implementor. It maintains a tight, editor-scoped context window: the currently open file, a few recently viewed files, and a short conversation history. Its inline suggestion mode doesn't use a full ReAct loop โ it's a single-shot LLM call with a heavily optimized context assembly (open file + neighboring code + editor cursor position). Copilot Chat adds a multi-turn session layer with local JSON cache for history persistence, but stops short of full tool use beyond code search.
Cursor takes a RAG-first approach. On every project open, it indexes the full codebase into a local vector store (LanceDB). On every request, it runs a semantic search before assembling context, ensuring that the most relevant code โ not just what is open โ arrives in the model's window. Cursor supports model routing: it sends short inline completions to a fast model (GPT-4o mini or Haiku) and routes complex refactoring requests to the more capable model (Claude 3.5 Sonnet or GPT-4o). The .cursorrules file in the repository root acts as a persistent system prompt injection, surviving context window overflow.
Claude Code (Anthropic's terminal agent) uses the full 200K-token Claude 3.5 Sonnet context window aggressively, loading more files than most agents before triggering RAG. Its tool registry is terminal-native: Bash, Read, Write, Edit, Search. Rather than sandboxing, Claude Code runs directly in your local shell โ giving it low latency at the cost of requiring the developer to supervise destructive operations. Its session management relies on the 200K window plus in-session file re-reads to detect stale context.
Devin / SWE-agent represents the most autonomous end of the spectrum. Rather than assisting a human developer in an editor, Devin receives a task (e.g., "fix GitHub issue #1234") and operates independently in a sandboxed cloud VM for hours. Its ReAct loop includes web browsing, test suite execution, and PR creation. Context management becomes critical at this scale: Devin uses hierarchical summarization (summarize sub-tasks before continuing) and writes intermediate plans to a scratchpad file that it reads back into context when needed โ a primitive form of Tier 2 (external cache) coordination.
| Agent | Context Strategy | Tool Depth | Session Persistence |
| Copilot | Editor-scoped, short history | Code search only | Local JSON file |
| Cursor | RAG-first, model routing | File + search | Local LanceDB |
| Claude Code | Large window, aggressive loading | Full terminal | In-session only |
| Aider | Git-native, explicit file selection | Git + terminal | .aider.chat.history.md |
| Devin | Sandboxed VM, hierarchical summary | Browser + terminal + tests | Cloud session store |
๐งช Tracing a Complete Agent Session: From Feature Request to Committed Edit
The best way to make the internals concrete is to trace a realistic request end-to-end. This example walks through what happens when a developer asks Cursor: "Add pagination to the GET /users endpoint โ use cursor-based pagination, not offset."
Turn 1 โ Context assembly. The orchestrator fires: embed the query โ retrieve top-5 file chunks from the vector store (UserController.java, UserRepository.java, PageRequest.java, UserDto.java, and a test file). System prompt + .cursorrules (500 tokens) + retrieved chunks (8,200 tokens) + user message (25 tokens) = ~8,725 tokens sent to GPT-4o. Well within budget.
Turn 1 โ LLM response. The model emits a tool call: read_file("src/main/java/com/example/UserRepository.java"). It wants the full file, not just the retrieved chunk, to understand the existing query methods. The orchestrator executes the read (230ms), appends the result (1,850 tokens) to context, and re-prompts.
Turn 2 โ LLM generates the edit. With the full repository file in context, the model emits an edit_file tool call with the complete diff: adding a findByCursorId JPA method, updating the controller endpoint signature, and adding a CursorPageResponse DTO. The orchestrator applies the diff via patch.
Turn 3 โ Validation. The model emits run_command("./mvnw test -pl api-service -Dtest=UserControllerTest"). The test runner returns a failure: CursorPageResponse is missing a @JsonProperty("next_cursor") annotation. The model observes the error, reasons about the fix, and emits a second edit_file call for the DTO. Tests pass on rerun.
Total cost for this session: 4 LLM calls, 14,800 tokens consumed, ~45 seconds wall clock time. The key insight: the agent didn't just write code โ it read the relevant files, ran tests, observed a failure, self-corrected, and produced working code. Each of these steps is a separate LLM call, not a single generation.
๐งญ Choosing Between Long Context, RAG, and Summarization: A Decision Guide
Not every project needs every technique. The right context management strategy depends on repository size, session length, and task type.
| Situation | Recommended Strategy | Why |
| Small focused repo (< 100K tokens total) | Load all files into context | No retrieval overhead; model has full picture |
| Large monorepo (> 500K tokens) | RAG-first: embed + retrieve top-k chunks | Prevents context overflow; focuses model on relevant code |
| Long refactor session (> 30 turns) | Summarization + .cursorrules for constraints | History will overflow; constraints must survive truncation |
| One-shot code generation task | Single-shot LLM call, no agent loop | Agent loop overhead (3โ6 extra API calls) is wasteful |
| Multi-step autonomous task (fix a bug + run tests) | Full ReAct loop with tool use | Multi-step requires observation/correction cycles |
| Team environment with shared codebase | Persistent vector store (Tier 3) + short sessions | Keeps individual context windows small; leverages shared index |
Anti-patterns to avoid:
- Using an agent for everything. For "explain this function," a single LLM call is faster and cheaper than a 5-tool-call agent session. Reserve agent loops for tasks that genuinely require multiple tool interactions.
- Trusting long sessions without
.cursorrules. After 30+ turns, your early constraints are likely evicted. Any rule that must hold for the whole session belongs in the system prompt, not a conversation turn. - Ignoring the "lost in the middle" effect. If you have 10 critical constraints, inject them as a numbered list at the beginning of the system prompt โ do not bury them in the middle of a large context block.
๐ ๏ธ Configuring Real Agents: Cursor Rules, Aider Flags, and OpenAI Tool Schemas
For Internals / How It Works posts, the OSS section shows config/integration-level examples only โ no application boilerplate.
Cursor: injecting constraints via .cursorrules
Cursor reads a .cursorrules file from the repository root and prepends its content to every system prompt. This is the most direct way to influence what the agent "always knows" regardless of context window state.
# .cursorrules
You are working in a Spring Boot 3.x codebase.
Rules:
- All payment services must implement the PaymentProcessor interface
- Never use LegacyBillingClient โ it is deprecated and must be removed
- Use constructor injection, not field injection
- All new services must be annotated with @Retryable(maxAttempts = 3)
- Target Java 21 record types for DTOs
Codebase layout:
- src/main/java/com/example/payment โ payment domain
- src/main/java/com/example/core โ shared infrastructure
- src/test โ unit and integration tests (TestContainers)
Because .cursorrules is in the system prompt, it survives context window overflow. It is the single most effective mitigation against the "forgot the rules" problem.
Aider: model selection and context configuration
Aider is a terminal-first coding agent that integrates directly with Git. It supports explicit model selection and history management:
# Use Claude 3.5 Sonnet with a 150K token context budget
aider --model claude-3-5-sonnet-20241022 \
--context-window 150000 \
--max-chat-history-tokens 30000 \
src/payment/FileUploader.java src/core/RetryPolicy.java
# Enable auto-commits after each accepted edit
aider --auto-commits --model gpt-4o
The --max-chat-history-tokens flag triggers automatic summarization when the conversation history exceeds the threshold, preventing silent truncation from corrupting session state.
OpenAI function calling: the tool schema contract
This is the JSON schema that an orchestrator registers with the OpenAI API to expose a run_tests tool. The model reads this schema as part of the system prompt and emits structured function_call JSON when it wants to invoke the tool. This is protocol-level config, not application code:
{
"type": "function",
"function": {
"name": "run_tests",
"description": "Execute the test suite for the specified module and return results",
"parameters": {
"type": "object",
"properties": {
"module": {
"type": "string",
"description": "Maven module name or path, e.g. 'payment-service'"
},
"test_filter": {
"type": "string",
"description": "Optional regex filter for test class names"
}
},
"required": ["module"]
}
}
}
The orchestrator parses the model's function_call response, maps it to the actual mvn test -pl payment-service invocation, captures stdout/stderr, and returns it as a tool message in the next API call.
For a full deep-dive on building LangGraph-based agent loops with tool calling, see LangChain Tools and Agents: The Classic Agent Loop and LangGraph ReAct Agent Pattern.
๐ Hard-Won Lessons from Production AI Coding Agent Deployments
The context window is your #1 operational constraint. Every architectural decision in a coding agent โ RAG, summarization, tool result truncation, priority-based eviction โ exists to serve one goal: fitting the right information into a finite window. When an agent "goes dumb," check the context, not the model.
.cursorrulesis your safety net. The system prompt is the only part of context that survives a window overflow intact. Any constraint you genuinely need the agent to respect โ interface contracts, deprecated class avoidance, style rules โ belongs in the system prompt, not in a one-time message at the start of the session.Statelessness is a feature, not a bug. The model's statelessness makes it deterministic (modulo temperature) and parallelizable. The difficulty of maintaining session state is an orchestration problem, not an LLM problem. Treating it as an LLM problem leads to frustration; treating it as an engineering problem leads to solutions.
Tool loops are a prompt engineering problem. When an agent gets stuck in a tool loop, the fix is almost always a better system prompt: explicit stop conditions ("if search returns empty, report that to the user and stop"), step limits in the orchestrator, and cleaner tool descriptions that reduce ambiguity.
The "lost in the middle" effect is real and measurable. In controlled experiments, models recall content at context positions 0โ10% and 90โ100% at significantly higher accuracy than positions 40โ60%. For long contexts, structure your system prompt so the highest-priority constraints appear at the very beginning, and place the user's actual question at the very end of the assembled context.
Multi-agent architectures multiply context management complexity. Each sub-agent in a Devin-style system has its own finite context window. Passing state between sub-agents via a shared Redis cache introduces serialization overhead and information loss. More agents does not mean more memory โ it means more orchestration surface area to get wrong.
Streaming is not just UX. Streaming tokens back to the client allows the orchestrator to detect and interrupt tool calls mid-stream, implement real-time safety checks on generated code, and update the UI incrementally. Batch-mode LLM calls make all of this harder.
๐ TLDR Summary and Key Takeaways: What Every Developer Should Know About AI Coding Agents
- An AI coding agent is an LLM plus tool registry plus orchestration loop plus context management. The model itself is the smallest part of the system.
- The context window is finite and every token in it has been placed there by deliberate engineering decisions. When an agent forgets something, tokens were evicted.
- The ReAct loop (Reason โ Act โ Observe) drives every non-trivial agent action. A single user request can trigger 5โ10 LLM API calls before producing a final answer.
- RAG (Retrieval-Augmented Generation) is how agents reference large codebases without exceeding the context window: embed โ store โ retrieve โ inject.
- Memory operates across four tiers: in-context (ephemeral, fastest), external cache (Redis, session-scoped), vector store (persistent, semantic), and parametric (model weights, static).
- Sessions are entirely an agent-side abstraction. The LLM sees a fresh slate every call; the orchestrator reconstructs the session from its serialized history.
- Tool calls are structured JSON emitted by the model. The model never directly executes code โ it expresses intent; the orchestrator acts.
- Critical constraints belong in
.cursorrulesor the system prompt, not in early conversation turns that will eventually be evicted. - The "lost in the middle" problem means context position matters โ put the most important instructions at the beginning and end of the assembled context.
๐ Practice Quiz: Test Your Understanding of AI Coding Agent Internals
A developer reports that Claude Code "forgot" a critical design rule 40 minutes into a refactoring session. What is the most likely cause?
- A) The model was fine-tuned on different code and lost the instruction
- B) The conversation history grew large enough to push the instruction out of the context window via truncation
- C) Claude Code does not support multi-turn sessions
- D) The instruction was too short to be stored in the vector database Correct Answer: B
Which component in the agent architecture is responsible for deciding whether to call a tool or generate a final answer?
- A) The LLM API
- B) The vector store
- C) The agent orchestrator running the ReAct loop
- D) The streaming response parser Correct Answer: C
A coding agent needs to reference a 200,000-line monorepo but the model only has a 128K token context window. Which technique allows the agent to answer questions about the full codebase?
- A) Fine-tuning the model on the entire codebase
- B) Retrieval-Augmented Generation โ embedding files, semantic search, injecting top-k relevant chunks
- C) Splitting the codebase into 128K chunks and calling the model once per chunk
- D) Asking the user to paste the relevant files manually Correct Answer: B
An agent generates a
read_filetool call. Which description is most accurate?- A) The LLM directly accesses the file system via an OS system call
- B) The LLM emits a structured JSON block expressing intent; the orchestrator executes the actual file read and injects the result back into context
- C) The LLM generates an approximation of the file contents from its training data
- D) The agent opens a subprocess that reads the file and streams characters to the model Correct Answer: B
According to the "lost in the middle" finding, where in the context window does an LLM attend to content most reliably?
- A) The middle, because it is equidistant from both ends
- B) Uniformly across all positions โ attention is evenly distributed
- C) The beginning and end of the context window
- D) Wherever the system prompt is positioned Correct Answer: C
Which of the four memory tiers is static โ meaning it cannot be updated without retraining the model?
- A) In-context memory (Tier 1)
- B) External short-term cache (Tier 2)
- C) Long-term vector store (Tier 3)
- D) Parametric memory โ the model weights (Tier 4) Correct Answer: D
An agent is stuck calling
search_codebaserepeatedly with slightly different queries without making progress. This is an example of which failure mode?- A) Context overflow
- B) Stale context
- C) Infinite tool loop
- D) Cascading hallucination Correct Answer: C
Why is
.cursorrulesmore reliable than stating a constraint in an early conversation message?- A)
.cursorrulesuses a different file format that the model processes differently - B)
.cursorrulesis injected into the system prompt, which survives context window overflow; early conversation turns are evicted first - C)
.cursorrulesbypasses the vector store and writes directly to model weights - D) Early conversation messages are encrypted and not visible to the model Correct Answer: B
- A)
Devin and E2B run tool execution in per-session containers. What is the primary reason for this sandboxing?
- A) To give the model direct access to the file system without latency
- B) To prevent arbitrary code executed by the agent from accessing host resources or persisting beyond the session
- C) To improve LLM token generation speed
- D) To allow multiple agents to share the same context window Correct Answer: B
Open-ended challenge: You are building a coding agent for a 2-million-line enterprise Java monorepo. The repo has 4,000 source files. Your model has a 128K token context window and you expect sessions to run 60โ90 minutes with 40โ60 turns. Describe your context management architecture: which memory tiers you would use, how you would handle context overflow, what you would put in the system prompt versus retrieve via RAG, and how you would ensure that critical architectural constraints are never evicted. There is no single correct answer โ justify your trade-offs. (Open-ended โ no single correct answer. Consider: persistent vector store for codebase, Redis for session state, summarization at 25K history tokens, critical rules in .cursorrules, file re-reads before edits to prevent stale context, and a max-turns guard with task checkpointing.)
๐ Related Posts
- How GPT (LLM) Works: The Next Word Predictor
- AI Agents Explained: When LLMs Start Using Tools
- Context Window Management Strategies for Long Documents
- LangChain Tools and Agents: The Classic Agent Loop
- LangGraph ReAct Agent Pattern
- Multi-Step AI Agents: The Power of Planning
- RAG Explained: How to Give Your LLM a Brain Upgrade
- Embeddings Explained

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
How JVM Garbage Collection Works: Types, Memory Impact, and Tuning
TLDR: JVM garbage collection automatically reclaims unused heap memory, but every algorithm makes a different trade-off between throughput, latency, and memory footprint. The default G1GC targets 200ms pause goals and works well for most services. Fo...

Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions โ with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...

Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work.
