PagedAttention & KV-Cache Optimization: How vLLM Handles Large Scale Inference
How virtual memory paged block mapping solves GPU memory bottleneck and fragmentation in LLM inference.

Abstract Algorithms
Helping engineers master software engineering topics.
TLDR: Large Language Model (LLM) serving is heavily bound by GPU memory capacity due to the Key-Value (KV) cache. Traditional serving frameworks allocate contiguous memory based on maximum sequence length, leading to severe fragmentation and wasted space. This guide explores how PagedAttention, inspired by operating system virtual memory, partitions the KV cache into small logical blocks to enable dynamic, non-contiguous allocations, prefix sharing, and massive throughput gains.
π Concept: GPU Memory Bottlenecks in LLM Inference
When running inference on large transformer models, the primary bottleneck is not computing capability, but memory bandwidth and memory capacity. This is particularly true during the autoregressive generation phase, where the model outputs tokens one by one.
At each step of generation, the self-attention mechanism requires the Query vector of the current token to be multiplied by the Key vectors of all previous tokens in the sequence, followed by soft-max scaling and multiplication with the Value vectors of all previous tokens. To avoid recalculating these Key and Value vectors at every single step, inference engines cache them in GPU high-bandwidth memory (HBM). This cache is known as the Key-Value (KV) Cache.
While caching keys and values dramatically reduces CPU and GPU compute cycles, it shifts the bottleneck to memory capacity. As sequence lengths grow and batch sizes increase, the KV cache scales linearly:
$$ \text{KV Cache Size} = 2 \times \text{Batch Size} \times \text{Sequence Length} \times \text{Layers} \times \text{Attention Heads} \times \text{Head Dimension} \times \text{Precision Bytes} $$
For a Llama-3-70B model running at FP16 precision with a batch size of 32 and a context length of 4096 tokens, the KV cache alone requires tens of gigabytes of GPU HBM.
Traditionally, serving frameworks allocated a single contiguous chunk of HBM for each request's KV cache. Since the engine cannot predict the final output sequence length in advance, it pre-allocated contiguous memory corresponding to the maximum possible sequence length (e.g., 2048 or 4096 tokens). If a request terminated early after generating only 50 tokens, the memory allocated for the remaining 4046 tokens remained reserved and completely unused. This design caused severe internal fragmentation (reserved memory that goes unused) and external fragmentation (free memory split into small segments, making it impossible to schedule new requests).
βοΈ Mechanics: Key-Value (KV) Cache and Memory Fragmentation
To understand why contiguous allocations degrade serving throughput, we must look at how memory is utilized over the lifetime of an active generation query.
When a prompt is first sent to the inference engine (the prefill phase), the model processes all prompt tokens in parallel, generating the initial KV cache tensors. Once the prefill phase is complete, the model enters the decode phase, generating one token at a time. Each generated token adds a new vector to the Key cache and another vector to the Value cache.
Under traditional contiguous memory systems, memory is reserved using static boundaries. The diagram below illustrates how this creates structural gaps. A request is allocated a fixed region matching the maximum context length:
| Request | Allocated Tokens | Max Context | Unused / Reserved Memory | Actual Utilization |
| Request A | 120 | 2048 | 1928 slots | 5.8% |
| Request B | 800 | 2048 | 1248 slots | 39.0% |
| Request C | 2000 | 2048 | 48 slots | 97.6% |
Because of these pre-allocations, the serving system can run out of GPU memory even when the actual memory utilized by active tokens is below 20%. The system must refuse incoming requests or queue them, lowering overall throughput. Additionally, external fragmentation prevents the engine from utilizing scattered memory slots since the underlying tensors require strict physical contiguity.
π§ Deep Dive: PagedAttention Architecture and Block Mapping
PagedAttention solves these fragmentation issues by decoupling logical sequence coordinates from physical memory locations. Instead of requesting a single contiguous memory block for a sequence, the engine divides the KV cache of each sequence into small, fixed-size logical blocks.
Internals: Virtual Memory Analogy and Page Tables
Similar to virtual memory in operating systems, PagedAttention represents a sequence's KV cache as a series of logical blocks. Each logical block contains the Key and Value tensors for a fixed number of tokens (typically 16 or 32). These logical blocks do not need to be mapped to contiguous physical blocks in GPU memory.
The engine maintains a Block Table to map logical blocks to physical blocks. During self-attention computation, the PagedAttention CUDA kernel reads the block table mapping to retrieve keys and values from non-contiguous HBM addresses, performing the attention math dynamically.
Logical Space (Sequence):
[ Block 0 (Tokens 0-15) ] -> [ Block 1 (Tokens 16-31) ] -> [ Block 2 (Tokens 32-47) ]
| | |
Block Table Mapping:
Logical Block 0 -------------> Physical Block 104 (HBM)
Logical Block 1 -------------> Physical Block 87 (HBM)
Logical Block 2 -------------> Physical Block 212 (HBM)
By using this lookup mechanism, physical blocks are allocated dynamically on-demand. When a sequence generates a new token that exceeds the capacity of its current block, the engine grabs a free physical block from the global pool and appends it to the sequence's block table.
Performance Analysis: Memory Utilization and Throughput Dynamics
Since physical blocks are small and allocated only when needed, internal fragmentation is limited to at most the fraction of the last block. For a block size of 16, the maximum wasted space per sequence is 15 tokens. Empirically, this reduces memory waste to under 4%, allowing serving systems to increase batch sizes by 2x to 4x compared to contiguous baselines.
Larger batch sizes allow GPU tensor cores to achieve higher occupancy, significantly improving decoding throughput. The table below compares the throughput and memory characteristics of static contiguous allocation against dynamic PagedAttention:
| Performance Metric | Contiguous Static Allocation | PagedAttention (vLLM) |
| Memory Waste | 60% β 80% (high fragmentation) | < 4% (block boundary limit) |
| Effective Batch Size | Low (constrained by worst-case max length) | High (dynamic scaling based on usage) |
| Latency Variance | Low | Low to Moderate (minor page table overhead) |
| Prefix Sharing | Impossible (requires copying cache) | Native (multiple mapping references) |
Mathematical Model: Dynamic KV-Cache Space and Allocation Efficiency
Let $N$ be the total number of active sequences in a batch. Let $L_i(t)$ be the length of sequence $i$ at decode step $t$. Let $B$ be the block size (number of tokens per block).
The number of logical blocks $M_i(t)$ required by sequence $i$ at step $t$ is:
$$ M_i(t) = \lceil \frac{L_i(t)}{B} \rceil $$
Let $C_{token}$ represent the memory footprint (in bytes) of the Key and Value tensors for a single token across all layers, heads, and head dimensions:
$$ C_{token} = 2 \times \text{Layers} \times \text{Heads} \times \text{Head Dimension} \times \text{Precision Bytes} $$
The total physical memory $M_{alloc}(t)$ allocated by PagedAttention at step $t$ is:
$$ M_{alloc}(t) = \sum_{i=1}^{N} M_i(t) \times B \times C_{token} $$
In contrast, static contiguous allocation must pre-allocate space for the maximum sequence length $L_{max}$:
$$ M_{static} = N \times L_{max} \times C_{token} $$
The allocation efficiency ratio $\eta(t)$ can be modeled as:
$$ \eta(t) = \frac{M_{alloc}(t)}{M_{static}} = \frac{\sum_{i=1}^{N} \lceil \frac{L_i(t)}{B} \rceil \times B}{N \times L_{max}} $$
Because $Li(t) \ll L{max}$ for the majority of the decode steps, $\eta(t)$ is typically between $0.15$ and $0.40$, indicating that PagedAttention utilizes only a fraction of the memory required by static allocation methods to support the same concurrent workload.
ποΈ Advanced Concepts: Copy-on-Write (CoW) and Prefix Sharing
Beyond eliminating fragmentation, decoupling logical and physical memory blocks enables advanced memory-sharing patterns, such as Prefix Sharing and Parallel Sampling.
When prompting an LLM with common system prompts, long instructions, or document contexts, multiple requests share the same initial prefix. In a contiguous allocation system, this prefix must be evaluated and stored separately for each request.
With PagedAttention, the physical memory blocks storing the KV cache of the shared prefix can be mapped to the block tables of multiple requests simultaneously. The physical blocks are marked read-only.
Sequence A Block Table: [ Block 10 (Prefix) ] -> [ Block 11 (Prefix) ] -> [ Block 45 (Private A) ]
Sequence B Block Table: [ Block 10 (Prefix) ] -> [ Block 11 (Prefix) ] -> [ Block 78 (Private B) ]
If one of the sequences needs to modify a shared block (for example, in multi-turn chatting or branch decoding), the engine performs a Copy-on-Write (CoW) operation. It allocates a new physical block, copies the contents of the shared block, updates that sequence's block table to point to the new block, and performs the write operation. This mechanism keeps shared blocks safe and isolates individual generation streams.
π Flow: Token Generation Lifecycle with PagedAttention
The flowchart below traces how the inference engine processes incoming tokens, maps virtual indices using page lookup structures, and dynamically assigns GPU physical memory during execution:
flowchart TD
Start[New Token Request Received] --> Prefill[Generate Initial KV Cache for Prompt]
Prefill --> MapBlocks[Divide KV Cache into Blocks of Size B]
MapBlocks --> Allocate[Allocate Physical Blocks from Free Pool]
Allocate --> MapTable[Update Block Table: Logical -> Physical]
MapTable --> Decode[Generate Next Token]
Decode --> Check{Exceeds Current Block Space?}
Check -->|No| Store[Store KV Tensors in Current Physical Block]
Check -->|Yes| RequestBlock[Request New Block from Free Pool]
RequestBlock --> AllocSuccess{Free Block Available?}
AllocSuccess -->|Yes| Assign[Assign Block & Update Block Table]
Assign --> Store
AllocSuccess -->|No| Evict[Preempt/Swap Out Lowest Priority KV Cache]
Evict --> Assign
Store --> NextToken{End of Sequence?}
NextToken -->|No| Decode
NextToken -->|Yes| Free[Release Physical Blocks to Free Pool]
π Applications: Production Inference Engines and Model Serving
- Enterprise LLM APIs: Frameworks like vLLM, TensorRT-LLM, and TGI use PagedAttention to serve thousands of concurrent users efficiently.
- Multi-Agent Systems: Agent frameworks generate multiple reasoning paths in parallel. Shared prefix mapping speeds up these parallel runs.
- Retrieval-Augmented Generation (RAG): When injecting large context files, shared document caches reduce GPU memory usage.
βοΈ Trade-offs and Failure Modes
- Block Lookup Overhead: Accessing non-contiguous memory blocks in HBM introduces minor latency penalties compared to reading flat contiguous memory. However, this is offset by the throughput benefits of larger batch sizes.
- Preemption and Swapping: If the physical block pool is completely exhausted, the engine must swap blocks out to host CPU memory or evict queries entirely, introducing latency spikes.
- Metadata Overhead: Tracking thousands of page tables across multiple layers and sequences adds CPU bookkeeping work.
π§ Decision Guide: Custom Memory Allocations vs. Page Tables
The table below contrasts PagedAttention against other memory management techniques for LLM serving:
| Management Technique | GPU Memory Waste | Prefix Sharing | Setup Overhead | Ideal Use Case |
| Static Contiguous | Very High (60-80%) | Impossible | Zero | Single-user local inference |
| Dynamic Contiguous | High (internal gaps) | Difficult | Moderate | Low-concurrency APIs |
| PagedAttention | Minimal (< 4%) | Native (CoW) | High (needs block table) | High-throughput APIs |
| Host Virtual Memory Swap | Low | Dynamic | Very High (PCIe latency) | Edge devices running large models |
π§ͺ Practical Implementation: Simulated Block Allocation Lookup Table in Python
Below is a complete, runnable Python script simulating a virtual page manager that allocates token KV tensors to logical blocks, updates lookup tables, and maps virtual sequence addresses:
import math
from typing import Dict, List, Optional
class KVBlockManager:
def __init__(self, block_size: int, total_physical_blocks: int):
self.block_size = block_size
self.total_blocks = total_physical_blocks
self.free_blocks: List[int] = list(range(total_physical_blocks))
# Maps sequence_id -> List of physical block IDs
self.block_tables: Dict[str, List[int]] = {}
# Maps sequence_id -> current token count
self.sequence_token_counts: Dict[str, int] = {}
def register_sequence(self, seq_id: str, prompt_len: int) -> bool:
if seq_id in self.block_tables:
raise ValueError(f"Sequence {seq_id} is already registered.")
needed_blocks = math.ceil(prompt_len / self.block_size)
if len(self.free_blocks) < needed_blocks:
print(f"Memory allocation failed: Not enough free blocks for prompt. Required: {needed_blocks}, Free: {len(self.free_blocks)}")
return False
allocated = []
for _ in range(needed_blocks):
allocated.append(self.free_blocks.pop(0))
self.block_tables[seq_id] = allocated
self.sequence_token_counts[seq_id] = prompt_len
print(f"Registered sequence {seq_id} ({prompt_len} tokens). Allocated blocks: {allocated}")
return True
def append_token(self, seq_id: str) -> bool:
if seq_id not in self.block_tables:
raise ValueError(f"Sequence {seq_id} is not registered.")
current_tokens = self.sequence_token_counts[seq_id]
current_blocks = self.block_tables[seq_id]
# Check if we need to allocate a new physical block
if current_tokens % self.block_size == 0:
if not self.free_blocks:
print(f"Memory Full! Cannot allocate block for sequence {seq_id}. Triggering preemption...")
return False
new_block = self.free_blocks.pop(0)
current_blocks.append(new_block)
print(f"Allocated new block {new_block} for sequence {seq_id} at token {current_tokens}")
self.sequence_token_counts[seq_id] = current_tokens + 1
return True
def free_sequence(self, seq_id: str):
if seq_id not in self.block_tables:
return
released_blocks = self.block_tables.pop(seq_id)
self.sequence_token_counts.pop(seq_id)
self.free_blocks.extend(released_blocks)
# Keep physical blocks sorted to minimize memory fragmentation
self.free_blocks.sort()
print(f"Released blocks {released_blocks} from sequence {seq_id}. Free blocks remaining: {len(self.free_blocks)}")
def get_physical_address(self, seq_id: str, token_index: int) -> Optional[tuple]:
if seq_id not in self.block_tables:
return None
if token_index >= self.sequence_token_counts[seq_id]:
return None
logical_block_index = token_index // self.block_size
block_offset = token_index % self.block_size
physical_block_id = self.block_tables[seq_id][logical_block_index]
return (physical_block_id, block_offset)
def run_paged_attention_simulation():
print("--- Starting PagedAttention Simulator ---")
# Initialize manager: 16 tokens per block, 10 physical blocks total
manager = KVBlockManager(block_size=16, total_physical_blocks=10)
print(f"Initial Free Blocks: {manager.free_blocks}\n")
# 1. Register two sequences
seq_a = "req_001"
seq_b = "req_002"
# Prompt A has 24 tokens (needs 2 blocks)
manager.register_sequence(seq_a, prompt_len=24)
# Prompt B has 35 tokens (needs 3 blocks)
manager.register_sequence(seq_b, prompt_len=35)
print(f"Free Blocks remaining: {len(manager.free_blocks)}\n")
# 2. Map logical token index to physical memory block coordinates
target_token = 20
address = manager.get_physical_address(seq_a, target_token)
if address:
block_id, offset = address
print(f"Token {target_token} of {seq_a} is mapped to Block {block_id} at offset {offset}.")
# 3. Simulate generation steps (appending tokens)
print("\n--- Simulating Decode Iterations ---")
# Generate 10 tokens for Sequence A (from 24 to 34 tokens)
for _ in range(10):
if not manager.append_token(seq_a):
break
# Trace mapping after generation
new_target_token = 33
address = manager.get_physical_address(seq_a, new_target_token)
if address:
block_id, offset = address
print(f"After generation: Token {new_target_token} of {seq_a} mapped to Block {block_id} at offset {offset}.")
# 4. Clean up sequences
print("\n--- Cleaning Up Resources ---")
manager.free_sequence(seq_a)
manager.free_sequence(seq_b)
print(f"Final Free Blocks: {manager.free_blocks}")
if __name__ == "__main__":
run_paged_attention_simulation()
π Lessons Learned: Production Integration Gotchas
- Tune the Block Size Carefully: Small blocks (e.g., 8 tokens) reduce fragmentation but increase block lookup overhead. Large blocks (e.g., 64 tokens) reduce lookup latency but increase internal fragmentation. Most production systems use a block size of 16 or 32 tokens.
- Handle Swap Space Latency: When HBM is full, swapping blocks to CPU RAM (using pinned memory) prevents server crashes, but the PCIe transmission latency during decode runs can cause sudden response stalls.
- Warm Up Block Pools: Create and allocate the full pool of physical blocks at server startup rather than executing dynamic allocations during inference requests. This avoids memory allocation overhead inside active CUDA kernels.
π Summary: The LLM Memory Cheatsheet
- Memory Bottleneck: LLM generation throughput is bound by GPU memory capacity, driven by the linear growth of the Key-Value (KV) cache.
- PagedAttention Solution: Divides the KV cache into fixed-size blocks (16 or 32 tokens), mapping logical blocks to non-contiguous physical memory via page tables.
- Zero Fragmentation: Eliminates internal and external fragmentation, allowing batch sizes to scale 2x to 4x.
- Prefix Sharing: Enables shared read-only blocks for common system prompts, copying blocks only when written to (Copy-on-Write).
- Block Allocators: Allocate the full physical block pool once at startup to avoid runtime allocation overhead.
π Related Posts
AI-generated article quiz
Test your understanding
Ready to test what you just learned?
Generate four focused questions from this article. Answers include immediate explanations.
Guided series path
LLM Engineering
Reader feedback
Was this article useful?
Rate it if it helped, then continue with the next deep dive when you are ready.
Sign in to save your rating.
Article metadata