Home/Blog/Llm/PagedAttention & KV-Cache Optimization: How vLLM Handles Large Scale Inference
LlmAdvancedβ€’14 min readβ€’

PagedAttention & KV-Cache Optimization: How vLLM Handles Large Scale Inference

How virtual memory paged block mapping solves GPU memory bottleneck and fragmentation in LLM inference.

Abstract Algorithms

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: Large Language Model (LLM) serving is heavily bound by GPU memory capacity due to the Key-Value (KV) cache. Traditional serving frameworks allocate contiguous memory based on maximum sequence length, leading to severe fragmentation and wasted space. This guide explores how PagedAttention, inspired by operating system virtual memory, partitions the KV cache into small logical blocks to enable dynamic, non-contiguous allocations, prefix sharing, and massive throughput gains.


πŸ“– Concept: GPU Memory Bottlenecks in LLM Inference

When running inference on large transformer models, the primary bottleneck is not computing capability, but memory bandwidth and memory capacity. This is particularly true during the autoregressive generation phase, where the model outputs tokens one by one.

At each step of generation, the self-attention mechanism requires the Query vector of the current token to be multiplied by the Key vectors of all previous tokens in the sequence, followed by soft-max scaling and multiplication with the Value vectors of all previous tokens. To avoid recalculating these Key and Value vectors at every single step, inference engines cache them in GPU high-bandwidth memory (HBM). This cache is known as the Key-Value (KV) Cache.

While caching keys and values dramatically reduces CPU and GPU compute cycles, it shifts the bottleneck to memory capacity. As sequence lengths grow and batch sizes increase, the KV cache scales linearly:

$$ \text{KV Cache Size} = 2 \times \text{Batch Size} \times \text{Sequence Length} \times \text{Layers} \times \text{Attention Heads} \times \text{Head Dimension} \times \text{Precision Bytes} $$

For a Llama-3-70B model running at FP16 precision with a batch size of 32 and a context length of 4096 tokens, the KV cache alone requires tens of gigabytes of GPU HBM.

Traditionally, serving frameworks allocated a single contiguous chunk of HBM for each request's KV cache. Since the engine cannot predict the final output sequence length in advance, it pre-allocated contiguous memory corresponding to the maximum possible sequence length (e.g., 2048 or 4096 tokens). If a request terminated early after generating only 50 tokens, the memory allocated for the remaining 4046 tokens remained reserved and completely unused. This design caused severe internal fragmentation (reserved memory that goes unused) and external fragmentation (free memory split into small segments, making it impossible to schedule new requests).


βš™οΈ Mechanics: Key-Value (KV) Cache and Memory Fragmentation

To understand why contiguous allocations degrade serving throughput, we must look at how memory is utilized over the lifetime of an active generation query.

When a prompt is first sent to the inference engine (the prefill phase), the model processes all prompt tokens in parallel, generating the initial KV cache tensors. Once the prefill phase is complete, the model enters the decode phase, generating one token at a time. Each generated token adds a new vector to the Key cache and another vector to the Value cache.

Under traditional contiguous memory systems, memory is reserved using static boundaries. The diagram below illustrates how this creates structural gaps. A request is allocated a fixed region matching the maximum context length:

RequestAllocated TokensMax ContextUnused / Reserved MemoryActual Utilization
Request A12020481928 slots5.8%
Request B80020481248 slots39.0%
Request C2000204848 slots97.6%

Because of these pre-allocations, the serving system can run out of GPU memory even when the actual memory utilized by active tokens is below 20%. The system must refuse incoming requests or queue them, lowering overall throughput. Additionally, external fragmentation prevents the engine from utilizing scattered memory slots since the underlying tensors require strict physical contiguity.


🧠 Deep Dive: PagedAttention Architecture and Block Mapping

PagedAttention solves these fragmentation issues by decoupling logical sequence coordinates from physical memory locations. Instead of requesting a single contiguous memory block for a sequence, the engine divides the KV cache of each sequence into small, fixed-size logical blocks.

Internals: Virtual Memory Analogy and Page Tables

Similar to virtual memory in operating systems, PagedAttention represents a sequence's KV cache as a series of logical blocks. Each logical block contains the Key and Value tensors for a fixed number of tokens (typically 16 or 32). These logical blocks do not need to be mapped to contiguous physical blocks in GPU memory.

The engine maintains a Block Table to map logical blocks to physical blocks. During self-attention computation, the PagedAttention CUDA kernel reads the block table mapping to retrieve keys and values from non-contiguous HBM addresses, performing the attention math dynamically.

Logical Space (Sequence):
[ Block 0 (Tokens 0-15) ] -> [ Block 1 (Tokens 16-31) ] -> [ Block 2 (Tokens 32-47) ]
      |                             |                             |
Block Table Mapping:
Logical Block 0 -------------> Physical Block 104 (HBM)
Logical Block 1 -------------> Physical Block 87  (HBM)
Logical Block 2 -------------> Physical Block 212 (HBM)

By using this lookup mechanism, physical blocks are allocated dynamically on-demand. When a sequence generates a new token that exceeds the capacity of its current block, the engine grabs a free physical block from the global pool and appends it to the sequence's block table.

Performance Analysis: Memory Utilization and Throughput Dynamics

Since physical blocks are small and allocated only when needed, internal fragmentation is limited to at most the fraction of the last block. For a block size of 16, the maximum wasted space per sequence is 15 tokens. Empirically, this reduces memory waste to under 4%, allowing serving systems to increase batch sizes by 2x to 4x compared to contiguous baselines.

Larger batch sizes allow GPU tensor cores to achieve higher occupancy, significantly improving decoding throughput. The table below compares the throughput and memory characteristics of static contiguous allocation against dynamic PagedAttention:

Performance MetricContiguous Static AllocationPagedAttention (vLLM)
Memory Waste60% – 80% (high fragmentation)< 4% (block boundary limit)
Effective Batch SizeLow (constrained by worst-case max length)High (dynamic scaling based on usage)
Latency VarianceLowLow to Moderate (minor page table overhead)
Prefix SharingImpossible (requires copying cache)Native (multiple mapping references)

Mathematical Model: Dynamic KV-Cache Space and Allocation Efficiency

Let $N$ be the total number of active sequences in a batch. Let $L_i(t)$ be the length of sequence $i$ at decode step $t$. Let $B$ be the block size (number of tokens per block).

The number of logical blocks $M_i(t)$ required by sequence $i$ at step $t$ is:

$$ M_i(t) = \lceil \frac{L_i(t)}{B} \rceil $$

Let $C_{token}$ represent the memory footprint (in bytes) of the Key and Value tensors for a single token across all layers, heads, and head dimensions:

$$ C_{token} = 2 \times \text{Layers} \times \text{Heads} \times \text{Head Dimension} \times \text{Precision Bytes} $$

The total physical memory $M_{alloc}(t)$ allocated by PagedAttention at step $t$ is:

$$ M_{alloc}(t) = \sum_{i=1}^{N} M_i(t) \times B \times C_{token} $$

In contrast, static contiguous allocation must pre-allocate space for the maximum sequence length $L_{max}$:

$$ M_{static} = N \times L_{max} \times C_{token} $$

The allocation efficiency ratio $\eta(t)$ can be modeled as:

$$ \eta(t) = \frac{M_{alloc}(t)}{M_{static}} = \frac{\sum_{i=1}^{N} \lceil \frac{L_i(t)}{B} \rceil \times B}{N \times L_{max}} $$

Because $Li(t) \ll L{max}$ for the majority of the decode steps, $\eta(t)$ is typically between $0.15$ and $0.40$, indicating that PagedAttention utilizes only a fraction of the memory required by static allocation methods to support the same concurrent workload.


πŸ—οΈ Advanced Concepts: Copy-on-Write (CoW) and Prefix Sharing

Beyond eliminating fragmentation, decoupling logical and physical memory blocks enables advanced memory-sharing patterns, such as Prefix Sharing and Parallel Sampling.

When prompting an LLM with common system prompts, long instructions, or document contexts, multiple requests share the same initial prefix. In a contiguous allocation system, this prefix must be evaluated and stored separately for each request.

With PagedAttention, the physical memory blocks storing the KV cache of the shared prefix can be mapped to the block tables of multiple requests simultaneously. The physical blocks are marked read-only.

Sequence A Block Table: [ Block 10 (Prefix) ] -> [ Block 11 (Prefix) ] -> [ Block 45 (Private A) ]
Sequence B Block Table: [ Block 10 (Prefix) ] -> [ Block 11 (Prefix) ] -> [ Block 78 (Private B) ]

If one of the sequences needs to modify a shared block (for example, in multi-turn chatting or branch decoding), the engine performs a Copy-on-Write (CoW) operation. It allocates a new physical block, copies the contents of the shared block, updates that sequence's block table to point to the new block, and performs the write operation. This mechanism keeps shared blocks safe and isolates individual generation streams.


πŸ“Š Flow: Token Generation Lifecycle with PagedAttention

The flowchart below traces how the inference engine processes incoming tokens, maps virtual indices using page lookup structures, and dynamically assigns GPU physical memory during execution:

flowchart TD
    Start[New Token Request Received] --> Prefill[Generate Initial KV Cache for Prompt]
    Prefill --> MapBlocks[Divide KV Cache into Blocks of Size B]
    MapBlocks --> Allocate[Allocate Physical Blocks from Free Pool]
    Allocate --> MapTable[Update Block Table: Logical -> Physical]
    MapTable --> Decode[Generate Next Token]
    Decode --> Check{Exceeds Current Block Space?}
    Check -->|No| Store[Store KV Tensors in Current Physical Block]
    Check -->|Yes| RequestBlock[Request New Block from Free Pool]
    RequestBlock --> AllocSuccess{Free Block Available?}
    AllocSuccess -->|Yes| Assign[Assign Block & Update Block Table]
    Assign --> Store
    AllocSuccess -->|No| Evict[Preempt/Swap Out Lowest Priority KV Cache]
    Evict --> Assign
    Store --> NextToken{End of Sequence?}
    NextToken -->|No| Decode
    NextToken -->|Yes| Free[Release Physical Blocks to Free Pool]

🌍 Applications: Production Inference Engines and Model Serving

  1. Enterprise LLM APIs: Frameworks like vLLM, TensorRT-LLM, and TGI use PagedAttention to serve thousands of concurrent users efficiently.
  2. Multi-Agent Systems: Agent frameworks generate multiple reasoning paths in parallel. Shared prefix mapping speeds up these parallel runs.
  3. Retrieval-Augmented Generation (RAG): When injecting large context files, shared document caches reduce GPU memory usage.

βš–οΈ Trade-offs and Failure Modes

  • Block Lookup Overhead: Accessing non-contiguous memory blocks in HBM introduces minor latency penalties compared to reading flat contiguous memory. However, this is offset by the throughput benefits of larger batch sizes.
  • Preemption and Swapping: If the physical block pool is completely exhausted, the engine must swap blocks out to host CPU memory or evict queries entirely, introducing latency spikes.
  • Metadata Overhead: Tracking thousands of page tables across multiple layers and sequences adds CPU bookkeeping work.

🧭 Decision Guide: Custom Memory Allocations vs. Page Tables

The table below contrasts PagedAttention against other memory management techniques for LLM serving:

Management TechniqueGPU Memory WastePrefix SharingSetup OverheadIdeal Use Case
Static ContiguousVery High (60-80%)ImpossibleZeroSingle-user local inference
Dynamic ContiguousHigh (internal gaps)DifficultModerateLow-concurrency APIs
PagedAttentionMinimal (< 4%)Native (CoW)High (needs block table)High-throughput APIs
Host Virtual Memory SwapLowDynamicVery High (PCIe latency)Edge devices running large models

πŸ§ͺ Practical Implementation: Simulated Block Allocation Lookup Table in Python

Below is a complete, runnable Python script simulating a virtual page manager that allocates token KV tensors to logical blocks, updates lookup tables, and maps virtual sequence addresses:

import math
from typing import Dict, List, Optional

class KVBlockManager:
    def __init__(self, block_size: int, total_physical_blocks: int):
        self.block_size = block_size
        self.total_blocks = total_physical_blocks
        self.free_blocks: List[int] = list(range(total_physical_blocks))

        # Maps sequence_id -> List of physical block IDs
        self.block_tables: Dict[str, List[int]] = {}
        # Maps sequence_id -> current token count
        self.sequence_token_counts: Dict[str, int] = {}

    def register_sequence(self, seq_id: str, prompt_len: int) -> bool:
        if seq_id in self.block_tables:
            raise ValueError(f"Sequence {seq_id} is already registered.")

        needed_blocks = math.ceil(prompt_len / self.block_size)
        if len(self.free_blocks) < needed_blocks:
            print(f"Memory allocation failed: Not enough free blocks for prompt. Required: {needed_blocks}, Free: {len(self.free_blocks)}")
            return False

        allocated = []
        for _ in range(needed_blocks):
            allocated.append(self.free_blocks.pop(0))

        self.block_tables[seq_id] = allocated
        self.sequence_token_counts[seq_id] = prompt_len
        print(f"Registered sequence {seq_id} ({prompt_len} tokens). Allocated blocks: {allocated}")
        return True

    def append_token(self, seq_id: str) -> bool:
        if seq_id not in self.block_tables:
            raise ValueError(f"Sequence {seq_id} is not registered.")

        current_tokens = self.sequence_token_counts[seq_id]
        current_blocks = self.block_tables[seq_id]

        # Check if we need to allocate a new physical block
        if current_tokens % self.block_size == 0:
            if not self.free_blocks:
                print(f"Memory Full! Cannot allocate block for sequence {seq_id}. Triggering preemption...")
                return False

            new_block = self.free_blocks.pop(0)
            current_blocks.append(new_block)
            print(f"Allocated new block {new_block} for sequence {seq_id} at token {current_tokens}")

        self.sequence_token_counts[seq_id] = current_tokens + 1
        return True

    def free_sequence(self, seq_id: str):
        if seq_id not in self.block_tables:
            return

        released_blocks = self.block_tables.pop(seq_id)
        self.sequence_token_counts.pop(seq_id)
        self.free_blocks.extend(released_blocks)
        # Keep physical blocks sorted to minimize memory fragmentation
        self.free_blocks.sort()
        print(f"Released blocks {released_blocks} from sequence {seq_id}. Free blocks remaining: {len(self.free_blocks)}")

    def get_physical_address(self, seq_id: str, token_index: int) -> Optional[tuple]:
        if seq_id not in self.block_tables:
            return None

        if token_index >= self.sequence_token_counts[seq_id]:
            return None

        logical_block_index = token_index // self.block_size
        block_offset = token_index % self.block_size

        physical_block_id = self.block_tables[seq_id][logical_block_index]
        return (physical_block_id, block_offset)

def run_paged_attention_simulation():
    print("--- Starting PagedAttention Simulator ---")
    # Initialize manager: 16 tokens per block, 10 physical blocks total
    manager = KVBlockManager(block_size=16, total_physical_blocks=10)
    print(f"Initial Free Blocks: {manager.free_blocks}\n")

    # 1. Register two sequences
    seq_a = "req_001"
    seq_b = "req_002"

    # Prompt A has 24 tokens (needs 2 blocks)
    manager.register_sequence(seq_a, prompt_len=24)
    # Prompt B has 35 tokens (needs 3 blocks)
    manager.register_sequence(seq_b, prompt_len=35)
    print(f"Free Blocks remaining: {len(manager.free_blocks)}\n")

    # 2. Map logical token index to physical memory block coordinates
    target_token = 20
    address = manager.get_physical_address(seq_a, target_token)
    if address:
        block_id, offset = address
        print(f"Token {target_token} of {seq_a} is mapped to Block {block_id} at offset {offset}.")

    # 3. Simulate generation steps (appending tokens)
    print("\n--- Simulating Decode Iterations ---")
    # Generate 10 tokens for Sequence A (from 24 to 34 tokens)
    for _ in range(10):
        if not manager.append_token(seq_a):
            break

    # Trace mapping after generation
    new_target_token = 33
    address = manager.get_physical_address(seq_a, new_target_token)
    if address:
        block_id, offset = address
        print(f"After generation: Token {new_target_token} of {seq_a} mapped to Block {block_id} at offset {offset}.")

    # 4. Clean up sequences
    print("\n--- Cleaning Up Resources ---")
    manager.free_sequence(seq_a)
    manager.free_sequence(seq_b)
    print(f"Final Free Blocks: {manager.free_blocks}")

if __name__ == "__main__":
    run_paged_attention_simulation()

πŸ“š Lessons Learned: Production Integration Gotchas

  1. Tune the Block Size Carefully: Small blocks (e.g., 8 tokens) reduce fragmentation but increase block lookup overhead. Large blocks (e.g., 64 tokens) reduce lookup latency but increase internal fragmentation. Most production systems use a block size of 16 or 32 tokens.
  2. Handle Swap Space Latency: When HBM is full, swapping blocks to CPU RAM (using pinned memory) prevents server crashes, but the PCIe transmission latency during decode runs can cause sudden response stalls.
  3. Warm Up Block Pools: Create and allocate the full pool of physical blocks at server startup rather than executing dynamic allocations during inference requests. This avoids memory allocation overhead inside active CUDA kernels.

πŸ“Œ Summary: The LLM Memory Cheatsheet

  • Memory Bottleneck: LLM generation throughput is bound by GPU memory capacity, driven by the linear growth of the Key-Value (KV) cache.
  • PagedAttention Solution: Divides the KV cache into fixed-size blocks (16 or 32 tokens), mapping logical blocks to non-contiguous physical memory via page tables.
  • Zero Fragmentation: Eliminates internal and external fragmentation, allowing batch sizes to scale 2x to 4x.
  • Prefix Sharing: Enables shared read-only blocks for common system prompts, copying blocks only when written to (Copy-on-Write).
  • Block Allocators: Allocate the full physical block pool once at startup to avoid runtime allocation overhead.

AI-generated article quiz

Test your understanding

🧠

Ready to test what you just learned?

Generate four focused questions from this article. Answers include immediate explanations.

Guided series path

LLM Engineering

View all lessons β†’
Lesson 48 of 50

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Sign in to save your rating.