24 min readLlm Transformers Deep Learning

Dense LLM Architecture: How Every Parameter Works on Every Token

Inside the dense transformer block — attention heads, FFN layers, and why scaling dense models hits a wall

Abstract Algorithms/Apr 17, 2026/LLM Engineering

On this page

📖 The Universal Participation Rule: What Makes a Model "Dense"⚙️ The Transformer Block: Anatomy of a Single Dense Layer 🔍 How Attention Heads Divide and Conquer Token Relationships 🧠 The Feed-Forward Network: Where the Transformer Stores World Knowledge Internals: Structure, Activation Functions, and Parameter Distribution Performance Analysis: The FFN as the Inference Memory Bottleneck 🏗️ Stacking Layers: How Depth Transforms Syntax into Reasoning 📊 The Dense Scaling Wall: When More Parameters Becomes Prohibitively Expensive 🌍 Real Dense Models: GPT-3, LLaMA-2, and Mistral as Case Studies

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Executive TLDR

TLDR: In a dense LLM every single parameter is active for every token in every forward pass — no routing, no selection.
A transformer block runs multi head self attention (Q, K, V) followed by a feed forward network (FFN) with roughly 4× the hidden dimension.
Stack 32–96 of those blocks and you have GPT 3, LLaMA 2, or Mistral.
The compute cost scales with the product of parameters and tokens, which is why models like Mixtral replaced dense layers with Mixture of Experts routing.

Core mental model

Read this as a system of state, constraints, and failure boundaries.

Inside the dense transformer block — attention heads, FFN layers, and why scaling dense models hits a wall

Explain simpler Compare tradeoffs

Key systems visualization

The article’s conceptual path

📖 The Universal Participation Rule: What Makes a Model "Dense"

⚙️ The Transformer Block: Anatomy of a Single Dense Layer

🔍 How Attention Heads Divide and Conquer Token Relationships

🧠 The Feed-Forward Network: Where the Transformer Stores World Knowledge

🏗️ Stacking Layers: How Depth Transforms Syntax into Reasoning

TLDR: In a dense LLM every single parameter is active for every token in every forward pass — no routing, no selection. A transformer block runs multi-head self-attention (Q, K, V) followed by a feed-forward network (FFN) with roughly 4× the hidden dimension. Stack 32–96 of those blocks and you have GPT-3, LLaMA-2, or Mistral. The compute cost scales with the product of parameters and tokens, which is why models like Mixtral replaced dense layers with Mixture-of-Experts routing.

📖 The Universal Participation Rule: What Makes a Model "Dense"

You are teaching a class of 175 billion students. Every time a question comes in, every student raises their hand. Not just the specialists, not just the ones who slept through yesterday's lecture — every single one. That is how a dense language model works.

In deep learning, dense means every parameter participates in every forward pass for every input token. When GPT-3 processes the word "Paris" in the sentence "The capital of France is Paris", all 175 billion weights activate, compute something, and contribute to the output. Nothing sits idle. Nothing is routed away.

This is the direct opposite of a sparse model. Sparse architectures — most famously Mixture-of-Experts (MoE) — maintain a much larger pool of parameters but only activate a small fraction of them per token, selecting the most relevant expert sub-networks via a learned router. In a dense model, that router does not exist. Every neuron participates unconditionally.

Why does this matter? Because it has direct consequences for:

Compute cost — more active parameters means more floating-point operations per token
Memory bandwidth — every weight must be loaded from GPU memory for every forward pass
Scaling behavior — accuracy predictably improves with parameter count and training tokens (Chinchilla), but cost rises proportionally

Understanding the dense architecture is the prerequisite for understanding why the field moved to MoE at scale. Before you can appreciate the router, you need to understand what it routes around.

⚙️ The Transformer Block: Anatomy of a Single Dense Layer

Every large language model is built by stacking identical transformer blocks. Within each block, every token passes through the same two major sub-layers in sequence: multi-head self-attention and a feed-forward network (FFN). Residual connections and layer norms wrap each sub-layer to stabilize gradients during training.

Here is the precise data flow through one transformer block:

Input: A tensor of shape [batch, sequence_length, model_dim] arrives — one embedding vector per token.
Layer Norm 1: Each token's embedding is normalized before attention. Pre-norm (applied before the sub-layer) is the modern convention used by LLaMA, GPT-NeoX, and Mistral.
Multi-Head Self-Attention: Each token queries all other tokens, producing a context-enriched representation.
Residual Add: The original (pre-attention) embedding is added back to the attention output, preserving the gradient path.
Layer Norm 2: The residual-summed tensor is normalized again before the FFN.
Feed-Forward Network: Two linear projections with a non-linear activation in between expand and compress the representation.
Residual Add: The pre-FFN embedding is added back, completing the block.
Output: The same shape [batch, sequence_length, model_dim] passes to the next block.

The diagram below traces this data flow. Notice that the residual connections create two parallel paths — one through the sub-layer and one that bypasses it — which sum at the add nodes. This is what allows very deep stacks (32–96 blocks) to train without vanishing gradients.

graph TD
    A[Input Token Embeddings] --> LN1[Layer Norm 1]
    LN1 --> MHA[Multi-Head Self-Attention Q K V]
    MHA --> ADD1[Residual Add]
    A --> ADD1
    ADD1 --> LN2[Layer Norm 2]
    LN2 --> FFN[Feed-Forward Network - two linear layers with activation]
    FFN --> ADD2[Residual Add]
    ADD1 --> ADD2
    ADD2 --> OUT[Output to Next Transformer Block]

The residual path (the direct arrows from A to ADD1 and from ADD1 to ADD2) is not decorative — it is structurally essential. Without residuals, gradients from the loss function would shrink exponentially as they back-propagate through dozens of blocks, making training impossible. The residual shortcut provides a gradient highway that bypasses each sub-layer entirely.

🔍 How Attention Heads Divide and Conquer Token Relationships

Self-attention allows every token to look at every other token in the sequence simultaneously. But running attention just once would force a single set of weights to capture every type of linguistic relationship — syntax, semantics, coreference, positional proximity — all at once. Multi-head attention solves this by running several independent attention operations in parallel.

The model dimension d_model is split evenly across h heads. Each head works in a lower-dimensional space of size d_head = d_model / h. This means a 4096-dim model with 32 heads gives each head a 128-dim space to work in. Within that space, a head learns its own Q, K, V projection matrices and discovers its own relationship pattern. The outputs of all heads are concatenated and projected back to d_model.

In practice, heads specialize on their own without being explicitly told to:

Head type	Typical behavior
Syntactic head	Attends from a verb to its subject; captures grammatical agreement
Coreference head	Attends from a pronoun to its antecedent across many tokens
Positional head	Attends strongly to adjacent tokens; captures local n-gram context
Semantic head	Attends to thematically related tokens regardless of distance
Delimiter head	Attends to sentence boundaries and punctuation anchors

The table below shows how head count and model dimension scale across four well-known dense models:

Model	Parameters	Layers	Heads (Q)	KV Heads	d_model	d_head
GPT-2 (Large)	774M	36	20	20	1280	64
GPT-3	175B	96	96	96	12,288	128
LLaMA-2 7B	7B	32	32	32	4,096	128
LLaMA-2 70B	70B	80	64	8	8,192	128

The LLaMA-2 70B row is notable: it uses Grouped-Query Attention (GQA), where 64 query heads share only 8 key-value head pairs. This dramatically reduces the size of the KV cache during inference without measurably degrading quality, because K and V representations are less sensitive to per-head specialization than Q projections.

The attention mechanism is O(n²) in sequence length because every token must compute a score against every other token. A sequence of 4,096 tokens requires 4,096 × 4,096 = ~16.7 million attention score computations per head per layer. This quadratic cost is the primary motivation for techniques like sliding window attention (Mistral) and attention approximations (Longformer, BigBird).

🧠 The Feed-Forward Network: Where the Transformer Stores World Knowledge

If attention decides which tokens to look at, the FFN decides what to do with the information once it has been gathered. The FFN is often described as the model's long-term memory because it stores factual associations learned during pre-training — for example, that "Eiffel Tower" is associated with "Paris" and "iron lattice construction". This is not a metaphor: probing experiments on frozen FFN weights can successfully decode factual associations without any attention context, confirming that knowledge is encoded directly in the FFN weight matrices.

Internals: Structure, Activation Functions, and Parameter Distribution

The canonical FFN is a two-layer structure:

Up-projection: A linear layer expands the d_model-dimensional vector to 4 × d_model (e.g., 4096 → 16,384 for LLaMA-2 7B)
Activation: A non-linear function applied element-wise
Down-projection: A linear layer compresses back to d_model

The 4× expansion factor is not arbitrary. Wider intermediate layers store more distinct patterns simultaneously, but they must be compressed back to model dimension so the next block can accept the output. The expansion-compression bottleneck forces the model to select and distill the most relevant features from the broader space.

Activation functions matter. The original transformer used ReLU. Modern dense models use SwiGLU (Swish Gated Linear Unit), which adds a learned gating mechanism. SwiGLU introduces a third matrix in the FFN (a "gate" projection), multiplied element-wise with the up-projection before activation. This gating allows the model to suppress certain neurons selectively, giving it more fine-grained control over what information flows through. LLaMA, Mistral, and most post-2022 dense models use SwiGLU.

The parameter breakdown below shows how weight counts are distributed for LLaMA-2 7B (approximate values):

Component	Formula	Approximate Params
Token embedding table	vocab_size × d_model = 32,000 × 4,096	~131M
Attention (per layer)	4 × d_model² = 4 × 4,096²	~67M
FFN with SwiGLU (per layer)	3 × d_model × 4 × d_model = 3 × 4,096 × 11,008	~135M
All 32 attention layers	32 × 67M	~2.1B
All 32 FFN layers	32 × 135M	~4.3B
Output projection + norms	—	~131M
Total		~7B

FFN layers account for roughly 60% of total parameters in most dense models. This is why quantizing or pruning FFN weights is the primary lever for compressing dense models, and why MoE replaces FFN layers (not attention layers) with routed expert networks.

Performance Analysis: The FFN as the Inference Memory Bottleneck

The FFN is the single largest contributor to both parameter count and inference memory pressure in dense models. At batch size 1 during autoregressive token generation, the GPU must load all FFN weight matrices from High Bandwidth Memory (HBM) for every new token — there is no weight reuse across decode steps because each step computes a fresh forward pass.

For LLaMA-2 7B, ~4.3B FFN parameters × 2 bytes (FP16) = ~8.6 GB of FFN weights loaded per token generated. The A100 SXM provides 2 TB/s of HBM bandwidth — that means the FFN load alone takes approximately 4.3ms per token before a single multiply-add executes. Arithmetic intensity (FLOPs per byte loaded) sits around 1–2 FLOP/byte at batch size 1, far below the 300+ FLOP/byte compute roofline.

Increasing batch size to 32 amortizes those 8.6 GB of FFN weights across 32 simultaneous tokens. FFN load time per token drops to ~0.13ms. This is the single biggest reason production serving systems like vLLM implement continuous batching — spreading the dense model's fixed weight-load cost across more useful work per GPU cycle.

🏗️ Stacking Layers: How Depth Transforms Syntax into Reasoning

A single transformer block produces a context-enriched token representation — but a single block is not an LLM. The power of large models comes from stacking many blocks in sequence, with each block receiving the output of the previous one. Depth creates qualitative capability.

The representation of a token changes character as it moves through the stack:

Early blocks (layers 1–8 in a 32-layer model): Representations reflect surface patterns — tokenization artifacts, punctuation roles, and basic syntactic structure. Probing classifiers can reliably decode part-of-speech tags from these layers.
Middle blocks (layers 9–24): Representations capture semantic relationships — coreference chains, entity types, paraphrase equivalences. Named entity recognition probes perform best at these depths.
Late blocks (layers 25–32): Representations encode task-level reasoning — whether a statement is a question, a command, or a factual claim. Instruction-following behavior and chain-of-thought emerge primarily in these layers.

The diagram below shows how representations at each layer grow progressively richer as they ascend the stack:

graph TD
    EMB[Input Embeddings - Surface token identity] --> B1[Block 1-8 - Syntax and Surface Structure]
    B1 --> B2[Block 9-16 - Local Semantic Relationships]
    B2 --> B3[Block 17-24 - Entity Types and Coreference]
    B3 --> B4[Block 25-32 - Task Reasoning and World Knowledge]
    B4 --> LM[Language Model Head - Next Token Logits]

This layer-depth specialization has practical consequences. When fine-tuning a dense model with LoRA (Low-Rank Adaptation), researchers typically inject trainable adapters into the middle and late blocks, because that is where task-relevant representations live. Adapting only the early blocks recovers little performance because syntactic representations are already close to optimal after pre-training.

The total parameter count of a dense model scales roughly as:

N ≈ 12 × num_layers × d_model²

This approximation holds because each layer contributes four attention matrices (W_Q, W_K, W_V, W_O) and three FFN matrices (up, gate, down), each of size d_model × d_model or similar. Doubling d_model quadruples the parameter count per layer; doubling num_layers doubles it linearly.

📊 The Dense Scaling Wall: When More Parameters Becomes Prohibitively Expensive

The dense architecture has a clean, predictable scaling story — and a brutal cost curve.

Chinchilla scaling laws (Hoffmann et al., 2022) established that a compute-optimal dense model should train on approximately 20 tokens per parameter. A 7B model is optimally trained on ~140 billion tokens; a 70B model on ~1.4 trillion tokens. Training models that are too large for the token budget underperforms smaller models trained longer. GPT-3 (175B, trained on ~300B tokens) was significantly undertrained by Chinchilla standards.

The FLOPs required for a single forward pass through a dense model scale as:

FLOPs ≈ 2 × N × T

where N is the number of active parameters and T is the number of input tokens in the batch. In a dense model N is the total parameter count — every parameter is active. In a MoE model, N is only the active expert parameters per token, which can be 2–8× smaller than total model size.

The table below shows how this plays out at model scale for a single forward pass over one million tokens:

Model	Parameters (active)	Approx FLOPs (1M tokens)	Peak GPU Memory (FP16)	Notes
GPT-3 175B	175B (all)	~350 PFLOPs	~350 GB	Requires ~5× A100 80GB
LLaMA-2 70B	70B (all)	~140 PFLOPs	~140 GB	Requires ~2× A100 80GB
LLaMA-2 13B	13B (all)	~26 PFLOPs	~26 GB	Fits on a single A100
LLaMA-2 7B	7B (all)	~14 PFLOPs	~14 GB	Fits on a single A100 40GB

The memory bandwidth bottleneck is often more limiting than raw FLOPs at inference time. During autoregressive generation, the model processes one new token at a time. For each token, every weight in the model must be loaded from High Bandwidth Memory (HBM) on the GPU. GPT-3 at 175B parameters in FP16 weighs ~350 GB. The A100 SXM has 2 TB/s of memory bandwidth — meaning loading all GPT-3 weights takes ~175ms per token even before a single multiply-add is performed. Arithmetic intensity (FLOPs per byte) is extremely low at batch size 1, making these inference workloads memory-bound rather than compute-bound.

This is the dense scaling wall: every parameter you add costs you inference memory and bandwidth unconditionally, regardless of whether that parameter is useful for the current token. The MoE architecture was invented specifically to escape this wall — route tokens to a few relevant experts, keep most parameters cold, and lower the effective inference cost per token while maintaining model capacity.

🌍 Real Dense Models: GPT-3, LLaMA-2, and Mistral as Case Studies

All three of the most influential open and closed dense model families share the same core architecture but diverge in the engineering choices that determine efficiency:

GPT-3 (175B, OpenAI, 2020): The canonical large dense model. 96 transformer blocks, 96 attention heads, 12,288 model dimension, learned positional embeddings, Pre-LN layout. All 96 heads maintain full KV pairs, meaning the KV cache at inference scales as num_layers × sequence_length × 2 × d_model. At long context lengths, this cache alone can exceed available GPU memory.

LLaMA-2 (7B–70B, Meta, 2023): Open-weight dense models with several engineering refinements. Rotary Positional Embeddings (RoPE) replace learned position tables — RoPE encodes position via rotation of Q and K vectors, generalizing better to sequence lengths longer than those seen during training. RMSNorm replaces LayerNorm (removes mean subtraction, 7–10% faster). Grouped-Query Attention in the 70B model reduces KV cache by 8× with negligible quality loss. SwiGLU activation throughout.

Mistral 7B (Mistral AI, 2023): A dense model with one structural efficiency trick: sliding window attention (SWA). Rather than attending to all previous tokens, each token attends only to the most recent 4,096 tokens. This reduces attention from O(n²) to O(n × window_size) for long sequences, while information from earlier tokens still propagates via the layer stack. Crucially, Mistral 7B is still a dense model — every FFN parameter is active for every token. SWA reduces the attention computation overhead, but FFN participation remains universal.

graph LR
    subgraph GPT3[GPT-3 175B]
        A1[Learned Positional Encoding]
        A2[Full Multi-Head Attention - 96 heads - 96 KV heads]
        A3[LayerNorm - Pre-LN]
        A4[ReLU FFN - 4x expansion]
    end
    subgraph LLaMA2[LLaMA-2 70B]
        B1[RoPE Rotary Positional Encoding]
        B2[GQA - 64 Q heads - 8 KV heads]
        B3[RMSNorm]
        B4[SwiGLU FFN]
    end
    subgraph Mistral[Mistral 7B]
        C1[RoPE Rotary Positional Encoding]
        C2[Sliding Window Attention - 4096 token window]
        C3[RMSNorm]
        C4[SwiGLU FFN - dense - all params active]
    end

This comparison reveals the direction of the field: positional encoding moved from learned tables to RoPE, normalization moved from LayerNorm to RMSNorm, activation moved from ReLU to SwiGLU, and KV heads moved from full to grouped. The fundamental dense contract — every FFN parameter participates unconditionally — remained intact in all three.

⚖️ Dense vs. Sparse: The Real Trade-offs Every LLM Engineer Must Weigh

Dense and sparse (MoE) architectures represent different engineering bets. Neither is universally superior — the right answer depends on your compute budget, latency requirements, and whether you are training or serving.

Trade-off dimension	Dense (e.g., LLaMA-2 70B)	Sparse MoE (e.g., Mixtral 8×7B)
Active params per token	100% of total params	~12–25% of total params
Inference memory	Must load all weights every token	Only active experts loaded
Inference latency (batch=1)	Memory-bound; lower throughput	Lower active param load; faster
Training cost	Straightforward; all params train on all tokens	Expert routing adds instability; load-balancing loss needed
Quality ceiling	Scales predictably with total params	Can match larger dense models at lower active-param cost
Fine-tuning	Well-understood; LoRA and full fine-tuning both work	Expert routing can shift during fine-tuning; needs care
Quantization	Every weight matters; quality degradation is uniform	Inactive experts tolerate more aggressive quantization
Debugging & interpretability	Every activation traceable; no routing to reason about	Router behavior adds a layer of non-determinism

The key insight: a dense model's cost scales with total parameter count unconditionally. A MoE model's inference cost scales with active parameter count, while training cost scales more with total parameters. Mixtral 8×7B (46B total parameters, ~13B active per token) serves responses with the compute cost of a 13B dense model while approaching the quality of a 70B dense model.

Dense architecture remains the dominant choice for:

Models below ~20B parameters where MoE routing overhead is disproportionate
Fine-tuning scenarios requiring full gradient flow across all weights
Research settings where reproducible, interpretable behavior matters more than raw throughput

🧭 Choosing the Right Dense Model: A Decision Guide for Production and Research

The single most important variable when choosing a dense LLM is the GPU memory envelope. Every byte of the model must fit in VRAM before inference begins. The table below maps use case to model size based on typical hardware constraints:

Use case	Recommended size	Min VRAM (FP16)	Min VRAM (INT4)	Hardware example
Personal chatbot / dev testing	7B	~16 GB	~6 GB	RTX 4090, single A10
Quality-first local assistant	13B	~28 GB	~9 GB	2× RTX 3090 or A10G
Production API (low latency)	7B–13B + vLLM	~16–28 GB	~6–9 GB	Single A100 40GB
Production API (high quality)	70B	~140 GB FP16	~40 GB INT4	2–4× A100 80GB
Fine-tuning on domain data	7B (full) or 70B (LoRA)	~60 GB (full 7B)	~20 GB (LoRA 70B)	A100 or H100
Research / ablation studies	7B (controlled)	~16 GB	~6 GB	Single A100 40GB

Decision rules:

If you need results today with one GPU: LLaMA-3 8B or Mistral 7B at INT4. Both run on a single consumer GPU and deliver production-grade quality for most NLP tasks.
If quality must be maximized and cost is secondary: LLaMA-2 70B or LLaMA-3 70B in FP16 on four A100s. This is the dense model performance ceiling at open-weight scale.
If you are fine-tuning: Start with 7B full fine-tuning (fits in ~60 GB with gradient checkpointing) or LoRA on 70B (adapters add ~1% of parameter count, fitting in ~40 GB INT4). See the LoRA companion post for full details.
If you need serving throughput over single-request latency: Deploy with vLLM and continuous batching. Throughput scales near-linearly with batch size for dense models; latency degrades gracefully.
If your context regularly exceeds 4,096 tokens: Prefer LLaMA-3 (128k context window) or Mistral's sliding-window variants over GPT-3 class models with fixed 2k–4k context.

🧪 Tracing "The Eiffel Tower is in" Through LLaMA-2 7B: A Layer-by-Layer Walkthrough

The best way to internalize what "dense" means is to follow a concrete token through every stage of a real model. Consider the 5-token input sequence "The Eiffel Tower is in" being processed by LLaMA-2 7B.

Input preparation (pre-block): The tokenizer converts the sentence into token IDs. "Eiffel" and "Tower" may be represented as a single token or split into subwords depending on the vocabulary. Each token ID is looked up in the embedding table (shape: 32,000 × 4,096), producing a 4,096-dimensional vector. Rotary Positional Embeddings are then applied to the Q and K vectors inside each attention head — not to the embeddings directly.

Block 1–8 (early layers — syntax): The five token vectors pass through eight transformer blocks. Multi-head attention in these layers primarily captures short-range dependencies: which article ("The") modifies which noun, verb tense, and adjacent token relationships. The FFN in each block applies non-linear transforms that activate patterns learned during pre-training — pattern families like "determiner → noun" fire here. After block 8, each of the five token vectors is still shape [4096], but now enriched with syntactic context.

Block 9–24 (middle layers — semantics and world knowledge): These are the layers where the model's stored world knowledge becomes most active. For the token "Tower", the FFN neurons associated with architectural landmarks, French geography, and iron construction history all activate with non-zero weights — because they do so for every token, always, in a dense model. Attention heads in these layers resolve longer-range entity relationships: "Eiffel Tower" is being treated as a single named entity despite spanning two positions.

Block 25–32 (late layers — task and output prediction): The representation of the last token "in" is now the model's prediction target context. Late-layer attention heads weight heavily on "Eiffel Tower" and "is" to predict what comes after "in". The FFN activations at layer 30 are strongly correlated with "Paris" as the next token. After block 32, the final token representation passes through the language model head (a linear projection from 4,096 to 32,000), and softmax over the vocabulary gives "Paris" its highest logit.

What makes this "dense": At every one of the 32 blocks, every FFN weight matrix (all 135M parameters per block, 4.3B total) participated in the computation for all five tokens. No parameter was skipped. No weight was cold. The model paid the full 14 PFLOPs compute price for this 5-token sequence.

🚀 LLaMA-3, Mistral, and Falcon in Practice

The most practical dense models available today are LLaMA-3 (Meta, 8B and 70B), Mistral 7B (Mistral AI), and Falcon 40B (TII). All three are open-weight, meaning the parameters are publicly downloadable and can be run on your own hardware.

LLaMA-3 8B fits in 16 GB of GPU memory at FP16 (or 8 GB at 4-bit quantization), making it runnable on a single consumer GPU like an RTX 4090. The architecture is a dense transformer with GQA, RoPE, and SwiGLU — a direct descendant of LLaMA-2 with expanded vocabulary (128k tokens) and longer context training.

Ollama provides the simplest local inference path. Install Ollama, then pull and run a model with two commands:

# Ollama model configuration (Modelfile)
FROM llama3:8b
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
SYSTEM "You are a helpful assistant."

HuggingFace Transformers supports all three families natively. Load any of them with AutoModelForCausalLM.from_pretrained and specify device_map="auto" to distribute across available GPUs.

vLLM is the production-grade serving engine for dense models. Its PagedAttention memory manager partitions the KV cache into non-contiguous pages, eliminating memory fragmentation and enabling 3–5× higher throughput than naive implementations. For any dense model serving more than a handful of concurrent users, vLLM is the default choice.

For a full deep-dive on production LLM deployment, see the companion posts on quantization and LoRA fine-tuning linked in Related Posts.

📚 Production Lessons from Running Dense Models at Scale

After deploying dense transformer models in production, the same failure patterns emerge repeatedly:

KV cache is the silent memory budget killer. The key-value cache stores the attention keys and values for all previous tokens so they do not need to be recomputed during generation. For a 7B model with 32 layers and 32 heads, a single 4,096-token context occupies roughly 512 MB of GPU memory in FP16. Multiply by 10 concurrent requests and you have exhausted a 40 GB A100 before a single new token has been generated. Batching strategy, maximum context length, and KV cache eviction policies must be sized together.

Quantization degrades quality non-uniformly across a dense model. Because every weight is active for every token, quantizing to INT4 introduces rounding error into every computation in every forward pass. There are no inactive weights to "absorb" errors. The impact is particularly pronounced in late layers, which encode the most fragile task-specific representations. 4-bit NormalFloat (NF4) with double quantization (used by bitsandbytes) is the current best practice for quality-preserving compression — but it is not lossless.

Batch size vs. latency is a fundamental tradeoff, not a tunable parameter. Dense model inference is memory-bandwidth-bound at batch size 1 (each new token requires loading all model weights). Increasing batch size amortizes weight loading across multiple requests, raising throughput at the cost of latency. There is no configuration that achieves both minimum latency and maximum throughput simultaneously — choose based on your SLA.

Layer-depth matters for fine-tuning targeting. If you are adapting a dense model with LoRA for a downstream task, adding adapters only to early attention layers wastes capacity — syntactic features are already well-learned. Target middle and late layers (60–100% of depth) where task-relevant and world-knowledge representations are concentrated.

📌 TLDR: Dense Architecture in Five Rules

Every parameter activates for every token. No routing, no gating at the model level — 100% parameter utilization per forward pass.
A transformer block = attention + FFN, wrapped in residuals. Attention gathers context; FFN stores and applies patterns; residuals preserve gradient flow.
FFN holds most of the parameters. Roughly 60% of weights in a standard dense model live in the FFN layers — this is where factual associations are stored.
Scaling cost is proportional to total parameter count. FLOPs ≈ 2 × N × T; memory = all weights, always loaded. There is no "inactive" parameter relief at inference time.
Chinchilla says: match training tokens to model size. ~20 tokens per parameter for compute-optimal training. Bigger is not better if the token budget is fixed.

Quiet AI help

Explain simpler Compare approaches What next?

Article metadata

Written by

Abstract Algorithms

@abstractalgorithms

Related deep dives

Dot Product in Machine Learning: The Engine Behind Similarity, Attention, and Neural Networks

22 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

31 min read

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

31 min read

Softmax Function Explained: From Raw Scores to Probabilities

23 min · Machine Learning · best next step

Open Collection