Dense LLM Architecture: How Every Parameter Works on Every Token
Inside the dense transformer block — attention heads, FFN layers, and why scaling dense models hits a wall
Abstract AlgorithmsTLDR: In a dense LLM every single parameter is active for every token in every forward pass — no routing, no selection. A transformer block runs multi-head self-attention (Q, K, V) followed by a feed-forward network (FFN) with roughly 4× the hidden dimension. Stack 32–96 of those blocks and you have GPT-3, LLaMA-2, or Mistral. The compute cost scales with the product of parameters and tokens, which is why models like Mixtral replaced dense layers with Mixture-of-Experts routing.
📖 The Universal Participation Rule: What Makes a Model "Dense"
You are teaching a class of 175 billion students. Every time a question comes in, every student raises their hand. Not just the specialists, not just the ones who slept through yesterday's lecture — every single one. That is how a dense language model works.
In deep learning, dense means every parameter participates in every forward pass for every input token. When GPT-3 processes the word "Paris" in the sentence "The capital of France is Paris", all 175 billion weights activate, compute something, and contribute to the output. Nothing sits idle. Nothing is routed away.
This is the direct opposite of a sparse model. Sparse architectures — most famously Mixture-of-Experts (MoE) — maintain a much larger pool of parameters but only activate a small fraction of them per token, selecting the most relevant expert sub-networks via a learned router. In a dense model, that router does not exist. Every neuron participates unconditionally.
Why does this matter? Because it has direct consequences for:
- Compute cost — more active parameters means more floating-point operations per token
- Memory bandwidth — every weight must be loaded from GPU memory for every forward pass
- Scaling behavior — accuracy predictably improves with parameter count and training tokens (Chinchilla), but cost rises proportionally
Understanding the dense architecture is the prerequisite for understanding why the field moved to MoE at scale. Before you can appreciate the router, you need to understand what it routes around.
⚙️ The Transformer Block: Anatomy of a Single Dense Layer
Every large language model is built by stacking identical transformer blocks. Within each block, every token passes through the same two major sub-layers in sequence: multi-head self-attention and a feed-forward network (FFN). Residual connections and layer norms wrap each sub-layer to stabilize gradients during training.
Here is the precise data flow through one transformer block:
- Input: A tensor of shape
[batch, sequence_length, model_dim]arrives — one embedding vector per token. - Layer Norm 1: Each token's embedding is normalized before attention. Pre-norm (applied before the sub-layer) is the modern convention used by LLaMA, GPT-NeoX, and Mistral.
- Multi-Head Self-Attention: Each token queries all other tokens, producing a context-enriched representation.
- Residual Add: The original (pre-attention) embedding is added back to the attention output, preserving the gradient path.
- Layer Norm 2: The residual-summed tensor is normalized again before the FFN.
- Feed-Forward Network: Two linear projections with a non-linear activation in between expand and compress the representation.
- Residual Add: The pre-FFN embedding is added back, completing the block.
- Output: The same shape
[batch, sequence_length, model_dim]passes to the next block.
The diagram below traces this data flow. Notice that the residual connections create two parallel paths — one through the sub-layer and one that bypasses it — which sum at the add nodes. This is what allows very deep stacks (32–96 blocks) to train without vanishing gradients.
graph TD
A[Input Token Embeddings] --> LN1[Layer Norm 1]
LN1 --> MHA[Multi-Head Self-Attention Q K V]
MHA --> ADD1[Residual Add]
A --> ADD1
ADD1 --> LN2[Layer Norm 2]
LN2 --> FFN[Feed-Forward Network - two linear layers with activation]
FFN --> ADD2[Residual Add]
ADD1 --> ADD2
ADD2 --> OUT[Output to Next Transformer Block]
The residual path (the direct arrows from A to ADD1 and from ADD1 to ADD2) is not decorative — it is structurally essential. Without residuals, gradients from the loss function would shrink exponentially as they back-propagate through dozens of blocks, making training impossible. The residual shortcut provides a gradient highway that bypasses each sub-layer entirely.
🔍 How Attention Heads Divide and Conquer Token Relationships
Self-attention allows every token to look at every other token in the sequence simultaneously. But running attention just once would force a single set of weights to capture every type of linguistic relationship — syntax, semantics, coreference, positional proximity — all at once. Multi-head attention solves this by running several independent attention operations in parallel.
The model dimension d_model is split evenly across h heads. Each head works in a lower-dimensional space of size d_head = d_model / h. This means a 4096-dim model with 32 heads gives each head a 128-dim space to work in. Within that space, a head learns its own Q, K, V projection matrices and discovers its own relationship pattern. The outputs of all heads are concatenated and projected back to d_model.
In practice, heads specialize on their own without being explicitly told to:
| Head type | Typical behavior |
| Syntactic head | Attends from a verb to its subject; captures grammatical agreement |
| Coreference head | Attends from a pronoun to its antecedent across many tokens |
| Positional head | Attends strongly to adjacent tokens; captures local n-gram context |
| Semantic head | Attends to thematically related tokens regardless of distance |
| Delimiter head | Attends to sentence boundaries and punctuation anchors |
The table below shows how head count and model dimension scale across four well-known dense models:
| Model | Parameters | Layers | Heads (Q) | KV Heads | d_model | d_head |
| GPT-2 (Large) | 774M | 36 | 20 | 20 | 1280 | 64 |
| GPT-3 | 175B | 96 | 96 | 96 | 12,288 | 128 |
| LLaMA-2 7B | 7B | 32 | 32 | 32 | 4,096 | 128 |
| LLaMA-2 70B | 70B | 80 | 64 | 8 | 8,192 | 128 |
The LLaMA-2 70B row is notable: it uses Grouped-Query Attention (GQA), where 64 query heads share only 8 key-value head pairs. This dramatically reduces the size of the KV cache during inference without measurably degrading quality, because K and V representations are less sensitive to per-head specialization than Q projections.
The attention mechanism is O(n²) in sequence length because every token must compute a score against every other token. A sequence of 4,096 tokens requires 4,096 × 4,096 = ~16.7 million attention score computations per head per layer. This quadratic cost is the primary motivation for techniques like sliding window attention (Mistral) and attention approximations (Longformer, BigBird).
🧠 The Feed-Forward Network: Where the Transformer Stores World Knowledge
If attention decides which tokens to look at, the FFN decides what to do with the information once it has been gathered. The FFN is often described as the model's long-term memory because it stores factual associations learned during pre-training — for example, that "Eiffel Tower" is associated with "Paris" and "iron lattice construction". This is not a metaphor: probing experiments on frozen FFN weights can successfully decode factual associations without any attention context, confirming that knowledge is encoded directly in the FFN weight matrices.
Internals: Structure, Activation Functions, and Parameter Distribution
The canonical FFN is a two-layer structure:
- Up-projection: A linear layer expands the
d_model-dimensional vector to4 × d_model(e.g., 4096 → 16,384 for LLaMA-2 7B) - Activation: A non-linear function applied element-wise
- Down-projection: A linear layer compresses back to
d_model
The 4× expansion factor is not arbitrary. Wider intermediate layers store more distinct patterns simultaneously, but they must be compressed back to model dimension so the next block can accept the output. The expansion-compression bottleneck forces the model to select and distill the most relevant features from the broader space.
Activation functions matter. The original transformer used ReLU. Modern dense models use SwiGLU (Swish Gated Linear Unit), which adds a learned gating mechanism. SwiGLU introduces a third matrix in the FFN (a "gate" projection), multiplied element-wise with the up-projection before activation. This gating allows the model to suppress certain neurons selectively, giving it more fine-grained control over what information flows through. LLaMA, Mistral, and most post-2022 dense models use SwiGLU.
The parameter breakdown below shows how weight counts are distributed for LLaMA-2 7B (approximate values):
| Component | Formula | Approximate Params |
| Token embedding table | vocab_size × d_model = 32,000 × 4,096 | ~131M |
| Attention (per layer) | 4 × d_model² = 4 × 4,096² | ~67M |
| FFN with SwiGLU (per layer) | 3 × d_model × 4 × d_model = 3 × 4,096 × 11,008 | ~135M |
| All 32 attention layers | 32 × 67M | ~2.1B |
| All 32 FFN layers | 32 × 135M | ~4.3B |
| Output projection + norms | — | ~131M |
| Total | ~7B |
FFN layers account for roughly 60% of total parameters in most dense models. This is why quantizing or pruning FFN weights is the primary lever for compressing dense models, and why MoE replaces FFN layers (not attention layers) with routed expert networks.
Performance Analysis: The FFN as the Inference Memory Bottleneck
The FFN is the single largest contributor to both parameter count and inference memory pressure in dense models. At batch size 1 during autoregressive token generation, the GPU must load all FFN weight matrices from High Bandwidth Memory (HBM) for every new token — there is no weight reuse across decode steps because each step computes a fresh forward pass.
For LLaMA-2 7B, ~4.3B FFN parameters × 2 bytes (FP16) = ~8.6 GB of FFN weights loaded per token generated. The A100 SXM provides 2 TB/s of HBM bandwidth — that means the FFN load alone takes approximately 4.3ms per token before a single multiply-add executes. Arithmetic intensity (FLOPs per byte loaded) sits around 1–2 FLOP/byte at batch size 1, far below the 300+ FLOP/byte compute roofline.
Increasing batch size to 32 amortizes those 8.6 GB of FFN weights across 32 simultaneous tokens. FFN load time per token drops to ~0.13ms. This is the single biggest reason production serving systems like vLLM implement continuous batching — spreading the dense model's fixed weight-load cost across more useful work per GPU cycle.
🏗️ Stacking Layers: How Depth Transforms Syntax into Reasoning
A single transformer block produces a context-enriched token representation — but a single block is not an LLM. The power of large models comes from stacking many blocks in sequence, with each block receiving the output of the previous one. Depth creates qualitative capability.
The representation of a token changes character as it moves through the stack:
- Early blocks (layers 1–8 in a 32-layer model): Representations reflect surface patterns — tokenization artifacts, punctuation roles, and basic syntactic structure. Probing classifiers can reliably decode part-of-speech tags from these layers.
- Middle blocks (layers 9–24): Representations capture semantic relationships — coreference chains, entity types, paraphrase equivalences. Named entity recognition probes perform best at these depths.
- Late blocks (layers 25–32): Representations encode task-level reasoning — whether a statement is a question, a command, or a factual claim. Instruction-following behavior and chain-of-thought emerge primarily in these layers.
The diagram below shows how representations at each layer grow progressively richer as they ascend the stack:
graph TD
EMB[Input Embeddings - Surface token identity] --> B1[Block 1-8 - Syntax and Surface Structure]
B1 --> B2[Block 9-16 - Local Semantic Relationships]
B2 --> B3[Block 17-24 - Entity Types and Coreference]
B3 --> B4[Block 25-32 - Task Reasoning and World Knowledge]
B4 --> LM[Language Model Head - Next Token Logits]
This layer-depth specialization has practical consequences. When fine-tuning a dense model with LoRA (Low-Rank Adaptation), researchers typically inject trainable adapters into the middle and late blocks, because that is where task-relevant representations live. Adapting only the early blocks recovers little performance because syntactic representations are already close to optimal after pre-training.
The total parameter count of a dense model scales roughly as:
N ≈ 12 × num_layers × d_model²
This approximation holds because each layer contributes four attention matrices (W_Q, W_K, W_V, W_O) and three FFN matrices (up, gate, down), each of size d_model × d_model or similar. Doubling d_model quadruples the parameter count per layer; doubling num_layers doubles it linearly.
📊 The Dense Scaling Wall: When More Parameters Becomes Prohibitively Expensive
The dense architecture has a clean, predictable scaling story — and a brutal cost curve.
Chinchilla scaling laws (Hoffmann et al., 2022) established that a compute-optimal dense model should train on approximately 20 tokens per parameter. A 7B model is optimally trained on ~140 billion tokens; a 70B model on ~1.4 trillion tokens. Training models that are too large for the token budget underperforms smaller models trained longer. GPT-3 (175B, trained on ~300B tokens) was significantly undertrained by Chinchilla standards.
The FLOPs required for a single forward pass through a dense model scale as:
FLOPs ≈ 2 × N × T
where N is the number of active parameters and T is the number of input tokens in the batch. In a dense model N is the total parameter count — every parameter is active. In a MoE model, N is only the active expert parameters per token, which can be 2–8× smaller than total model size.
The table below shows how this plays out at model scale for a single forward pass over one million tokens:
| Model | Parameters (active) | Approx FLOPs (1M tokens) | Peak GPU Memory (FP16) | Notes |
| GPT-3 175B | 175B (all) | ~350 PFLOPs | ~350 GB | Requires ~5× A100 80GB |
| LLaMA-2 70B | 70B (all) | ~140 PFLOPs | ~140 GB | Requires ~2× A100 80GB |
| LLaMA-2 13B | 13B (all) | ~26 PFLOPs | ~26 GB | Fits on a single A100 |
| LLaMA-2 7B | 7B (all) | ~14 PFLOPs | ~14 GB | Fits on a single A100 40GB |
The memory bandwidth bottleneck is often more limiting than raw FLOPs at inference time. During autoregressive generation, the model processes one new token at a time. For each token, every weight in the model must be loaded from High Bandwidth Memory (HBM) on the GPU. GPT-3 at 175B parameters in FP16 weighs ~350 GB. The A100 SXM has 2 TB/s of memory bandwidth — meaning loading all GPT-3 weights takes ~175ms per token even before a single multiply-add is performed. Arithmetic intensity (FLOPs per byte) is extremely low at batch size 1, making these inference workloads memory-bound rather than compute-bound.
This is the dense scaling wall: every parameter you add costs you inference memory and bandwidth unconditionally, regardless of whether that parameter is useful for the current token. The MoE architecture was invented specifically to escape this wall — route tokens to a few relevant experts, keep most parameters cold, and lower the effective inference cost per token while maintaining model capacity.
🌍 Real Dense Models: GPT-3, LLaMA-2, and Mistral as Case Studies
All three of the most influential open and closed dense model families share the same core architecture but diverge in the engineering choices that determine efficiency:
GPT-3 (175B, OpenAI, 2020): The canonical large dense model. 96 transformer blocks, 96 attention heads, 12,288 model dimension, learned positional embeddings, Pre-LN layout. All 96 heads maintain full KV pairs, meaning the KV cache at inference scales as num_layers × sequence_length × 2 × d_model. At long context lengths, this cache alone can exceed available GPU memory.
LLaMA-2 (7B–70B, Meta, 2023): Open-weight dense models with several engineering refinements. Rotary Positional Embeddings (RoPE) replace learned position tables — RoPE encodes position via rotation of Q and K vectors, generalizing better to sequence lengths longer than those seen during training. RMSNorm replaces LayerNorm (removes mean subtraction, 7–10% faster). Grouped-Query Attention in the 70B model reduces KV cache by 8× with negligible quality loss. SwiGLU activation throughout.
Mistral 7B (Mistral AI, 2023): A dense model with one structural efficiency trick: sliding window attention (SWA). Rather than attending to all previous tokens, each token attends only to the most recent 4,096 tokens. This reduces attention from O(n²) to O(n × window_size) for long sequences, while information from earlier tokens still propagates via the layer stack. Crucially, Mistral 7B is still a dense model — every FFN parameter is active for every token. SWA reduces the attention computation overhead, but FFN participation remains universal.
graph LR
subgraph GPT3[GPT-3 175B]
A1[Learned Positional Encoding]
A2[Full Multi-Head Attention - 96 heads - 96 KV heads]
A3[LayerNorm - Pre-LN]
A4[ReLU FFN - 4x expansion]
end
subgraph LLaMA2[LLaMA-2 70B]
B1[RoPE Rotary Positional Encoding]
B2[GQA - 64 Q heads - 8 KV heads]
B3[RMSNorm]
B4[SwiGLU FFN]
end
subgraph Mistral[Mistral 7B]
C1[RoPE Rotary Positional Encoding]
C2[Sliding Window Attention - 4096 token window]
C3[RMSNorm]
C4[SwiGLU FFN - dense - all params active]
end
This comparison reveals the direction of the field: positional encoding moved from learned tables to RoPE, normalization moved from LayerNorm to RMSNorm, activation moved from ReLU to SwiGLU, and KV heads moved from full to grouped. The fundamental dense contract — every FFN parameter participates unconditionally — remained intact in all three.
⚖️ Dense vs. Sparse: The Real Trade-offs Every LLM Engineer Must Weigh
Dense and sparse (MoE) architectures represent different engineering bets. Neither is universally superior — the right answer depends on your compute budget, latency requirements, and whether you are training or serving.
| Trade-off dimension | Dense (e.g., LLaMA-2 70B) | Sparse MoE (e.g., Mixtral 8×7B) |
| Active params per token | 100% of total params | ~12–25% of total params |
| Inference memory | Must load all weights every token | Only active experts loaded |
| Inference latency (batch=1) | Memory-bound; lower throughput | Lower active param load; faster |
| Training cost | Straightforward; all params train on all tokens | Expert routing adds instability; load-balancing loss needed |
| Quality ceiling | Scales predictably with total params | Can match larger dense models at lower active-param cost |
| Fine-tuning | Well-understood; LoRA and full fine-tuning both work | Expert routing can shift during fine-tuning; needs care |
| Quantization | Every weight matters; quality degradation is uniform | Inactive experts tolerate more aggressive quantization |
| Debugging & interpretability | Every activation traceable; no routing to reason about | Router behavior adds a layer of non-determinism |
The key insight: a dense model's cost scales with total parameter count unconditionally. A MoE model's inference cost scales with active parameter count, while training cost scales more with total parameters. Mixtral 8×7B (46B total parameters, ~13B active per token) serves responses with the compute cost of a 13B dense model while approaching the quality of a 70B dense model.
Dense architecture remains the dominant choice for:
- Models below ~20B parameters where MoE routing overhead is disproportionate
- Fine-tuning scenarios requiring full gradient flow across all weights
- Research settings where reproducible, interpretable behavior matters more than raw throughput
🧭 Choosing the Right Dense Model: A Decision Guide for Production and Research
The single most important variable when choosing a dense LLM is the GPU memory envelope. Every byte of the model must fit in VRAM before inference begins. The table below maps use case to model size based on typical hardware constraints:
| Use case | Recommended size | Min VRAM (FP16) | Min VRAM (INT4) | Hardware example |
| Personal chatbot / dev testing | 7B | ~16 GB | ~6 GB | RTX 4090, single A10 |
| Quality-first local assistant | 13B | ~28 GB | ~9 GB | 2× RTX 3090 or A10G |
| Production API (low latency) | 7B–13B + vLLM | ~16–28 GB | ~6–9 GB | Single A100 40GB |
| Production API (high quality) | 70B | ~140 GB FP16 | ~40 GB INT4 | 2–4× A100 80GB |
| Fine-tuning on domain data | 7B (full) or 70B (LoRA) | ~60 GB (full 7B) | ~20 GB (LoRA 70B) | A100 or H100 |
| Research / ablation studies | 7B (controlled) | ~16 GB | ~6 GB | Single A100 40GB |
Decision rules:
If you need results today with one GPU: LLaMA-3 8B or Mistral 7B at INT4. Both run on a single consumer GPU and deliver production-grade quality for most NLP tasks.
If quality must be maximized and cost is secondary: LLaMA-2 70B or LLaMA-3 70B in FP16 on four A100s. This is the dense model performance ceiling at open-weight scale.
If you are fine-tuning: Start with 7B full fine-tuning (fits in ~60 GB with gradient checkpointing) or LoRA on 70B (adapters add ~1% of parameter count, fitting in ~40 GB INT4). See the LoRA companion post for full details.
If you need serving throughput over single-request latency: Deploy with vLLM and continuous batching. Throughput scales near-linearly with batch size for dense models; latency degrades gracefully.
If your context regularly exceeds 4,096 tokens: Prefer LLaMA-3 (128k context window) or Mistral's sliding-window variants over GPT-3 class models with fixed 2k–4k context.
🧪 Tracing "The Eiffel Tower is in" Through LLaMA-2 7B: A Layer-by-Layer Walkthrough
The best way to internalize what "dense" means is to follow a concrete token through every stage of a real model. Consider the 5-token input sequence "The Eiffel Tower is in" being processed by LLaMA-2 7B.
Input preparation (pre-block): The tokenizer converts the sentence into token IDs. "Eiffel" and "Tower" may be represented as a single token or split into subwords depending on the vocabulary. Each token ID is looked up in the embedding table (shape: 32,000 × 4,096), producing a 4,096-dimensional vector. Rotary Positional Embeddings are then applied to the Q and K vectors inside each attention head — not to the embeddings directly.
Block 1–8 (early layers — syntax):
The five token vectors pass through eight transformer blocks. Multi-head attention in these layers primarily captures short-range dependencies: which article ("The") modifies which noun, verb tense, and adjacent token relationships. The FFN in each block applies non-linear transforms that activate patterns learned during pre-training — pattern families like "determiner → noun" fire here. After block 8, each of the five token vectors is still shape [4096], but now enriched with syntactic context.
Block 9–24 (middle layers — semantics and world knowledge):
These are the layers where the model's stored world knowledge becomes most active. For the token "Tower", the FFN neurons associated with architectural landmarks, French geography, and iron construction history all activate with non-zero weights — because they do so for every token, always, in a dense model. Attention heads in these layers resolve longer-range entity relationships: "Eiffel Tower" is being treated as a single named entity despite spanning two positions.
Block 25–32 (late layers — task and output prediction):
The representation of the last token "in" is now the model's prediction target context. Late-layer attention heads weight heavily on "Eiffel Tower" and "is" to predict what comes after "in". The FFN activations at layer 30 are strongly correlated with "Paris" as the next token. After block 32, the final token representation passes through the language model head (a linear projection from 4,096 to 32,000), and softmax over the vocabulary gives "Paris" its highest logit.
What makes this "dense": At every one of the 32 blocks, every FFN weight matrix (all 135M parameters per block, 4.3B total) participated in the computation for all five tokens. No parameter was skipped. No weight was cold. The model paid the full 14 PFLOPs compute price for this 5-token sequence.
🚀 LLaMA-3, Mistral, and Falcon in Practice
The most practical dense models available today are LLaMA-3 (Meta, 8B and 70B), Mistral 7B (Mistral AI), and Falcon 40B (TII). All three are open-weight, meaning the parameters are publicly downloadable and can be run on your own hardware.
LLaMA-3 8B fits in 16 GB of GPU memory at FP16 (or 8 GB at 4-bit quantization), making it runnable on a single consumer GPU like an RTX 4090. The architecture is a dense transformer with GQA, RoPE, and SwiGLU — a direct descendant of LLaMA-2 with expanded vocabulary (128k tokens) and longer context training.
Ollama provides the simplest local inference path. Install Ollama, then pull and run a model with two commands:
# Ollama model configuration (Modelfile)
FROM llama3:8b
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
SYSTEM "You are a helpful assistant."
HuggingFace Transformers supports all three families natively. Load any of them with AutoModelForCausalLM.from_pretrained and specify device_map="auto" to distribute across available GPUs.
vLLM is the production-grade serving engine for dense models. Its PagedAttention memory manager partitions the KV cache into non-contiguous pages, eliminating memory fragmentation and enabling 3–5× higher throughput than naive implementations. For any dense model serving more than a handful of concurrent users, vLLM is the default choice.
For a full deep-dive on production LLM deployment, see the companion posts on quantization and LoRA fine-tuning linked in Related Posts.
📚 Production Lessons from Running Dense Models at Scale
After deploying dense transformer models in production, the same failure patterns emerge repeatedly:
KV cache is the silent memory budget killer. The key-value cache stores the attention keys and values for all previous tokens so they do not need to be recomputed during generation. For a 7B model with 32 layers and 32 heads, a single 4,096-token context occupies roughly 512 MB of GPU memory in FP16. Multiply by 10 concurrent requests and you have exhausted a 40 GB A100 before a single new token has been generated. Batching strategy, maximum context length, and KV cache eviction policies must be sized together.
Quantization degrades quality non-uniformly across a dense model. Because every weight is active for every token, quantizing to INT4 introduces rounding error into every computation in every forward pass. There are no inactive weights to "absorb" errors. The impact is particularly pronounced in late layers, which encode the most fragile task-specific representations. 4-bit NormalFloat (NF4) with double quantization (used by bitsandbytes) is the current best practice for quality-preserving compression — but it is not lossless.
Batch size vs. latency is a fundamental tradeoff, not a tunable parameter. Dense model inference is memory-bandwidth-bound at batch size 1 (each new token requires loading all model weights). Increasing batch size amortizes weight loading across multiple requests, raising throughput at the cost of latency. There is no configuration that achieves both minimum latency and maximum throughput simultaneously — choose based on your SLA.
Layer-depth matters for fine-tuning targeting. If you are adapting a dense model with LoRA for a downstream task, adding adapters only to early attention layers wastes capacity — syntactic features are already well-learned. Target middle and late layers (60–100% of depth) where task-relevant and world-knowledge representations are concentrated.
📌 TLDR: Dense Architecture in Five Rules
- Every parameter activates for every token. No routing, no gating at the model level — 100% parameter utilization per forward pass.
- A transformer block = attention + FFN, wrapped in residuals. Attention gathers context; FFN stores and applies patterns; residuals preserve gradient flow.
- FFN holds most of the parameters. Roughly 60% of weights in a standard dense model live in the FFN layers — this is where factual associations are stored.
- Scaling cost is proportional to total parameter count. FLOPs ≈ 2 × N × T; memory = all weights, always loaded. There is no "inactive" parameter relief at inference time.
- Chinchilla says: match training tokens to model size. ~20 tokens per parameter for compute-optimal training. Bigger is not better if the token budget is fixed.
📝 Practice Quiz: Dense Transformer Architecture
In a dense LLM, how many parameters activate when the model processes a single input token?
- A) Only the parameters in the currently-executing layer
- B) Only the attention layer parameters
- C) All parameters in the entire model
- D) A dynamically routed subset selected per token
Correct Answer: C — Dense means every weight participates in every forward pass unconditionally. Selective activation based on routing is the defining property of sparse MoE architectures, not dense ones.
A transformer block uses a 4× FFN expansion. For LLaMA-2 7B with model dimension 4,096, what is the FFN intermediate dimension?
- A) 1,024
- B) 4,096
- C) 8,192
- D) 16,384
Correct Answer: D — 4 × 4,096 = 16,384 (approximately 11,008 with SwiGLU's three-matrix structure). This expansion provides the FFN with capacity to store a wide variety of factual patterns before compressing back to model dimension.
Why is standard self-attention described as O(n²) with respect to sequence length n?
- A) Weight matrix size grows quadratically with model dimension
- B) Each of the n tokens computes an attention score against every other n token
- C) The FFN expansion ratio scales with n²
- D) Residual connections double computation at each layer
Correct Answer: B — Attention produces an n × n score matrix. A 4,096-token sequence requires ~16.7M score computations per head per layer. This quadratic cost is why sliding window attention, sparse attention approximations, and FlashAttention exist.
Open-ended challenge: Your team must serve LLaMA-2 70B to 100 concurrent users with p99 latency under 3 seconds per response. Name the two largest bottlenecks specific to dense model inference and propose a concrete mitigation for each.
No single correct answer. Strong responses identify: (1) memory bandwidth — 70B × 2 bytes FP16 = 140 GB of weights loaded per token per request, mitigated by tensor parallelism across multiple A100s or by quantizing to INT8/NF4 with bitsandbytes; (2) KV cache memory growth — 100 concurrent sessions with 2K context, 80 layers, 8,192 dim = ~52 GB of KV cache alone, mitigated by PagedAttention (vLLM) and strict max-context limits per session. Bonus: continuous batching to amortize weight loads across many users simultaneously.
According to Chinchilla scaling laws, approximately how many training tokens are compute-optimal for a 7B parameter dense model?
- A) 7 billion tokens
- B) 70 billion tokens
- C) 140 billion tokens
- D) 1 trillion tokens
Correct Answer: C — Chinchilla recommends ~20 tokens per parameter. 7B × 20 = 140B tokens. LLaMA-2 7B was trained on 2T tokens (significantly over-trained by Chinchilla standards), which generally improves downstream benchmark quality but at much higher training compute cost.
🔗 Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Fine-Tuning LLMs: The Complete Engineer's Guide to SFT, LoRA, and RLHF
TLDR: A pretrained LLM is a generalist. Fine-tuning makes it a specialist. Supervised Fine-Tuning (SFT) teaches it your domain's language through labeled examples. LoRA does the same with 99% fewer tr

Chain of Thought Prompting: Teaching LLMs to Think Step by Step
TLDR: Chain of Thought (CoT) prompting tells a language model to reason out loud before answering. By generating intermediate steps, the model steers itself toward correct conclusions — turning guessw

Transfer Learning Explained: Standing on the Shoulders of Pretrained Models
TLDR: You don't need millions of labeled images or months of GPU time to build a great model. Transfer learning lets you borrow a pretrained network's hard-won feature detectors, plug in a new output

LLM Hallucinations: Causes, Detection, and Mitigation Strategies
TLDR: LLMs hallucinate because they are trained to predict the next plausible token — not the next true token. Understanding the three hallucination types (factual, faithfulness, open-domain) plus the
