Home/Blog/Architecture/How Transformer Architecture Works: A Deep Dive

ArchitectureAdvanced•18 min read•Mar 9, 2026

How Transformer Architecture Works: A Deep Dive

The Transformer is the engine behind ChatGPT, BERT, and Claude. We break down Self-Attention, Mul...

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: The Transformer is the architecture behind every major LLM (GPT, BERT, Claude, Gemini). Its core innovation is Self-Attention — a mechanism that lets the model weigh relationships between all tokens in a sequence simultaneously, regardless of distance. This enables parallelism that RNNs could not achieve.

📖 The Cocktail Party Listener: Why Transformers Attend to Everything at Once

Imagine standing at the center of a noisy cocktail party. You distribute attention across the entire room at once: 80% on the friend telling a story, 10% on the background music, 10% on the waiter. That is exactly what a Transformer does with language.

Before Transformers, Recurrent Neural Networks (RNNs) read text one token at a time. Each new word overwrote part of the previous hidden state — by the end of a long sentence, early context had faded. Transformers abolished this sequential dependency: every token attends to every other token simultaneously in a single forward pass, enabling full GPU parallelism. The cost is quadratic memory in sequence length, but that proves to be a solvable engineering problem.

🔍 Tokens, Embeddings, and Positional Clues: Building the Input Pipeline

Before attention can run, raw text must be converted into tensors the model can operate on.

Tokenization splits text into subword units using BPE or WordPiece. For example, "unhappiness" → ["un", "##happ", "##iness"], balancing vocabulary size against OOV rate.

Token embeddings map each integer token ID to a dense learned vector. BERT-base uses 768 dimensions; GPT-3 uses 12,288. These vectors are the model's initial "understanding" of each token before any contextual information is added.

Positional encodings solve a critical gap: self-attention is permutation-invariant by design — it treats "dog bites man" and "man bites dog" identically without positional signal. A positional encoding vector is added element-wise to each token embedding so the model knows which position each token occupies:

$$\text{Input}_i = \text{TokenEmb}(w_i) + \text{PosEnc}(i)$$

Component	Dimension	Learnable?	Used In
Token embedding	768 (BERT-base)	✅ Yes	All models
Sinusoidal positional encoding	768	❌ Fixed	Original Transformer, BERT
Learned positional encoding	768	✅ Yes	GPT-2, GPT-3
Rotary (RoPE) encoding	per head dim	❌ Fixed formula	LLaMA, GPT-NeoX, Mistral

⚙️ Self-Attention: How Every Token Reads the Room

For each token, the model creates three vectors via separate learned linear projections:

Q (Query): "What information am I looking for?"
K (Key): "What content do I advertise?"
V (Value): "What do I contribute if selected?"

The attention score measuring how relevant token $j$ is to token $i$ is:

$$\text{score}(i, j) = \frac{Q_i \cdot K_j^T}{\sqrt{d_k}}$$

Dividing by $\sqrt{d_k}$ (e.g., $\sqrt{64} = 8$ for 64-dim keys) prevents dot products from growing so large that softmax gradients saturate and vanish. Softmax then converts raw scores into a probability distribution over all tokens. The final output is a weighted sum of Value vectors:

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Concrete coreference example: In "The animal didn't cross the street because it was too tired", when computing the representation for "it", self-attention assigns a high score to "animal" and a low score to "street" — correctly resolving the pronoun to its antecedent without any hand-written grammar rule.

flowchart LR
    subgraph sa[Self-Attention for 'it']
        it[Token: it] -->|high score 0.82| animal[Token: animal]
        it -->|low score 0.04| street[Token: street]
    end

The edges here represent softmax-normalized attention weights that "it" assigns to every other token in the sentence. A weight of 0.82 toward "animal" means the output representation of "it" is dominated by "animal"'s Value vector — the model has learned, from training data alone, that a tired animal is more plausible than a tired street. This single diagram captures the core promise of self-attention: coreference resolution, a task that once required hand-crafted parse trees, emerges naturally from learned Q/K dot-product similarity in the embedding space.

Self-attention resolves the pronoun "it" to "animal" not "street" via learned attention weights.

🧠 Deep Dive: Under the Hood: Multi-Head Attention, Complexity, and the Math

Internals

Running attention once captures only a single type of token relationship. Multi-Head Attention runs $h$ independent attention operations in parallel, each with its own $W^Q_i$, $W^K_i$, $W^V_i$ projection matrices, then concatenates their outputs and projects back to model dimension:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W^O$$

$$\text{where } \text{head}_i = \text{Attention}(QW_i^Q,\; KW_i^K,\; VW_i^V)$$

Different heads tend to specialize in different linguistic relationships:

Head Specialization	What It Tends to Capture
Syntactic	Subject → verb agreement, dependency arcs
Coreference	Pronoun → antecedent resolution
Semantic	Synonym grouping, paraphrase patterns
Positional	Adjacent-token or local n-gram context
Long-range	Cross-sentence dependencies, topic anchors

GPT-3 uses 96 heads × 128 dimensions per head = 12,288-dim representations. The projection matrix $W^O$ merges all head outputs back into a single model-width vector before passing downstream.

📊 Attention Mechanism: Q × K → Weighted V

flowchart LR
    Q["Query (Q) What am I looking for?"] --> S[Q  K dot product scores]
    K["Key (K) What do I advertise?"] --> S
    S --> SC[" sqrt(d_k) stabilize gradients"]
    SC --> SM[Softmax attention weights]
    SM --> WV["Weights  Value (V)"]
    V["Value (V) What I contribute"] --> WV
    WV --> OUT[Attention Output context-aware token repr]

Reading left-to-right, this diagram traces the exact sequence of operations defined by $\text{softmax}(QK^T/\sqrt{d_k})V$: the Query and Key vectors are dot-producted to produce raw alignment scores, divided by $\sqrt{d_k}$ to keep gradients well-scaled (without this divisor, large dot products collapse softmax into a near-one-hot distribution, killing gradient flow), and then passed through softmax so scores become a proper probability distribution over all positions. Those weights are then used to compute a convex combination of every Value vector in the sequence — the output is not a lookup of a single "best match" but a blended aggregation, which is what allows the model to express graded, context-sensitive representations rather than hard categorical decisions.

📊 Multi-Head Attention Sequence

sequenceDiagram
    participant I as Input Tokens
    participant S1 as Head 1 (syntax)
    participant S2 as Head 2 (coreref)
    participant SH as Head h (semantic)
    participant C as Concat
    participant L as Linear Projection W_O

    I->>S1: split + project Q1,K1,V1
    I->>S2: split + project Q2,K2,V2
    I->>SH: split + project Qh,Kh,Vh
    S1-->>C: head_1 output
    S2-->>C: head_2 output
    SH-->>C: head_h output
    C->>L: concat all heads
    L-->>I: model-dim output (residual add)

The sequence diagram makes explicit why multi-head attention is architecturally superior to a single wide attention operation: each head receives its own learned projection matrices ($W^Q_i$, $W^K_i$, $W^V_i$), so Head 1 is free to develop sensitivity to syntactic dependencies while Head 2 develops coreference sensitivity — specializations that would destructively interfere if forced to share a single attention subspace. After every head independently produces its output, the Concat step stacks them into a wide vector and the linear projection $W^O$ compresses it back to model dimension, ensuring the output shape matches the input for the mandatory residual add that follows. In GPT-3, this process runs across 96 heads simultaneously on every token at every one of its 96 layers.

Performance Analysis

The central computational cost of self-attention is pairwise dot products between all token pairs:

$$\text{Complexity} = O(n^2 \cdot d)$$

For a 4,096-token context with $d=128$, that is ~16M dot products per layer. Scale to 100K tokens (Claude 3, Gemini 1.5 Pro) and the attention matrix alone requires roughly 40 GB of GPU HBM per layer — clearly intractable with naive implementation.

Three practical mitigations dominate current practice:

Flash Attention (Dao et al., 2022): An I/O-aware CUDA kernel that tiles $QK^T$ inside fast SRAM instead of materializing the full $n \times n$ matrix in HBM. Same mathematical output; 2–4× speedup at a fraction of memory.
Sparse Attention (Longformer, BigBird): Tokens attend to a local sliding window plus global tokens, reducing complexity to $O(n \cdot w)$ where $w$ is window size.

Mathematical Model

Each Transformer block applies attention and a feed-forward network (FFN), both wrapped in residual connections and Layer Normalization:

$$\mathbf{h} = \text{LayerNorm}\!\left(\mathbf{x} + \text{MultiHead}(\mathbf{x})\right)$$

$$\mathbf{y} = \text{LayerNorm}\!\left(\mathbf{h} + \text{FFN}(\mathbf{h})\right)$$

The FFN applies a two-layer MLP independently to each token position:

$$\text{FFN}(x) = \text{GELU}(xW_1 + b_1)\,W_2 + b_2$$

The intermediate dimension is typically 4× the model dimension (3,072 for BERT-base). GELU outperforms ReLU empirically on language tasks. Residual connections (x +) prevent gradient vanishing across deep stacks of 100+ layers. The FFN sub-layers are also where much of a model's factual knowledge is stored — ablation studies show they behave as key-value memories.

Chinchilla scaling laws (Hoffmann et al., 2022) show that training loss follows a predictable curve with parameters $N$ and training tokens $D$:

$$L \approx E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$$

The key finding: many large models were severely undertrained. For optimal compute efficiency, model size and training data should scale proportionally — roughly 20 training tokens per parameter.

🏗️ Encoder-Only, Decoder-Only, Encoder-Decoder: Picking the Right Variant

The original "Attention Is All You Need" (Vaswani et al., 2017) used a full Encoder-Decoder architecture for machine translation. Since then, three specialized families have emerged with distinct strengths.

Encoder-only (BERT, RoBERTa): Processes input bidirectionally — every token attends to every other token. Best for classification, NER, semantic search, and embeddings. Cannot generate text.

Decoder-only (GPT family, LLaMA, Mistral, Claude): A causal attention mask prevents each token from attending to future positions, enforcing autoregressive generation. The prompt and generated output share the same token stream.

Encoder-Decoder (T5, BART, mT5): A separate encoder reads input bidirectionally; the decoder generates autoregressively while attending to encoder states via cross-attention. Natural fit for translation and summarization.

Model Family	Attention Mask	Strengths	Examples
Encoder-only	Bidirectional (unmasked)	Classification, NER, dense retrieval	BERT, RoBERTa, DeBERTa
Decoder-only	Causal (left-to-right)	Text generation, reasoning, chat	GPT-4, LLaMA 3, Mistral, Claude
Encoder-Decoder	Bidirectional + Causal	Translation, summarization, Q&A	T5, BART, mT5, FLAN-T5

📊 From Raw Text to Next-Token Prediction: The Full Pipeline

flowchart TD
    A[Input Text e.g. 'The cat sat'] --> B[Tokenizer BPE / WordPiece]
    B --> C[Token IDs e.g. [464, 3797, 3332]]
    C --> D[Token Embeddings + Positional Encoding]
    D --> E[Transformer Block x N layers]
    E --> F["Self-Attention softmax(QK / sqrt_dk) times V"]
    F --> G["Add and Norm LayerNorm(x + Attention output)"]
    G --> H["Feed-Forward Network GELU(xW1)W2"]
    H --> I["Add and Norm LayerNorm(h + FFN output)"]
    I --> J[Output Logits vocab_size dimensions]
    J --> K[Softmax - Probability Distribution]
    K --> L[Next Token Prediction]

Following the diagram top-to-bottom reveals precisely where each architectural decision lives in the forward pass: the tokenizer and embedding layer are a one-time setup cost; the real computation is the Transformer Block (steps E–I), which is stacked N times in series — each repetition refines every token's representation by aggregating richer contextual signal from its neighbors. The final projection from hidden dimension to vocabulary size (50,257 tokens for GPT-2; 32,000 for LLaMA-3) produces raw logits, and the softmax converts those into a probability distribution from which the next token is sampled or greedily decoded; that decoded token is then appended to the sequence and the entire pipeline reruns autoregressively until an end-of-sequence token is produced.

Figure: End-to-end token-to-prediction pipeline through a Transformer. The Transformer Block (steps E–I) is stacked N times: 12 for BERT-base, 96 for GPT-3.

Each full forward pass through all $N$ layers produces a logit vector over the vocabulary. During generation, the highest-probability (or sampled) token is appended to the sequence and the process repeats autoregressively.

🌍 Real-World Applications: Where Transformers Run the World: GPT-4 to AlphaFold2

Transformers have escaped the NLP lab and now power applications across every major domain:

GPT-4 (OpenAI): Decoder-only model used for chat, code generation, and multi-step reasoning. Its 128K context window handles entire codebases in a single prompt.
BERT (Google): Encoder-only model powering Google Search's passage ranking and query understanding since 2019. Fine-tuned variants run NER, sentiment analysis, and extractive QA at web scale.
T5 / BART: Encoder-decoder models optimized for summarization (news, legal documents), translation, and dialogue generation.
Vision Transformer (ViT): Splits images into 16×16 pixel patches, treats each patch as a token, and runs a standard Transformer encoder. Matches or exceeds CNNs on ImageNet at scale with a simpler architecture.
DALL-E / Stable Diffusion: Use cross-attention to condition image generation on text prompts — the prompt provides Keys/Values while visual tokens issue Queries.
AlphaFold2 (DeepMind): Uses a Transformer-inspired Evoformer to predict 3D protein structure — solving a 50-year grand challenge in structural biology.

⚖️ Trade-offs & Failure Modes: The Hidden Costs: Memory, Stability, and Context Length Trade-offs

Trade-off	The Tension	Practical Notes
Memory vs. expressiveness	O(n^2) attention vs. capturing long-range context	Flash Attention + sparse patterns help but don't eliminate the fundamental cost
Depth vs. training stability	More layers → better performance, harder gradient flow	Residual connections + LayerNorm are non-negotiable for deep stacks
Bidirectional vs. causal masking	Full context (BERT) vs. generation-compatible (GPT)	Bidirectional wins on understanding; causal is required for autoregressive generation
Fine-tuning cost vs. task performance	Full fine-tuning is expensive	LoRA/PEFT can match full fine-tuning at ~1% of parameter updates
Context window length vs. compute	Doubling context length quadruples attention memory	128K+ contexts require Flash Attention, RoPE/ALiBi, and KV-cache management

The most underestimated trade-off is depth vs. stability: Pre-LN (LayerNorm before each sub-layer) trains more stably than the original Post-LN at a small quality delta — a real architectural decision at scale.

🧭 Decision Guide: BERT, GPT, or T5: Choosing the Right Transformer for Your Task

Situation	Recommendation
Classify text (sentiment, topic, intent)	Fine-tune an encoder-only model (BERT, RoBERTa). Bidirectional context maximizes classification accuracy.
Generate text (chat, autocomplete, code)	Use a decoder-only model (GPT-4, LLaMA, Mistral). Causal masking is required for coherent autoregressive output.
Translate or summarize (input→output)	Use an encoder-decoder model (T5, BART). Cross-attention cleanly bridges bidirectional understanding and autoregressive generation.
Semantic search / embeddings	Use an encoder-only model with mean-pooling (BERT, E5, BGE). Causal decoder models produce directionally biased embeddings that hurt symmetric similarity tasks.
Long documents (50K+ tokens)	Use Flash Attention + extended positional encoding (Longformer for classification; LLaMA/Mistral with sliding-window attention for generation).
Low-resource or domain-specific task	Prefer a smaller fine-tuned encoder (BERT-base, DistilBERT) over a large generative model. Smaller models are faster to fine-tune and easier to interpret.

🧪 Implementing Self-Attention and Inspecting It in a Real Model

Understanding the architecture means being able to both implement the core formula and observe it running in a real model. The two patterns below do exactly that — connecting the math directly to running code.

Part 1 — Implement scaled dot-product attention from scratch:

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(
    Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor,
    causal_mask: bool = False
) -> tuple:
    """
    Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

    Q, K, V shape: (batch, heads, seq_len, d_k)
    causal_mask=False → encoder (BERT):  each token attends to the full sequence
    causal_mask=True  → decoder (GPT):   each token only attends to past positions
    """
    d_k    = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)  # QKᵀ / √d_k

    if causal_mask:   # block future positions — required for autoregressive generation
        seq_len = scores.size(-1)
        mask    = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
        scores  = scores.masked_fill(mask, float("-inf"))

    weights = F.softmax(scores, dim=-1)   # probability distribution over attended positions
    output  = torch.matmul(weights, V)    # weighted blend of every value vector
    return output, weights

# Toy 3-token sequence: one head, d_k=8
Q = torch.randn(1, 1, 3, 8)
K = torch.randn(1, 1, 3, 8)
V = torch.randn(1, 1, 3, 8)

_, enc_weights = scaled_dot_product_attention(Q, K, V, causal_mask=False)
_, dec_weights = scaled_dot_product_attention(Q, K, V, causal_mask=True)

print("Encoder (BERT-style) — each token attends to all positions:")
print(enc_weights[0, 0].detach().round(decimals=2))
# row i = probability mass token i places on each position j
# e.g. [[0.3, 0.4, 0.3], [0.3, 0.4, 0.3], [0.3, 0.4, 0.3]]

print("\nDecoder (GPT-style) — upper triangle masked to -inf before softmax:")
print(dec_weights[0, 0].detach().round(decimals=2))
# token 0 can only see itself, token 1 sees [0,1], token 2 sees all
# e.g. [[1.0, 0.0, 0.0], [0.4, 0.6, 0.0], [0.2, 0.3, 0.5]]

Part 2 — Inspect real BERT attention weights to verify pronoun resolution:

The post explains that in "The animal didn't cross the street because it was too tired", self-attention should resolve "it" → "animal" (not "street"). Here is how to check that empirically:

from transformers import AutoTokenizer, AutoModel, pipeline

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model     = AutoModel.from_pretrained("bert-base-uncased", output_attentions=True)

sentence  = "The animal didn't cross the street because it was too tired"
inputs    = tokenizer(sentence, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# outputs.attentions: 12 tensors, one per layer, each (1, 12_heads, seq_len, seq_len)
tokens      = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
it_idx      = tokens.index("it")

# Average attention FROM "it" across all 12 heads in the deepest layer
attn_from_it = outputs.attentions[-1][0, :, it_idx, :].mean(dim=0)  # (seq_len,)
top2         = attn_from_it.topk(2)

print("\nTop 2 tokens 'it' attends to (layer 11, all heads averaged):")
for score, idx in zip(top2.values, top2.indices):
    print(f"  '{tokens[idx.item()]}' → weight {score.item():.3f}")
# Expected: 'animal' ranks above 'street' — attention weight resolves the coreference

# Confirm the encoder vs. decoder distinction with the high-level API
classifier = pipeline("sentiment-analysis",
                       model="distilbert-base-uncased-finetuned-sst-2-english")
generator  = pipeline("text-generation", model="gpt2")

print(classifier("The attention mechanism is remarkably elegant."))
# [{'label': 'POSITIVE', 'score': 0.9998}] — BERT encoder, causal_mask=False above

print(generator("The Transformer architecture changed",
                 max_new_tokens=20, do_sample=True, temperature=0.8)[0]["generated_text"])
# GPT-2 decoder, causal_mask=True — each token only sees prior context

The same pipeline API hides the underlying masking difference: BERT runs with causal_mask=False (full bidirectional attention); GPT-2 runs with causal_mask=True (left-to-right only) — the exact encoder vs. decoder distinction from the architecture section above.

🛠️ HuggingFace Transformers and FlashAttention: Running Scaled Dot-Product Attention in Practice

HuggingFace Transformers is the de-facto open-source library for loading, fine-tuning, and serving pretrained Transformer models — from BERT and GPT-2 to LLaMA and Mistral. It handles weight loading, tokenization, and device placement so you can go from a model name to a running forward pass in minutes.

How it solves the problem in this post: Rather than implementing $\text{softmax}(QK^T/\sqrt{d_k})V$ from scratch, transformers gives you pretrained weights for every layer — including multi-head attention projections and positional encodings — so you can focus on the forward pass and output interpretation. The snippet below uses torch.nn.functional.scaled_dot_product_attention, which automatically dispatches to FlashAttention (Dao et al., 2022) when CUDA is available, replacing the naive $O(n^2)$ kernel with an I/O-aware tiled implementation that cuts HBM usage by 2–4×.

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

# --- 1. Load a pretrained encoder and its tokenizer ---
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, output_attentions=True)
model.eval()

# --- 2. Tokenise a sentence pair and get hidden states ---
text = "The animal didn't cross the street because it was too tired."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# last_hidden_state: (batch=1, seq_len, hidden_dim=768)
hidden = outputs.last_hidden_state
print("Hidden state shape:", hidden.shape)

# --- 3. Manual scaled dot-product attention (single head) using Flash-ready API ---
# Project to Q / K / V with small demo dimensions
seq_len, d_model = hidden.shape[1], hidden.shape[2]
d_k = 64  # one attention head

Wq = torch.randn(d_model, d_k)
Wk = torch.randn(d_model, d_k)
Wv = torch.randn(d_model, d_k)

Q = hidden @ Wq          # (1, seq_len, d_k)
K = hidden @ Wk
V = hidden @ Wv

# scaled_dot_product_attention uses FlashAttention kernel on CUDA automatically
attn_output = F.scaled_dot_product_attention(Q, K, V, scale=d_k**-0.5)
print("Attention output shape:", attn_output.shape)  # (1, seq_len, 64)
# → tensor of shape [1, 25, 64]: each token now holds context from all other tokens

FlashAttention fuses the QK^T → softmax → V multiply into one CUDA kernel, keeping the $n \times n$ score matrix in fast SRAM (never materialised in HBM). The API above produces identical numerical output to the naive implementation but at a fraction of the memory cost.

For a full deep-dive on HuggingFace Transformers and FlashAttention integration, a dedicated follow-up post is planned.

📚 Hard-Won Lessons from Building with Transformers

Attention is cheap to understand, expensive at scale. The O(n²) cost hits hard past 8K tokens — always profile attention memory against GPU HBM budget before committing to a context window.
Positional encoding is not decoration. Mismatching positional schemes between pretraining and fine-tuning silently breaks position sensitivity, easy to miss in short-sequence evaluations.
LayerNorm placement matters. Pre-LN trains more stably than Post-LN at large scale despite a small quality delta.
Not all attention heads are equally useful. Up to 40% of BERT heads can be pruned with minimal accuracy loss — understanding head specialization enables targeted fine-tuning and efficient inference.

📌 TLDR: Summary & Key Takeaways

Self-Attention computes pairwise token relevance via Q/K/V projections: $\text{softmax}(QK^T/\sqrt{d_k})V$. Complexity is $O(n^2 \cdot d)$.
Multi-Head Attention runs $h$ parallel attention streams; heads specialize in syntax, coreference, semantics, and long-range dependencies.
Positional Encoding (sinusoidal, learned, or RoPE) injects token order — attention alone is permutation-invariant.
Encoder-only (BERT) = bidirectional → best for understanding. Decoder-only (GPT) = causal autoregressive → best for generation. Encoder-Decoder (T5) = seq2seq → best for translation/summarization.
Flash Attention + sparse patterns address the quadratic memory bottleneck for long contexts without changing model output.

Article tools

Explain simpler Compare approaches What next?

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Article metadata