How Transformer Architecture Works: A Deep Dive
The Transformer is the engine behind ChatGPT, BERT, and Claude. We break down Self-Attention, Mul...
Abstract AlgorithmsTLDR: The Transformer is the architecture behind every major LLM (GPT, BERT, Claude, Gemini). Its core innovation is Self-Attention โ a mechanism that lets the model weigh relationships between all tokens in a sequence simultaneously, regardless of distance. This enables parallelism that RNNs could not achieve.
๐ The Cocktail Party Listener: Attention as a Mental Model
Before Transformers, RNNs read text left-to-right, one word at a time, like reading a book. By the time the model reached the end of a long sentence, earlier words had faded from its hidden state.
Transformers are different. They stand in the center of a cocktail party and listen to everyone simultaneously:
- 80% attention on the friend telling a story (relevant context).
- 10% on the background music (modifier words).
- 10% on the waiter (punctuation, structure).
The model builds this attention map over the entire sentence at once โ no sequential dependency. This is why Transformers train faster on modern GPUs: every token can be processed in parallel.
๐ข From Tokens to Embeddings: Preparing the Input
Before any attention computation, input text is converted to tensors:
Tokenization: Split text into subword tokens.
"unhappiness"โ["un", "##happ", "##iness"](WordPiece / BPE)Token Embeddings: Map each token ID to a learned dense vector (dimension = 768 for BERT-base, 12288 for GPT-4).
Positional Encodings: Attention has no inherent sense of order. A positional encoding vector is added to each token embedding so the model knows position 0, position 1, etc.
Final Input = Token Embedding + Positional Encoding
| Component | Dimension | Learnable? |
| Token embedding | 768 (BERT-base) | โ |
| Positional encoding (sinusoidal) | 768 | โ (fixed) |
| Positional encoding (learned) | 768 | โ (GPT-2+) |
โ๏ธ Self-Attention: How Every Token Reads the Room
The Q, K, V Framework
For each token, the model creates three vectors via learned linear projections:
- Q (Query): "What am I looking for?"
- K (Key): "What do I advertise as my content?"
- V (Value): "What do I contribute if selected?"
The attention score between token $i$ and token $j$ is:
$$\text{score}(i, j) = \frac{Q_i \cdot K_j^T}{\sqrt{d_k}}$$
Divide by $\sqrt{d_k}$ (square root of key dimension) to prevent dot products from growing so large that softmax saturates.
Softmax normalizes scores into weights that sum to 1:
$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$
Concrete Example
Sentence: "The animal didn't cross the street because it was too tired."
When the model processes the token "it", self-attention computes:
- High score to
"animal"โ"it"refers to the animal. - Low score to
"street"โ not the referent.
Without attention, a model might incorrectly resolve the pronoun based on proximity.
flowchart LR
subgraph Self-Attention for "it"
it["Token: it"] -->|high score| animal["Token: animal"]
it -->|low score| street["Token: street"]
end
๐ง Multi-Head Attention: Learning Parallel Relationship Types
Running attention once captures one type of relationship. Multi-Head Attention runs $h$ parallel attention mechanisms (heads) and concatenates their outputs.
| Head | What it tends to learn |
| Head 1 | Syntactic relations (subject โ verb) |
| Head 2 | Coreference resolution (pronoun โ noun) |
| Head 3 | Semantic similarity (synonyms) |
| Head 4+ | Long-range dependencies, modifier attachment, etc. |
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O
where head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)
GPT-3: 96 heads ร 128 dimensions each = 12,288-dim model.
๐๏ธ The Full Encoder-Decoder Architecture
The original "Attention Is All You Need" (Vaswani et al., 2017) paper used an Encoder-Decoder structure:
flowchart TD
Input["Input Tokens\n(source language)"] --> Encoder["Encoder Stack\n(N ร {Self-Attention + FFN})"]
Encoder --> CrossAttn["Cross-Attention\n(decoder reads encoder output)"]
Output["Output Tokens\n(target language, shifted right)"] --> Decoder["Decoder Stack\n(N ร {Masked Self-Attention + Cross-Attention + FFN})"]
Decoder --> CrossAttn
CrossAttn --> Linear["Linear + Softmax"]
Linear --> Prediction["Next Token"]
Encoder-only (BERT): Reads the full sequence bidirectionally. Best for classification, NER, embeddings.
Decoder-only (GPT): Autoregressive: each token attends only to past tokens (causal mask). Best for generation.
Encoder-Decoder (T5, BART): Input โ encoder; generate โ decoder. Best for translation, summarization.
โ๏ธ Transformer Scaling Laws and Limitations
Scaling Laws (Chinchilla, 2022)
Training loss decreases predictably with model size and data:
$$L \propto N^{-\alpha} \cdot D^{-\beta}$$
Where $N$ = parameters, $D$ = training tokens. Chinchilla showed that many "large" models were undertrained โ optimal compute splits ~50/50 between model size and data quantity.
Quadratic Attention Complexity
Self-attention computes dot products between all pairs of tokens:
$$\text{Complexity} = O(n^2 \cdot d)$$
For a 4096-token context, that's ~16M dot products per layer. This is why long-context models (100K+ tokens) require approximations:
- Sparse attention (Longformer, BigBird): attend only to local windows + global tokens.
- Flash Attention: I/O-aware CUDA kernel that avoids materializing the full $n \times n$ attention matrix in HBM.
- RoPE + ALiBi: Positional encodings that generalize better to unseen context lengths.
Layer Normalization and Residual Connections
Each sub-layer (attention, FFN) is wrapped in:
output = LayerNorm(x + Sublayer(x))
The residual connection (x +) prevents gradient vanishing in deep stacks (GPT-4 reportedly uses ~120 layers). LayerNorm stabilizes training without batch statistics.
๐ก๏ธ Inside the FFN: Position-wise Feed-Forward Network
After attention, each token passes independently through a 2-layer MLP:
$$\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2$$
The intermediate dimension is typically 4ร the model dimension (e.g., 3072 hidden for 768-dim BERT). This FFN is where much of the model's factual knowledge is believed to be stored โ key-value memories for facts like "Paris is the capital of France."
๐ Summary
- Self-Attention computes pairwise importance scores across all tokens using Q/K/V projections. Complexity is $O(n^2)$.
- Multi-Head Attention runs $h$ parallel attention streams to capture different relationship types.
- Positional Encoding injects token order information because attention itself is permutation-invariant.
- Encoder-only (BERT) = bidirectional; decoder-only (GPT) = autoregressive causal; encoder-decoder (T5) = seq2seq.
- Flash Attention + sparse patterns address the quadratic memory bottleneck at long contexts.
- FFN layers act as key-value memories; residual connections + LayerNorm enable very deep stacks.
๐ Practice Quiz
Why is self-attention divided by $\sqrt{d_k}$ before the softmax?
- A) To normalize token embeddings to unit length.
- B) To prevent dot products from growing so large that softmax gradients vanish (saturation).
- C) To apply weight decay during training.
Answer: B
A model must process a 100,000-token book. What is the main bottleneck with standard self-attention?
- A) Tokenization speed.
- B) $O(n^2)$ memory โ the 100K ร 100K attention matrix requires ~40 GB of memory per layer.
- C) The FFN cannot handle sequences longer than 4096 tokens.
Answer: B
What is the key difference between an encoder-only model (BERT) and a decoder-only model (GPT)?
- A) BERT uses more parameters.
- B) BERT attends bidirectionally (full context); GPT uses a causal mask (only past tokens) for autoregressive generation.
- C) GPT uses positional encodings; BERT does not.
Answer: B

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. ๏ฟฝ...
