Large Language Models (LLMs): The Generative AI Revolution
Abstract AlgorithmsTL;DR
This guide explains how they are built, how they are trained, and why they sometimes hallucinate.

Introduction: Scale Changes Everything
We learned about Transformers in previous posts. An LLM is just a Transformer... but BIG.
- Big Data: Trained on petabytes of text (books, websites, code).
- Big Parameters: Hundreds of billions of weights (neurons).
- Big Compute: Trained on thousands of GPUs for months.
When you scale a model this much, it doesn't just get better at grammar; it develops Emergent Properties—abilities it wasn't explicitly trained for, like solving math problems or writing Python code.
1. The Architecture of Giants (GPT, BERT, T5)
While all LLMs use Transformers, they use them differently.
A. GPT (Generative Pre-trained Transformer)
- Type: Decoder-Only.
- Goal: Predict the next word (Causal Language Modeling).
- Mechanism: It can only see words to the left of the current position.
- Input: "The cat sat on the..."
- Prediction: "mat".
- Best For: Writing, coding, creative generation.
- Example: ChatGPT, Llama 3, Claude.
B. BERT (Bidirectional Encoder Representations from Transformers)
- Type: Encoder-Only.
- Goal: Predict a masked word in the middle of a sentence (Masked Language Modeling).
- Mechanism: It sees words on both sides (left and right) simultaneously.
- Input: "The [MASK] sat on the mat."
- Prediction: "cat".
- Best For: Understanding, classification, search engines (Google Search).
- Example: BERT, RoBERTa.
C. T5 (Text-to-Text Transfer Transformer)
- Type: Encoder-Decoder.
- Goal: Translate input text to output text.
- Mechanism: The Encoder reads the input (bidirectional), and the Decoder generates the output (unidirectional).
- Input: "translate English to German: That is good."
- Output: "Das ist gut."
- Best For: Translation, summarization.
2. How LLMs Are Built: The Training Pipeline
Building an LLM is a multi-stage process.
Stage 1: Pre-Training (The "Bookworm" Phase)
The model reads the internet. Its only goal is: Predict the next word.
- Self-Supervised Learning: We don't need humans to label data. We just take a sentence, hide the last word, and ask the model to guess it.
- The Math (Cross-Entropy Loss):
$$ L = -\sum y \log(\hat{y}) $$
- If the next word is "cat" (probability 1.0) and the model predicts "cat" with 0.9 probability, Loss is low.
- If it predicts "dog" (0.8), Loss is high.
- Result: A "Base Model" that knows facts and grammar but is unruly. (If you ask "How to make a bomb?", it might answer because it read a chemistry book.)
Stage 2: Fine-Tuning (The "Specialist" Phase)
We take the Base Model and train it on a smaller, high-quality dataset of "Instruction -> Answer" pairs.
- Input: "Summarize this article."
- Target: [A perfect summary].
- Technique (LoRA - Low-Rank Adaptation): Instead of updating all 100B weights (expensive), we freeze the main model and only train tiny "adapter" layers. This makes fine-tuning cheap and fast.
- Result: An "Instruct Model" that follows orders.
Stage 3: RLHF (Reinforcement Learning from Human Feedback)
This is the "Polishing" phase to make it safe and helpful.
- Human Ranking: Humans chat with the AI and rate two answers (A vs B).
- Prompt: "Write a poem about love."
- Answer A: [Good poem] (Winner).
- Answer B: [Bad poem].
- Reward Model: We train a separate AI to predict what humans like. It learns to output a score (e.g., 7/10).
- PPO (Proximal Policy Optimization): We use Reinforcement Learning to tune the LLM to maximize that reward score.
- Goal: Maximize Reward - Penalty (for drifting too far from the original model).
3. Modern Tricks for Speed and Smarts
Training these giants is hard. Engineers invented clever tricks to make them work.
A. Attention Optimizations
- Multi-Head Attention (Standard): Every "head" has its own Query, Key, and Value matrices. (Slow, high memory).
- Multi-Query Attention (MQA): All heads share the same Key and Value matrices.
- Benefit: Drastically reduces memory usage (KV Cache) during chat.
- Grouped Query Attention (GQA): A middle ground (used in Llama 2/3). Heads are grouped, and each group shares Keys/Values.
- Benefit: Good balance of speed and quality.
B. Positional Encodings 2.0
- RoPE (Rotary Positional Embeddings): Instead of adding a number to the vector, we rotate the vector in 2D space.
- Math: To encode position $m$, we rotate the vector by angle $m\theta$.
- Benefit: Relative distance is preserved perfectly. If word A and B are 5 steps apart, the rotation difference is always the same, regardless of where they are in the sentence.
- ALiBI: Adds a penalty to attention scores based on distance.
- Benefit: Allows models to read longer texts than they were trained on.
C. Activation Functions
- ReLU (Old): $f(x) = \max(0, x)$. (Simple, but "dead neurons" problem).
- GELU (Gaussian Error Linear Unit): $f(x) = x \Phi(x)$. (Smoother curve).
- SwiGLU (Swish-Gated Linear Unit): Used in Llama and PaLM.
- Math: $SwiGLU(x) = (Swish(xW) \otimes xV)W_2$
- Benefit: Empirically learns better in deep networks.
D. Normalization
- LayerNorm (Standard): Normalizes inputs to have mean 0 and variance 1. Requires calculating mean and variance.
- RMSNorm (Root Mean Square Normalization): Only scales by the root mean square (ignores mean).
- Math: $\bar{a}_i = \frac{a_i}{\text{RMS}(a)} g_i$
- Benefit: Faster to calculate, same performance.
E. KV Cache (Key-Value Cache)
- The Problem: When generating a 1000-word essay, for the 1001st word, the model has to re-calculate attention for the first 1000 words.
- The Solution: Save (Cache) the Key and Value matrices of the past tokens in GPU memory.
- The Result: We only compute attention for the new token against the cache. Generation becomes $O(1)$ instead of $O(N)$.
Summary & Key Takeaways
- LLMs are massive Transformers trained to predict the next word.
- Architectures: GPT (Decoder) for writing, BERT (Encoder) for understanding.
- Training: Pre-training (Knowledge) -> Fine-Tuning (Behavior) -> RLHF (Safety).
- Optimization: Tricks like RoPE, RMSNorm, and KV Cache make these massive models fast enough to use.
What's Next?
We have these powerful brains, but they are trapped in a chat window. How do we give them memory, tools, and the ability to act in the real world? In the final post of this series, we'll explore Advanced AI: Agents and RAG.
Ready for the cutting edge? Subscribe to the series!

Written by
Abstract Algorithms
@abstractalgorithms
