All Posts

How GPT (LLM) Works: The Next Word Predictor

ChatGPT isn't magic; it's math. We explain the core mechanism of Generative Pre-trained Transformers (GPT) and how they predict the next word.

Abstract AlgorithmsAbstract Algorithms
ยทยท4 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: At its core, GPT asks one question, repeated: "Given everything so far, what is the most likely next token?" Tokens are not words โ€” they're subword units. The Transformer architecture uses self-attention to weigh how much each token should influence the prediction. Sampling strategies (temperature, top-p) control how creative or deterministic the output is.


๐Ÿ“– The One Question GPT Asks a Trillion Times

Everything ChatGPT, Claude, and Gemini do comes down to a single repeated operation:

Given the sequence of tokens seen so far, predict a probability distribution over the next token.

Sample one token from that distribution. Append it to the sequence. Repeat.

That's it. By doing this billions of times during training (and millions of times during inference), the model learns to generate coherent paragraphs, working code, and useful answers.


๐Ÿ”ข Tokenization: Words Are Not the Input

GPT doesn't process words โ€” it processes tokens, which are roughly 3โ€“4 characters each.

"Hello, world!" โ†’ ["Hello", ",", " world", "!"]
"unbelievable"  โ†’ ["un", "bel", "iev", "able"]
"ChatGPT"       โ†’ ["Chat", "G", "PT"]

Subword tokenization (BPE โ€” Byte Pair Encoding) allows the model to:

  • Handle any word, including rare or invented ones
  • Represent a vocabulary of 100K tokens that covers millions of possible words
  • Share representations between related words ("run", "running", "runner")

Practical impact: GPT-4 has a 128K token context window. 1 token โ‰ˆ 0.75 words, so that's roughly 100K words โ€” about one full novel.


โš™๏ธ Predicting the Next Token: Logits, Softmax, and Sampling

After processing the input, GPT outputs a logit for every token in its vocabulary:

Raw logits (not probabilities):
"Paris"   โ†’  8.3
"London"  โ†’  6.1
"apple"   โ†’ -1.2
...

These are converted to probabilities via softmax:

$$P(\text{token}_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$

Then one token is sampled from that distribution. How you sample controls the trade-off between coherence and creativity:

StrategyHow it worksResult
GreedyAlways pick highest probabilityRepetitive, safe, boring
Temperature < 1Sharpen the distributionMore confident, less diverse
Temperature > 1Flatten the distributionMore creative, less reliable
Top-kSample only from top-k tokensCuts off long tail
Top-p (nucleus)Sample from smallest set summing to probability pDynamic top-k

๐Ÿง  The Transformer Under the Hood: Attention in Plain Language

GPT is built from stacked Transformer decoder blocks. Each block runs multi-head self-attention.

Self-attention answers: "For each position in the sequence, how much should I attend to every other position?"

Example: "The animal didn't cross the street because it was too tired."

  • Self-attention figures out that "it" refers to "animal," not "street."
  • It does this by learning attention weights between all token pairs.
flowchart LR
    Token1[The] --> Attn[Self-Attention Layer]
    Token2[animal] --> Attn
    Token3[didn't] --> Attn
    Token4[cross] --> Attn
    Token5[it] --> Attn
    Attn --> Rep[Contextual\nRepresentation\nof each token]
    Rep --> FF[Feed-Forward Layer]
    FF --> Next[Next token\ndistribution]

The attention formula:

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

  • $Q$ (Query): "What am I looking for?"
  • $K$ (Key): "What do I offer?"
  • $V$ (Value): "What information do I contain?"
  • $\sqrt{d_k}$: scaling factor to prevent softmax saturation for large dimensions

โš–๏ธ GPT's Known Limitations

LimitationWhat it means in practice
Knowledge cutoffCannot know events that happened after training data ended
Context windowCannot "remember" earlier in a conversation once the window is full
HallucinationGenerates plausible-sounding but factually wrong content
No true reasoningPattern matching at scale, not symbolic logic
Token budgetLong inputs are expensive โ€” both in cost ($) and latency

๐Ÿ“Œ Key Takeaways

  • GPT = autoregressive next-token prediction, repeated at inference time.
  • Input is tokenized into subwords (BPE); tokens โ‰ˆ 0.75 words.
  • Raw logits โ†’ softmax โ†’ probability distribution โ†’ sample a token.
  • Self-attention allows GPT to model relationships between any two tokens in the context window.
  • Key limitations: knowledge cutoff, context window ceiling, hallucination, and no symbolic reasoning.

๐Ÿงฉ Test Your Understanding

  1. GPT outputs a logit of 8.3 for "Paris" and 6.1 for "London." Which is more probable after softmax? Why isn't it just "pick 8.3"?
  2. What would temperature=0 mean operationally?
  3. In the self-attention formula, what do Q, K, and V represent in intuitive terms?
  4. Why doesn't GPT know about an event from last week?

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms