All Posts

Tokenization Explained: How LLMs Understand Text

Computers don't read words; they read numbers. Tokenization is the process of converting text into these numbers. Learn about BPE, WordPiece, and why

Abstract AlgorithmsAbstract Algorithms
Β·Β·12 min read
Cover Image for Tokenization Explained: How LLMs Understand Text
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: LLMs don't read words β€” they read tokens. A token is roughly 4 characters. Byte Pair Encoding (BPE) builds an efficient subword vocabulary by iteratively merging frequent character pairs. Tokenization choices directly affect cost, context limits, and why LLMs struggle with counting and spelling.


πŸ“– Why LLMs Can't Just Read Words

Computers only work with numbers. Before an LLM can process the sentence "Hello world", it must first map that text to a sequence of integers. That mapping is tokenization.

Analogy: Teaching a dog commands by assigning each sound a unique signal ID. "Sit" = 42, "down" = 17, "fetch" = 91. The dog doesn't understand English β€” it responds to the number sequence.

There are three approaches, each with a trade-off:

ApproachExampleVocabulary sizeSequence length
Word tokenization["Hello", "world"]Very large (every word form)Short
Character tokenization["H","e","l","l","o"," ","w","o","r","l","d"]Tiny (256 bytes)Very long
Subword tokenization["Hello", " world"]Moderate (~50k)Moderate

Modern LLMs (GPT-4, Claude, Llama) use subword tokenization for the best balance.


πŸ” How Byte Pair Encoding Builds a Vocabulary

BPE starts with individual characters and iteratively merges the most frequent adjacent pair, growing the vocabulary until a target size is reached.

Step-by-step trace on "low lower lowest":

Initial characters: ['l','o','w',' ','l','o','w','e','r',' ','l','o','w','e','s','t']

Round 1: Most frequent pair = ('l','o') β†’ merge to 'lo'
β†’ ['lo','w',' ','lo','w','e','r',' ','lo','w','e','s','t']

Round 2: Most frequent pair = ('lo','w') β†’ merge to 'low'
β†’ ['low',' ','low','e','r',' ','low','e','s','t']

Round 3: Most frequent pair = ('low','e') β†’ merge to 'lowe'
β†’ ['low',' ','lowe','r',' ','lowe','s','t']

...continues until vocabulary size target is reached
graph TD
    A[All characters in training corpus] --> B[Find most frequent adjacent pair]
    B --> C[Merge pair into a new token]
    C --> D[Update frequency counts]
    D --> E{Vocabulary size reached?}
    E -->|No| B
    E -->|Yes| F[Final vocabulary]

Result: Common words stay whole (the, in, is). Rare or new words are split into known subword pieces (tokenization β†’ token, ization).

πŸ“Š BPE Tokenization Steps

flowchart LR
    TX[Raw Text] --> CS[Character Splits]
    CS --> MP[Merge Frequent Pairs]
    MP --> VC[Vocabulary Built]
    VC --> TK[Token IDs]

βš™οΈ What a Token Actually Looks Like in Practice

import tiktoken                    # OpenAI's tokenizer library

enc = tiktoken.encoding_for_model("gpt-4o")

text = "Tokenization is fascinating!"
tokens = enc.encode(text)
print(tokens)
# β†’ [8908, 2065, 374, 41353, 0]   (5 tokens)

decoded = [enc.decode([t]) for t in tokens]
print(decoded)
# β†’ ['Token', 'ization', ' is', ' fascinating', '!']

Key observations:

  • tokenization β†’ 2 tokens
  • The space before is is part of the is token
  • ! is its own token
  • Cost = number of tokens, not number of words

Token counting rule of thumb (English): 1 token β‰ˆ 4 characters β‰ˆ 0.75 words.

πŸ“Š Encode-Decode Roundtrip

sequenceDiagram
    participant T as Text Input
    participant TZ as Tokenizer
    participant M as LLM
    participant D as Detokenizer
    T->>TZ: raw string
    TZ->>M: token ID sequence
    M->>D: output token IDs
    D-->>T: decoded text output

πŸ“Š BPE Training and Encoding Flow

The BPE algorithm has two distinct phases: a one-time vocabulary training phase (offline) and a per-text encoding phase (online at inference time).

graph TD
    subgraph Training Phase offline
        A[Large text corpus] --> B[Initialize: each character is a token]
        B --> C[Count all adjacent token pairs]
        C --> D[Merge the most frequent pair into a new token]
        D --> E[Update pair counts]
        E --> F{Vocab size target reached?}
        F -->|No| C
        F -->|Yes| G[Final vocabulary saved]
    end
    subgraph Encoding Phase online
        H[Input text] --> I[Split into characters]
        I --> J[Apply learned merge rules in order]
        J --> K[Token sequence produced]
        K --> L[Convert tokens to integer IDs]
        L --> M[Integer sequence fed to LLM]
    end
    G --> J

The critical insight is that the merge rules learned during training are applied in the same order during encoding. Rule 1 (first merge) is always applied before Rule 2 (second merge). This determinism ensures that the same text always tokenizes to the same integer sequence β€” a requirement for reproducible model behavior.

Why vocabulary size matters:

  • Too small (< 10k): Many words split into many short subwords β†’ long sequences β†’ expensive attention computation.
  • Too large (> 200k): Rare tokens appear so infrequently in training that their embeddings are poorly learned β†’ poor generalization.
  • Sweet spot: 32k–128k tokens balances sequence length, embedding quality, and memory usage.

🧠 Deep Dive: Why Subword Tokenization Beats Words and Characters

Word tokenization creates a vocabulary explosion β€” every inflected form needs its own entry, and unseen words become [UNK]. Character tokenization avoids that but makes sequences 5–10Γ— longer, overwhelming the attention computation. Subword tokenization (BPE, WordPiece) finds the middle path: common words stay whole, rare words decompose into known pieces. The vocabulary tops out at 32k–128k β€” small enough for good embeddings, large enough to keep sequence lengths practical.


🌍 Real-World Applications: Why Tokenization Explains LLM Quirks

Many surprising LLM behaviors are tokenization artifacts:

LLM behaviorTokenization cause
Struggles to count letters in a wordCharacters aren't the input unit β€” tokens are
Can't easily reverse a wordReversal would break token boundaries
Different behavior for "color" vs "colour"They may map to different token IDs
Token budget runs out faster with code/JSONCode symbols β‰ˆ 1 token each β†’ more tokens per byte
Non-English text costs more tokensLanguages with fewer BPE merges have longer token sequences
# Why "9.11 > 9.9" confuses some models:
enc.encode("9.11")   # β†’ ['9', '.', '11']   β€” three tokens
enc.encode("9.9")    # β†’ ['9', '.9']        β€” two tokens, different internal splits

The model sees these as sequences of integers β€” the decimal arithmetic intuition doesn't directly apply.


βš–οΈ Trade-offs & Failure Modes: Comparing Tokenization Algorithms

AlgorithmUsed byKey ideaVocabulary size
BPE (Byte Pair Encoding)GPT-4, Llama, MistralMerge most frequent char pairs~50k–128k
WordPieceBERT, DistilBERTMaximize likelihood of vocabulary given corpus~30k
Unigram LMT5, ALBERTProbabilistic β€” prune vocab by minimizing perplexity~32k
SentencePieceT5, XLM-RLanguage-agnostic; treats text as raw bytesConfigurable

All modern subword algorithms solve the same core problem differently: balance vocabulary size, sequence length, and handling of unseen (OOV) words.


🧭 Decision Guide: Practical Tokenization Guidelines

SituationGuideline
Estimating API costTokens / 1000 Γ— price-per-1k. Use tiktoken to count before sending
Fitting in context windowCount prompt tokens; leave room for completion
Multilingual contentExpect 2–4Γ— more tokens than English for same semantic content
Structured data (JSON/XML)Each bracket, key, and value = 1–2 tokens; JSON is expensive
Sensitive exact character tasksDon't rely on LLM for char-level operations (spelling, reversal)

🎯 What to Learn Next


πŸ§ͺ Hands-On: Explore Tokenization with tiktoken

The best way to build intuition for tokenization is to run experiments yourself. The tiktoken library (used by OpenAI's GPT models) is available as a Python package.

Install and import:

pip install tiktoken

Experiment 1 β€” See how your prompt is tokenized:

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")

prompt = "The quick brown fox jumps over the lazy dog."
tokens = enc.encode(prompt)
print(f"Token count: {len(tokens)}")
print([enc.decode([t]) for t in tokens])
# Token count: 10
# ['The', ' quick', ' brown', ' fox', ' jumps', ' over', ' the', ' lazy', ' dog', '.']

Experiment 2 β€” Compare English vs. a non-Latin script:

english = "Artificial intelligence is transforming software."
japanese = "δΊΊε·₯ηŸ₯θƒ½γ―γ‚½γƒ•γƒˆγ‚¦γ‚§γ‚’γ‚’ε€‰ι©γ—γ¦γ„γΎγ™γ€‚"

print(len(enc.encode(english)))   # β†’ 7
print(len(enc.encode(japanese)))  # β†’ 17

The same semantic content uses 2.4Γ— more tokens in Japanese β€” directly increasing API cost and consuming more of the context window.

Experiment 3 β€” Verify the "strawberry" counting problem:

word = "strawberry"
tokens = enc.encode(word)
pieces = [enc.decode([t]) for t in tokens]
print(pieces)   # β†’ ['straw', 'berry']

The model sees two token IDs, not ten characters. When asked "How many r's are in strawberry?", it must reason about subword pieces rather than individual letters β€” which is why it sometimes miscounts.

Challenge: Count the tokens in your most-used system prompt. Calculate the monthly token cost at GPT-4o pricing ($5 / 1M tokens) if this prompt is sent 100,000 times per day.


πŸ› οΈ HuggingFace Tokenizers: BPE and WordPiece in Two Lines of Code

HuggingFace Tokenizers is the open-source Python library that powers tokenization for GPT-2, BERT, LLaMA, Mistral, and virtually every modern LLM β€” it exposes both the tokenizers (Rust-backed, fast) and transformers (high-level) APIs, making it the standard tool for inspecting exactly how any model splits text into token IDs.

The library makes every concept in this post directly observable: you can load real BPE (GPT-2) and WordPiece (BERT) tokenizers, inspect split boundaries, count tokens, and measure the cost difference between languages in under 20 lines of code.

# pip install transformers tokenizers

from transformers import AutoTokenizer

# ── BPE tokenizer (GPT-2 style) ───────────────────────────────────────────────
gpt2_tok = AutoTokenizer.from_pretrained("gpt2")

text = "Tokenization is fascinating!"
gpt2_ids   = gpt2_tok.encode(text)
gpt2_split = gpt2_tok.convert_ids_to_tokens(gpt2_ids)
print("GPT-2 BPE:")
print("  IDs:    ", gpt2_ids)       # β†’ [30642, 1634, 318, 17049, 0]
print("  Tokens: ", gpt2_split)     # β†’ ['Token', 'ization', 'Δ is', 'Δ fascinating', '!']
print("  Count:  ", len(gpt2_ids))  # β†’ 5

# ── WordPiece tokenizer (BERT style) ──────────────────────────────────────────
bert_tok = AutoTokenizer.from_pretrained("bert-base-uncased")

bert_ids   = bert_tok.encode(text, add_special_tokens=False)
bert_split = bert_tok.convert_ids_to_tokens(bert_ids)
print("\nBERT WordPiece:")
print("  IDs:    ", bert_ids)       # β†’ [19204, 3989, 2003, 18895, 999]
print("  Tokens: ", bert_split)     # β†’ ['token', '##ization', 'is', 'fascinating', '!']
print("  Count:  ", len(bert_ids))  # β†’ 5
# Note: '##' prefix marks continuation sub-words in WordPiece

# ── Token cost comparison: English vs. Japanese ───────────────────────────────
english  = "Artificial intelligence is transforming software."
japanese = "δΊΊε·₯ηŸ₯θƒ½γ―γ‚½γƒ•γƒˆγ‚¦γ‚§γ‚’γ‚’ε€‰ι©γ—γ¦γ„γΎγ™γ€‚"

print("\nToken counts (GPT-2 BPE):")
print(f"  English:  {len(gpt2_tok.encode(english))} tokens")   # β†’ ~8
print(f"  Japanese: {len(gpt2_tok.encode(japanese))} tokens")  # β†’ ~25+ (2-3Γ— more)

# ── Reproduce the "strawberry" counting problem ───────────────────────────────
word = "strawberry"
pieces = gpt2_tok.convert_ids_to_tokens(gpt2_tok.encode(word))
print(f"\n'strawberry' splits to: {pieces}")
# β†’ ['straw', 'berry']  β€” model sees 2 tokens, not 10 characters

# ── Count prompt tokens before sending to API ────────────────────────────────
# Use the model-specific tokenizer to get an accurate count
system_prompt = "You are a helpful assistant. Always answer concisely."
user_message  = "Explain the difference between SQL and NoSQL databases."
full_prompt   = system_prompt + "\n" + user_message

token_count = len(gpt2_tok.encode(full_prompt))
cost_usd    = token_count / 1_000_000 * 5.00  # GPT-4o input pricing
print(f"\nPrompt token count: {token_count}")
print(f"Cost per 1 call: ${cost_usd:.6f}")

The ## prefix in BERT WordPiece tokens marks a continuation sub-word (part of a longer word), while GPT-2 BPE uses the Δ  prefix to mark a token that follows a space. Both are subword algorithms but produce different split boundaries β€” a subtle source of model-specific behavior differences.

For a full deep-dive on HuggingFace Tokenizers and vocabulary training, a dedicated follow-up post is planned.


πŸ“š Lessons from Understanding Tokenization

Lesson 1 β€” Token count, not word count, determines cost. Every LLM API charges per token. A 500-word prompt in English is ~667 tokens. The same content in a morphologically rich language (Finnish, Turkish) may be 1,200 tokens. Always budget in tokens, not words.

Lesson 2 β€” The context window is a token budget, not a word budget. A 128k-token context window holds roughly 96,000 English words. But a JSON-heavy system prompt with brackets, colons, and quoted keys burns tokens at 1.5–2Γ— the prose rate. Compact your structured data.

Lesson 3 β€” Tokenization artifacts explain model quirks at the character level. Letter counting, palindrome detection, reversing strings, and rhyme finding all require character-level reasoning β€” but the model only sees token IDs. These tasks are genuinely harder for LLMs, not symptoms of low intelligence.

Lesson 4 β€” Consistent tokenization is a stability guarantee. The same text always produces the same token sequence with the same tokenizer version. If you cache LLM responses keyed by the token sequence, you get deterministic cache hits. If you upgrade the tokenizer, the cache is invalidated.

Lesson 5 β€” Use tiktoken for cost estimation before shipping a prompt to production. Never estimate token count by eye. A prompt that looks like 200 words may be 350 tokens once you count the system prompt, conversation history, tool definitions, and output format instructions. Measure first.


πŸ“Œ TLDR: Summary & Key Takeaways

  • Tokenization converts text to integer sequences β€” LLMs work with tokens, not words or characters.
  • BPE builds a vocabulary by iteratively merging the most frequent adjacent character pairs.
  • 1 token β‰ˆ 4 English characters β‰ˆ 0.75 words β€” use this to estimate context and cost.
  • Tokenization boundaries explain why LLMs struggle with spelling, letter counting, and exact character operations.
  • Modern tokenizers (tiktoken, SentencePiece) expose these internals; always count tokens before sending expensive prompts.

πŸ“ Practice Quiz

  1. What is the first step of BPE tokenization?

    • A) Splitting text into sentences
    • B) Starting with individual characters and counting pair frequencies
    • C) Building a large word dictionary
    • D) Removing stop words

    Correct Answer: B β€” BPE bootstraps from the smallest unit (individual characters or bytes) and iteratively merges the most frequent adjacent pair until the vocabulary size target is reached.

  2. Why does "tokenization" get split into two tokens by BPE?

    • A) It is too long to fit in one token
    • B) The full word sequence is not in the learned vocabulary; it gets split at a subword boundary
    • C) BPE always splits words longer than 8 characters
    • D) The word contains an irregular character pattern

    Correct Answer: B β€” BPE's vocabulary is built from the training corpus. "tokenization" as a whole unit may be rare enough that the merge that would create it was never reached. The split at "token" + "ization" reflects two high-frequency subwords that were merged earlier.

  3. Why do LLMs struggle to count the letters in a word like "strawberry"?

    • A) LLMs are not trained on spelling tasks
    • B) Characters are not the input unit β€” the model sees token IDs, not individual letters
    • C) "Strawberry" contains an irregular letter pattern
    • D) The model only processes words up to 6 characters

    Correct Answer: B β€” "strawberry" tokenizes to ["straw", "berry"] in GPT-4o. The model must reason about character structure from within each subword piece β€” a task that its training data provides little direct signal for.



Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms