Tokenization Explained: How LLMs Understand Text
Computers don't read words; they read numbers. Tokenization is the process of converting text into these numbers. Learn about BPE, WordPiece, and why
Abstract Algorithms
TLDR: LLMs don't read words β they read tokens. A token is roughly 4 characters. Byte Pair Encoding (BPE) builds an efficient subword vocabulary by iteratively merging frequent character pairs. Tokenization choices directly affect cost, context limits, and why LLMs struggle with counting and spelling.
π Why LLMs Can't Just Read Words
Computers only work with numbers. Before an LLM can process the sentence "Hello world", it must first map that text to a sequence of integers. That mapping is tokenization.
Analogy: Teaching a dog commands by assigning each sound a unique signal ID. "Sit" = 42, "down" = 17, "fetch" = 91. The dog doesn't understand English β it responds to the number sequence.
There are three approaches, each with a trade-off:
| Approach | Example | Vocabulary size | Sequence length |
| Word tokenization | ["Hello", "world"] | Very large (every word form) | Short |
| Character tokenization | ["H","e","l","l","o"," ","w","o","r","l","d"] | Tiny (256 bytes) | Very long |
| Subword tokenization | ["Hello", " world"] | Moderate (~50k) | Moderate |
Modern LLMs (GPT-4, Claude, Llama) use subword tokenization for the best balance.
π How Byte Pair Encoding Builds a Vocabulary
BPE starts with individual characters and iteratively merges the most frequent adjacent pair, growing the vocabulary until a target size is reached.
Step-by-step trace on "low lower lowest":
Initial characters: ['l','o','w',' ','l','o','w','e','r',' ','l','o','w','e','s','t']
Round 1: Most frequent pair = ('l','o') β merge to 'lo'
β ['lo','w',' ','lo','w','e','r',' ','lo','w','e','s','t']
Round 2: Most frequent pair = ('lo','w') β merge to 'low'
β ['low',' ','low','e','r',' ','low','e','s','t']
Round 3: Most frequent pair = ('low','e') β merge to 'lowe'
β ['low',' ','lowe','r',' ','lowe','s','t']
...continues until vocabulary size target is reached
graph TD
A[All characters in training corpus] --> B[Find most frequent adjacent pair]
B --> C[Merge pair into a new token]
C --> D[Update frequency counts]
D --> E{Vocabulary size reached?}
E -->|No| B
E -->|Yes| F[Final vocabulary]
Result: Common words stay whole (the, in, is). Rare or new words are split into known subword pieces (tokenization β token, ization).
π BPE Tokenization Steps
flowchart LR
TX[Raw Text] --> CS[Character Splits]
CS --> MP[Merge Frequent Pairs]
MP --> VC[Vocabulary Built]
VC --> TK[Token IDs]
βοΈ What a Token Actually Looks Like in Practice
import tiktoken # OpenAI's tokenizer library
enc = tiktoken.encoding_for_model("gpt-4o")
text = "Tokenization is fascinating!"
tokens = enc.encode(text)
print(tokens)
# β [8908, 2065, 374, 41353, 0] (5 tokens)
decoded = [enc.decode([t]) for t in tokens]
print(decoded)
# β ['Token', 'ization', ' is', ' fascinating', '!']
Key observations:
tokenizationβ 2 tokens- The space before
isis part of theistoken !is its own token- Cost = number of tokens, not number of words
Token counting rule of thumb (English): 1 token β 4 characters β 0.75 words.
π Encode-Decode Roundtrip
sequenceDiagram
participant T as Text Input
participant TZ as Tokenizer
participant M as LLM
participant D as Detokenizer
T->>TZ: raw string
TZ->>M: token ID sequence
M->>D: output token IDs
D-->>T: decoded text output
π BPE Training and Encoding Flow
The BPE algorithm has two distinct phases: a one-time vocabulary training phase (offline) and a per-text encoding phase (online at inference time).
graph TD
subgraph Training Phase offline
A[Large text corpus] --> B[Initialize: each character is a token]
B --> C[Count all adjacent token pairs]
C --> D[Merge the most frequent pair into a new token]
D --> E[Update pair counts]
E --> F{Vocab size target reached?}
F -->|No| C
F -->|Yes| G[Final vocabulary saved]
end
subgraph Encoding Phase online
H[Input text] --> I[Split into characters]
I --> J[Apply learned merge rules in order]
J --> K[Token sequence produced]
K --> L[Convert tokens to integer IDs]
L --> M[Integer sequence fed to LLM]
end
G --> J
The critical insight is that the merge rules learned during training are applied in the same order during encoding. Rule 1 (first merge) is always applied before Rule 2 (second merge). This determinism ensures that the same text always tokenizes to the same integer sequence β a requirement for reproducible model behavior.
Why vocabulary size matters:
- Too small (< 10k): Many words split into many short subwords β long sequences β expensive attention computation.
- Too large (> 200k): Rare tokens appear so infrequently in training that their embeddings are poorly learned β poor generalization.
- Sweet spot: 32kβ128k tokens balances sequence length, embedding quality, and memory usage.
π§ Deep Dive: Why Subword Tokenization Beats Words and Characters
Word tokenization creates a vocabulary explosion β every inflected form needs its own entry, and unseen words become [UNK]. Character tokenization avoids that but makes sequences 5β10Γ longer, overwhelming the attention computation. Subword tokenization (BPE, WordPiece) finds the middle path: common words stay whole, rare words decompose into known pieces. The vocabulary tops out at 32kβ128k β small enough for good embeddings, large enough to keep sequence lengths practical.
π Real-World Applications: Why Tokenization Explains LLM Quirks
Many surprising LLM behaviors are tokenization artifacts:
| LLM behavior | Tokenization cause |
| Struggles to count letters in a word | Characters aren't the input unit β tokens are |
| Can't easily reverse a word | Reversal would break token boundaries |
| Different behavior for "color" vs "colour" | They may map to different token IDs |
| Token budget runs out faster with code/JSON | Code symbols β 1 token each β more tokens per byte |
| Non-English text costs more tokens | Languages with fewer BPE merges have longer token sequences |
# Why "9.11 > 9.9" confuses some models:
enc.encode("9.11") # β ['9', '.', '11'] β three tokens
enc.encode("9.9") # β ['9', '.9'] β two tokens, different internal splits
The model sees these as sequences of integers β the decimal arithmetic intuition doesn't directly apply.
βοΈ Trade-offs & Failure Modes: Comparing Tokenization Algorithms
| Algorithm | Used by | Key idea | Vocabulary size |
| BPE (Byte Pair Encoding) | GPT-4, Llama, Mistral | Merge most frequent char pairs | ~50kβ128k |
| WordPiece | BERT, DistilBERT | Maximize likelihood of vocabulary given corpus | ~30k |
| Unigram LM | T5, ALBERT | Probabilistic β prune vocab by minimizing perplexity | ~32k |
| SentencePiece | T5, XLM-R | Language-agnostic; treats text as raw bytes | Configurable |
All modern subword algorithms solve the same core problem differently: balance vocabulary size, sequence length, and handling of unseen (OOV) words.
π§ Decision Guide: Practical Tokenization Guidelines
| Situation | Guideline |
| Estimating API cost | Tokens / 1000 Γ price-per-1k. Use tiktoken to count before sending |
| Fitting in context window | Count prompt tokens; leave room for completion |
| Multilingual content | Expect 2β4Γ more tokens than English for same semantic content |
| Structured data (JSON/XML) | Each bracket, key, and value = 1β2 tokens; JSON is expensive |
| Sensitive exact character tasks | Don't rely on LLM for char-level operations (spelling, reversal) |
π― What to Learn Next
- LLM Terms: A Helpful Glossary
- LLM Hyperparameters: Temperature, Top-p, and Top-k
- RAG Explained: How to Give Your LLM a Brain Upgrade
π§ͺ Hands-On: Explore Tokenization with tiktoken
The best way to build intuition for tokenization is to run experiments yourself. The tiktoken library (used by OpenAI's GPT models) is available as a Python package.
Install and import:
pip install tiktoken
Experiment 1 β See how your prompt is tokenized:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
prompt = "The quick brown fox jumps over the lazy dog."
tokens = enc.encode(prompt)
print(f"Token count: {len(tokens)}")
print([enc.decode([t]) for t in tokens])
# Token count: 10
# ['The', ' quick', ' brown', ' fox', ' jumps', ' over', ' the', ' lazy', ' dog', '.']
Experiment 2 β Compare English vs. a non-Latin script:
english = "Artificial intelligence is transforming software."
japanese = "δΊΊε·₯η₯θ½γ―γ½γγγ¦γ§γ’γε€ι©γγ¦γγΎγγ"
print(len(enc.encode(english))) # β 7
print(len(enc.encode(japanese))) # β 17
The same semantic content uses 2.4Γ more tokens in Japanese β directly increasing API cost and consuming more of the context window.
Experiment 3 β Verify the "strawberry" counting problem:
word = "strawberry"
tokens = enc.encode(word)
pieces = [enc.decode([t]) for t in tokens]
print(pieces) # β ['straw', 'berry']
The model sees two token IDs, not ten characters. When asked "How many r's are in strawberry?", it must reason about subword pieces rather than individual letters β which is why it sometimes miscounts.
Challenge: Count the tokens in your most-used system prompt. Calculate the monthly token cost at GPT-4o pricing ($5 / 1M tokens) if this prompt is sent 100,000 times per day.
π οΈ HuggingFace Tokenizers: BPE and WordPiece in Two Lines of Code
HuggingFace Tokenizers is the open-source Python library that powers tokenization for GPT-2, BERT, LLaMA, Mistral, and virtually every modern LLM β it exposes both the tokenizers (Rust-backed, fast) and transformers (high-level) APIs, making it the standard tool for inspecting exactly how any model splits text into token IDs.
The library makes every concept in this post directly observable: you can load real BPE (GPT-2) and WordPiece (BERT) tokenizers, inspect split boundaries, count tokens, and measure the cost difference between languages in under 20 lines of code.
# pip install transformers tokenizers
from transformers import AutoTokenizer
# ββ BPE tokenizer (GPT-2 style) βββββββββββββββββββββββββββββββββββββββββββββββ
gpt2_tok = AutoTokenizer.from_pretrained("gpt2")
text = "Tokenization is fascinating!"
gpt2_ids = gpt2_tok.encode(text)
gpt2_split = gpt2_tok.convert_ids_to_tokens(gpt2_ids)
print("GPT-2 BPE:")
print(" IDs: ", gpt2_ids) # β [30642, 1634, 318, 17049, 0]
print(" Tokens: ", gpt2_split) # β ['Token', 'ization', 'Δ is', 'Δ fascinating', '!']
print(" Count: ", len(gpt2_ids)) # β 5
# ββ WordPiece tokenizer (BERT style) ββββββββββββββββββββββββββββββββββββββββββ
bert_tok = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_ids = bert_tok.encode(text, add_special_tokens=False)
bert_split = bert_tok.convert_ids_to_tokens(bert_ids)
print("\nBERT WordPiece:")
print(" IDs: ", bert_ids) # β [19204, 3989, 2003, 18895, 999]
print(" Tokens: ", bert_split) # β ['token', '##ization', 'is', 'fascinating', '!']
print(" Count: ", len(bert_ids)) # β 5
# Note: '##' prefix marks continuation sub-words in WordPiece
# ββ Token cost comparison: English vs. Japanese βββββββββββββββββββββββββββββββ
english = "Artificial intelligence is transforming software."
japanese = "δΊΊε·₯η₯θ½γ―γ½γγγ¦γ§γ’γε€ι©γγ¦γγΎγγ"
print("\nToken counts (GPT-2 BPE):")
print(f" English: {len(gpt2_tok.encode(english))} tokens") # β ~8
print(f" Japanese: {len(gpt2_tok.encode(japanese))} tokens") # β ~25+ (2-3Γ more)
# ββ Reproduce the "strawberry" counting problem βββββββββββββββββββββββββββββββ
word = "strawberry"
pieces = gpt2_tok.convert_ids_to_tokens(gpt2_tok.encode(word))
print(f"\n'strawberry' splits to: {pieces}")
# β ['straw', 'berry'] β model sees 2 tokens, not 10 characters
# ββ Count prompt tokens before sending to API ββββββββββββββββββββββββββββββββ
# Use the model-specific tokenizer to get an accurate count
system_prompt = "You are a helpful assistant. Always answer concisely."
user_message = "Explain the difference between SQL and NoSQL databases."
full_prompt = system_prompt + "\n" + user_message
token_count = len(gpt2_tok.encode(full_prompt))
cost_usd = token_count / 1_000_000 * 5.00 # GPT-4o input pricing
print(f"\nPrompt token count: {token_count}")
print(f"Cost per 1 call: ${cost_usd:.6f}")
The ## prefix in BERT WordPiece tokens marks a continuation sub-word (part of a longer word), while GPT-2 BPE uses the Δ prefix to mark a token that follows a space. Both are subword algorithms but produce different split boundaries β a subtle source of model-specific behavior differences.
For a full deep-dive on HuggingFace Tokenizers and vocabulary training, a dedicated follow-up post is planned.
π Lessons from Understanding Tokenization
Lesson 1 β Token count, not word count, determines cost. Every LLM API charges per token. A 500-word prompt in English is ~667 tokens. The same content in a morphologically rich language (Finnish, Turkish) may be 1,200 tokens. Always budget in tokens, not words.
Lesson 2 β The context window is a token budget, not a word budget. A 128k-token context window holds roughly 96,000 English words. But a JSON-heavy system prompt with brackets, colons, and quoted keys burns tokens at 1.5β2Γ the prose rate. Compact your structured data.
Lesson 3 β Tokenization artifacts explain model quirks at the character level. Letter counting, palindrome detection, reversing strings, and rhyme finding all require character-level reasoning β but the model only sees token IDs. These tasks are genuinely harder for LLMs, not symptoms of low intelligence.
Lesson 4 β Consistent tokenization is a stability guarantee. The same text always produces the same token sequence with the same tokenizer version. If you cache LLM responses keyed by the token sequence, you get deterministic cache hits. If you upgrade the tokenizer, the cache is invalidated.
Lesson 5 β Use tiktoken for cost estimation before shipping a prompt to production. Never estimate token count by eye. A prompt that looks like 200 words may be 350 tokens once you count the system prompt, conversation history, tool definitions, and output format instructions. Measure first.
π TLDR: Summary & Key Takeaways
- Tokenization converts text to integer sequences β LLMs work with tokens, not words or characters.
- BPE builds a vocabulary by iteratively merging the most frequent adjacent character pairs.
- 1 token β 4 English characters β 0.75 words β use this to estimate context and cost.
- Tokenization boundaries explain why LLMs struggle with spelling, letter counting, and exact character operations.
- Modern tokenizers (tiktoken, SentencePiece) expose these internals; always count tokens before sending expensive prompts.
π Practice Quiz
What is the first step of BPE tokenization?
- A) Splitting text into sentences
- B) Starting with individual characters and counting pair frequencies
- C) Building a large word dictionary
- D) Removing stop words
Correct Answer: B β BPE bootstraps from the smallest unit (individual characters or bytes) and iteratively merges the most frequent adjacent pair until the vocabulary size target is reached.
Why does "tokenization" get split into two tokens by BPE?
- A) It is too long to fit in one token
- B) The full word sequence is not in the learned vocabulary; it gets split at a subword boundary
- C) BPE always splits words longer than 8 characters
- D) The word contains an irregular character pattern
Correct Answer: B β BPE's vocabulary is built from the training corpus. "tokenization" as a whole unit may be rare enough that the merge that would create it was never reached. The split at "token" + "ization" reflects two high-frequency subwords that were merged earlier.
Why do LLMs struggle to count the letters in a word like "strawberry"?
- A) LLMs are not trained on spelling tasks
- B) Characters are not the input unit β the model sees token IDs, not individual letters
- C) "Strawberry" contains an irregular letter pattern
- D) The model only processes words up to 6 characters
Correct Answer: B β "strawberry" tokenizes to ["straw", "berry"] in GPT-4o. The model must reason about character structure from within each subword piece β a task that its training data provides little direct signal for.
π Related Posts
- LLM Terms: A Helpful Glossary
- RAG Explained: How to Give Your LLM a Brain Upgrade
- Embeddings Explained

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions β with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...

Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work.
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy β but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose β range, hash, consistent hashing, or directory β determines whether range queries stay ch...
