Tokenization Explained: How LLMs Understand Text

Abstract Algorithms

·Feb 11, 2026·4 min read

TL;DR

TLDR: Computers don't read words; they read numbers. Tokenization is the process of breaking text down into smaller pieces (tokens) and converting them into numerical IDs that a Large Language Model can process. It's the foundational first step for a...

Cover Image for Tokenization Explained: How LLMs Understand Text

TLDR: Computers don't read words; they read numbers. Tokenization is the process of breaking text down into smaller pieces (tokens) and converting them into numerical IDs that a Large Language Model can process. It's the foundational first step for any NLP task.

1. What is Tokenization? (The "No-Jargon" Explanation)

Imagine you are a Chef preparing a meal. You don't throw a whole carrot into the pot. You first chop it into smaller, manageable pieces.

The Text: The raw ingredient (the carrot).
The Tokenizer: The knife.
The Tokens: The small, chopped pieces.

A tokenizer is a "knife" for text. It breaks down sentences and words into smaller units that the AI can understand. The model doesn't see "Hello World"; it sees [15496, 2159]

2. Tokenization Strategies: The Good, The Bad, and The Ugly

There are three main ways to chop up text.

A. Word-Based Tokenization

How: Split text by spaces. ["Hello", "world", "!"]
Problem: What about "don't"? Is it one word or two? What about rare words like "epistemology"? The vocabulary would be enormous.

B. Character-Based Tokenization

How: Split text into individual characters. ['H', 'e', 'l', 'l', 'o']
Problem: The meaning is lost. "H" has no semantic value on its own. The model has to work much harder to learn what words mean.

C. Subword-Based Tokenization (The Winner)

How: A hybrid approach. Common words are single tokens ("the", "a"). Rare words are broken into meaningful sub-parts ("unbelievably" -> ["un", "believe", "ably"]).
Benefit: It balances vocabulary size and meaning. It can represent any word, even ones it has never seen before.
Algorithm: Byte-Pair Encoding (BPE) is the most common.

3. Deep Dive: How Byte-Pair Encoding (BPE) Works

BPE is a simple but powerful algorithm that learns the most efficient way to "chop" text.

The Goal: Start with characters and merge the most frequent pairs until you reach a desired vocabulary size.

Toy Example: Learning a Vocabulary

Let's say our entire training data is:

"hug" (5 times)
"pug" (3 times)
"pun" (4 times)
"bun" (2 times)

Step 1: Initial Vocabulary (Characters) Our starting vocabulary is ['b', 'g', 'h', 'n', 'p', 'u'].

Step 2: Find the Most Frequent Pair The pair "ug" appears 5 + 3 = 8 times. This is the most common pair.

Step 3: Create a Merge Rule We create a new token "ug" and add it to our vocabulary.

New Vocabulary: ['b', 'g', 'h', 'n', 'p', 'u', 'ug']

Step 4: Repeat Now, the most frequent pair is "un" (4 times).

New Vocabulary: ['b', 'g', 'h', 'n', 'p', 'u', 'ug', 'un']

The Result: After training, when we see a new word like "bug", the tokenizer knows to split it into ["b", "ug"]. It can handle a new word without having seen it before!

4. Real-World Application: Why Tokens Matter

Tokenization isn't just an academic detail; it has huge practical implications.

A. Context Window Limits

The Problem: An LLM like Llama 3 has a context window of 8,192 tokens. This is not 8,192 words.
Example: The phrase "Tokenization is unbelievably important" might be 5 words, but it could be 7 tokens: ["Token", "ization", " is", " un", "believe", "ably", " important"].
Impact: Complex words use up your context window faster.

B. API Costs

The Problem: OpenAI, Anthropic, and Google all charge you per token, not per word.
Example: Sending a long, technical document with many rare words will cost more than sending a simple story of the same word count.

C. Multilingual Challenges

The Problem: A tokenizer trained mostly on English will be inefficient for other languages.
Example: A Japanese character might be broken into many meaningless byte tokens, making the model perform poorly on Japanese text. This is why multilingual models use much larger vocabularies.

Summary & Key Takeaways

Tokenization: The process of breaking text into numerical IDs for an LLM.
Subword Tokenization (BPE): The standard method. It balances vocabulary size and the ability to represent any word.
Practical Impact: Tokens determine your API costs and how much text you can fit into a model's context window.
Rule of Thumb: 1 word $\approx$ 1.3 tokens in English.

Practice Quiz: Test Your Knowledge

Scenario: Why is subword tokenization (like BPE) generally preferred over word-based tokenization?
- A) It is faster to train.
- B) It can handle rare or made-up words without failing.
- C) It always results in fewer tokens.
Scenario: You are using an LLM API that costs $0.01 per 1,000 tokens. You send a 1,000-word essay. What is the most likely cost?
- A) Exactly $0.01.
- B) Less than $0.01.
- C) More than $0.01.
Scenario: The word "antidisestablishmentarianism" is not in the tokenizer's vocabulary. What will a BPE tokenizer likely do?
- A) Return an "Unknown Word" error.
- B) Break it into smaller, known subwords like ["anti", "dis", "establish", ...].
- C) Ignore the word completely.

(Answers: 1-B, 2-C, 3-B)