All Posts

Tokenization Explained: How LLMs Understand Text

Abstract AlgorithmsAbstract Algorithms
··4 min read

TL;DR

TLDR: Computers don't read words; they read numbers. Tokenization is the process of breaking text down into smaller pieces (tokens) and converting them into numerical IDs that a Large Language Model can process. It's the foundational first step for a...

Cover Image for Tokenization Explained: How LLMs Understand Text

TLDR: Computers don't read words; they read numbers. Tokenization is the process of breaking text down into smaller pieces (tokens) and converting them into numerical IDs that a Large Language Model can process. It's the foundational first step for any NLP task.


1. What is Tokenization? (The "No-Jargon" Explanation)

Imagine you are a Chef preparing a meal. You don't throw a whole carrot into the pot. You first chop it into smaller, manageable pieces.

  • The Text: The raw ingredient (the carrot).
  • The Tokenizer: The knife.
  • The Tokens: The small, chopped pieces.

A tokenizer is a "knife" for text. It breaks down sentences and words into smaller units that the AI can understand. The model doesn't see "Hello World"; it sees [15496, 2159]


2. Tokenization Strategies: The Good, The Bad, and The Ugly

There are three main ways to chop up text.

A. Word-Based Tokenization

  • How: Split text by spaces. ["Hello", "world", "!"]
  • Problem: What about "don't"? Is it one word or two? What about rare words like "epistemology"? The vocabulary would be enormous.

B. Character-Based Tokenization

  • How: Split text into individual characters. ['H', 'e', 'l', 'l', 'o']
  • Problem: The meaning is lost. "H" has no semantic value on its own. The model has to work much harder to learn what words mean.

C. Subword-Based Tokenization (The Winner)

  • How: A hybrid approach. Common words are single tokens ("the", "a"). Rare words are broken into meaningful sub-parts ("unbelievably" -> ["un", "believe", "ably"]).
  • Benefit: It balances vocabulary size and meaning. It can represent any word, even ones it has never seen before.
  • Algorithm: Byte-Pair Encoding (BPE) is the most common.

3. Deep Dive: How Byte-Pair Encoding (BPE) Works

BPE is a simple but powerful algorithm that learns the most efficient way to "chop" text.

The Goal: Start with characters and merge the most frequent pairs until you reach a desired vocabulary size.

Toy Example: Learning a Vocabulary

Let's say our entire training data is:

  • "hug" (5 times)
  • "pug" (3 times)
  • "pun" (4 times)
  • "bun" (2 times)

Step 1: Initial Vocabulary (Characters) Our starting vocabulary is ['b', 'g', 'h', 'n', 'p', 'u'].

Step 2: Find the Most Frequent Pair The pair "ug" appears 5 + 3 = 8 times. This is the most common pair.

Step 3: Create a Merge Rule We create a new token "ug" and add it to our vocabulary.

  • New Vocabulary: ['b', 'g', 'h', 'n', 'p', 'u', 'ug']

Step 4: Repeat Now, the most frequent pair is "un" (4 times).

  • New Vocabulary: ['b', 'g', 'h', 'n', 'p', 'u', 'ug', 'un']

The Result: After training, when we see a new word like "bug", the tokenizer knows to split it into ["b", "ug"]. It can handle a new word without having seen it before!


4. Real-World Application: Why Tokens Matter

Tokenization isn't just an academic detail; it has huge practical implications.

A. Context Window Limits

  • The Problem: An LLM like Llama 3 has a context window of 8,192 tokens. This is not 8,192 words.
  • Example: The phrase "Tokenization is unbelievably important" might be 5 words, but it could be 7 tokens: ["Token", "ization", " is", " un", "believe", "ably", " important"].
  • Impact: Complex words use up your context window faster.

B. API Costs

  • The Problem: OpenAI, Anthropic, and Google all charge you per token, not per word.
  • Example: Sending a long, technical document with many rare words will cost more than sending a simple story of the same word count.

C. Multilingual Challenges

  • The Problem: A tokenizer trained mostly on English will be inefficient for other languages.
  • Example: A Japanese character might be broken into many meaningless byte tokens, making the model perform poorly on Japanese text. This is why multilingual models use much larger vocabularies.

Summary & Key Takeaways

  • Tokenization: The process of breaking text into numerical IDs for an LLM.
  • Subword Tokenization (BPE): The standard method. It balances vocabulary size and the ability to represent any word.
  • Practical Impact: Tokens determine your API costs and how much text you can fit into a model's context window.
  • Rule of Thumb: 1 word $\approx$ 1.3 tokens in English.

Practice Quiz: Test Your Knowledge

  1. Scenario: Why is subword tokenization (like BPE) generally preferred over word-based tokenization?

    • A) It is faster to train.
    • B) It can handle rare or made-up words without failing.
    • C) It always results in fewer tokens.
  2. Scenario: You are using an LLM API that costs $0.01 per 1,000 tokens. You send a 1,000-word essay. What is the most likely cost?

    • A) Exactly $0.01.
    • B) Less than $0.01.
    • C) More than $0.01.
  3. Scenario: The word "antidisestablishmentarianism" is not in the tokenizer's vocabulary. What will a BPE tokenizer likely do?

    • A) Return an "Unknown Word" error.
    • B) Break it into smaller, known subwords like ["anti", "dis", "establish", ...].
    • C) Ignore the word completely.

(Answers: 1-B, 2-C, 3-B)

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms