All Posts

Natural Language Processing (NLP): Teaching Computers to Read

Abstract AlgorithmsAbstract Algorithms
··6 min read

TL;DR

This guide explains how we turn words into numbers (Embeddings) and how machines learn to understand grammar, sentiment, and meaning.

Cover Image for Natural Language Processing (NLP): Teaching Computers to Read

Introduction: The Language Barrier

To a computer, the word "Apple" is just a string of bytes (01000001...). It has no concept of fruit, technology, or pie. Natural Language Processing (NLP) is the field of AI focused on enabling computers to understand, interpret, generate, and even reason with human language in a way that's useful and natural.

NLP powers everything from chatbots and search engines to translation apps, voice assistants, sentiment analysis on reviews, and even medical report summarization. Today, NLP has moved far beyond basic word matching — thanks to massive scaling, transformers, and multimodal integration (text + images/video).


Step 1: Preprocessing (Cleaning the Text)

Before a model can "read," we must clean and prepare messy human text.

1. Tokenization: The Art of Breaking Words

Tokenization is the process of breaking text into smaller chunks called tokens.

A. Word-based Tokenization (Classic)

  • Input: "I'm learning."
  • Output: ["I", "'m", "learning", "."]
  • Issue: Huge vocabulary size. "Run", "Running", "Runs" are all separate IDs. If the model sees "Runnning" (typo), it fails (Out of Vocabulary).

B. Subword Tokenization (The Modern Standard) Modern LLMs use subword algorithms to balance vocabulary size and meaning. They break rare words into meaningful pieces.

  • Byte-Pair Encoding (BPE) (Used by GPT-2, GPT-4, RoBERTa):

    • How it works: Iteratively merges the most frequent pair of adjacent characters/tokens.
    • Example: The word "smartest" might be split into ["smart", "est"].
    • Real-world: GPT-4 tokenizes "unbelievable" as ["un", "believ", "able"].
  • WordPiece (Used by BERT, DistilBERT):

    • How it works: Similar to BPE but selects merges that maximize the likelihood of the training data. Uses ## to mark parts of a word.
    • Example: "playing" → ["play", "##ing"].
    • Why: The ## tells the model "this attaches to the previous token."
  • SentencePiece (Used by T5, Llama, ALBERT):

    • How it works: Treats the input as a raw stream of unicode characters, including spaces. Replaces spaces with a special character (e.g., _ or <0x20>).
    • Example: "Hello World" → ["Hello", "_World"].
    • Benefit: Fully reversible (you don't lose track of spaces) and language-agnostic.

2. Stemming vs. Lemmatization

Reducing words to their root form to simplify the vocabulary.

Example Sentence: "The wolves are running faster."

  • Stemming (Porter Stemmer):
    • Output: "The wolv are run faster."
    • Logic: Chops off "es" from "wolves" (leaving "wolv") and "ning" from "running". Fast, but often results in non-words.
  • Lemmatization (WordNet Lemmatizer):
    • Output: "The wolf be run fast."
    • Logic: Uses a dictionary to map "wolves" → "wolf", "are" → "be", "faster" → "fast". Slower, but linguistically correct.

Note: In modern Deep Learning (Transformers), these are often skipped because subword tokenization handles variations automatically.


Step 2: Vectorization (Words to Numbers)

Models can't do math on strings — we convert text into numbers.

1. The Old Way: TF-IDF (Term Frequency - Inverse Document Frequency)

Instead of just counting words (Bag of Words), we weigh them by how "unique" they are.

  • Formula: $TF \times IDF$
    • TF: How often word appears in this document.
    • IDF: $\log(\frac{\text{Total Docs}}{\text{Docs with word}})$. Punishes words like "the" that appear everywhere.

Example:

  • Doc A: "The cat ran."
  • Doc B: "The dog ran."
  • Word "cat":
    • TF (in A): 1 (appears once).
    • IDF: $\log(2/1) \approx 0.3$ (appears in 1 of 2 docs).
    • Score: 0.3 (High importance).
  • Word "The":
    • TF (in A): 1.
    • IDF: $\log(2/2) = 0$ (appears in all docs).
    • Score: 0 (Ignored).

2. Static Embeddings (Word2Vec, GloVe, FastText)

The breakthrough: Represent words as dense vectors in a high-dimensional space (e.g., 300 dimensions).

A. Word2Vec (2013): Learning Context

  • Skip-Gram: Tries to predict context words from a target word.
    • Input: "fox"
    • Target: Predict ["quick", "brown", "jumps"].
  • CBOW (Continuous Bag of Words): Tries to predict the target word from context.
    • Input: ["quick", "brown", "jumps"]
    • Target: Predict "fox".

B. FastText (2016): Handling Unknown Words What if the model sees "Googleplex" but only knows "Google"?

  • Word2Vec: Fails (Out of Vocabulary).
  • FastText: Breaks it down into n-grams: <go, goo, oog, ... lex>.
  • It sums the vectors of the parts to understand the whole. It guesses "Googleplex" is related to "Google".

Famous proof-of-concept: $$ V(King) - V(Man) + V(Woman) \approx V(Queen) $$

3. Contextual Embeddings (BERT & GPT)

Static embeddings have a flaw: The word "Bank" has the same vector in "River bank" and "Bank deposit".

  • ELMo / BERT: Generate vectors dynamically.
  • Sentence A: "I went to the bank to deposit money." → "bank" vector leans toward finance.
  • Sentence B: "I sat on the river bank fishing." → "bank" vector leans toward nature.

Step 3: Understanding Context (The Transformer Era)

The Transformer ("Attention is All You Need", 2017) changed everything by using Self-Attention.

Deep Dive: Self-Attention Example

How does a model understand pronouns?

Sentence: "The animal didn't cross the street because it was too tired."

The Calculation:

  1. The model takes the vector for "it".
  2. It calculates a dot product (similarity score) against every other word in the sentence.
    • Score("it", "The") = 0.01
    • Score("it", "animal") = 0.95 (High match!)
    • Score("it", "street") = 0.05
    • Score("it", "tired") = 0.60
  3. It updates the representation of "it" to include information from "animal".
  4. Now, the vector for "it" effectively means "The tired animal".

This allows the model to resolve ambiguity instantly, regardless of how far apart the words are.


Common NLP Tasks (With Examples)

NLP tasks have exploded with LLMs. Here is what they look like in practice.

  1. Sentiment Analysis:

    • Input: "The battery life is terrible, but the screen is great."
    • Output: {"battery": "Negative", "screen": "Positive"} (Aspect-based sentiment).
  2. Named Entity Recognition (NER):

    • Input: "Elon Musk bought Twitter for $44 billion in 2022."
    • Output: [("Elon Musk", PERSON), ("Twitter", ORG), ("$44 billion", MONEY), ("2022", DATE)].
  3. Machine Translation:

    • Input: "Hello world."
    • Output: "Bonjour le monde." (French).
  4. Text Summarization:

    • Input: [A 5-page news article about a hurricane].
    • Output: "Hurricane X hit Florida yesterday, causing power outages for 1M people. Relief efforts are underway."
  5. Question Answering (RAG):

    • Input: "Based on the company PDF, what is the sick leave policy?"
    • Output: "Employees are entitled to 10 days of paid sick leave per year."

Summary & Key Takeaways

  • Preprocessing: Modern models use Subword Tokenization (BPE/WordPiece) to handle any word in any language.
  • Vectorization: We moved from counting words (TF-IDF) to Contextual Embeddings (BERT) where math equals meaning.
  • Attention: The mechanism that lets models understand relationships ("it" = "animal") across long distances.

What's Next?

We've covered the building blocks of language AI. Now we're ready for the main event: Large Language Models (LLMs) that scale these techniques to trillions of parameters. In the next post, we'll explore the giants — GPT, Llama, Grok, Claude, Gemini — and how they work at massive scale.

Ready to meet the LLMs? Subscribe to the series!

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms