All Posts

Why Embeddings Matter: Solving Key Issues in Data Representation

How do computers understand that 'King' - 'Man' + 'Woman' = 'Queen'? Embeddings convert words int...

Abstract AlgorithmsAbstract Algorithms
··5 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Embeddings convert words (and images, users, products) into dense numerical vectors in a geometric space where semantic similarity = geometric proximity. "King - Man + Woman ≈ Queen" is not magic — it is the arithmetic property of well-trained embeddings.


📖 The One-Hot Problem: Numbers That Know Nothing

Before embeddings, machines represented words as one-hot vectors:

Vocabulary: [cat, dog, fish, car, truck]
"cat"  = [1, 0, 0, 0, 0]
"dog"  = [0, 1, 0, 0, 0]
"fish" = [0, 0, 1, 0, 0]

Problems:

  1. Sparse: 50,000-word vocab = 50,000-dimensional vectors that are 99.998% zeros.
  2. No similarity: The machine sees cat and dog as equally distant as cat and car. Nothing in their representation captures that cats and dogs are both pets.

Embeddings solve both problems.


🔢 Dense Vectors: Coordinates in Meaning Space

An embedding represents a word as a dense low-dimensional vector (e.g., 300 dimensions):

"cat"  → [0.8, -0.3,  0.6, 0.1, ...]   (300 values, all non-zero)
"dog"  → [0.7, -0.2,  0.5, 0.2, ...]   (similar to cat)
"car"  → [-0.1, 0.9, -0.4, 0.8, ...]   (different region of space)

Cosine similarity between cat and dog: ~0.92 (very close).
Cosine similarity between cat and car: ~0.1 (far apart).

The model has learned that cats and dogs live in the same region of the 300D semantic space.


⚙️ The Learning Principle: "You Shall Know a Word by the Company It Keeps"

This is Firth's distributional hypothesis (1957), and it is the foundation of word2vec, GloVe, and modern LLM embeddings.

Training signal: Predict surrounding words.

Context window: "... feeds the ___ every morning ..."
Target word: "cat"

Words that appear in similar contexts get similar representations. Cat and dog both appear near "pet," "feed," "vet," "collar" → their vectors converge in the training process.

Word2Vec (skip-gram) objective: Given a word $w$, maximize the probability of observing its context words $c_i$:

$$\max \sum_{(w, c) \in \text{corpus}} \log P(c | w)$$


🧠 Vector Arithmetic: Why "King - Man + Woman ≈ Queen"

Because semantically coherent relationships are encoded as directions in embedding space:

direction("Man""King") ≈ direction("Woman""Queen")
= vector("King") - vector("Man") ≈ vector("Queen") - vector("Woman")

Rearranging: $$\text{vector("Queen")} \approx \text{vector("King")} - \text{vector("Man")} + \text{vector("Woman")}$$

The "royalty" concept is a direction. The "gender" flip is another direction. These directions are geometrically consistent across thousands of analogies because they learned from the same statistical patterns.

flowchart LR
    King["King"] -->|subtract| ManAxis["— Man direction"]
    ManAxis -->|add| WomanAxis["+ Woman direction"]
    WomanAxis --> Queen["≈ Queen\n(nearest neighbor in embedding space)"]

⚙️ Embeddings in Production: Not Just Words

Modern embeddings go far beyond words:

Input TypeEmbedding ModelApplication
TextBERT, Sentence-BERT, OpenAI text-embedding-3Semantic search, RAG, classification
ImagesCLIP, ViTImage search, visual Q&A, multimodal retrieval
UsersCollaborative filtering embeddingsRecommendation systems (Netflix, Spotify)
ProductsCatalog embeddings"Customers who bought X also bought Y"
CodeOpenAI Codex embeddingsSemantic code search

Vector databases (Pinecone, Weaviate, Milvus, pgvector) store billions of embedding vectors and support approximate nearest neighbor (ANN) search — the "find the most semantically similar documents" query at scale.


⚖️ One-Hot vs. Dense Embeddings: The Full Picture

PropertyOne-HotDense Embedding
DimensionalityEqual to vocabulary size (50K+)Fixed low dimension (128–1536)
Sparsity~100% sparseDense (all values non-zero)
Semantic similarityNot capturedCaptured via geometric distance
ComputationHigh (huge sparse vectors)Efficient (small dense vectors)
Supports analogies (King-Man+Woman)
Requires training❌ (constructed)✅ (learned from data)

📌 Summary

  • One-hot encoding is sparse and captures no semantic similarity.
  • Embeddings are dense vectors learned from co-occurrence statistics; semantically similar items are geometrically close.
  • Firth's hypothesis: "A word is known by the company it keeps" — context predicts embeddings.
  • Vector arithmetic (King - Man + Woman ≈ Queen) works because semantic relationships are consistent directions in the embedding space.
  • Vector databases (Pinecone, pgvector) serve nearest-neighbor queries over billions of embeddings for RAG, recommendation, and semantic search.

📝 Practice Quiz

  1. Two words have a cosine similarity of 0.95 in embedding space. What does this indicate?

    • A) Their one-hot vectors overlap in 95% of positions.
    • B) The words appear in very similar contexts and are semantically close (e.g., "cat" and "kitten").
    • C) One word contains 95% of the letters of the other.
      Answer: B
  2. Why does "King - Man + Woman ≈ Queen" work with word embeddings?

    • A) It's a hard-coded rule in the embedding model.
    • B) The model learned consistent semantic directions from co-occurrence data — "royalty" and "gender" are separate geometric directions in the embedding space.
    • C) It only works for those four words specifically.
      Answer: B
  3. You need to find the 10 most semantically similar documents to a query in a corpus of 100 million documents. Which tool is designed for this?

    • A) A relational database with LIKE '%query%' search.
    • B) A vector database (e.g., Pinecone, Weaviate, pgvector) with approximate nearest-neighbor (ANN) search over embedding vectors.
    • C) An inverted index with TF-IDF ranking.
      Answer: B

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms