All Posts

How Apache Lucene Works: The Engine Behind Elasticsearch

Elasticsearch and Solr are built on Lucene. We explain the Inverted Index, Segments, and TF-IDF scoring.

Abstract AlgorithmsAbstract Algorithms
ยทยท5 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Lucene is a search library. Its core innovation is the inverted index โ€” a reverse map from words to documents, like the index at the back of a textbook. Documents are stored in immutable segments that Lucene merges in the background to keep queries fast. Elasticsearch is Lucene distributed across a cluster.


๐Ÿ“– The Book Index That Elasticsearch Is Built On

Imagine a 1,000-page technical book. You want to find every page that mentions "distributed lock." The slow way: read every page. The fast way: check the index at the back.

Lucene's inverted index works exactly like that book index:

"distributed" โ†’ pages 42, 87, 203, 456
"lock"         โ†’ pages 87, 203, 312
"distributed" AND "lock" โ†’ intersection โ†’ pages 87, 203

This structure is built at index time and stays on disk. Queries read it without scanning every document.


๐Ÿ”ข The Inverted Index: From Words to Posting Lists

The core data structure:

Term         Posting List (doc IDs)
---------    ----------------------
"error"   โ†’  [3, 7, 12, 45]
"timeout" โ†’  [7, 12, 89]
"database"โ†’  [3, 12, 203]

Each posting also stores:

  • Term frequency (TF): how many times the term appears in that document
  • Position offsets: for phrase queries like "database error" in sequence
  • Document norms: document length for relevance scoring

Building this at write time is expensive but queries become near-instant.


โš™๏ธ Segments: Why Lucene Writes Immutable Files

Lucene never updates an existing index file. Instead:

  1. Buffer: New documents accumulate in memory.
  2. Flush: The buffer is written as a new, immutable segment (a mini-index with its own inverted index).
  3. Search: All segments are queried and results are merged.
  4. Merge: Background threads merge small segments into larger ones to reduce query overhead.
flowchart TD
    Docs[New Documents] --> Buffer[In-Memory Buffer]
    Buffer -->|Full| Seg1[Segment 1\nimmutable]
    Buffer -->|Full| Seg2[Segment 2\nimmutable]
    Buffer -->|Full| Seg3[Segment 3\nimmutable]
    Seg1 & Seg2 & Seg3 -->|Background merge| BigSeg[Merged Segment]
    Query[Search Query] --> Seg1 & Seg2 & Seg3 & BigSeg
    Seg1 & Seg2 & Seg3 & BigSeg --> Merge[Merge results]
    Merge --> Results[Ranked results]

Deletes: Lucene doesn't delete from segments (they're immutable). When you delete a document, Lucene writes its doc ID to a .del file. The segment still contains the document, but it is filtered out at query time. The data is physically removed only during a segment merge.


๐Ÿง  BM25 Scoring: TF, IDF, and Why "The" Gets Zero Points

When you search "dragon lore", Lucene ranks matching documents by relevance. The scoring formula is BM25 (the successor to TF-IDF):

$$\text{score}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{\text{TF}(t,d) \cdot (k_1 + 1)}{\text{TF}(t,d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$$

In plain English:

  • TF (Term Frequency): "dragon" appearing 5 times in a document is more relevant than appearing once. Diminishing returns: going from 1 to 5 helps more than 50 to 55.
  • IDF (Inverse Document Frequency): "dragon" appearing in 10 out of 1M documents is rare (high IDF = high importance). "the" appears in every document (IDF โ‰ˆ 0 = useless signal).
  • Length normalization ($b$): A short document with "dragon" once is often more focused than a long document with it once.
TermTF in docIDF (rarity)Score contribution
"dragon"5high (rare)high
"the"50~0 (everywhere)~0
"lore"2mediummedium

โš–๏ธ What Lucene Cannot Do

LimitationDetail
No ACID transactionsA write is durable once the segment is flushed; partial flushes are not atomic across documents
Slow real-time visibilityDocuments are not visible until the segment is refreshed (~1 second in Elasticsearch)
No joinsLucene matches single documents; cross-document relationships need denormalization
High update costUpdates = delete + re-index; updates to high-frequency fields are expensive
Not a data storeYou cannot rely on Lucene as a source of truth; it's a search index over another store

๐Ÿ“Œ Key Takeaways

  • Lucene's core is the inverted index: maps terms โ†’ documents for near-instant full-text search.
  • Documents are written as immutable segments; background merges keep query speed high.
  • Deletes use a .del file (soft delete); physical removal happens during merges.
  • BM25 scores relevance using TF (frequency), IDF (rarity), and document length normalization.
  • Lucene is not a database โ€” use it as a search layer over a primary store (PostgreSQL, DynamoDB, etc.).

๐Ÿงฉ Test Your Understanding

  1. What is the difference between an inverted index and a forward index?
  2. Why does Lucene use immutable segments instead of updating documents in place?
  3. You index 10,000 documents and delete 2,000. How does Lucene handle the deletes physically?
  4. Why does the word "the" get almost zero score in BM25?

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms