How Apache Lucene Works: The Engine Behind Elasticsearch
Elasticsearch and Solr are built on Lucene. We explain the Inverted Index, Segments, and TF-IDF scoring.
Abstract AlgorithmsTLDR: Lucene is a search library. Its core innovation is the inverted index โ a reverse map from words to documents, like the index at the back of a textbook. Documents are stored in immutable segments that Lucene merges in the background to keep queries fast. Elasticsearch is Lucene distributed across a cluster.
๐ The Book Index That Elasticsearch Is Built On
Imagine a 1,000-page technical book. You want to find every page that mentions "distributed lock." The slow way: read every page. The fast way: check the index at the back.
Lucene's inverted index works exactly like that book index:
"distributed" โ pages 42, 87, 203, 456
"lock" โ pages 87, 203, 312
"distributed" AND "lock" โ intersection โ pages 87, 203
This structure is built at index time and stays on disk. Queries read it without scanning every document.
๐ข The Inverted Index: From Words to Posting Lists
The core data structure:
Term Posting List (doc IDs)
--------- ----------------------
"error" โ [3, 7, 12, 45]
"timeout" โ [7, 12, 89]
"database"โ [3, 12, 203]
Each posting also stores:
- Term frequency (TF): how many times the term appears in that document
- Position offsets: for phrase queries like
"database error"in sequence - Document norms: document length for relevance scoring
Building this at write time is expensive but queries become near-instant.
โ๏ธ Segments: Why Lucene Writes Immutable Files
Lucene never updates an existing index file. Instead:
- Buffer: New documents accumulate in memory.
- Flush: The buffer is written as a new, immutable segment (a mini-index with its own inverted index).
- Search: All segments are queried and results are merged.
- Merge: Background threads merge small segments into larger ones to reduce query overhead.
flowchart TD
Docs[New Documents] --> Buffer[In-Memory Buffer]
Buffer -->|Full| Seg1[Segment 1\nimmutable]
Buffer -->|Full| Seg2[Segment 2\nimmutable]
Buffer -->|Full| Seg3[Segment 3\nimmutable]
Seg1 & Seg2 & Seg3 -->|Background merge| BigSeg[Merged Segment]
Query[Search Query] --> Seg1 & Seg2 & Seg3 & BigSeg
Seg1 & Seg2 & Seg3 & BigSeg --> Merge[Merge results]
Merge --> Results[Ranked results]
Deletes: Lucene doesn't delete from segments (they're immutable). When you delete a document, Lucene writes its doc ID to a .del file. The segment still contains the document, but it is filtered out at query time. The data is physically removed only during a segment merge.
๐ง BM25 Scoring: TF, IDF, and Why "The" Gets Zero Points
When you search "dragon lore", Lucene ranks matching documents by relevance. The scoring formula is BM25 (the successor to TF-IDF):
$$\text{score}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{\text{TF}(t,d) \cdot (k_1 + 1)}{\text{TF}(t,d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$$
In plain English:
- TF (Term Frequency): "dragon" appearing 5 times in a document is more relevant than appearing once. Diminishing returns: going from 1 to 5 helps more than 50 to 55.
- IDF (Inverse Document Frequency): "dragon" appearing in 10 out of 1M documents is rare (high IDF = high importance). "the" appears in every document (IDF โ 0 = useless signal).
- Length normalization ($b$): A short document with "dragon" once is often more focused than a long document with it once.
| Term | TF in doc | IDF (rarity) | Score contribution |
| "dragon" | 5 | high (rare) | high |
| "the" | 50 | ~0 (everywhere) | ~0 |
| "lore" | 2 | medium | medium |
โ๏ธ What Lucene Cannot Do
| Limitation | Detail |
| No ACID transactions | A write is durable once the segment is flushed; partial flushes are not atomic across documents |
| Slow real-time visibility | Documents are not visible until the segment is refreshed (~1 second in Elasticsearch) |
| No joins | Lucene matches single documents; cross-document relationships need denormalization |
| High update cost | Updates = delete + re-index; updates to high-frequency fields are expensive |
| Not a data store | You cannot rely on Lucene as a source of truth; it's a search index over another store |
๐ Key Takeaways
- Lucene's core is the inverted index: maps terms โ documents for near-instant full-text search.
- Documents are written as immutable segments; background merges keep query speed high.
- Deletes use a
.delfile (soft delete); physical removal happens during merges. - BM25 scores relevance using TF (frequency), IDF (rarity), and document length normalization.
- Lucene is not a database โ use it as a search layer over a primary store (PostgreSQL, DynamoDB, etc.).
๐งฉ Test Your Understanding
- What is the difference between an inverted index and a forward index?
- Why does Lucene use immutable segments instead of updating documents in place?
- You index 10,000 documents and delete 2,000. How does Lucene handle the deletes physically?
- Why does the word "the" get almost zero score in BM25?
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. ๏ฟฝ...
