Understanding Inverted Index and Its Benefits in Software Development
Why is Google so fast? It doesn't scan every webpage. It uses an Inverted Index. We explain how t...
Abstract AlgorithmsTLDR: An Inverted Index maps every word to the list of documents containing it โ the same structure as the back-of-the-book index. It is the core data structure behind every full-text search engine, including Elasticsearch, Lucene, and PostgreSQL full-text search.
๐ The Textbook Index You Already Know
Open any textbook. Flip to the index at the back. Under "Binary Tree" you see: pages 42, 78, 203. That listing tells you exactly which pages to go to without reading every page.
A database full-text search without an inverted index would mean reading every document to find which ones contain "Binary Tree." For a billion documents, that's O(N).
An inverted index turns that into O(1) lookup followed by O(k log k) sort, where k = number of matching documents.
๐ข How an Inverted Index Is Built
Given three documents:
Doc 1: "Apple is fruit"
Doc 2: "Banana is fruit"
Doc 3: "Apple and Banana are both fruit"
Step 1 โ Tokenize: Split into tokens. Apply stemming ("fruits" โ "fruit") and stop word removal ("is", "are", "and" removed).
Step 2 โ Build posting lists:
| Word | Posting List |
| apple | [1, 3] |
| banana | [2, 3] |
| fruit | [1, 2, 3] |
Each posting list entry also stores the position of the word in the document (for phrase queries) and the frequency (for TF-IDF ranking).
Step 3 โ Sort posting lists by document ID. Sorted order is critical for fast intersection.
โ๏ธ Executing a Query: AND, OR, and Phrase
AND Query: "Apple AND Banana"
Intersect the two posting lists:
apple: [1, 3]
banana: [2, 3]
intersection: [3]
Since both lists are sorted, a simple two-pointer merge achieves O(m + n) intersection โ no nested loops.
i โ 1, j โ 2: 1 < 2, advance i
i โ 3, j โ 2: 3 > 2, advance j
i โ 3, j โ 3: MATCH โ output Doc 3
OR Query: "Apple OR Banana"
Union: [1, 2, 3] โ all documents containing either word.
Phrase Query: "Apple Banana" (adjacent)
Use position data: find documents in both lists where apple's position + 1 = banana's position.
๐ง Skip Pointers: Speeding Up Large Posting List Intersection
When posting lists are large (millions of documents), pointer-by-pointer traversal is slow.
Skip pointers index every $\sqrt{n}$ entries in the list, allowing jumps:
apple: [1, 3, 7, 11, 15, 22, ...] with skip pointer at 7 โ jump over 3,7 if needed
If apple[i]=3 and banana[j]=22, the skip pointer can jump directly to the first entry โฅ 22 in the apple list without touching 7, 11, 15.
โ๏ธ TF-IDF: Why "the" Ranks Lower Than "quantum"
A search for "quantum computing" should rank documents that specifically discuss quantum computing higher than documents that just happen to contain the word "quantum" once.
TF-IDF (Term Frequency โ Inverse Document Frequency):
$$\text{score}(d, t) = TF(d, t) \times IDF(t)$$
$$TF(d, t) = \frac{\text{count of } t \text{ in doc } d}{\text{total tokens in doc } d}$$
$$IDF(t) = \log \frac{N}{\text{docs containing } t}$$
Intuition:
- TF: High if the term appears often in this document.
- IDF: High if the term appears in few documents overall (rare = discriminative).
"The" appears in every document โ IDF โ 0 โ score โ 0. "Quantum" appears in 100/1M documents โ high IDF โ high score in docs with high TF.
โ๏ธ Inverted Index in Elasticsearch / Lucene
Elasticsearch (built on Lucene) stores inverted indexes per shard:
flowchart LR
Doc["New Document\n(JSON)"] --> Analyze["Text Analyzer\n(tokenize, stem, stop-words)"]
Analyze --> Index["Inverted Index\n(term โ [docId, position, freq])"]
Index --> Segment["Immutable Segment\nflushed to disk"]
Segment --> Merge["Merge Segments\n(background, periodic)"]
Segments are immutable โ new documents go into a new segment, not into existing ones. Periodic merging consolidates segments for query efficiency.
๐ Summary
- An Inverted Index maps term โ sorted list of document IDs (posting list).
- AND queries use two-pointer sorted intersection (O(m+n)).
- Skip pointers enable O(โn) jumps in large posting lists.
- TF-IDF scores documents by term frequency within the document ร rarity across all documents.
- Elasticsearch stores indexes in immutable segments โ new writes go to new segments; background merge consolidates them.
๐ Practice Quiz
You search for "Apple AND Banana." Apple's posting list is [1, 3] and Banana's is [2, 3]. What is the result?
- A) [1, 2, 3] โ all documents mentioning either word.
- B) [3] โ only the document containing both words.
- C) [1] โ the first document in the Apple list.
Answer: B
Why does "the" score near zero in TF-IDF, even if it appears 100 times in a document?
- A) Stop word removal always filters "the" before indexing.
- B) IDF is near zero because "the" appears in nearly every document, making it non-discriminative โ even high TF can't rescue it.
- C) TF penalizes words that appear too frequently.
Answer: B
Why does Elasticsearch write new documents to new immutable segments instead of updating the existing segment?
- A) To avoid disk fragmentation.
- B) Immutability means no locking during reads โ existing segments can be searched concurrently while new segments are written.
- C) To support time-series indexing.
Answer: B

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. ๏ฟฝ...
