System Design HLD Example: Search Autocomplete (Google/Amazon)

Sub-10ms prefix lookup, ranking, typo tolerance, and language-partitioned indexes at Google scale.

Abstract Algorithms

·Mar 13, 2026·14 min read

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Search autocomplete must respond in sub-10ms to feel "instant." The core trade-off is Latency vs. Data Freshness: we use an offline pipeline (Spark) to pre-calculate prefix-to-suggestion mappings and store them in Redis Sorted Sets (or a specialized Trie) for near-instant retrieval, while sacrificing real-time updates for most queries.

⌨️ The "Keystroke" Performance Gap

Imagine you are typing in a search bar: "best res..." You expect the system to suggest "best restaurants" before you even finish the word "restaurants." An average person types at about 60-80 words per minute, which means there are only about 100-150 milliseconds between keystrokes.

In a poorly designed system, every keystroke triggers a SQL query: SELECT query FROM search_logs WHERE query LIKE 'best res%' ORDER BY count DESC LIMIT 5. Under the weight of 100,000 concurrent users, this LIKE query will crush your database. By the time the results for "best res" come back, the user has already typed "best restaurants near me," and the suggestions are irrelevant. Autocomplete isn't just a feature; it's a race against human typing speed. If you can't respond in under 20ms (leaving 80ms for network travel), you’ve already lost.

📖 Search Autocomplete: Use Cases & Requirements

Actors

Search User: Expects instant, relevant suggestions as they type.
Data Scientist / Analyst: Fine-tunes ranking algorithms based on click-through rates.
Admin: Manages blacklists for inappropriate or harmful suggestions.

Functional Requirements

Prefix Matching: Given a prefix (e.g., "ap"), return the top 5-10 most popular suggestions (e.g., "apple", "amazon").
Ranking: Suggestions must be sorted by popularity (frequency) and potentially freshness.
Typo Tolerance: Handle minor typos (e.g., "appple" -> "apple").
Personalization: (Bonus) Show suggestions based on the user's previous searches.

Non-Functional Requirements

Ultra-Low Latency: Response time must be < 10ms for the backend lookup.
High Availability: If the autocomplete service is down, the search bar should still work (graceful degradation).
Scalability: Support 100k+ queries per second (QPS).
Eventual Consistency: Suggestions don't need to update the millisecond a new term becomes popular (daily/hourly updates are fine).

🔍 Basics: The Trie Data Structure

At the heart of most autocomplete systems is a Trie (Prefix Tree). A Trie is a tree-like data structure where each node represents a character.

To find suggestions for "app", you traverse the tree: a -> p -> p.
Every node can store the top 10 most popular queries that pass through it.
The Catch: Building and updating a Trie in real-time at Google-scale is difficult. We need a way to serve these results without rebuilding a massive tree for every keystroke.

⚙️ Mechanics: Distribution & Frequency Aggregation

We don't compute suggestions on-the-fly. Instead, we use a Data Ingestion Pipeline:

Log Collection: Every time a user submits a full search, we log it to a message queue (Kafka).
Aggregation: A background job (Spark/Flink) aggregates these logs hourly/daily to calculate the "Frequency" of each query.
Index Building: We take these frequencies and build our suggestion index (Trie or Redis).
Serving: The API reads from the pre-computed index.

This "Offline" approach is what allows us to hit sub-10ms latencies.

📐 Estimations & Design Goals

The Math of Search

Daily Active Users (DAU): 100 Million.
Searches per User: 10.
Keystrokes per Search: 20.
Total Autocomplete Requests: 100M 10 20 = 20 Billion per day.
Average QPS: 20B / 86,400 $\approx 230,000$ requests per second.

Design Goal: The system must be stateless and horizontally scalable. We will use Redis Sorted Sets as our primary serving layer because they are optimized for ranked range queries.

📊 High-Level Design: The Two-Path Architecture

The system is split into the "Fast Path" (Serving) and the "Slow Path" (Analytics).

graph TD
    User((User)) --> LB[Load Balancer]
    LB --> AS[Autocomplete Service]

    subgraph Serving_Fast_Path
        AS --> RC[(Suggestion Cache: Redis)]
    end

    subgraph Ingestion_Slow_Path
        AS --> MQ[Kafka]
        MQ --> Agg[Aggregation Service: Spark]
        Agg --> PDB[(Search History: Postgres)]
        Agg --> IB[Index Builder]
        IB --> RC
    end

The diagram shows the two-path architecture clearly. The Fast Path (top) serves user keystrokes directly from the Redis Sorted Set cache with a single ZREVRANGEBYSCORE call — the entire backend lookup completes in under 2ms. The Slow Path (bottom) records completed searches to Kafka asynchronously, aggregates frequency counts in Spark on an hourly or daily schedule, and rebuilds the suggestion index in Redis. This separation is what makes sub-10ms latency achievable: serving and indexing never compete for the same resources.

Suggestion Index Schema

The Redis Sorted Set structure for the suggestion index. Each prefix key maps to a set of candidate query strings, scored by their computed popularity weight.

Redis Key	Member	Score	Notes
`suggest:a`	`amazon`	9500	Top query starting with "a"
`suggest:a`	`apple`	9200	Second query starting with "a"
`suggest:ap`	`apple`	9200	Top query starting with "ap"
`suggest:ap`	`apple store`	4100	Second query for "ap"
`suggest:app`	`apple`	9200	Queries for "app"
`suggest:app`	`apple store`	4100	Second for "app"
`suggest:app`	`apple tv`	3800	Third for "app"

The score combines frequency count with a freshness decay factor: score = frequency * (1 - days_since_peak * 0.01). This ensures that trending queries from last week don't permanently dominate suggestions over newly popular terms.

🧠 Deep Dive: Trie vs. Redis Sorted Sets, Typo Tolerance, and Personalization Under 10ms

The three hardest engineering problems in search autocomplete are choosing the right data structure for sub-10ms prefix lookup, handling typos without a full-text index, and personalizing results without violating the latency budget.

Internals: Trie vs. Redis Sorted Set — When Each Wins

A Trie (prefix tree) stores suggestions in a tree where each node represents a character. Traversing from root to "ap" takes exactly two pointer hops regardless of the dictionary size. The top-K suggestions for a prefix can be pre-stored at each node, making retrieval O(K) after the O(M) traversal (M = prefix length). The weakness is memory: a Trie for 100 million query strings with frequency counts stored at every prefix node requires hundreds of gigabytes of RAM and is complex to update incrementally.

Redis Sorted Sets (ZSETs) trade memory efficiency for operational simplicity. For every possible prefix (all prefixes of all queries up to length 10–15), a ZSET key stores the top-K candidate queries scored by popularity. The lookup is a single ZREVRANGEBYSCORE suggest:{prefix} +inf -inf LIMIT 0 10 call, which Redis executes as a skip-list traversal in O(log N + K) time. In practice, N is always bounded (you only store the top 200 candidates per prefix), making this effectively O(1). The trade-off is storage: a ZSET for every prefix of every query multiplies the storage by a factor of the average query length (roughly 8–12×). For 1 million unique queries of average length 8, you need roughly 8 million ZSET keys — manageable in a 16GB Redis instance.

When to use a Trie: When the dictionary fits in memory, you control the serving infrastructure, and you need latency strictly under 1ms (avoiding a network hop to Redis). Companies like Google and Bing use distributed Tries sharded by language and first character.

When to use Redis Sorted Sets: When the team is small, operational simplicity matters, and 2–5ms latency (including the Redis network hop) meets the latency budget. Redis is the right default for any system that doesn't have a dedicated search infrastructure team.

Performance Analysis: Offline Aggregation Pipeline and Index Freshness Budget

The serving latency of under 2ms is achievable only because the suggestion index is pre-computed offline. The aggregation pipeline runs in three stages. Stage 1: Kafka collects a stream of completed search events (not keystrokes — only the final submitted query). Stage 2: a Spark or Flink job aggregates these events over a configurable window (hourly for breaking news sites, daily for general search). The job computes, for each query string, the total frequency count, the geographic distribution, and the timestamp of peak popularity. Stage 3: an index builder writes the top-K suggestions per prefix into Redis Sorted Sets, applying the freshness decay formula to compute scores.

The freshness budget is the critical design parameter: how stale can suggestions be? At Google scale, suggestions update within 15–30 minutes to surface trending events (breaking news, product launches, celebrity mentions). At Amazon scale, product suggestions update hourly because inventory and pricing changes drive what is "best" to suggest. For most systems, a 1-hour update cycle is both technically achievable and business-acceptable. A 24-hour cycle is simpler to operate and sufficient for stable-catalog searches (job listings, recipe sites, academic papers).

🌍 Real-World Autocomplete Systems: Google, Amazon, and Elasticsearch

Google Suggest operates at 230,000 QPS with suggestions updated within 15 minutes of trending events. Google's implementation uses a distributed Trie sharded by geographic region and query language. Each Trie shard serves a subset of the prefix space (e.g., all prefixes starting with characters A–M in English for the US West Coast region). This limits Trie size to under 10GB per shard while achieving sub-1ms traversal latency within the data center.

Amazon A9's autocomplete prioritizes purchasable products and inventory availability alongside query frequency. A suggestion with high search frequency but zero inventory is ranked lower than a less-searched product that is in stock. Amazon's index builder runs a Spark job that joins search frequency data with real-time inventory signals from their product catalog, recomputing suggestion scores every 30 minutes.

Elasticsearch's completion suggester implements a variant of the Trie as an in-memory FST (Finite State Transducer) — a highly compressed trie-like structure that stores millions of suggestions in a fraction of the RAM a traditional trie requires. The FST is built at index time and loaded into the JVM heap for O(M) prefix traversal, where M is the prefix length. Elasticsearch's completion suggester is used by Yelp, Booking.com, and GitHub for their search bars.

⚖️ Trade-offs and Failure Modes in Search Autocomplete Design

Dimension	Trade-off	Recommendation
Trie vs. Redis Sorted Set	Trie: lower latency (<1ms), complex ops; Redis: simpler ops, 2–5ms latency	Redis ZSET for most systems; Trie for dedicated search teams at Google scale
Real-time vs. offline index	Real-time: always fresh, Kafka consumer on every keystroke (expensive); Offline: 15min–1hr stale, 99% cheaper	Offline pipeline for general search; real-time layer only for trending/news contexts
Exact prefix vs. fuzzy match	Exact: O(1) Redis lookup; Fuzzy (Levenshtein): 10–100ms full-text search	Exact match for the serving layer; fallback to Elasticsearch fuzzy search only on zero-result prefix
Global vs. personalized index	Global: single Redis cluster; Personalized: per-user model, 100× storage	Global index as the baseline; personalization as a re-ranking layer on top of global suggestions

Zero-Result Prefix: When a user types a prefix that has no matching entry in the suggestion index (either because it is genuinely rare or because the index hasn't been updated yet), the system must gracefully degrade. The recommended fallback sequence is: (1) check the global Redis index, (2) if empty, check a smaller "top-100K queries" fallback index, (3) if still empty, return an empty suggestion list and log the prefix for inclusion in the next index build.

Hot Key Amplification: At 230,000 QPS, the single-character prefix "a" receives every request for any query beginning with "a" — a potential hot-key bottleneck in Redis. Mitigations include read replicas for short-prefix keys, local in-process caching on the Autocomplete Service nodes (with a 1-second TTL for single-character prefixes where suggestions almost never change), and consistent hashing to spread prefix keys across a Redis Cluster.

🧭 Decision Guide: Serving Strategy by System Scale

Query Volume (QPS)	Dataset Size	Recommended Serving Strategy
Under 10,000 QPS	Under 1 million unique queries	Redis Sorted Sets, single instance, 1-hour update cycle
10,000–100,000 QPS	1–100 million queries	Redis Cluster, sharded by prefix first character, hourly index rebuild
100,000–1M QPS	100M+ queries	Distributed Trie per language, geographically sharded, 15-minute refresh cycle
Over 1M QPS, real-time trending	Any	Hybrid: offline Trie for base suggestions + real-time trending layer from Kafka Streams
Personalization required	Any	Global suggestion index + per-user embedding-based re-ranking (model served separately)

Typo tolerance should never be in the critical serving path. Add a fuzzy-match fallback (Elasticsearch BM25 or DL distance) only when the primary Redis lookup returns zero suggestions, and serve it with a relaxed latency budget (50–100ms). Do not attempt fuzzy matching on every keystroke.

🧪 Interview Delivery Example: Designing Google-Scale Autocomplete in 45 Minutes

Minutes 1–5 — Frame the Human Constraint: Open with the typing speed math. 100ms between keystrokes. 80ms of that is network. You have 20ms for the backend. The entire system must fit in that budget. This framing immediately makes the "why offline index?" decision self-evident.

Minutes 6–15 — Two-Path Architecture: Draw the Fast Path and Slow Path separation. Explain that serving and indexing are completely decoupled. The Autocomplete Service only reads from Redis — it never writes. The Index Builder only writes to Redis — it never serves users. This isolation is what makes the system horizontally scalable on both axes independently.

Minutes 16–30 — Redis Sorted Set Serving Layer: Explain the key naming scheme (suggest:{prefix}), the ZREVRANGEBYSCORE query, and the score formula (frequency × freshness decay). Show the hot-key problem for single-character prefixes and propose read replicas plus local in-process caching with a 1-second TTL.

Minutes 31–40 — Offline Aggregation Pipeline: Walk through Kafka → Spark → Index Builder. Explain the freshness budget trade-off (15 min for news, 1 hour for general search, 24 hours for stable catalogs). Show how the index builder computes scores for each prefix from the aggregated frequency data.

Minutes 41–45 — Trade-offs and Failure Modes: Compare Trie vs. Redis Sorted Set. Address typo tolerance as a fallback layer, not a primary path. Address personalization as a re-ranking layer. Mention the zero-result prefix fallback and the hot-key amplification problem with mitigations.

🛠️ Open-Source Autocomplete Infrastructure Worth Knowing

Elasticsearch Completion Suggester: FST-based in-memory prefix index. Production-ready for up to ~50,000 QPS per node. Used by GitHub, Yelp, and Booking.com.
Redis: The default serving layer for custom-built autocomplete systems. Redis Sorted Sets provide the exact primitives needed (ranked range queries) with single-digit millisecond latency.
Apache Flink: Preferred for real-time frequency aggregation pipelines where suggestion freshness under 5 minutes is required. Maintains running frequency counts over a configurable time window.
Apache Spark: Preferred for hourly or daily batch index rebuilds. Spark's shuffle-based aggregation handles hundreds of terabytes of search log data efficiently.
OpenSearch (AWS): Fork of Elasticsearch with the completion suggester built-in, managed as a service. Reduces operational overhead for teams that don't want to run Elasticsearch clusters.

📚 Lessons Learned from Autocomplete Failures at Scale

A Single Hot Key Took Down the Suggestion Service During a Product Launch. An e-commerce platform's autocomplete Redis cluster received 40,000 QPS all targeting the key suggest:i (for "iPhone") the second a major phone launch was announced. A single Redis shard hit 100% CPU, causing 5-second response times across all suggestions — not just those starting with "i." The fix was a combination of local in-process suggestion caching on each Autocomplete Service pod (1-second TTL, covering the top 26 single-character prefix keys) and Redis Cluster with explicit slot assignment to distribute hot prefix keys across separate shards.

Offensive Content Appeared in Suggestions Because the Blacklist Was Applied Only at Index Build Time. A content platform's blacklist was applied by the Index Builder — it filtered out blacklisted queries before writing to Redis. However, a batch build job failed silently, and the previous day's index (which was built before a newly banned term was added to the blacklist) remained live. The banned term appeared in suggestions for 18 hours. The fix: apply the blacklist at both index build time and at serving time. The Autocomplete Service checks every candidate suggestion against an in-memory blacklist set (loaded from a configuration service) before returning results to the user.

Stale Suggestions During a Crisis Led to User Safety Concerns. During a natural disaster, users searched for emergency services. The 24-hour index cycle meant suggestions still showed previous day's top queries rather than emergency-relevant terms. The lesson: implement a real-time override channel — a small Redis hash of manually curated high-priority suggestions that the Autocomplete Service checks before the main ZSET index. This override set is writable by an admin tool and bypasses the offline pipeline for time-sensitive content curation.

📌 TLDR & Key Takeaways

Sub-10ms autocomplete is achievable only with a pre-computed offline index. Do not compute suggestions on the fly from a database LIKE query — this will not survive production traffic.
Use Redis Sorted Sets (ZREVRANGEBYSCORE suggest:{prefix}) as the default serving layer. The key naming scheme creates one ZSET per prefix; the score encodes popularity with a freshness decay factor.
The offline aggregation pipeline (Kafka → Spark/Flink → Index Builder) separates serving from indexing, allowing each to scale independently. The freshness budget (15 minutes to 24 hours) is a business decision driven by how quickly suggestions must reflect trending content.
Hot keys on short prefixes (single characters) are the primary scaling bottleneck. Mitigate with local in-process caching (1-second TTL) plus read replicas or Redis Cluster slot assignment.
Typo tolerance belongs in a fallback layer only — triggered when the primary index returns zero results. Never run fuzzy matching on every keystroke.
Apply the content blacklist at both index build time and serving time to prevent stale blacklist state from surfacing inappropriate suggestions.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...

Apr 19, 2026•23 min read