System Design HLD Example: Ride-Sharing (Uber/Lyft)

A practical interview-ready HLD for a ride-sharing platform with real-time matching and location tracking.

System Design Interview Prep

Abstract Algorithms

·Mar 28, 2026·15 min read

📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 15 min

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: A ride-sharing platform is a high-velocity geospatial matching engine. Drivers stream GPS coordinates every 5 seconds into a Redis Geospatial Index. When a rider requests a trip, the Matching Service executes a GEORADIUS query to find the 10 closest available drivers. A distributed Trip State Machine ensures transactional consistency (no double-matching), while WebSockets provide the real-time "car on map" experience that users expect.

🚗 The New Year's Eve Meltdown

Imagine it’s New Year's Eve in New York City. At 12:05 AM, one million people open their ride-sharing apps simultaneously. In the background, 50,000 drivers are moving through the city, sending GPS updates every few seconds.

Your matching algorithm is working perfectly, but your database is screaming. Every ride request triggers a massive SQL query: SELECT * FROM drivers WHERE ST_Distance(location, rider_location) < 2000. With 100,000 concurrent requests, the database's CPU hits 100%, lock contention skyrockets, and the "Nearest Driver" query that usually takes 10ms now takes 30 seconds. By the time a driver is "matched," they’ve already moved three blocks away or accepted a ride from a competitor.

This is the Geospatial Hotspot problem. In ride-sharing, the challenge isn't just matching—it's maintaining a high-velocity, in-memory index of moving objects that can be queried hundreds of times per second without falling behind. At the scale of Uber or Lyft, you aren't just building a database; you are building a real-time reflection of the physical world in silicon.

Actors & Journeys

Rider: Requests a ride, views ETAs, tracks the driver in real-time, and pays for the trip.
Driver: Updates their availability (Online/Offline), broadcasts their location, and accepts/rejects trip requests.
System Admin: Monitors surge pricing, platform health, and fraud detection.

In/Out Scope

In-Scope: Real-time driver tracking, geospatial matching, trip lifecycle management, and live position streaming.
Out-of-Scope: Advanced driver background checks (manual workflows), automated vehicle maintenance scheduling, and complex airport queueing logic.

Functional Requirements

Real-time Location Stream: Drivers must broadcast GPS coordinates every 5 seconds.
Geospatial Matching: Efficiently find available drivers within a specific radius (e.g., 2km) of a rider.
Trip Management: Manage the full lifecycle of a ride (Request $\rightarrow$ Match $\rightarrow$ Pickup $\rightarrow$ Dropoff).
Live Tracking: Stream the driver's current position to the rider's map with low latency.

Non-Functional Requirements (NFRs)

Ultra-Low Matching Latency: A rider should see a "Driver Found" confirmation in $< 2$ seconds.
High Write Availability: The system must handle 100k+ GPS updates per second without dropping pings.
Strict Consistency for Matching: One driver must never be matched to two riders simultaneously (Double Booking prevention).
Scalability: The architecture must handle city-wide surges (e.g., New Year's Eve) without cascading failures.

🔍 Foundations: How Geospatial Matching Works

At its core, a ride-sharing app is a high-frequency matching market.

The baseline architecture relies on three concepts:

Geospatial Indexing: Traditional B-Tree indexes cannot efficiently handle 2D range queries. We use Geohashing or Quadtrees to convert 2D coordinates into a 1D searchable space.
The Ping-Pong Protocol: Drivers "ping" the server with their location; the server "pongs" back with current demand or nearby rider requests.
The Buffer Zone: We don't just match the "closest" driver; we match the "best" driver, which might consider ETA, driver rating, or heading.

Without these foundations, the system would require constant full-table scans of every driver in the world for every rider request—a task that is computationally impossible at scale.

⚙️ The Mechanics of Geohashing

The mechanism of spatial search is powered by Geohashing.

Geohash Encoding: A Geohash is a hierarchical spatial data structure that subdivides the world into a grid of cells. Each cell is assigned a unique alphanumeric string.
The Prefix Property: Two points that are close together in the physical world will usually have Geohashes with a common prefix. For example, 9q8yy and 9q8yz are both in San Francisco.
Radius Search: By finding the rider's Geohash and its 8 neighboring cells, we can perform a simple range query in a Sorted Set (like Redis) to find all drivers in the vicinity. This turns a $O(N^2)$ geometry problem into a $O(\log N)$ string comparison problem.

📐 Estimations & Design Goals

The Math of Real-Time Geospatial

Total Drivers: 1,000,000 globally.
Active Drivers (Peak): 500,000.
Location Update Frequency: Every 5 seconds.
Location Write QPS: $500,000 / 5 = \mathbf{100,000 \text{ writes/sec}}$.
Ride Request QPS (Peak): $\approx \mathbf{500 \text{ requests/sec}}$.
In-Memory Index Size: 500k drivers $\times$ 100 bytes (ID + Lat + Lng + Metadata) $\approx$ 50 MB. This is tiny enough for a single Redis node but needs sharding for QPS.

Design Goals

In-Memory Indexing: Use Redis Sorted Sets for $O(\log N)$ spatial search.
Decoupled Write Path: GPS updates are buffered through a high-throughput Location Service.
Double-Booking Prevention: Use distributed locking for atomic matching.

📊 High-Level Design: The Real-Time Matching Flow

The architecture separates the high-frequency location stream from the transactional trip management logic.

graph TD
    Driver -->|GPS 5s| LS[Location Service]
    LS -->|GEOADD| Redis[(Redis Geo Index)]

    Rider -->|Request Ride| MS[Matching Service]
    MS -->|GEORADIUS| Redis
    MS -->|Lock Driver| TS[Trip Service]
    TS --> DB[(Trip DB: Postgres)]

    TS -->|Match Event| NS[Notification Service]
    NS -->|Push| Driver
    NS -->|Push| Rider

    LS -->|Stream| WS[WebSocket Gateway]
    WS -->|Live Update| Rider

The diagram separates two fundamentally different traffic patterns. The Location Service receives 100,000 GPS writes per second from drivers and writes each update into a Redis Geospatial index using GEOADD. The Matching Service handles approximately 500 ride requests per second — three orders of magnitude fewer — performing a GEORADIUS query to find candidates, then acquiring a distributed lock before creating the trip record in Postgres. The WebSocket Gateway streams driver position updates to riders watching the live map, sourced from the same Location Service event stream via Redis Pub/Sub.

🧠 Deep Dive: The Trip State Machine and Double-Booking Prevention

The most critical correctness requirement in ride-sharing is that a single driver is never simultaneously matched to two riders. Unlike an e-commerce inventory system where overselling is recoverable with a refund, a double-matched driver produces an irrecoverable user experience failure: one rider waits indefinitely for a driver who has already accepted a different trip.

The solution combines a distributed Trip State Machine with a Distributed Lock Manager (implemented using Redis Redisson or Zookeeper). When the Matching Service selects a driver candidate from the GEORADIUS result list, it atomically attempts to acquire a lock on driver_lock:{driver_id} with a 15-second TTL. If the lock is already held — meaning another Matching Service instance is simultaneously processing a match for the same driver — the attempt fails fast and the service moves to the next candidate. Only after acquiring the lock does the service write the trip record to Postgres and publish the match notification.

Internals: How Redis GEORADIUS Encodes Geospatial Data as Sorted Set Scores

Redis Geospatial commands are built on top of Sorted Sets. When a driver's location is stored with GEOADD, Redis internally encodes the latitude and longitude into a single 52-bit integer using the Geohash algorithm. This integer becomes the score of a Sorted Set entry where the member is the driver ID. The 52-bit score provides spatial resolution of approximately 0.6 meters at the equator — more than sufficient for ride-sharing matching purposes.

A GEORADIUS query is internally translated into a Sorted Set score range query. Redis computes the Geohash scores that correspond to the target radius, finds all scores in that range using a standard Sorted Set range scan (O(log N + M) where M is the result count), and then applies an exact distance filter to remove members whose actual distance exceeds the requested radius. This is architecturally identical to the two-phase Geohash + Haversine approach used in the Proximity Service, but implemented entirely within the Redis engine.

The Trip Lifecycle State Machine

Every ride progresses through a strictly ordered sequence of states. No state transition can be skipped. Each transition requires an atomic write to the Trip table in Postgres, ensuring the system has a durable audit trail even if the driver app crashes mid-trip.

graph TD
    A[REQUESTED] --> B[MATCHED]
    B --> C[DRIVER_EN_ROUTE]
    C --> D[RIDER_PICKED_UP]
    D --> E[TRIP_IN_PROGRESS]
    E --> F[COMPLETED]
    B --> G[CANCELLED]
    C --> G
    D --> G

The distributed lock is held only during the REQUESTED → MATCHED transition window (at most 15 seconds). Once the trip reaches MATCHED state and the driver receives the push notification, the lock is released. All subsequent state transitions — en route, pickup, dropoff — are driven by driver app events and do not require distributed locking, since the driver is exclusively bound to the trip record after the MATCHED state is written.

Driver and Trip Data Models

Driver Table (Postgres)

Column	Type	Notes
`driver_id`	UUID	Primary key
`status`	ENUM	OFFLINE, AVAILABLE, ON_TRIP
`vehicle_type`	VARCHAR	UberX, UberXL, Black, etc.
`rating`	DECIMAL(3,2)	Rolling average from completed trip reviews
`city_id`	VARCHAR	Used as Redis Geo Index sharding key

Trip Table (Postgres)

Column	Type	Notes
`trip_id`	UUID	Primary key
`driver_id`	UUID	FK to drivers; indexed for driver history
`rider_id`	UUID	FK to riders; indexed for rider history
`status`	ENUM	Current state in the state machine
`pickup_lat`	DECIMAL(10,7)	Rider-requested pickup coordinates
`pickup_lng`	DECIMAL(10,7)	Rider-requested pickup coordinates
`requested_at`	TIMESTAMP	When the rider submitted the request
`completed_at`	TIMESTAMP	NULL until trip reaches COMPLETED
`fare_cents`	INTEGER	Final fare calculated at trip completion

Performance Analysis: Write Throughput and Matching Latency at Peak Load

The system handles two fundamentally different traffic patterns at widely separated QPS levels. The Location Service absorbs 100,000 GPS writes per second from 500,000 active drivers pinging every 5 seconds. Redis GEOADD is an O(log N) operation on the Sorted Set, taking approximately 0.1 ms per write. At 100,000 writes/sec, a single Redis node at 100 K ops/sec capacity sits at its practical ceiling — hence the city-based sharding requirement. Each city shard handles a fraction of the global driver population, typically 1,000–50,000 active drivers, running each node at 10–20% capacity.

The Matching Service handles approximately 500 trip requests per second. Each match requires one GEORADIUS query (O(log N + M), approximately 1–2 ms) plus one distributed lock acquisition (approximately 1–3 ms round trip to Redis) plus one Postgres trip record write (approximately 5–10 ms with replication). Total matching latency is approximately 10–15 ms per request, well within the 2-second NFR. The matching latency is dominated by the Postgres write, making database write performance the primary scaling concern for the matching path.

Redis Geospatial Index Schema

Key Pattern	Redis Data Structure	Notes
`drivers:active:{city_id}`	Sorted Set (Geo)	Available drivers encoded by coordinates
`driver:status:{driver_id}`	String	AVAILABLE or ON_TRIP; TTL 60 seconds
`driver_lock:{driver_id}`	String (Distributed Lock)	Held during match window; TTL 15 seconds
`trip:ws:{rider_id}`	Pub/Sub Channel	Driver location published here; rider subscribes

🌍 How Uber, Lyft, and DiDi Run This Architecture at Global Scale

The architecture described here maps closely to what major ride-sharing platforms publish from their engineering organizations.

Uber processes approximately 15 million trips per day globally. Their engineering blog describes a Location Service called DISCO (Driver Indexing Service for City Operations), which is architecturally equivalent to the Redis Geospatial layer described in this HLD. Each city is an independent Redis shard, isolating a New Year's Eve surge in New York from affecting throughput in London. Uber's matching algorithm, nicknamed "Dispatch," runs as a separate microservice that calls the equivalent of GEORADIUS on the city's driver index and applies a scoring function that considers driver heading, ETA estimate, and driver acceptance rate — not just raw distance.

Lyft uses a Kafka-based event stream as the backbone for driver location updates, feeding into an in-memory matching engine. Their published insight is that matching doesn't need to be instantaneous to satisfy user expectations — a 1–2 second matching window is perceptually instantaneous to the rider and allows the system to evaluate a richer candidate set rather than greedily matching the nearest driver regardless of heading or traffic.

DiDi operates at a scale that exceeds Uber and Lyft combined, processing over 10 million daily rides in a single country. Their published research describes a hierarchical spatial index that first partitions China into province-level macro cells and then into neighborhood-level micro cells. This two-level hierarchy reduces the GEORADIUS search space from millions of drivers to thousands before the distance filter is applied, enabling matching throughput that would be impossible with a flat geo index.

Redis Single-Node Write Throughput: A single Redis node handles roughly 100,000 operations per second under production conditions. With 500,000 active drivers each pinging every 5 seconds, the location write QPS is exactly 100,000 — at the practical ceiling of a single node. The resolution is city-based sharding: drivers:active:{city_id} distributes writes across as many Redis nodes as there are active cities. City-based sharding is natural because a rider in San Francisco will never match a driver in Chicago, so cross-shard queries never occur.

WebSocket Connection State at Scale: At 1 million concurrent riders watching live driver maps, maintaining WebSocket connection state in a single gateway is impractical. The correct architecture uses a stateless WebSocket gateway fleet backed by Redis Pub/Sub, where each driver's location update is published to a channel keyed by trip:ws:{rider_id} and the rider's WebSocket connection subscribes to that channel regardless of which gateway node holds the connection. This decouples the connection lifetime from the gateway instance.

Dead Reckoning Between GPS Pings: A driver's GPS position updates every 5 seconds. At highway speed, a driver can move 70 meters between pings. The live map uses dead reckoning — interpolating position between GPS updates using the last known heading and speed — to create the illusion of smooth, continuous movement. Without dead reckoning, riders would see their driver teleporting every 5 seconds rather than moving fluidly.

🧭 Matching Architecture Decision Guide

Scenario	Best Architecture	Why
Fewer than 10 K drivers globally	SQL with PostGIS `ST_DWithin`	No Redis overhead justified
100 K–1 M drivers, global	Redis GEORADIUS sharded by city	Optimal write and read QPS balance
Surge pricing heatmaps	Kafka consumer reading GEORADIUS density	Density map updated asynchronously
Multi-modal transport (cars, bikes, scooters)	Separate Redis key namespace per vehicle type	Prevents mixed-type result contamination
Scheduled rides booked in advance	Add PRE_SCHEDULED state to Trip State Machine	Deferred matching trigger at scheduled time

Start by anchoring on the two most technically interesting constraints: the high-frequency GPS write path and the double-booking prevention requirement. Naming these early signals that you understand what the interviewer is actually probing for.

Walk through the happy path first — driver pings location → rider requests ride → Matching Service runs GEORADIUS → distributed lock acquired → trip record created → push notifications sent → rider watches live map. Then proactively address failure modes before being prompted: "What if the driver's phone dies between MATCHED and DRIVER_EN_ROUTE? The trip stays in MATCHED state. We need a background job that times out stale MATCHED trips after 90 seconds and returns the driver to AVAILABLE."

When drawing the architecture diagram, explicitly separate the Location Service from the Trip Service and explain why: location writes arrive at 100,000 per second; trip matching occurs at 500 per second. Combining them in a single service means a GPS write surge can starve the trip matching thread pool — a failure mode that would be invisible in load testing but catastrophic on New Year's Eve.

The ride-sharing HLD is a masterclass in matching the data structure to the access pattern. Redis Sorted Sets — the data structure that powers the GEORADIUS command — internally encode latitude and longitude as a 52-bit integer using the Geohash algorithm, then store it as the score of a Sorted Set entry. A spatial radius query becomes a score range query, which is O(log N + M) where M is the result count. This is the bridge between the data structure theory you studied in algorithms class and the geospatial engineering that makes real-world ride-sharing possible.

The distributed lock pattern used for double-booking prevention is equally instructive. The lock is not about protecting a shared data structure from concurrent reads — Redis handles that natively. It is about enforcing a business invariant (one driver, one active trip) across a distributed system where multiple Matching Service instances operate without shared memory. Distributed locks are the mechanism for extending single-machine mutual exclusion semantics to a cluster.

📌 TLDR & Key Takeaways

Ride-sharing is a real-time geospatial matching engine, not a conventional database application. The data access patterns demand in-memory spatial indexing.
Driver location updates arrive at 100,000 writes/second and are stored in Redis Geospatial, not SQL.
Double-booking prevention requires a distributed lock on the driver ID, held only during the REQUESTED → MATCHED transition window.
The Trip State Machine enforces ordered state transitions with durable Postgres writes at each step.
WebSocket connections backed by Redis Pub/Sub provide the live "car on map" experience for millions of concurrent riders.
City-based Redis sharding isolates regional surge events and keeps each geo index small enough to fit on a single node.
The GPS dead reckoning layer interpolates driver position between 5-second pings to produce smooth map animations.

System Design HLD: Proximity Service (Yelp/Google Places) — The static-object counterpart to ride-sharing's moving-object geospatial matching, using Geohash prefix queries on Postgres.
System Design HLD: Notification Service — Deep dive into the push notification pipeline used to alert drivers and riders at each trip state transition.
System Design HLD: Distributed Cache — Understanding Redis architecture, sharding, and eviction strategies that underpin the Geo Index in this design.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Stale Reads and Cascading Failures in Distributed Systems

TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable — stale reads...

May 3, 2026•23 min read

NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node — virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...

May 3, 2026•22 min read

Clock Skew and Causality Violations: Why Distributed Clocks Lie

TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions — but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...

May 3, 2026•18 min read

Split Brain Explained: When Two Nodes Both Think They Are Leader

TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader — each accepting writes the other never sees. Prevent it with quorum consensus (at least ⌊N/2⌋+1 nodes must agree before leadership is g...

May 3, 2026•20 min read