System Design HLD Example: Ride-Sharing (Uber/Lyft)
A practical interview-ready HLD for a ride-sharing platform with real-time matching and location tracking.
Abstract AlgorithmsIntermediate
For developers with some experience. Builds on fundamentals.
Estimated read time: 15 min
AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: A ride-sharing platform is a high-velocity geospatial matching engine. Drivers stream GPS coordinates every 5 seconds into a Redis Geospatial Index. When a rider requests a trip, the Matching Service executes a
GEORADIUSquery to find the 10 closest available drivers. A distributed Trip State Machine ensures transactional consistency (no double-matching), while WebSockets provide the real-time "car on map" experience that users expect.
π The New Year's Eve Meltdown
Imagine itβs New Year's Eve in New York City. At 12:05 AM, one million people open their ride-sharing apps simultaneously. In the background, 50,000 drivers are moving through the city, sending GPS updates every few seconds.
Your matching algorithm is working perfectly, but your database is screaming. Every ride request triggers a massive SQL query: SELECT * FROM drivers WHERE ST_Distance(location, rider_location) < 2000. With 100,000 concurrent requests, the database's CPU hits 100%, lock contention skyrockets, and the "Nearest Driver" query that usually takes 10ms now takes 30 seconds. By the time a driver is "matched," theyβve already moved three blocks away or accepted a ride from a competitor.
This is the Geospatial Hotspot problem. In ride-sharing, the challenge isn't just matchingβit's maintaining a high-velocity, in-memory index of moving objects that can be queried hundreds of times per second without falling behind. At the scale of Uber or Lyft, you aren't just building a database; you are building a real-time reflection of the physical world in silicon.
π Ride-Sharing: Use Cases & Requirements
Actors & Journeys
- Rider: Requests a ride, views ETAs, tracks the driver in real-time, and pays for the trip.
- Driver: Updates their availability (Online/Offline), broadcasts their location, and accepts/rejects trip requests.
- System Admin: Monitors surge pricing, platform health, and fraud detection.
In/Out Scope
- In-Scope: Real-time driver tracking, geospatial matching, trip lifecycle management, and live position streaming.
- Out-of-Scope: Advanced driver background checks (manual workflows), automated vehicle maintenance scheduling, and complex airport queueing logic.
Functional Requirements
- Real-time Location Stream: Drivers must broadcast GPS coordinates every 5 seconds.
- Geospatial Matching: Efficiently find available drivers within a specific radius (e.g., 2km) of a rider.
- Trip Management: Manage the full lifecycle of a ride (Request $\rightarrow$ Match $\rightarrow$ Pickup $\rightarrow$ Dropoff).
- Live Tracking: Stream the driver's current position to the rider's map with low latency.
Non-Functional Requirements (NFRs)
- Ultra-Low Matching Latency: A rider should see a "Driver Found" confirmation in $< 2$ seconds.
- High Write Availability: The system must handle 100k+ GPS updates per second without dropping pings.
- Strict Consistency for Matching: One driver must never be matched to two riders simultaneously (Double Booking prevention).
- Scalability: The architecture must handle city-wide surges (e.g., New Year's Eve) without cascading failures.
π Foundations: How Geospatial Matching Works
At its core, a ride-sharing app is a high-frequency matching market.
The baseline architecture relies on three concepts:
- Geospatial Indexing: Traditional B-Tree indexes cannot efficiently handle 2D range queries. We use Geohashing or Quadtrees to convert 2D coordinates into a 1D searchable space.
- The Ping-Pong Protocol: Drivers "ping" the server with their location; the server "pongs" back with current demand or nearby rider requests.
- The Buffer Zone: We don't just match the "closest" driver; we match the "best" driver, which might consider ETA, driver rating, or heading.
Without these foundations, the system would require constant full-table scans of every driver in the world for every rider requestβa task that is computationally impossible at scale.
βοΈ The Mechanics of Geohashing
The mechanism of spatial search is powered by Geohashing.
- Geohash Encoding: A Geohash is a hierarchical spatial data structure that subdivides the world into a grid of cells. Each cell is assigned a unique alphanumeric string.
- The Prefix Property: Two points that are close together in the physical world will usually have Geohashes with a common prefix. For example,
9q8yyand9q8yzare both in San Francisco. - Radius Search: By finding the rider's Geohash and its 8 neighboring cells, we can perform a simple range query in a Sorted Set (like Redis) to find all drivers in the vicinity. This turns a $O(N^2)$ geometry problem into a $O(\log N)$ string comparison problem.
π Estimations & Design Goals
The Math of Real-Time Geospatial
- Total Drivers: 1,000,000 globally.
- Active Drivers (Peak): 500,000.
- Location Update Frequency: Every 5 seconds.
- Location Write QPS: $500,000 / 5 = \mathbf{100,000 \text{ writes/sec}}$.
- Ride Request QPS (Peak): $\approx \mathbf{500 \text{ requests/sec}}$.
- In-Memory Index Size: 500k drivers $\times$ 100 bytes (ID + Lat + Lng + Metadata) $\approx$ 50 MB. This is tiny enough for a single Redis node but needs sharding for QPS.
Design Goals
- In-Memory Indexing: Use Redis Sorted Sets for $O(\log N)$ spatial search.
- Decoupled Write Path: GPS updates are buffered through a high-throughput Location Service.
- Double-Booking Prevention: Use distributed locking for atomic matching.
π High-Level Design: The Real-Time Matching Flow
The architecture separates the high-frequency location stream from the transactional trip management logic.
graph TD
Driver -->|GPS 5s| LS[Location Service]
LS -->|GEOADD| Redis[(Redis Geo Index)]
Rider -->|Request Ride| MS[Matching Service]
MS -->|GEORADIUS| Redis
MS -->|Lock Driver| TS[Trip Service]
TS --> DB[(Trip DB: Postgres)]
TS -->|Match Event| NS[Notification Service]
NS -->|Push| Driver
NS -->|Push| Rider
LS -->|Stream| WS[WebSocket Gateway]
WS -->|Live Update| Rider
The diagram separates two fundamentally different traffic patterns. The Location Service receives 100,000 GPS writes per second from drivers and writes each update into a Redis Geospatial index using
GEOADD. The Matching Service handles approximately 500 ride requests per second β three orders of magnitude fewer β performing aGEORADIUSquery to find candidates, then acquiring a distributed lock before creating the trip record in Postgres. The WebSocket Gateway streams driver position updates to riders watching the live map, sourced from the same Location Service event stream via Redis Pub/Sub.
π§ Deep Dive: The Trip State Machine and Double-Booking Prevention
The most critical correctness requirement in ride-sharing is that a single driver is never simultaneously matched to two riders. Unlike an e-commerce inventory system where overselling is recoverable with a refund, a double-matched driver produces an irrecoverable user experience failure: one rider waits indefinitely for a driver who has already accepted a different trip.
The solution combines a distributed Trip State Machine with a Distributed Lock Manager (implemented using Redis Redisson or Zookeeper). When the Matching Service selects a driver candidate from the GEORADIUS result list, it atomically attempts to acquire a lock on driver_lock:{driver_id} with a 15-second TTL. If the lock is already held β meaning another Matching Service instance is simultaneously processing a match for the same driver β the attempt fails fast and the service moves to the next candidate. Only after acquiring the lock does the service write the trip record to Postgres and publish the match notification.
Internals: How Redis GEORADIUS Encodes Geospatial Data as Sorted Set Scores
Redis Geospatial commands are built on top of Sorted Sets. When a driver's location is stored with GEOADD, Redis internally encodes the latitude and longitude into a single 52-bit integer using the Geohash algorithm. This integer becomes the score of a Sorted Set entry where the member is the driver ID. The 52-bit score provides spatial resolution of approximately 0.6 meters at the equator β more than sufficient for ride-sharing matching purposes.
A GEORADIUS query is internally translated into a Sorted Set score range query. Redis computes the Geohash scores that correspond to the target radius, finds all scores in that range using a standard Sorted Set range scan (O(log N + M) where M is the result count), and then applies an exact distance filter to remove members whose actual distance exceeds the requested radius. This is architecturally identical to the two-phase Geohash + Haversine approach used in the Proximity Service, but implemented entirely within the Redis engine.
The Trip Lifecycle State Machine
Every ride progresses through a strictly ordered sequence of states. No state transition can be skipped. Each transition requires an atomic write to the Trip table in Postgres, ensuring the system has a durable audit trail even if the driver app crashes mid-trip.
graph TD
A[REQUESTED] --> B[MATCHED]
B --> C[DRIVER_EN_ROUTE]
C --> D[RIDER_PICKED_UP]
D --> E[TRIP_IN_PROGRESS]
E --> F[COMPLETED]
B --> G[CANCELLED]
C --> G
D --> G
The distributed lock is held only during the REQUESTED β MATCHED transition window (at most 15 seconds). Once the trip reaches MATCHED state and the driver receives the push notification, the lock is released. All subsequent state transitions β en route, pickup, dropoff β are driven by driver app events and do not require distributed locking, since the driver is exclusively bound to the trip record after the MATCHED state is written.
Driver and Trip Data Models
Driver Table (Postgres)
| Column | Type | Notes |
driver_id | UUID | Primary key |
status | ENUM | OFFLINE, AVAILABLE, ON_TRIP |
vehicle_type | VARCHAR | UberX, UberXL, Black, etc. |
rating | DECIMAL(3,2) | Rolling average from completed trip reviews |
city_id | VARCHAR | Used as Redis Geo Index sharding key |
Trip Table (Postgres)
| Column | Type | Notes |
trip_id | UUID | Primary key |
driver_id | UUID | FK to drivers; indexed for driver history |
rider_id | UUID | FK to riders; indexed for rider history |
status | ENUM | Current state in the state machine |
pickup_lat | DECIMAL(10,7) | Rider-requested pickup coordinates |
pickup_lng | DECIMAL(10,7) | Rider-requested pickup coordinates |
requested_at | TIMESTAMP | When the rider submitted the request |
completed_at | TIMESTAMP | NULL until trip reaches COMPLETED |
fare_cents | INTEGER | Final fare calculated at trip completion |
Performance Analysis: Write Throughput and Matching Latency at Peak Load
The system handles two fundamentally different traffic patterns at widely separated QPS levels. The Location Service absorbs 100,000 GPS writes per second from 500,000 active drivers pinging every 5 seconds. Redis GEOADD is an O(log N) operation on the Sorted Set, taking approximately 0.1 ms per write. At 100,000 writes/sec, a single Redis node at 100 K ops/sec capacity sits at its practical ceiling β hence the city-based sharding requirement. Each city shard handles a fraction of the global driver population, typically 1,000β50,000 active drivers, running each node at 10β20% capacity.
The Matching Service handles approximately 500 trip requests per second. Each match requires one GEORADIUS query (O(log N + M), approximately 1β2 ms) plus one distributed lock acquisition (approximately 1β3 ms round trip to Redis) plus one Postgres trip record write (approximately 5β10 ms with replication). Total matching latency is approximately 10β15 ms per request, well within the 2-second NFR. The matching latency is dominated by the Postgres write, making database write performance the primary scaling concern for the matching path.
Redis Geospatial Index Schema
| Key Pattern | Redis Data Structure | Notes |
drivers:active:{city_id} | Sorted Set (Geo) | Available drivers encoded by coordinates |
driver:status:{driver_id} | String | AVAILABLE or ON_TRIP; TTL 60 seconds |
driver_lock:{driver_id} | String (Distributed Lock) | Held during match window; TTL 15 seconds |
trip:ws:{rider_id} | Pub/Sub Channel | Driver location published here; rider subscribes |
π How Uber, Lyft, and DiDi Run This Architecture at Global Scale
The architecture described here maps closely to what major ride-sharing platforms publish from their engineering organizations.
Uber processes approximately 15 million trips per day globally. Their engineering blog describes a Location Service called DISCO (Driver Indexing Service for City Operations), which is architecturally equivalent to the Redis Geospatial layer described in this HLD. Each city is an independent Redis shard, isolating a New Year's Eve surge in New York from affecting throughput in London. Uber's matching algorithm, nicknamed "Dispatch," runs as a separate microservice that calls the equivalent of GEORADIUS on the city's driver index and applies a scoring function that considers driver heading, ETA estimate, and driver acceptance rate β not just raw distance.
Lyft uses a Kafka-based event stream as the backbone for driver location updates, feeding into an in-memory matching engine. Their published insight is that matching doesn't need to be instantaneous to satisfy user expectations β a 1β2 second matching window is perceptually instantaneous to the rider and allows the system to evaluate a richer candidate set rather than greedily matching the nearest driver regardless of heading or traffic.
DiDi operates at a scale that exceeds Uber and Lyft combined, processing over 10 million daily rides in a single country. Their published research describes a hierarchical spatial index that first partitions China into province-level macro cells and then into neighborhood-level micro cells. This two-level hierarchy reduces the GEORADIUS search space from millions of drivers to thousands before the distance filter is applied, enabling matching throughput that would be impossible with a flat geo index.
βοΈ Where the Ride-Sharing Architecture Shows Its Limits
Redis Single-Node Write Throughput: A single Redis node handles roughly 100,000 operations per second under production conditions. With 500,000 active drivers each pinging every 5 seconds, the location write QPS is exactly 100,000 β at the practical ceiling of a single node. The resolution is city-based sharding: drivers:active:{city_id} distributes writes across as many Redis nodes as there are active cities. City-based sharding is natural because a rider in San Francisco will never match a driver in Chicago, so cross-shard queries never occur.
WebSocket Connection State at Scale: At 1 million concurrent riders watching live driver maps, maintaining WebSocket connection state in a single gateway is impractical. The correct architecture uses a stateless WebSocket gateway fleet backed by Redis Pub/Sub, where each driver's location update is published to a channel keyed by trip:ws:{rider_id} and the rider's WebSocket connection subscribes to that channel regardless of which gateway node holds the connection. This decouples the connection lifetime from the gateway instance.
Dead Reckoning Between GPS Pings: A driver's GPS position updates every 5 seconds. At highway speed, a driver can move 70 meters between pings. The live map uses dead reckoning β interpolating position between GPS updates using the last known heading and speed β to create the illusion of smooth, continuous movement. Without dead reckoning, riders would see their driver teleporting every 5 seconds rather than moving fluidly.
π§ Matching Architecture Decision Guide
| Scenario | Best Architecture | Why |
| Fewer than 10 K drivers globally | SQL with PostGIS ST_DWithin | No Redis overhead justified |
| 100 Kβ1 M drivers, global | Redis GEORADIUS sharded by city | Optimal write and read QPS balance |
| Surge pricing heatmaps | Kafka consumer reading GEORADIUS density | Density map updated asynchronously |
| Multi-modal transport (cars, bikes, scooters) | Separate Redis key namespace per vehicle type | Prevents mixed-type result contamination |
| Scheduled rides booked in advance | Add PRE_SCHEDULED state to Trip State Machine | Deferred matching trigger at scheduled time |
π§ͺ Delivering the Ride-Sharing HLD in a System Design Interview
Start by anchoring on the two most technically interesting constraints: the high-frequency GPS write path and the double-booking prevention requirement. Naming these early signals that you understand what the interviewer is actually probing for.
Walk through the happy path first β driver pings location β rider requests ride β Matching Service runs GEORADIUS β distributed lock acquired β trip record created β push notifications sent β rider watches live map. Then proactively address failure modes before being prompted: "What if the driver's phone dies between MATCHED and DRIVER_EN_ROUTE? The trip stays in MATCHED state. We need a background job that times out stale MATCHED trips after 90 seconds and returns the driver to AVAILABLE."
When drawing the architecture diagram, explicitly separate the Location Service from the Trip Service and explain why: location writes arrive at 100,000 per second; trip matching occurs at 500 per second. Combining them in a single service means a GPS write surge can starve the trip matching thread pool β a failure mode that would be invisible in load testing but catastrophic on New Year's Eve.
π What Ride-Sharing Architecture Reveals About In-Memory Data Structures
The ride-sharing HLD is a masterclass in matching the data structure to the access pattern. Redis Sorted Sets β the data structure that powers the GEORADIUS command β internally encode latitude and longitude as a 52-bit integer using the Geohash algorithm, then store it as the score of a Sorted Set entry. A spatial radius query becomes a score range query, which is O(log N + M) where M is the result count. This is the bridge between the data structure theory you studied in algorithms class and the geospatial engineering that makes real-world ride-sharing possible.
The distributed lock pattern used for double-booking prevention is equally instructive. The lock is not about protecting a shared data structure from concurrent reads β Redis handles that natively. It is about enforcing a business invariant (one driver, one active trip) across a distributed system where multiple Matching Service instances operate without shared memory. Distributed locks are the mechanism for extending single-machine mutual exclusion semantics to a cluster.
π TLDR & Key Takeaways
- Ride-sharing is a real-time geospatial matching engine, not a conventional database application. The data access patterns demand in-memory spatial indexing.
- Driver location updates arrive at 100,000 writes/second and are stored in Redis Geospatial, not SQL.
- Double-booking prevention requires a distributed lock on the driver ID, held only during the REQUESTED β MATCHED transition window.
- The Trip State Machine enforces ordered state transitions with durable Postgres writes at each step.
- WebSocket connections backed by Redis Pub/Sub provide the live "car on map" experience for millions of concurrent riders.
- City-based Redis sharding isolates regional surge events and keeps each geo index small enough to fit on a single node.
- The GPS dead reckoning layer interpolates driver position between 5-second pings to produce smooth map animations.
π Related Posts
- System Design HLD: Proximity Service (Yelp/Google Places) β The static-object counterpart to ride-sharing's moving-object geospatial matching, using Geohash prefix queries on Postgres.
- System Design HLD: Notification Service β Deep dive into the push notification pipeline used to alert drivers and riders at each trip state transition.
- System Design HLD: Distributed Cache β Understanding Redis architecture, sharding, and eviction strategies that underpin the Geo Index in this design.
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Stale Reads and Cascading Failures in Distributed Systems
TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable β stale reads...
NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data
TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node β virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...
Clock Skew and Causality Violations: Why Distributed Clocks Lie
TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions β but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...
Split Brain Explained: When Two Nodes Both Think They Are Leader
TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader β each accepting writes the other never sees. Prevent it with quorum consensus (at least βN/2β+1 nodes must agree before leadership is g...
