System Design HLD Example: Ride Sharing (Uber)
A practical interview-ready HLD for a ride-sharing platform with real-time matching and location tracking.
Abstract AlgorithmsTLDR: A ride-sharing platform is a real-time geospatial matching system. Drivers push GPS location every 5 seconds into a Redis geospatial index. When a rider requests a trip, the Matching Service runs a GEORADIUS query to find nearby available drivers, selects the best candidate, and notifies both parties via push. A Trip State Machine (REQUESTED β MATCHED β PICKED_UP β IN_PROGRESS β COMPLETED) enforces consistency. During the trip, WebSockets stream live driver location to the rider app. The hardest part is not matching β it is keeping 500K concurrent driver locations fresh at scale.
Uber processes roughly 5 million ride requests per day. That is about 58 per second on average β but during peak demand (Friday evening, rain, end of concerts), that number spikes 10x. Every second a rider waits for a match is a second they might open a competitor's app instead. Uber's internal target: match a rider to a driver in under 2 seconds from the moment the request lands.
How do you architect a system that does this reliably? The answer is not a clever algorithm β it is a set of deliberate architectural decisions: geospatial indexing in Redis, geohash-based partitioning, a strict trip state machine, and WebSocket-based location streaming. This post walks through each decision and explains why it was made.
By the end of this walkthrough you will understand why Redis outperforms a relational database for driver location storage, why geohashing beats a k-d tree for geographic partitioning at scale, and why WebSockets are the right protocol for in-trip location streaming rather than polling.
π The Matching Problem β Why 5M Rides/Day Is Hard
Use Cases
| Actor | Role |
| Rider | Opens app, sets destination, requests a ride |
| Driver | Accepts or declines match; updates location every 5 seconds |
| Matching Service | Finds nearby available drivers and assigns the best candidate |
| Location Service | Ingests driver GPS pings; maintains the live geospatial index |
| Trip Service | Owns the trip state machine; is the source of truth for trip state |
| Notification Service | Pushes match confirmation and status updates to rider/driver apps |
| Payment Service | Charges the rider after trip completion; processes driver payout |
Core user journeys in scope:
- Rider requests a ride β Matching Service finds nearest available driver β both parties are notified β driver picks up rider β trip ends β payment is processed β ride appears in history.
- Driver app sends a GPS ping every 5 seconds β Location Service writes to Redis geospatial index β stale entries expire automatically.
- Rider app streams live driver location on a map during the trip via WebSocket.
Out of Scope (v1 boundary)
- Surge pricing algorithm internals (supply/demand ratio per geohash cell is noted in the design; the pricing model itself is a separate post).
- Driver background check and onboarding pipeline.
- Carpooling / shared ride matching.
- Driver earnings settlement and tax reporting.
Naming non-goals explicitly is one of the highest-signal moves in an interview. It shows architectural maturity β you are scoping the problem, not avoiding it.
π Functional Requirements and Design Goals
Non-Functional Requirements
| Dimension | Target | Why it matters |
| Matching latency | p99 < 2s from ride request to driver notification | Rider cancels after ~2s without response |
| Location freshness | Driver position no older than 10s in the index | Stale positions cause wrong ETAs and failed matches |
| Availability | 99.99% for trip creation; 99.9% for ride history | A failed trip creation is revenue lost |
| Scalability | 500K concurrent active drivers; 5M rides/day | Location index must handle ~100K writes/sec at peak |
| Consistency | Trip state transitions must be idempotent β no double-booking | Double-matching one driver to two riders destroys trust |
Design Goals (specific and measurable)
- Driver location index must serve nearest-driver queries in < 50ms for 500K active drivers.
- Trip state machine must guarantee at-most-once driver assignment β a driver is locked to one rider at a time.
- Location updates from 500K drivers arriving every 5 seconds must sustain 100K writes/sec without degrading read latency.
- WebSocket connections must be maintained for all active trips concurrently β peak estimate: ~80K concurrent trips during Friday evening.
βοΈ Capacity Estimation β From 5M Rides to Concrete QPS Numbers
Sizing before designing is what separates interview answers from guesses.
| Metric | Assumption | Derivation |
| Rides per day | 5M | Given |
| Peak rides/sec | ~580 | 5M / 86400 Γ 10x peak factor |
| Active drivers (peak) | 500K | ~10% of 5M daily riders Γ driver:rider ratio of ~1:5 |
| Location update frequency | Every 5s per driver | GPS ping interval in driver app |
| Location write QPS | ~100K writes/sec | 500K drivers / 5s |
| Matching requests/sec | ~580/sec | One per ride request (same as peak rides/sec) |
| Driver record size | ~100 bytes | driver_id (16B) + lat/lng (16B) + timestamp (8B) + status (4B) + padding |
| Active driver index size | ~50MB | 500K drivers Γ 100 bytes |
| Trip record size | ~512 bytes | IDs, status, timestamps, route metadata |
| Trips stored (1 year) | ~1.8TB | 5M/day Γ 365 Γ 512B |
Key insight: The location index is tiny (~50MB) but write-heavy (~100K/sec). This points directly at Redis, not PostgreSQL β the index must live entirely in memory, and Redis geospatial commands are purpose-built for this workload.
π High-Level Architecture β How the Core Services Fit Together
The system decomposes into six services with clear ownership boundaries. No service owns two distinct concerns.
flowchart TD
RA[Rider App] -->|POST /trips| AG[API Gateway]
DA[Driver App] -->|PUT /location| AG
AG -->|Route ride request| MS[Matching Service]
AG -->|Route location update| LS[Location Service]
MS -->|GEORADIUS query| LS
MS -->|Lock driver + create trip| TS[Trip Service]
TS -->|Trigger notification| NS[Notification Service]
TS -->|Charge on completion| PS[Payment Service]
LS -->|GEOADD driver position| RD[(Redis Geo Index)]
TS -->|Trip state writes| DB[(PostgreSQL)]
NS -->|Push to app| FCM[FCM / APNs]
MS -->|ETA calculation| ETA[ETA Service]
This diagram shows the write path (driver location update β Redis) and the read path (ride request β Matching Service β Redis GEORADIUS β Trip Service β notification) running through independent service paths. The API Gateway routes the two most frequent request types β location updates and ride requests β to separate services, so a spike in one does not degrade the other.
Service ownership summary:
| Service | Owns | Storage |
| Location Service | Driver GPS index; geospatial writes/reads | Redis Geo |
| Matching Service | Nearest-driver query; driver selection logic; driver lock | Calls Location Service + Trip Service |
| Trip Service | Trip state machine; source of truth for all trip state | PostgreSQL |
| Notification Service | Push delivery to rider/driver apps | Kafka + FCM/APNs |
| Payment Service | Post-trip charge; driver payout | PostgreSQL + Stripe |
| ETA Service | Real-time ETA from driver to pickup point | Map routing API |
π§ Deep Dive: Location Tracking β Geohash and the Redis Geo Index
The location index is the most performance-sensitive component in the system. At 100K writes/second, it must be entirely in memory β a relational database with disk I/O cannot sustain this write rate without horizontal sharding complexity.
Internals: How Redis Geo Stores Driver Positions
Redis Geo is built on a sorted set. Every driver's position is stored as a member whose score is a 52-bit integer derived from a Geohash of the latitude/longitude pair. Because Geohash encodes nearby points with similar integer values, Redis can answer radius queries with sorted-set range operations β scanning only the integer range that covers the target cell rather than the entire keyspace.
When a driver sends GEOADD drivers:active -122.4194 37.7749 drv_abc123, Redis:
- Encodes
(-122.4194, 37.7749)into a 52-bit Geohash integer. - Calls
ZADD drivers:active <score> drv_abc123internally. - Returns in under 1ms (in-memory operation, no disk I/O).
GEORADIUS reverses the process: it converts the center point and radius into a set of contiguous Geohash integer ranges and runs sorted-set range scans over each range, collecting members whose decoded coordinates fall within the actual radius. This is O(N+log(M)) where M is total members and N is the result count β fast enough for 500K drivers.
Performance Analysis: Sustaining 100K Location Writes per Second
| Scenario | Write QPS | p99 write latency | Viable? |
| Single Redis node | 100K | < 1ms | β Yes (Redis handles 200K+ simple ops/sec on modern hardware) |
| PostgreSQL PostGIS single node | 100K | 15β40ms | β No β index maintenance under this write rate blocks read queries |
| Redis cluster (geohash-sharded) | 1M+ | < 1ms | β Yes β horizontal scale at geographic boundaries |
A single Redis node comfortably handles the average case (100K writes/sec). The peak spike risk (10x = 1M writes/sec) motivates the geohash-sharded cluster described in the Bottlenecks section. The key insight: the location index is read-heavy during matching (one GEORADIUS per ride request β 580/sec at peak) but write-heavy for location updates (100K/sec). Separating the two operations is not necessary β Redis handles both on the same sorted set because GEORADIUS reads and GEOADD writes operate on non-overlapping key ranges within the sorted set.
Why Geohash?
A geohash encodes a latitude/longitude pair into a short alphanumeric string. Strings with the same prefix share the same geographic region. This makes range queries β "find all drivers within 1km of this point" β reducible to string prefix lookups rather than two-dimensional range scans.
# Geohash example
San Francisco (37.7749Β° N, 122.4194Β° W) β geohash: "9q8yy"
1 block north (37.7759Β° N, 122.4194Β° W) β geohash: "9q8yy" (same cell)
5km east (37.7749Β° N, 122.3694Β° W) β geohash: "9q8yz" (adjacent cell, same prefix)
Precision 5 (5-char geohash) β ~4.9km Γ 4.9km cell
Precision 6 (6-char geohash) β ~1.2km Γ 0.6km cell β good for city-block resolution
Redis implements geohashing internally when you store a point with GEOADD. The stored score in the sorted set is the 52-bit integer geohash. Proximity queries (GEORADIUS) use this encoding to scan only the relevant cells instead of the entire keyspace.
Write Path: Driver Location Update
Every 5 seconds, the driver app sends a GPS ping:
PUT /location
{ "driver_id": "drv_abc123", "lat": 37.7749, "lng": -122.4194 }
The Location Service handles this with two Redis operations:
# Store or update driver position in the geospatial index
GEOADD drivers:active 37.7749 -122.4194 drv_abc123
# Set an expiry key so stale drivers auto-evict after 30s of silence
SET driver:heartbeat:drv_abc123 1 EX 30
A background job watches for expired heartbeat keys and removes the corresponding driver from drivers:active with ZREM. This prevents stale entries from appearing in match queries.
Why EX 30 and not EX 5? A single missed GPS ping (network hiccup, app backgrounding) should not evict an active driver. 30 seconds = 6 missed pings β if a driver hasn't pinged in 30 seconds they have genuinely gone offline.
Read Path: Nearest-Driver Query During Matching
# Find all available drivers within 2km of the rider's pickup point
GEORADIUS drivers:active -122.4194 37.7749 2 km ASC COUNT 10
# Returns: ["drv_abc123", "drv_xyz789", ...]
# ASC = sorted by distance; COUNT 10 = return up to 10 candidates
The Matching Service takes these candidates, filters to status = AVAILABLE (checked in the Trip Service), requests ETAs from the ETA Service, and picks the driver with the shortest ETA to the pickup point β not the geographically nearest one, because road topology matters.
π§ Deep Dive: Driver Matching β Locking the Driver Atomically
Selecting a driver from the GEORADIUS result is not enough. Two concurrent ride requests can both select the same driver. The system needs an atomic driver lock before assigning the trip.
sequenceDiagram
participant R as Rider App
participant MS as Matching Service
participant LS as Location Service
participant TS as Trip Service
participant RD as Redis
R->>MS: POST /trips (pickup, destination)
MS->>LS: GEORADIUS pickup 2km ASC COUNT 10
LS->>RD: GEORADIUS drivers:active
RD-->>LS: [drv_abc123, drv_xyz789]
LS-->>MS: candidates list
MS->>TS: lockDriver(drv_abc123, tripId)
TS->>RD: SET driver:lock:drv_abc123 tripId NX EX 30
alt Lock acquired
TS-->>MS: locked
MS->>TS: createTrip(tripId, riderId, driverId)
else Lock already held
TS-->>MS: conflict
MS->>TS: lockDriver(drv_xyz789, tripId)
end
This sequence shows why atomicity matters: SET ... NX EX 30 (set if not exists, expire in 30 seconds) is Redis's atomic compare-and-set. If two Matching Service instances race to lock the same driver, exactly one wins. The loser falls through to the next candidate in the list. The 30-second expiry ensures a crashed Matching Service instance does not hold a lock forever.
π§ Deep Dive: Trip Lifecycle β The State Machine That Keeps Rides Consistent
Every trip passes through a strict set of states. The Trip Service enforces valid transitions β no trip can skip from REQUESTED directly to COMPLETED, and no state transition can run twice (idempotency).
// Trip state machine β only the transitions shown are valid
public enum TripStatus {
REQUESTED, // Rider has submitted request; no driver assigned yet
MATCHED, // Driver accepted and is en route to pickup
ARRIVED, // Driver is at the pickup location
PICKED_UP, // Rider is in the vehicle
IN_PROGRESS, // En route to destination
COMPLETED, // Trip ended; awaiting payment
CANCELLED // Cancelled by rider, driver, or system timeout
}
// Valid transitions (enforced in TripService.transition())
private static final Map<TripStatus, Set<TripStatus>> VALID_TRANSITIONS = Map.of(
REQUESTED, Set.of(MATCHED, CANCELLED),
MATCHED, Set.of(ARRIVED, CANCELLED),
ARRIVED, Set.of(PICKED_UP, CANCELLED),
PICKED_UP, Set.of(IN_PROGRESS),
IN_PROGRESS, Set.of(COMPLETED),
COMPLETED, Set.of(), // terminal
CANCELLED, Set.of() // terminal
);
Idempotency key: Every state transition request carries a client-generated idempotency_key. The Trip Service stores processed keys in Redis (SET trip:idem:{key} 1 EX 86400). A retry of the same transition returns the same response without re-processing β this prevents double-booking when the Matching Service retries a failed createTrip call.
State transition table:
| From | To | Trigger | Who calls it |
| REQUESTED | MATCHED | Driver accepts | Driver App β Trip Service |
| MATCHED | ARRIVED | Driver marks arrival | Driver App β Trip Service |
| ARRIVED | PICKED_UP | Driver starts trip | Driver App β Trip Service |
| PICKED_UP | IN_PROGRESS | Trip started | Driver App β Trip Service |
| IN_PROGRESS | COMPLETED | Driver ends trip | Driver App β Trip Service |
| Any | CANCELLED | Rider/driver cancels or timeout | Rider/Driver App or Scheduler |
π§ Deep Dive: Real-Time Location Sharing During the Trip
Once a trip is MATCHED, the rider app needs a live map showing the driver's position. Polling every 3 seconds would work but creates unnecessary HTTP overhead β every poll is a full TCP handshake, TLS negotiation, and request/response cycle. WebSockets keep the connection open so the server can push updates as they arrive.
// WebSocket pseudocode β server-side location push handler
onDriverLocationUpdate(driverId, lat, lng):
tripId = getTripIdForDriver(driverId) // O(1) Redis lookup
if tripId is not null and trip.status in [MATCHED, IN_PROGRESS]:
riderId = getRiderForTrip(tripId) // O(1) Redis lookup
wsConn = activeConnections.get(riderId) // in-memory connection registry
if wsConn is open:
wsConn.send({ lat, lng, timestamp }) // push immediately; no polling
// Rider app initiates connection when trip enters MATCHED state
// Connection is closed when trip reaches COMPLETED or CANCELLED
sequenceDiagram
participant DA as Driver App
participant LS as Location Service
participant WS as WebSocket Gateway
participant RA as Rider App
DA->>LS: PUT /location (lat, lng) every 5s
LS->>LS: GEOADD drivers:active
LS->>WS: publishLocationUpdate(driverId, lat, lng)
WS->>WS: look up riderId for active trip
WS-->>RA: push { lat, lng } over WebSocket
The WebSocket Gateway maintains a mapping from tripId β riderWebSocketConnection in memory. When a location update arrives, it looks up the rider's connection and pushes the coordinates. The lookup is O(1). For 80K concurrent active trips, the in-memory table is tiny (~10MB) and the push is sub-millisecond.
Why not Server-Sent Events (SSE)? SSE is unidirectional (server β client only) which suits this use case, but WebSockets are already required for the driver app to send confirmations back. Unifying on WebSockets avoids operating two connection protocols.
π Real-World Applications: Uber, Lyft, and DiDi at Scale
The same core patterns β Redis geospatial index, geohash partitioning, trip state machines β appear across all major ride-sharing platforms, each adapted to their specific operational constraints.
| Platform | Location Index | Matching Approach | Notable Design Decision |
| Uber | Redis Geo with geohash-sharded clusters per city | ETA-based matching (not pure distance) | H3 hexagonal grid (open-sourced) for surge pricing zone calculations |
| Lyft | Redis Geo + PostgreSQL PostGIS for cold path | Supply/demand heat map per geohash cell | Separate services for airport queue matching vs. standard matching |
| DiDi | Distributed Redis cluster across regions | AI-assisted predictive positioning | Pre-positions drivers near predicted demand zones 5β10 minutes ahead |
Real-world incident: Uber's geohash boundary bug. In early deployments, riders near the edge of a geohash cell (e.g., exactly on the boundary between 9q8yy and 9q8yz cells) were sometimes unmatched because a GEORADIUS query on one shard did not include drivers in the adjacent cell. The fix: Matching Service queries the target cell and all 8 adjacent geohash cells for riders within 200m of a cell boundary. This "boundary fan-out" adds at most one extra Redis query for ~15% of ride requests.
Uber H3 (Hexagonal Hierarchical Spatial Index): Uber open-sourced their H3 library specifically because traditional rectangular geohash cells have a distance distortion problem β corner-to-corner distance is 41% longer than center-to-center distance within the same cell. H3 hexagons are more uniform. In practice, H3 is used for surge pricing zone calculations; the matching layer still uses Redis Geo's built-in Geohash because the GEORADIUS command handles the boundary overlap automatically.
Input: Rider at (37.7749, -122.4194) requests a ride
Process:
1. GEORADIUS on primary shard (9q8) β 8 candidate drivers
2. Boundary check: rider is 180m from 9q8/9q9 cell boundary
3. Fan-out: GEORADIUS on adjacent shard (9q9) β 2 additional candidates
4. Merge candidates, calculate ETAs, select driver with min ETA
Output: drv_abc123 matched (0.4km away, ETA 2 min)
ποΈ Data Model β Tables, Indexes, and Sharding Keys
The relational database owns durable, auditable records. The Redis layer owns ephemeral, high-throughput state.
PostgreSQL Schema
CREATE TABLE trips (
trip_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
rider_id UUID NOT NULL,
driver_id UUID, -- NULL until MATCHED
status TEXT NOT NULL DEFAULT 'REQUESTED',
pickup_lat DECIMAL(9,6) NOT NULL,
pickup_lng DECIMAL(9,6) NOT NULL,
dest_lat DECIMAL(9,6) NOT NULL,
dest_lng DECIMAL(9,6) NOT NULL,
requested_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
matched_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
fare_cents INT, -- final fare in cents; NULL until COMPLETED
idempotency_key TEXT UNIQUE
);
-- Shard key: rider_id β keeps all trips for a rider on one shard for history queries
CREATE INDEX idx_trips_rider ON trips (rider_id, requested_at DESC);
CREATE INDEX idx_trips_driver ON trips (driver_id, requested_at DESC);
CREATE INDEX idx_trips_status ON trips (status) WHERE status NOT IN ('COMPLETED','CANCELLED');
CREATE TABLE drivers (
driver_id UUID PRIMARY KEY,
name TEXT NOT NULL,
phone TEXT NOT NULL,
vehicle_id UUID NOT NULL,
rating DECIMAL(3,2) DEFAULT 5.00,
status TEXT NOT NULL DEFAULT 'OFFLINE' -- OFFLINE | AVAILABLE | ON_TRIP
);
CREATE TABLE riders (
rider_id UUID PRIMARY KEY,
name TEXT NOT NULL,
phone TEXT NOT NULL,
email TEXT UNIQUE NOT NULL,
default_payment TEXT -- payment method token
);
CREATE TABLE location_snapshots (
snapshot_id BIGSERIAL PRIMARY KEY,
driver_id UUID NOT NULL,
trip_id UUID, -- NULL for idle pings
lat DECIMAL(9,6) NOT NULL,
lng DECIMAL(9,6) NOT NULL,
recorded_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Partitioned by recorded_at (monthly); retained for 90 days for dispute resolution
CREATE INDEX idx_location_driver ON location_snapshots (driver_id, recorded_at DESC);
Redis Key Layout
| Key Pattern | Value | TTL | Purpose |
drivers:active | Geo sorted set (all active drivers) | No TTL (ZREM on eviction) | GEORADIUS queries |
driver:heartbeat:{driverId} | "1" | 30s | Auto-evicts stale drivers |
driver:lock:{driverId} | tripId | 30s | Atomic driver assignment lock |
trip:idem:{idempotencyKey} | tripId | 24h | Idempotent trip creation |
trip:status:{tripId} | "MATCHED" etc. | 2h | Fast status reads for WebSocket gateway |
βοΈ Trade-offs, Failure Modes, and Scaling Decisions
The Location Service Bottleneck
At 100K writes/second, a single Redis instance is the obvious bottleneck. Redis can handle ~200K-400K simple SET operations per second on modern hardware, so a single node can survive normal load β but not a 10x peak spike.
Mitigation: Geohash-based partitioning. Shard the drivers:active key by geohash prefix. Cities map naturally to geohash cells at precision 3-4 (156km Γ 156km cells). Each shard owns a geographic region:
| Shard | Geohash prefix | Region |
| redis-shard-0 | 9q8 | San Francisco |
| redis-shard-1 | dr5 | New York City |
| redis-shard-2 | u33 | London |
A GEORADIUS query only touches the shard(s) for the relevant geographic cell(s), and writes are naturally distributed across shards by city. Cross-cell queries (a rider near the boundary of two geohash cells) require queries to adjacent shards β the Matching Service handles this with a two-shard fan-out for border cases.
Trade-off Summary
| Decision | Chosen | Alternative | Why |
| Driver location storage | Redis Geo | PostgreSQL PostGIS | PostGIS can't sustain 100K writes/sec without heavy connection pooling and sharding; Redis keeps the index in memory |
| Geographic indexing | Geohash (Redis native) | k-d tree | k-d trees require re-balancing under concurrent inserts β operationally expensive at 500K concurrent writers; geohash partitioning is stateless and trivially shardable |
| Matching notification | Push (FCM/APNs) | SMS | Sub-second push latency; SMS has 1-10s delivery variance |
| In-trip location updates | WebSocket | HTTP polling | Polling at 5s interval = 16K+ HTTP requests/sec for 80K active trips; WebSocket collapses this to one connection per trip |
| Trip state storage | PostgreSQL | Redis | Trip history is auditable, financial-grade data β it must survive cache eviction |
Surge Pricing β A Brief Note
The Location Service continuously maintains a supply/demand ratio per geohash cell: active_drivers / open_requests in each cell. When this ratio drops below a threshold (e.g., 0.5 available drivers per pending request), the Pricing Service multiplies the base fare. The cell granularity (precision 5, ~5km cells) is coarser than matching (precision 6) so surge zones cover meaningful geographic areas rather than single blocks.
π§ Architecture Decision Guide β Choosing the Right Tool for Each Concern
| Situation | Recommendation |
| High-frequency ephemeral updates (location, session state) | Redis β in-memory, sub-millisecond latency, TTL-based auto-eviction |
| Durable financial or auditable records (trips, payments) | PostgreSQL β ACID, point-in-time recovery, audit log |
| Geographic radius queries at 100K+ writes/sec | Redis GEORADIUS β native geospatial, sorted-set backed, no separate spatial index |
| Complex polygon queries or persistent geo data (routing, boundaries) | PostGIS β full spatial SQL, persistent, ACID |
| Real-time serverβclient data push (live map, status updates) | WebSockets β bidirectional, persistent connection, eliminates polling overhead |
| Occasional status checks (ride history, profile) | REST HTTP β stateless, cacheable, operationally simpler than WebSockets |
| Driver assignment under concurrent Matching Service instances | Redis SET NX EX (atomic lock) β compare-and-set in one round-trip, no distributed lock manager needed |
| Scale beyond single Redis node for location index | Geohash-based sharding (by city/region) β stateless partitioning, no cross-shard coordination for radius queries within a city |
When NOT to use Redis for primary data:
- Any data that must survive a Redis eviction or crash without a separate durable write path.
- Trip state, payment records, and driver earnings β these must be in PostgreSQL even if cached in Redis.
- Data requiring complex relational queries (joins, aggregations over months of data) β Redis is a cache and index, not a query engine.
π§ͺ Interview Walkthrough β How to Deliver This Design in 45 Minutes
A repeatable framework for presenting this design under interview time pressure:
Minute 0β5: Scope and requirements
- State the scale: 5M rides/day, 500K active drivers, location updates every 5 seconds.
- Identify the single hardest constraint: "keeping 500K driver positions fresh at 100K writes/sec while serving sub-50ms radius queries."
- Name non-goals explicitly: surge pricing internals, driver onboarding, carpooling.
Minute 5β10: Capacity estimation (3 numbers that matter)
- 100K location writes/sec β need in-memory storage.
- ~580 ride requests/sec at peak β matching is fast but needs atomic driver locking.
- ~80K concurrent WebSocket connections during Friday evening peak β manageable for a mid-size gateway cluster.
Minute 10β20: Draw the HLD
- Six services: Location, Matching, Trip, Notification, Payment, ETA.
- Two write paths: (1) driver GPS β Location Service β Redis, (2) ride request β Matching Service β Trip Service β DB.
- One read path: Matching Service β Redis GEORADIUS β driver candidates.
Minute 20β35: Walk through three critical decisions
- Why Redis Geo? GEORADIUS on 500K drivers in < 5ms; PostgreSQL PostGIS cannot sustain 100K writes/sec without blocking reads.
- Why geohash partitioning? Stateless sharding by city prefix; k-d trees need re-balancing under concurrent inserts.
- Why WebSockets for in-trip tracking? 80K concurrent trips Γ 5s polling = 16K HTTP requests/sec just for location; WebSockets collapse this to 80K persistent connections.
Minute 35β40: Trip state machine and consistency
- Draw the state diagram: REQUESTED β MATCHED β ARRIVED β PICKED_UP β IN_PROGRESS β COMPLETED.
- Emphasise: each transition is idempotent (idempotency key in Redis), and driver assignment uses
SET NX EXto prevent double-booking.
Minute 40β45: Scaling evolution
- Phase 1: Single Redis node per region. Handles average load.
- Phase 2: Geohash-sharded Redis cluster. Handles 10x peak.
- Phase 3: Separate WebSocket Gateway cluster with horizontal scaling. Handles 500K+ concurrent connections.
A strong closing line: "The core architectural principle is separation of concerns between ephemeral, high-frequency state (Redis) and durable, auditable state (PostgreSQL) β every component decision flows from that split."
π οΈ Redis Geo: How It Powers Driver Location Matching
Redis Geo is a set of commands layered on top of Redis Sorted Sets. Every member's score is a 52-bit Geohash integer derived from the latitude/longitude pair. This lets Redis answer proximity queries using sorted-set range operations internally β no separate spatial index is needed.
Core commands:
# Add or update a driver's position
GEOADD drivers:active -122.4194 37.7749 drv_abc123
# Syntax: GEOADD key longitude latitude member
# Returns: 1 (added) or 0 (updated existing)
# Find drivers within 2km of a rider's pickup point, sorted by distance
GEORADIUS drivers:active -122.4194 37.7749 2 km ASC COUNT 10 WITHDIST
# Returns:
# 1) 1) "drv_abc123"
# 2) "0.1832" <- distance in km
# 2) 1) "drv_xyz789"
# 2) "1.4210"
# Get stored position of a specific driver
GEOPOS drivers:active drv_abc123
# Returns: [[-122.41940081119537, 37.774898917926234]]
# Get distance between two drivers
GEODIST drivers:active drv_abc123 drv_xyz789 km
# Returns: "1.2748"
# Remove a driver from the index (goes offline or heartbeat expired)
ZREM drivers:active drv_abc123
Why Redis Geo over a purpose-built spatial database (PostGIS, Elasticsearch Geo)?
| Criterion | Redis Geo | PostGIS | Elasticsearch Geo |
| Write latency | < 1ms (in-memory) | 5β20ms (disk) | 5β15ms (index overhead) |
| Read latency (GEORADIUS) | 1β5ms | 10β50ms | 10β30ms |
| Throughput | 200K+ ops/sec/node | ~10K writes/sec | ~20K writes/sec |
| Persistence | Optional RDB/AOF | Full ACID | Full with replicas |
| Best for | Ephemeral, high-freq updates | Persistent geo data, complex queries | Full-text + geo hybrid |
For driver location β ephemeral, high-frequency, read-heavy with simple radius queries β Redis Geo is the correct tool. For permanent route geometry or complex polygon queries, PostGIS is the right choice.
For a deeper dive on Redis commands and data structures, see the companion post System Design Caching and Asynchronism.
π Lessons Learned
- The location index is tiny but write-intensive. 500K drivers Γ 100 bytes = ~50MB. The challenge is not storage β it is sustaining 100K writes/second without degrading read latency. Memory-first storage (Redis) is the only viable choice at this write rate without complex multi-node PostgreSQL sharding.
- Geohash partitioning is operationally simpler than k-d trees. K-d trees are theoretically faster for nearest-neighbor queries, but they require re-balancing during concurrent inserts. Geohash cells are stateless β each shard owns a fixed geographic prefix, and no shard needs to know about its neighbours.
- Driver locking must be atomic. The single biggest consistency risk in the matching flow is two Matching Service instances selecting the same driver.
SET ... NX EXin Redis gives you an atomic compare-and-set that costs one network round-trip. Without this, double-booking is a race condition that is very hard to reproduce in testing and very visible to users in production. - Trip state machines prevent invalid state transitions. Without an enforced state machine, a driver can accidentally send a
COMPLETEDevent beforePICKED_UP, or a retry can double-process aMATCHEDevent. The state machine is the contract between every service that touches a trip. - WebSockets win over polling for in-trip updates at scale. At 80K concurrent trips with a 5-second update interval, HTTP polling would generate 16K+ requests per second purely for location updates. WebSockets compress that to 80K persistent connections β which modern gateway infrastructure handles trivially.
- Heartbeat expiry is how you handle driver disconnections gracefully. If a driver's app crashes, the GPS pings stop. A 30-second heartbeat key expiry removes the driver from the active index without requiring an explicit
OFFLINEevent β the absence of pings is itself the signal.
π TLDR: Summary & Key Takeaways
- Redis Geo (
GEOADD/GEORADIUS) is the correct storage layer for the driver location index: in-memory, sub-millisecond write latency, and native radius queries. PostgreSQL cannot sustain 100K location writes/sec without aggressive sharding. - Geohash partitioning allows the location index to scale horizontally by geographic region. Each shard owns a geohash prefix cell; GEORADIUS queries only touch the relevant shard(s).
- Atomic driver locking (
SET driver:lock:{id} tripId NX EX 30) prevents double-booking when multiple Matching Service instances race to assign the same driver. - The Trip State Machine (REQUESTED β MATCHED β ARRIVED β PICKED_UP β IN_PROGRESS β COMPLETED) enforces valid transitions and, combined with idempotency keys, makes every state transition safe to retry.
- WebSockets are the right protocol for in-trip location streaming: one persistent connection per rider replaces thousands of polling requests per second at peak load.
- Heartbeat expiry (30-second TTL on
driver:heartbeat:{id}) auto-evicts stale drivers from the index without requiring an explicit offline signal. - The hardest engineering problem in ride-sharing is not the matching algorithm β it is keeping half a million driver positions fresh in memory at 100K writes/second while serving sub-50ms GEORADIUS queries concurrently.
π Practice Quiz
Why is Redis preferred over a PostgreSQL PostGIS index for storing active driver locations?
- A) Because PostgreSQL does not support geospatial data types
- B) Because Redis keeps the entire index in memory, sustaining 100K+ writes/sec with sub-millisecond latency β PostGIS requires disk I/O and cannot match this write throughput without heavy sharding
- C) Because Redis has better replication support than PostgreSQL Correct Answer: B
Why is geohashing used for driver location partitioning instead of a k-d tree?
- A) Because geohash strings are easier to display in a UI
- B) Because k-d trees are not supported in distributed systems
- C) Because k-d trees require re-balancing during concurrent inserts, making them operationally expensive at 500K concurrent writers; geohash cells are stateless and each shard owns a fixed geographic prefix with no coordination needed Correct Answer: C
Why are WebSockets used for in-trip location streaming instead of HTTP long-polling or regular polling?
- A) WebSockets have a lower per-message overhead once the connection is established; 80K active trips polling every 5 seconds would generate 16K+ HTTP requests/second β WebSockets reduce this to 80K persistent connections with server-initiated pushes
- B) Because polling is not supported by mobile operating systems
- C) Because WebSockets automatically re-connect after network failures Correct Answer: A
What is the purpose of
SET driver:lock:{driverId} tripId NX EX 30in the driver matching flow?- A) To cache the driver's profile so the Matching Service can read it without hitting PostgreSQL
- B) To atomically lock a driver to a single trip β the NX flag ensures only the first Matching Service instance to execute this command succeeds, preventing two concurrent requests from double-booking the same driver
- C) To store the trip's state in Redis so the Trip Service can read it faster Correct Answer: B
Open-ended: A rider in San Francisco requests a ride. The GEORADIUS query returns 3 candidate drivers at distances of 0.4km, 0.9km, and 1.7km. The Matching Service picks the closest one (0.4km), but the lock attempt fails because another ride request just locked that driver. Walk through exactly what happens next β lock fallback, ETA recalculation, trip creation, and driver notification. What happens if all 3 candidates are locked?
π Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Modern Table Formats: Delta Lake vs Apache Iceberg vs Apache Hudi
TLDR: Delta Lake, Apache Iceberg, and Apache Hudi are open table formats that wrap Parquet files with a transaction log (or snapshot tree) to deliver ACID guarantees, time travel, schema evolution, and efficient upserts on object storage. Choose Delt...
Medallion Architecture: Bronze, Silver, and Gold Layers in Practice
TLDR: Medallion Architecture solves the "data swamp" problem by organizing a data lake into three progressively refined zones β Bronze (raw, immutable), Silver (cleaned, conformed), Gold (aggregated, business-ready) β so teams always build on a trust...
Kappa Architecture: Streaming-First Data Pipelines
TLDR: Kappa architecture replaces Lambda's batch + speed dual codebases with a single streaming pipeline backed by a replayable Kafka log. Reprocessing becomes replaying from offset 0. One codebase, no drift. TLDR: Kappa is the right call when your t...
Big Data 101: The 5 Vs, Ecosystem, and Why Scale Breaks Everything
TLDR: Traditional databases fail at big data scale for three concrete reasons β storage saturation, compute bottleneck, and write-lock contention. The 5 Vs (Volume, Velocity, Variety, Veracity, Value) frame what makes data "big." A layered ecosystem ...
