All Posts

System Design HLD Example: Ride Sharing (Uber)

A practical interview-ready HLD for a ride-sharing platform with real-time matching and location tracking.

Abstract AlgorithmsAbstract Algorithms
Β·Β·27 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: A ride-sharing platform is a real-time geospatial matching system. Drivers push GPS location every 5 seconds into a Redis geospatial index. When a rider requests a trip, the Matching Service runs a GEORADIUS query to find nearby available drivers, selects the best candidate, and notifies both parties via push. A Trip State Machine (REQUESTED β†’ MATCHED β†’ PICKED_UP β†’ IN_PROGRESS β†’ COMPLETED) enforces consistency. During the trip, WebSockets stream live driver location to the rider app. The hardest part is not matching β€” it is keeping 500K concurrent driver locations fresh at scale.

Uber processes roughly 5 million ride requests per day. That is about 58 per second on average β€” but during peak demand (Friday evening, rain, end of concerts), that number spikes 10x. Every second a rider waits for a match is a second they might open a competitor's app instead. Uber's internal target: match a rider to a driver in under 2 seconds from the moment the request lands.

How do you architect a system that does this reliably? The answer is not a clever algorithm β€” it is a set of deliberate architectural decisions: geospatial indexing in Redis, geohash-based partitioning, a strict trip state machine, and WebSocket-based location streaming. This post walks through each decision and explains why it was made.

By the end of this walkthrough you will understand why Redis outperforms a relational database for driver location storage, why geohashing beats a k-d tree for geographic partitioning at scale, and why WebSockets are the right protocol for in-trip location streaming rather than polling.

πŸ“– The Matching Problem β€” Why 5M Rides/Day Is Hard

Use Cases

ActorRole
RiderOpens app, sets destination, requests a ride
DriverAccepts or declines match; updates location every 5 seconds
Matching ServiceFinds nearby available drivers and assigns the best candidate
Location ServiceIngests driver GPS pings; maintains the live geospatial index
Trip ServiceOwns the trip state machine; is the source of truth for trip state
Notification ServicePushes match confirmation and status updates to rider/driver apps
Payment ServiceCharges the rider after trip completion; processes driver payout

Core user journeys in scope:

  • Rider requests a ride β†’ Matching Service finds nearest available driver β†’ both parties are notified β†’ driver picks up rider β†’ trip ends β†’ payment is processed β†’ ride appears in history.
  • Driver app sends a GPS ping every 5 seconds β†’ Location Service writes to Redis geospatial index β†’ stale entries expire automatically.
  • Rider app streams live driver location on a map during the trip via WebSocket.

Out of Scope (v1 boundary)

  • Surge pricing algorithm internals (supply/demand ratio per geohash cell is noted in the design; the pricing model itself is a separate post).
  • Driver background check and onboarding pipeline.
  • Carpooling / shared ride matching.
  • Driver earnings settlement and tax reporting.

Naming non-goals explicitly is one of the highest-signal moves in an interview. It shows architectural maturity β€” you are scoping the problem, not avoiding it.

πŸ” Functional Requirements and Design Goals

Non-Functional Requirements

DimensionTargetWhy it matters
Matching latencyp99 < 2s from ride request to driver notificationRider cancels after ~2s without response
Location freshnessDriver position no older than 10s in the indexStale positions cause wrong ETAs and failed matches
Availability99.99% for trip creation; 99.9% for ride historyA failed trip creation is revenue lost
Scalability500K concurrent active drivers; 5M rides/dayLocation index must handle ~100K writes/sec at peak
ConsistencyTrip state transitions must be idempotent β€” no double-bookingDouble-matching one driver to two riders destroys trust

Design Goals (specific and measurable)

  • Driver location index must serve nearest-driver queries in < 50ms for 500K active drivers.
  • Trip state machine must guarantee at-most-once driver assignment β€” a driver is locked to one rider at a time.
  • Location updates from 500K drivers arriving every 5 seconds must sustain 100K writes/sec without degrading read latency.
  • WebSocket connections must be maintained for all active trips concurrently β€” peak estimate: ~80K concurrent trips during Friday evening.

βš™οΈ Capacity Estimation β€” From 5M Rides to Concrete QPS Numbers

Sizing before designing is what separates interview answers from guesses.

MetricAssumptionDerivation
Rides per day5MGiven
Peak rides/sec~5805M / 86400 Γ— 10x peak factor
Active drivers (peak)500K~10% of 5M daily riders Γ— driver:rider ratio of ~1:5
Location update frequencyEvery 5s per driverGPS ping interval in driver app
Location write QPS~100K writes/sec500K drivers / 5s
Matching requests/sec~580/secOne per ride request (same as peak rides/sec)
Driver record size~100 bytesdriver_id (16B) + lat/lng (16B) + timestamp (8B) + status (4B) + padding
Active driver index size~50MB500K drivers Γ— 100 bytes
Trip record size~512 bytesIDs, status, timestamps, route metadata
Trips stored (1 year)~1.8TB5M/day Γ— 365 Γ— 512B

Key insight: The location index is tiny (~50MB) but write-heavy (~100K/sec). This points directly at Redis, not PostgreSQL β€” the index must live entirely in memory, and Redis geospatial commands are purpose-built for this workload.

πŸ“Š High-Level Architecture β€” How the Core Services Fit Together

The system decomposes into six services with clear ownership boundaries. No service owns two distinct concerns.

flowchart TD
    RA[Rider App] -->|POST /trips| AG[API Gateway]
    DA[Driver App] -->|PUT /location| AG
    AG -->|Route ride request| MS[Matching Service]
    AG -->|Route location update| LS[Location Service]
    MS -->|GEORADIUS query| LS
    MS -->|Lock driver + create trip| TS[Trip Service]
    TS -->|Trigger notification| NS[Notification Service]
    TS -->|Charge on completion| PS[Payment Service]
    LS -->|GEOADD driver position| RD[(Redis Geo Index)]
    TS -->|Trip state writes| DB[(PostgreSQL)]
    NS -->|Push to app| FCM[FCM / APNs]
    MS -->|ETA calculation| ETA[ETA Service]

This diagram shows the write path (driver location update β†’ Redis) and the read path (ride request β†’ Matching Service β†’ Redis GEORADIUS β†’ Trip Service β†’ notification) running through independent service paths. The API Gateway routes the two most frequent request types β€” location updates and ride requests β€” to separate services, so a spike in one does not degrade the other.

Service ownership summary:

ServiceOwnsStorage
Location ServiceDriver GPS index; geospatial writes/readsRedis Geo
Matching ServiceNearest-driver query; driver selection logic; driver lockCalls Location Service + Trip Service
Trip ServiceTrip state machine; source of truth for all trip statePostgreSQL
Notification ServicePush delivery to rider/driver appsKafka + FCM/APNs
Payment ServicePost-trip charge; driver payoutPostgreSQL + Stripe
ETA ServiceReal-time ETA from driver to pickup pointMap routing API

🧠 Deep Dive: Location Tracking β€” Geohash and the Redis Geo Index

The location index is the most performance-sensitive component in the system. At 100K writes/second, it must be entirely in memory β€” a relational database with disk I/O cannot sustain this write rate without horizontal sharding complexity.

Internals: How Redis Geo Stores Driver Positions

Redis Geo is built on a sorted set. Every driver's position is stored as a member whose score is a 52-bit integer derived from a Geohash of the latitude/longitude pair. Because Geohash encodes nearby points with similar integer values, Redis can answer radius queries with sorted-set range operations β€” scanning only the integer range that covers the target cell rather than the entire keyspace.

When a driver sends GEOADD drivers:active -122.4194 37.7749 drv_abc123, Redis:

  1. Encodes (-122.4194, 37.7749) into a 52-bit Geohash integer.
  2. Calls ZADD drivers:active <score> drv_abc123 internally.
  3. Returns in under 1ms (in-memory operation, no disk I/O).

GEORADIUS reverses the process: it converts the center point and radius into a set of contiguous Geohash integer ranges and runs sorted-set range scans over each range, collecting members whose decoded coordinates fall within the actual radius. This is O(N+log(M)) where M is total members and N is the result count β€” fast enough for 500K drivers.

Performance Analysis: Sustaining 100K Location Writes per Second

ScenarioWrite QPSp99 write latencyViable?
Single Redis node100K< 1msβœ… Yes (Redis handles 200K+ simple ops/sec on modern hardware)
PostgreSQL PostGIS single node100K15–40ms❌ No β€” index maintenance under this write rate blocks read queries
Redis cluster (geohash-sharded)1M+< 1msβœ… Yes β€” horizontal scale at geographic boundaries

A single Redis node comfortably handles the average case (100K writes/sec). The peak spike risk (10x = 1M writes/sec) motivates the geohash-sharded cluster described in the Bottlenecks section. The key insight: the location index is read-heavy during matching (one GEORADIUS per ride request β‰ˆ 580/sec at peak) but write-heavy for location updates (100K/sec). Separating the two operations is not necessary β€” Redis handles both on the same sorted set because GEORADIUS reads and GEOADD writes operate on non-overlapping key ranges within the sorted set.

Why Geohash?

A geohash encodes a latitude/longitude pair into a short alphanumeric string. Strings with the same prefix share the same geographic region. This makes range queries β€” "find all drivers within 1km of this point" β€” reducible to string prefix lookups rather than two-dimensional range scans.

# Geohash example
San Francisco (37.7749Β° N, 122.4194Β° W)  β†’  geohash: "9q8yy"
1 block north (37.7759Β° N, 122.4194Β° W)  β†’  geohash: "9q8yy"  (same cell)
5km east     (37.7749Β° N, 122.3694Β° W)   β†’  geohash: "9q8yz"  (adjacent cell, same prefix)

Precision 5 (5-char geohash) β†’ ~4.9km Γ— 4.9km cell
Precision 6 (6-char geohash) β†’ ~1.2km Γ— 0.6km cell  ← good for city-block resolution

Redis implements geohashing internally when you store a point with GEOADD. The stored score in the sorted set is the 52-bit integer geohash. Proximity queries (GEORADIUS) use this encoding to scan only the relevant cells instead of the entire keyspace.

Write Path: Driver Location Update

Every 5 seconds, the driver app sends a GPS ping:

PUT /location
{ "driver_id": "drv_abc123", "lat": 37.7749, "lng": -122.4194 }

The Location Service handles this with two Redis operations:

# Store or update driver position in the geospatial index
GEOADD drivers:active 37.7749 -122.4194 drv_abc123

# Set an expiry key so stale drivers auto-evict after 30s of silence
SET driver:heartbeat:drv_abc123 1 EX 30

A background job watches for expired heartbeat keys and removes the corresponding driver from drivers:active with ZREM. This prevents stale entries from appearing in match queries.

Why EX 30 and not EX 5? A single missed GPS ping (network hiccup, app backgrounding) should not evict an active driver. 30 seconds = 6 missed pings β€” if a driver hasn't pinged in 30 seconds they have genuinely gone offline.

Read Path: Nearest-Driver Query During Matching

# Find all available drivers within 2km of the rider's pickup point
GEORADIUS drivers:active -122.4194 37.7749 2 km ASC COUNT 10

# Returns: ["drv_abc123", "drv_xyz789", ...]
# ASC = sorted by distance; COUNT 10 = return up to 10 candidates

The Matching Service takes these candidates, filters to status = AVAILABLE (checked in the Trip Service), requests ETAs from the ETA Service, and picks the driver with the shortest ETA to the pickup point β€” not the geographically nearest one, because road topology matters.

🧠 Deep Dive: Driver Matching β€” Locking the Driver Atomically

Selecting a driver from the GEORADIUS result is not enough. Two concurrent ride requests can both select the same driver. The system needs an atomic driver lock before assigning the trip.

sequenceDiagram
    participant R as Rider App
    participant MS as Matching Service
    participant LS as Location Service
    participant TS as Trip Service
    participant RD as Redis

    R->>MS: POST /trips (pickup, destination)
    MS->>LS: GEORADIUS pickup 2km ASC COUNT 10
    LS->>RD: GEORADIUS drivers:active
    RD-->>LS: [drv_abc123, drv_xyz789]
    LS-->>MS: candidates list
    MS->>TS: lockDriver(drv_abc123, tripId)
    TS->>RD: SET driver:lock:drv_abc123 tripId NX EX 30
    alt Lock acquired
        TS-->>MS: locked
        MS->>TS: createTrip(tripId, riderId, driverId)
    else Lock already held
        TS-->>MS: conflict
        MS->>TS: lockDriver(drv_xyz789, tripId)
    end

This sequence shows why atomicity matters: SET ... NX EX 30 (set if not exists, expire in 30 seconds) is Redis's atomic compare-and-set. If two Matching Service instances race to lock the same driver, exactly one wins. The loser falls through to the next candidate in the list. The 30-second expiry ensures a crashed Matching Service instance does not hold a lock forever.

🧠 Deep Dive: Trip Lifecycle β€” The State Machine That Keeps Rides Consistent

Every trip passes through a strict set of states. The Trip Service enforces valid transitions β€” no trip can skip from REQUESTED directly to COMPLETED, and no state transition can run twice (idempotency).

// Trip state machine β€” only the transitions shown are valid
public enum TripStatus {
    REQUESTED,   // Rider has submitted request; no driver assigned yet
    MATCHED,     // Driver accepted and is en route to pickup
    ARRIVED,     // Driver is at the pickup location
    PICKED_UP,   // Rider is in the vehicle
    IN_PROGRESS, // En route to destination
    COMPLETED,   // Trip ended; awaiting payment
    CANCELLED    // Cancelled by rider, driver, or system timeout
}

// Valid transitions (enforced in TripService.transition())
private static final Map<TripStatus, Set<TripStatus>> VALID_TRANSITIONS = Map.of(
    REQUESTED,   Set.of(MATCHED, CANCELLED),
    MATCHED,     Set.of(ARRIVED, CANCELLED),
    ARRIVED,     Set.of(PICKED_UP, CANCELLED),
    PICKED_UP,   Set.of(IN_PROGRESS),
    IN_PROGRESS, Set.of(COMPLETED),
    COMPLETED,   Set.of(),   // terminal
    CANCELLED,   Set.of()    // terminal
);

Idempotency key: Every state transition request carries a client-generated idempotency_key. The Trip Service stores processed keys in Redis (SET trip:idem:{key} 1 EX 86400). A retry of the same transition returns the same response without re-processing β€” this prevents double-booking when the Matching Service retries a failed createTrip call.

State transition table:

FromToTriggerWho calls it
REQUESTEDMATCHEDDriver acceptsDriver App β†’ Trip Service
MATCHEDARRIVEDDriver marks arrivalDriver App β†’ Trip Service
ARRIVEDPICKED_UPDriver starts tripDriver App β†’ Trip Service
PICKED_UPIN_PROGRESSTrip startedDriver App β†’ Trip Service
IN_PROGRESSCOMPLETEDDriver ends tripDriver App β†’ Trip Service
AnyCANCELLEDRider/driver cancels or timeoutRider/Driver App or Scheduler

🧠 Deep Dive: Real-Time Location Sharing During the Trip

Once a trip is MATCHED, the rider app needs a live map showing the driver's position. Polling every 3 seconds would work but creates unnecessary HTTP overhead β€” every poll is a full TCP handshake, TLS negotiation, and request/response cycle. WebSockets keep the connection open so the server can push updates as they arrive.

// WebSocket pseudocode β€” server-side location push handler

onDriverLocationUpdate(driverId, lat, lng):
    tripId = getTripIdForDriver(driverId)          // O(1) Redis lookup
    if tripId is not null and trip.status in [MATCHED, IN_PROGRESS]:
        riderId = getRiderForTrip(tripId)           // O(1) Redis lookup
        wsConn = activeConnections.get(riderId)     // in-memory connection registry
        if wsConn is open:
            wsConn.send({ lat, lng, timestamp })    // push immediately; no polling

// Rider app initiates connection when trip enters MATCHED state
// Connection is closed when trip reaches COMPLETED or CANCELLED
sequenceDiagram
    participant DA as Driver App
    participant LS as Location Service
    participant WS as WebSocket Gateway
    participant RA as Rider App

    DA->>LS: PUT /location (lat, lng) every 5s
    LS->>LS: GEOADD drivers:active
    LS->>WS: publishLocationUpdate(driverId, lat, lng)
    WS->>WS: look up riderId for active trip
    WS-->>RA: push { lat, lng } over WebSocket

The WebSocket Gateway maintains a mapping from tripId β†’ riderWebSocketConnection in memory. When a location update arrives, it looks up the rider's connection and pushes the coordinates. The lookup is O(1). For 80K concurrent active trips, the in-memory table is tiny (~10MB) and the push is sub-millisecond.

Why not Server-Sent Events (SSE)? SSE is unidirectional (server β†’ client only) which suits this use case, but WebSockets are already required for the driver app to send confirmations back. Unifying on WebSockets avoids operating two connection protocols.

🌍 Real-World Applications: Uber, Lyft, and DiDi at Scale

The same core patterns β€” Redis geospatial index, geohash partitioning, trip state machines β€” appear across all major ride-sharing platforms, each adapted to their specific operational constraints.

PlatformLocation IndexMatching ApproachNotable Design Decision
UberRedis Geo with geohash-sharded clusters per cityETA-based matching (not pure distance)H3 hexagonal grid (open-sourced) for surge pricing zone calculations
LyftRedis Geo + PostgreSQL PostGIS for cold pathSupply/demand heat map per geohash cellSeparate services for airport queue matching vs. standard matching
DiDiDistributed Redis cluster across regionsAI-assisted predictive positioningPre-positions drivers near predicted demand zones 5–10 minutes ahead

Real-world incident: Uber's geohash boundary bug. In early deployments, riders near the edge of a geohash cell (e.g., exactly on the boundary between 9q8yy and 9q8yz cells) were sometimes unmatched because a GEORADIUS query on one shard did not include drivers in the adjacent cell. The fix: Matching Service queries the target cell and all 8 adjacent geohash cells for riders within 200m of a cell boundary. This "boundary fan-out" adds at most one extra Redis query for ~15% of ride requests.

Uber H3 (Hexagonal Hierarchical Spatial Index): Uber open-sourced their H3 library specifically because traditional rectangular geohash cells have a distance distortion problem β€” corner-to-corner distance is 41% longer than center-to-center distance within the same cell. H3 hexagons are more uniform. In practice, H3 is used for surge pricing zone calculations; the matching layer still uses Redis Geo's built-in Geohash because the GEORADIUS command handles the boundary overlap automatically.

Input:  Rider at (37.7749, -122.4194) requests a ride
Process:
  1. GEORADIUS on primary shard (9q8) β†’ 8 candidate drivers
  2. Boundary check: rider is 180m from 9q8/9q9 cell boundary
  3. Fan-out: GEORADIUS on adjacent shard (9q9) β†’ 2 additional candidates
  4. Merge candidates, calculate ETAs, select driver with min ETA
Output: drv_abc123 matched (0.4km away, ETA 2 min)

πŸ—ƒοΈ Data Model β€” Tables, Indexes, and Sharding Keys

The relational database owns durable, auditable records. The Redis layer owns ephemeral, high-throughput state.

PostgreSQL Schema

CREATE TABLE trips (
    trip_id         UUID         PRIMARY KEY DEFAULT gen_random_uuid(),
    rider_id        UUID         NOT NULL,
    driver_id       UUID,        -- NULL until MATCHED
    status          TEXT         NOT NULL DEFAULT 'REQUESTED',
    pickup_lat      DECIMAL(9,6) NOT NULL,
    pickup_lng      DECIMAL(9,6) NOT NULL,
    dest_lat        DECIMAL(9,6) NOT NULL,
    dest_lng        DECIMAL(9,6) NOT NULL,
    requested_at    TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    matched_at      TIMESTAMPTZ,
    completed_at    TIMESTAMPTZ,
    fare_cents      INT,         -- final fare in cents; NULL until COMPLETED
    idempotency_key TEXT         UNIQUE
);
-- Shard key: rider_id β€” keeps all trips for a rider on one shard for history queries
CREATE INDEX idx_trips_rider     ON trips (rider_id, requested_at DESC);
CREATE INDEX idx_trips_driver    ON trips (driver_id, requested_at DESC);
CREATE INDEX idx_trips_status    ON trips (status) WHERE status NOT IN ('COMPLETED','CANCELLED');

CREATE TABLE drivers (
    driver_id       UUID         PRIMARY KEY,
    name            TEXT         NOT NULL,
    phone           TEXT         NOT NULL,
    vehicle_id      UUID         NOT NULL,
    rating          DECIMAL(3,2) DEFAULT 5.00,
    status          TEXT         NOT NULL DEFAULT 'OFFLINE'  -- OFFLINE | AVAILABLE | ON_TRIP
);

CREATE TABLE riders (
    rider_id        UUID         PRIMARY KEY,
    name            TEXT         NOT NULL,
    phone           TEXT         NOT NULL,
    email           TEXT         UNIQUE NOT NULL,
    default_payment TEXT         -- payment method token
);

CREATE TABLE location_snapshots (
    snapshot_id     BIGSERIAL    PRIMARY KEY,
    driver_id       UUID         NOT NULL,
    trip_id         UUID,        -- NULL for idle pings
    lat             DECIMAL(9,6) NOT NULL,
    lng             DECIMAL(9,6) NOT NULL,
    recorded_at     TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);
-- Partitioned by recorded_at (monthly); retained for 90 days for dispute resolution
CREATE INDEX idx_location_driver ON location_snapshots (driver_id, recorded_at DESC);

Redis Key Layout

Key PatternValueTTLPurpose
drivers:activeGeo sorted set (all active drivers)No TTL (ZREM on eviction)GEORADIUS queries
driver:heartbeat:{driverId}"1"30sAuto-evicts stale drivers
driver:lock:{driverId}tripId30sAtomic driver assignment lock
trip:idem:{idempotencyKey}tripId24hIdempotent trip creation
trip:status:{tripId}"MATCHED" etc.2hFast status reads for WebSocket gateway

βš–οΈ Trade-offs, Failure Modes, and Scaling Decisions

The Location Service Bottleneck

At 100K writes/second, a single Redis instance is the obvious bottleneck. Redis can handle ~200K-400K simple SET operations per second on modern hardware, so a single node can survive normal load β€” but not a 10x peak spike.

Mitigation: Geohash-based partitioning. Shard the drivers:active key by geohash prefix. Cities map naturally to geohash cells at precision 3-4 (156km Γ— 156km cells). Each shard owns a geographic region:

ShardGeohash prefixRegion
redis-shard-09q8San Francisco
redis-shard-1dr5New York City
redis-shard-2u33London

A GEORADIUS query only touches the shard(s) for the relevant geographic cell(s), and writes are naturally distributed across shards by city. Cross-cell queries (a rider near the boundary of two geohash cells) require queries to adjacent shards β€” the Matching Service handles this with a two-shard fan-out for border cases.

Trade-off Summary

DecisionChosenAlternativeWhy
Driver location storageRedis GeoPostgreSQL PostGISPostGIS can't sustain 100K writes/sec without heavy connection pooling and sharding; Redis keeps the index in memory
Geographic indexingGeohash (Redis native)k-d treek-d trees require re-balancing under concurrent inserts β€” operationally expensive at 500K concurrent writers; geohash partitioning is stateless and trivially shardable
Matching notificationPush (FCM/APNs)SMSSub-second push latency; SMS has 1-10s delivery variance
In-trip location updatesWebSocketHTTP pollingPolling at 5s interval = 16K+ HTTP requests/sec for 80K active trips; WebSocket collapses this to one connection per trip
Trip state storagePostgreSQLRedisTrip history is auditable, financial-grade data β€” it must survive cache eviction

Surge Pricing β€” A Brief Note

The Location Service continuously maintains a supply/demand ratio per geohash cell: active_drivers / open_requests in each cell. When this ratio drops below a threshold (e.g., 0.5 available drivers per pending request), the Pricing Service multiplies the base fare. The cell granularity (precision 5, ~5km cells) is coarser than matching (precision 6) so surge zones cover meaningful geographic areas rather than single blocks.

🧭 Architecture Decision Guide β€” Choosing the Right Tool for Each Concern

SituationRecommendation
High-frequency ephemeral updates (location, session state)Redis β€” in-memory, sub-millisecond latency, TTL-based auto-eviction
Durable financial or auditable records (trips, payments)PostgreSQL β€” ACID, point-in-time recovery, audit log
Geographic radius queries at 100K+ writes/secRedis GEORADIUS β€” native geospatial, sorted-set backed, no separate spatial index
Complex polygon queries or persistent geo data (routing, boundaries)PostGIS β€” full spatial SQL, persistent, ACID
Real-time server→client data push (live map, status updates)WebSockets — bidirectional, persistent connection, eliminates polling overhead
Occasional status checks (ride history, profile)REST HTTP β€” stateless, cacheable, operationally simpler than WebSockets
Driver assignment under concurrent Matching Service instancesRedis SET NX EX (atomic lock) β€” compare-and-set in one round-trip, no distributed lock manager needed
Scale beyond single Redis node for location indexGeohash-based sharding (by city/region) β€” stateless partitioning, no cross-shard coordination for radius queries within a city

When NOT to use Redis for primary data:

  • Any data that must survive a Redis eviction or crash without a separate durable write path.
  • Trip state, payment records, and driver earnings β€” these must be in PostgreSQL even if cached in Redis.
  • Data requiring complex relational queries (joins, aggregations over months of data) β€” Redis is a cache and index, not a query engine.

πŸ§ͺ Interview Walkthrough β€” How to Deliver This Design in 45 Minutes

A repeatable framework for presenting this design under interview time pressure:

Minute 0–5: Scope and requirements

  • State the scale: 5M rides/day, 500K active drivers, location updates every 5 seconds.
  • Identify the single hardest constraint: "keeping 500K driver positions fresh at 100K writes/sec while serving sub-50ms radius queries."
  • Name non-goals explicitly: surge pricing internals, driver onboarding, carpooling.

Minute 5–10: Capacity estimation (3 numbers that matter)

  • 100K location writes/sec β†’ need in-memory storage.
  • ~580 ride requests/sec at peak β†’ matching is fast but needs atomic driver locking.
  • ~80K concurrent WebSocket connections during Friday evening peak β†’ manageable for a mid-size gateway cluster.

Minute 10–20: Draw the HLD

  • Six services: Location, Matching, Trip, Notification, Payment, ETA.
  • Two write paths: (1) driver GPS β†’ Location Service β†’ Redis, (2) ride request β†’ Matching Service β†’ Trip Service β†’ DB.
  • One read path: Matching Service β†’ Redis GEORADIUS β†’ driver candidates.

Minute 20–35: Walk through three critical decisions

  1. Why Redis Geo? GEORADIUS on 500K drivers in < 5ms; PostgreSQL PostGIS cannot sustain 100K writes/sec without blocking reads.
  2. Why geohash partitioning? Stateless sharding by city prefix; k-d trees need re-balancing under concurrent inserts.
  3. Why WebSockets for in-trip tracking? 80K concurrent trips Γ— 5s polling = 16K HTTP requests/sec just for location; WebSockets collapse this to 80K persistent connections.

Minute 35–40: Trip state machine and consistency

  • Draw the state diagram: REQUESTED β†’ MATCHED β†’ ARRIVED β†’ PICKED_UP β†’ IN_PROGRESS β†’ COMPLETED.
  • Emphasise: each transition is idempotent (idempotency key in Redis), and driver assignment uses SET NX EX to prevent double-booking.

Minute 40–45: Scaling evolution

  • Phase 1: Single Redis node per region. Handles average load.
  • Phase 2: Geohash-sharded Redis cluster. Handles 10x peak.
  • Phase 3: Separate WebSocket Gateway cluster with horizontal scaling. Handles 500K+ concurrent connections.

A strong closing line: "The core architectural principle is separation of concerns between ephemeral, high-frequency state (Redis) and durable, auditable state (PostgreSQL) β€” every component decision flows from that split."

πŸ› οΈ Redis Geo: How It Powers Driver Location Matching

Redis Geo is a set of commands layered on top of Redis Sorted Sets. Every member's score is a 52-bit Geohash integer derived from the latitude/longitude pair. This lets Redis answer proximity queries using sorted-set range operations internally β€” no separate spatial index is needed.

Core commands:

# Add or update a driver's position
GEOADD drivers:active -122.4194 37.7749 drv_abc123
# Syntax: GEOADD key longitude latitude member
# Returns: 1 (added) or 0 (updated existing)

# Find drivers within 2km of a rider's pickup point, sorted by distance
GEORADIUS drivers:active -122.4194 37.7749 2 km ASC COUNT 10 WITHDIST
# Returns:
# 1) 1) "drv_abc123"
#    2) "0.1832"       <- distance in km
# 2) 1) "drv_xyz789"
#    2) "1.4210"

# Get stored position of a specific driver
GEOPOS drivers:active drv_abc123
# Returns: [[-122.41940081119537, 37.774898917926234]]

# Get distance between two drivers
GEODIST drivers:active drv_abc123 drv_xyz789 km
# Returns: "1.2748"

# Remove a driver from the index (goes offline or heartbeat expired)
ZREM drivers:active drv_abc123

Why Redis Geo over a purpose-built spatial database (PostGIS, Elasticsearch Geo)?

CriterionRedis GeoPostGISElasticsearch Geo
Write latency< 1ms (in-memory)5–20ms (disk)5–15ms (index overhead)
Read latency (GEORADIUS)1–5ms10–50ms10–30ms
Throughput200K+ ops/sec/node~10K writes/sec~20K writes/sec
PersistenceOptional RDB/AOFFull ACIDFull with replicas
Best forEphemeral, high-freq updatesPersistent geo data, complex queriesFull-text + geo hybrid

For driver location β€” ephemeral, high-frequency, read-heavy with simple radius queries β€” Redis Geo is the correct tool. For permanent route geometry or complex polygon queries, PostGIS is the right choice.

For a deeper dive on Redis commands and data structures, see the companion post System Design Caching and Asynchronism.

πŸ“š Lessons Learned

  • The location index is tiny but write-intensive. 500K drivers Γ— 100 bytes = ~50MB. The challenge is not storage β€” it is sustaining 100K writes/second without degrading read latency. Memory-first storage (Redis) is the only viable choice at this write rate without complex multi-node PostgreSQL sharding.
  • Geohash partitioning is operationally simpler than k-d trees. K-d trees are theoretically faster for nearest-neighbor queries, but they require re-balancing during concurrent inserts. Geohash cells are stateless β€” each shard owns a fixed geographic prefix, and no shard needs to know about its neighbours.
  • Driver locking must be atomic. The single biggest consistency risk in the matching flow is two Matching Service instances selecting the same driver. SET ... NX EX in Redis gives you an atomic compare-and-set that costs one network round-trip. Without this, double-booking is a race condition that is very hard to reproduce in testing and very visible to users in production.
  • Trip state machines prevent invalid state transitions. Without an enforced state machine, a driver can accidentally send a COMPLETED event before PICKED_UP, or a retry can double-process a MATCHED event. The state machine is the contract between every service that touches a trip.
  • WebSockets win over polling for in-trip updates at scale. At 80K concurrent trips with a 5-second update interval, HTTP polling would generate 16K+ requests per second purely for location updates. WebSockets compress that to 80K persistent connections β€” which modern gateway infrastructure handles trivially.
  • Heartbeat expiry is how you handle driver disconnections gracefully. If a driver's app crashes, the GPS pings stop. A 30-second heartbeat key expiry removes the driver from the active index without requiring an explicit OFFLINE event β€” the absence of pings is itself the signal.

πŸ“Œ TLDR: Summary & Key Takeaways

  • Redis Geo (GEOADD / GEORADIUS) is the correct storage layer for the driver location index: in-memory, sub-millisecond write latency, and native radius queries. PostgreSQL cannot sustain 100K location writes/sec without aggressive sharding.
  • Geohash partitioning allows the location index to scale horizontally by geographic region. Each shard owns a geohash prefix cell; GEORADIUS queries only touch the relevant shard(s).
  • Atomic driver locking (SET driver:lock:{id} tripId NX EX 30) prevents double-booking when multiple Matching Service instances race to assign the same driver.
  • The Trip State Machine (REQUESTED β†’ MATCHED β†’ ARRIVED β†’ PICKED_UP β†’ IN_PROGRESS β†’ COMPLETED) enforces valid transitions and, combined with idempotency keys, makes every state transition safe to retry.
  • WebSockets are the right protocol for in-trip location streaming: one persistent connection per rider replaces thousands of polling requests per second at peak load.
  • Heartbeat expiry (30-second TTL on driver:heartbeat:{id}) auto-evicts stale drivers from the index without requiring an explicit offline signal.
  • The hardest engineering problem in ride-sharing is not the matching algorithm β€” it is keeping half a million driver positions fresh in memory at 100K writes/second while serving sub-50ms GEORADIUS queries concurrently.

πŸ“ Practice Quiz

  1. Why is Redis preferred over a PostgreSQL PostGIS index for storing active driver locations?

    • A) Because PostgreSQL does not support geospatial data types
    • B) Because Redis keeps the entire index in memory, sustaining 100K+ writes/sec with sub-millisecond latency β€” PostGIS requires disk I/O and cannot match this write throughput without heavy sharding
    • C) Because Redis has better replication support than PostgreSQL Correct Answer: B
  2. Why is geohashing used for driver location partitioning instead of a k-d tree?

    • A) Because geohash strings are easier to display in a UI
    • B) Because k-d trees are not supported in distributed systems
    • C) Because k-d trees require re-balancing during concurrent inserts, making them operationally expensive at 500K concurrent writers; geohash cells are stateless and each shard owns a fixed geographic prefix with no coordination needed Correct Answer: C
  3. Why are WebSockets used for in-trip location streaming instead of HTTP long-polling or regular polling?

    • A) WebSockets have a lower per-message overhead once the connection is established; 80K active trips polling every 5 seconds would generate 16K+ HTTP requests/second β€” WebSockets reduce this to 80K persistent connections with server-initiated pushes
    • B) Because polling is not supported by mobile operating systems
    • C) Because WebSockets automatically re-connect after network failures Correct Answer: A
  4. What is the purpose of SET driver:lock:{driverId} tripId NX EX 30 in the driver matching flow?

    • A) To cache the driver's profile so the Matching Service can read it without hitting PostgreSQL
    • B) To atomically lock a driver to a single trip β€” the NX flag ensures only the first Matching Service instance to execute this command succeeds, preventing two concurrent requests from double-booking the same driver
    • C) To store the trip's state in Redis so the Trip Service can read it faster Correct Answer: B
  5. Open-ended: A rider in San Francisco requests a ride. The GEORADIUS query returns 3 candidate drivers at distances of 0.4km, 0.9km, and 1.7km. The Matching Service picks the closest one (0.4km), but the lock attempt fails because another ride request just locked that driver. Walk through exactly what happens next β€” lock fallback, ETA recalculation, trip creation, and driver notification. What happens if all 3 candidates are locked?

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms