All Posts

System Design HLD Example: Hotel Booking System (Airbnb)

A practical interview-ready HLD for a hotel booking platform handling availability, concurrency, and reservations.

Abstract AlgorithmsAbstract Algorithms
ยทยท24 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Design a hotel booking system like Airbnb. This article covers availability tracking with date-level inventory, double-booking prevention via optimistic locking, two-phase reservation (hold โ†’ confirm), geospatial search with Elasticsearch, and payment pre-authorisation โ€” with concrete SQL schema, Redis key layout, and a targeted Java @Version snippet.

TLDR: Booking systems must solve one problem above all others: preventing two guests from successfully reserving the same room for the same night, even under high concurrency.

At 11:59 PM on New Year's Eve, two guests click "Book Now" on the same Manhattan apartment. Both see "Available." Both reach the payment screen. Both receive a confirmation email. Their hosts wake up to two strangers at the door on the same night. This is the double-booking race condition, and it is the reason hotel booking systems are a favourite system design interview question.

Airbnb's engineering team encountered this exact failure mode as the platform scaled beyond 100 million nights booked. The fix required rethinking availability tracking at the database row level, introducing a two-phase hold-then-confirm reservation model, and separating the read-heavy search workload from the write-critical booking path.

By the end of this walkthrough you will know why availability tracking uses a date-level inventory table rather than a simple "is booked" flag, why optimistic locking outperforms pessimistic locks for most booking loads, and why search and booking must never share the same data store.

๐Ÿ“– The Double-Booking Problem: Two Guests, One Room, One Disaster

Actors

ActorRole
GuestSearches for rooms, views availability, creates and cancels bookings
HostLists property, manages availability calendar, receives payments
Search ServiceHandles geo-location + date range queries against Elasticsearch
Availability ServiceTracks which dates are booked per room; enforces the hold and confirm lifecycle
Booking ServiceOrchestrates hold โ†’ payment โ†’ confirm with idempotency
Payment ServicePre-authorises card at booking creation; captures charge 48 h before check-in
Notification ServiceSends confirmation emails, host alerts, and cancellation notices

Use Cases

  • Primary interview prompt: Design a hotel or room booking platform like Airbnb or Booking.com.
  • Core user journeys:
    • Search โ€” guest queries rooms by city or coordinates, date range, guest count, and filters (price, amenities, rating)
    • View availability โ€” guest views the per-night calendar for a specific listing before committing
    • Create booking โ€” guest selects dates โ†’ system places a 15-minute hold โ†’ payment pre-auth โ†’ confirmed booking
    • Cancel booking โ€” guest cancels; system releases held or booked slots and triggers a refund per the host's cancellation policy
    • Host availability management โ€” host blocks or opens dates; sets nightly price and minimum stay rules
    • Payment lifecycle โ€” pre-authorise at booking creation; capture 48 h before check-in; refund on cancellation
    • Reviews โ€” post-checkout; guest and host each submit a rating
  • Read and write paths are explained separately to keep bottlenecks and consistency boundaries explicit.

The double-booking race condition is the reason this system appears in nearly every senior engineering interview: it looks like straightforward CRUD until you model two users simultaneously selecting the same room.

๐Ÿ” Functional Requirements: What the System Must Cover

In Scope

  • Room search โ€” GET /search with geo bounding box or geo_distance, date range, guest count, price ceiling
  • Availability calendar โ€” GET /rooms/{id}/availability?from=&to= returning per-date status
  • Create booking โ€” POST /bookings with Idempotency-Key; places hold then awaits payment confirmation
  • Cancel booking โ€” DELETE /bookings/{id}; configurable refund policy per host (flexible, moderate, strict)
  • Payment pre-auth and capture โ€” integrate via Payment Service; pre-auth at hold; capture before check-in
  • Host calendar management โ€” block dates, set nightly pricing, manage minimum stay and advance notice rules

Out of Scope (v1 boundary)

  • Dynamic pricing and revenue optimisation (ML-driven surge pricing algorithms)
  • Fraud scoring and risk model evaluation during booking
  • Multi-currency FX conversion
  • Loyalty programmes and reward redemption

Functional Breakdown

FeatureAPI ContractKey Decision
Room searchGET /search?lat=&lon=&radius=&from=&to=&guests=Elasticsearch for geo + date filtering; Availability Service for confirmed open slots
Availability calendarGET /rooms/{id}/availability?from=&to=availability_slots table; Redis cache with 30 s TTL
Create bookingPOST /bookings + Idempotency-KeyTwo-phase: HOLD slots (version check) โ†’ payment โ†’ CONFIRM; UNIQUE(room_id, slot_date) backstop
CancelDELETE /bookings/{id}Release held or booked slots atomically; enqueue refund per cancellation policy
Host calendarPATCH /rooms/{id}/availabilityBatch update availability_slots; invalidate Redis availability cache

A strong answer names non-goals explicitly. Interviewers use this to judge prioritisation quality and architectural maturity under time constraints.

โš™๏ธ Non-Functional Requirements and Design Goals

DimensionTargetWhy It Matters
CorrectnessZero double-bookings under 10K concurrent requests for the same roomThe financial and reputational cost of a double-booking is unbounded
Availability99.99 % search path; 99.9 % booking pathDowntime during peak booking windows (New Year's Eve, school holidays) directly costs revenue
PerformanceSearch p95 < 200 ms; booking confirmation p95 < 2 sGuests abandon checkout when confirmation is slow; search must feel instant
Scalability150M DAU; 10K search QPS; 500 booking TPS peakPeak booking windows produce sudden 10ร— write spikes on popular listings
ConsistencyStrong consistency for availability slots; eventual for reviews and analyticsA guest must never see "available" for a room that is already held or booked

๐Ÿง  Deep Dive: Estimations and Design Goals

The Internals: Data Model and Locking Strategy

Three tables form the availability backbone. The central insight is that availability is tracked at the date-slot level โ€” each individual night is an independent lockable unit. This design means holding December 31st does not require locking December 30th, which maximises write concurrency for partial-week bookings.

CREATE TABLE rooms (
    room_id         UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
    host_id         UUID        NOT NULL,
    title           TEXT        NOT NULL,
    latitude        DECIMAL(9,6) NOT NULL,
    longitude       DECIMAL(9,6) NOT NULL,
    price_per_night INT         NOT NULL,  -- stored in cents; never FLOAT
    max_guests      SMALLINT    NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_rooms_host ON rooms(host_id);

-- One row per room per night โ€” the lockable unit of availability.
-- The `version` column is the optimistic locking counter.
CREATE TABLE availability_slots (
    slot_id     UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
    room_id     UUID        NOT NULL REFERENCES rooms(room_id),
    slot_date   DATE        NOT NULL,
    status      TEXT        NOT NULL DEFAULT 'available'
                CHECK (status IN ('available', 'held', 'booked', 'blocked')),
    booking_id  UUID,                 -- NULL when available or blocked
    version     INT         NOT NULL DEFAULT 0,  -- incremented on every UPDATE
    held_until  TIMESTAMPTZ,          -- expiry for held slots; 15-minute TTL
    UNIQUE (room_id, slot_date)       -- one slot per room per night; DB backstop
);
CREATE INDEX idx_slots_room_date   ON availability_slots(room_id, slot_date);
CREATE INDEX idx_slots_held_expiry ON availability_slots(status, held_until)
    WHERE status = 'held';  -- efficient sweep for expired holds

-- Each booking spans one or more availability_slots rows.
CREATE TABLE bookings (
    booking_id      UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
    idempotency_key TEXT        UNIQUE NOT NULL,  -- client-supplied dedup anchor
    room_id         UUID        NOT NULL REFERENCES rooms(room_id),
    guest_id        UUID        NOT NULL,
    check_in        DATE        NOT NULL,
    check_out       DATE        NOT NULL,
    status          TEXT        NOT NULL DEFAULT 'pending'
        CHECK (status IN ('pending','held','confirmed','cancelled','expired')),
    total_cents     BIGINT      NOT NULL,
    payment_auth_id TEXT,               -- pre-auth reference from Payment Service
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    confirmed_at    TIMESTAMPTZ,
    cancelled_at    TIMESTAMPTZ
);
CREATE INDEX idx_bookings_room_dates ON bookings(room_id, check_in, check_out);
CREATE INDEX idx_bookings_guest      ON bookings(guest_id, created_at DESC);

The version column is the double-booking prevention mechanism. Every UPDATE on a slot increments version by 1. A concurrent UPDATE that reads the old version returns 0 rows โ€” the application layer treats 0 rows as a conflict and retries or rejects the booking attempt. The UNIQUE(room_id, slot_date) constraint is the database-level backstop that catches any conflict the application layer misses.

๐Ÿ“Š Data Model: Key Relationships

erDiagram
    ROOMS {
        uuid room_id PK
        uuid host_id
        text title
        decimal latitude
        decimal longitude
        int price_per_night
        smallint max_guests
    }
    AVAILABILITY_SLOTS {
        uuid slot_id PK
        uuid room_id FK
        date slot_date
        text status
        uuid booking_id FK
        int version
        timestamptz held_until
    }
    BOOKINGS {
        uuid booking_id PK
        text idempotency_key UK
        uuid room_id FK
        uuid guest_id
        date check_in
        date check_out
        text status
        bigint total_cents
        text payment_auth_id
    }
    ROOMS ||--|{ AVAILABILITY_SLOTS : "one slot per night"
    BOOKINGS ||--|{ AVAILABILITY_SLOTS : "holds or books"

Redis key layout:

Key PatternValueTTL
availability:{roomId}:{from}:{to}JSON array of slot statuses30 seconds
booking:lock:{roomId}:{checkIn}:{checkOut}"1" (distributed lock)20 seconds
search:cache:{sha256(query)}JSON search result set5 minutes
idempotency:booking:{key}Serialised booking response JSON24 hours

Estimations

Assumptions for Airbnb-scale:

DimensionAssumptionDerivation
DAU150M~10 % of 1.5B MAU active daily
Search QPS~10K steady; ~26K peak150M ร— 5 searches/day รท 86,400 โ‰ˆ 8.7K steady; 3ร— holiday peak
Booking TPS~500Peak holiday windows; ~1 % of searches convert to a hold
Listings7M rooms worldwideAirbnb scale; grows ~10 % year-over-year
Availability rows~2.5B7M rooms ร— 365 days; pre-populated for 12-month rolling window
Redis availability cache~200 GB7M rooms ร— 28-day window ร— ~1 KB per window
Booking rows per year~180M500 bookings/sec ร— 3,600 s ร— ~100 peak hours/year

Key insight: The search-to-booking QPS ratio is roughly 20:1. This asymmetry drives the single most important architectural decision: Elasticsearch for search (optimised for geo and date read throughput) and PostgreSQL for bookings (optimised for ACID writes with row-level versioning). Mixing both workloads in one database would cause read-write contention during the booking peaks that matter most.

Design Goals

GoalWhy It MattersDecision It Drives
Zero double-bookingsTwo concurrent HOLD requests for the same room and date must not both succeedOptimistic locking (version column); UNIQUE(room_id, slot_date) as the DB backstop
15-minute hold windowGuest must complete checkout without the listing disappearing; abandoned holds must releaseheld_until timestamp + background cleanup job on a 5-minute cron schedule
Search must not query the booking DBBooking DB cannot absorb 10K read QPS from search without degrading write performanceElasticsearch index synced asynchronously from PostgreSQL via CDC (Debezium)
Idempotent booking creationNetwork retries must not create two confirmed bookings for the same checkout attemptidempotency_key UNIQUE constraint at DB level; Redis 24 h response cache as the fast path

Performance Analysis: Bottlenecks and Scaling Pressure

Pressure PointSymptomFirst ResponseSecond Response
Hot listing (popular room)Hundreds of concurrent HOLD attempts โ†’ OCC conflict rate spikesExponential backoff + 3 retries in Booking ServiceRedis SETNX lock per (roomId, dateRange) to serialise hold attempts before DB
Availability cache stalenessGuests see "available" for an already-held room during booking waveShorten cache TTL to 10 s during peak windowPublish slot.status_changed event; cache consumer invalidates the key immediately
Search QPS peakElasticsearch query latency climbs under holiday loadRead replicas behind a load balancerCache identical query hashes for 5 minutes at the Search Service layer
Expired hold accumulationheld rows accumulate and incorrectly suppress availabilityCron job every 5 minutes bulk-updates expired holds back to availableidx_slots_held_expiry partial index makes sweep fast even at 2.5B rows
Booking DB write contentionavailability_slots UPDATE p99 climbs under sustained 500 TPSPartition availability_slots by room_id rangeSeparate read replica for availability calendar reads; primary for writes only

๐Ÿ“Š High-Level Architecture

Five services handle the booking platform. The critical topology decision is that the Search Service and the Booking Service never share a database: Elasticsearch absorbs all geo and date read traffic; PostgreSQL handles only ACID availability writes. Kafka decouples booking confirmation events from notification delivery and Elasticsearch index updates.

graph TD
    Client["๐Ÿง‘ Guest / Host Client"]
    GW["API Gateway\n(Auth ยท Rate Limit ยท Routing)"]
    Search["Search Service\n(Elasticsearch)"]
    Avail["Availability Service\n(PostgreSQL)"]
    Booking["Booking Service\n(PostgreSQL)"]
    Payment["Payment Service\n(Stripe / Adyen)"]
    Notify["Notification Service\n(SES / FCM)"]
    Redis["Redis\n(Availability Cache\nIdempotency Store\nBooking Locks)"]
    ES["Elasticsearch\n(Geo + Date Index)"]
    DB[("PostgreSQL\nBookings + Availability")]
    Kafka["Kafka\n(Event Bus)"]

    Client --> GW
    GW --> Search
    GW --> Booking
    Search --> ES
    Search --> Redis
    Booking --> Avail
    Booking --> Redis
    Avail --> DB
    Booking --> DB
    Booking --> Payment
    Booking --> Kafka
    Kafka --> Notify
    Kafka --> ES

The API Gateway enforces authentication and rate limiting (guests are limited to 10 booking attempts per minute to prevent inventory squatting). It routes search traffic to the Search Service and booking traffic to the Booking Service, keeping the two latency profiles and consistency requirements fully isolated.

Component responsibilities at a glance:

ComponentResponsibilityTechnology
Search ServiceGeo + date + filter queries; returns listing metadataElasticsearch with geo_distance + range filters
Availability ServiceDate-level inventory; hold/book/release lifecyclePostgreSQL; availability_slots with @Version
Booking ServiceOrchestrate hold โ†’ payment โ†’ confirm; idempotencySpring Boot; PostgreSQL; Redis SETNX lock
Payment ServicePre-auth at booking; capture before check-in; refundStripe adapter with client-supplied idempotency keys
Notification ServiceGuest confirmation, host alert, cancellation noticesKafka consumer; async delivery via SES and FCM
RedisAvailability cache (30 s TTL); booking lock (SETNX); idempotency storeRedis Cluster
KafkaDecouple booking events from notifications and ES index updatesTopics: booking.confirmed, booking.cancelled, slot.status_changed

๐ŸŒ Real-World Application: How Airbnb Engineers Eliminated Double-Bookings

Airbnb's early architecture used a naive check-and-book pattern: read availability, then if available create the booking. This is a classic TOCTOU (time-of-check-time-of-use) race condition. Two guests reading at the same millisecond both see the room as available, both write a booking row, and both receive a confirmation email โ€” while the host faces an impossible situation.

The production fix was a two-phase model with optimistic locking on the availability slot row:

Phase 1 โ€” HOLD: The Booking Service attempts to UPDATE all availability_slots rows for the requested dates from available to held, checking the current version value. If any slot's UPDATE returns 0 rows affected (because another request already incremented that version), the hold fails immediately. The guest sees "Room no longer available for selected dates."

Phase 2 โ€” CONFIRM: After successful payment pre-authorisation, the service transitions slot status from held to booked, again with a version check. On cancellation or payment failure, it reverts to available.

// JPA @Version causes Hibernate to include the version in every UPDATE WHERE clause.
// If the version has changed (another request won the race), Hibernate throws
// OptimisticLockingFailureException โ€” the application layer retries up to 3 times.
@Entity
@Table(name = "availability_slots")
public class AvailabilitySlot {

    @Id
    private UUID slotId;

    @Column(nullable = false)
    private UUID roomId;

    @Column(nullable = false)
    private LocalDate slotDate;

    @Enumerated(EnumType.STRING)
    private SlotStatus status;   // AVAILABLE โ†’ HELD โ†’ BOOKED / back to AVAILABLE

    @Version                     // Hibernate increments this on every UPDATE
    private int version;         // Concurrent UPDATE with stale version โ†’ exception

    private Instant heldUntil;   // non-null only when status = HELD
}

// Booking Service: retry up to 3 times with exponential backoff on OCC conflict.
@Retryable(value = OptimisticLockingFailureException.class,
           maxAttempts = 3, backoff = @Backoff(delay = 100, multiplier = 2))
public BookingResult holdDates(UUID roomId, LocalDate checkIn,
                               LocalDate checkOut, UUID bookingId) {
    List<AvailabilitySlot> slots =
        slotRepository.findByRoomIdAndDateRange(roomId, checkIn, checkOut);
    slots.forEach(slot -> {
        if (slot.getStatus() != SlotStatus.AVAILABLE)
            throw new RoomNotAvailableException(slot.getSlotDate());
        slot.setStatus(SlotStatus.HELD);
        slot.setHeldUntil(Instant.now().plus(Duration.ofMinutes(15)));
    });
    return new BookingResult(slotRepository.saveAll(slots));  // version incremented here
}

For the most contested listings โ€” a city-centre apartment listed during a sold-out concert weekend โ€” optimistic locking alone can generate a thundering herd of retries. A Redis SETNX front-door lock on the (roomId, dateRange) key serialises concurrent HOLD attempts before they reach the database, eliminating unnecessary OCC conflicts:

// SETNX (SET if Not eXists) acquires a lock that expires automatically.
// If the lock is already held, the request is rejected immediately rather than
// hammering the DB with 200 concurrent optimistic-lock retries.
String lockKey = String.format("booking:lock:%s:%s:%s",
    roomId, checkIn, checkOut);
Boolean acquired = redisTemplate.opsForValue()
    .setIfAbsent(lockKey, "1", Duration.ofSeconds(20));
if (Boolean.FALSE.equals(acquired)) {
    throw new BookingLockConflictException(
        "Room is being reserved. Please retry in a moment.");
}
// Proceed to the database HOLD inside a try/finally that releases the lock.

This two-layer pattern โ€” Redis front-door lock for hot listings, OCC for the long tail โ€” is the industry-standard approach that allows a platform like Airbnb to absorb sudden booking spikes on popular properties without database deadlocks or runaway retry storms.

โš–๏ธ Trade-offs & Failure Modes in Booking Concurrency and Availability

Trade-offOption AOption BChosen Approach
Locking strategyPessimistic (SELECT FOR UPDATE) โ€” simple; no conflictsOptimistic (version column) โ€” high throughput; requires retry logicOCC for most rooms; Redis SETNX for top-1 % contested listings
Availability storeSame PostgreSQL as bookings โ€” always consistentSeparate read replica for calendar readsRead replica for calendar; all writes go to primary
Search freshnessQuery booking DB directly โ€” always accurateElasticsearch with CDC sync โ€” 1โ€“5 s lagES for search speed; Availability Service is authoritative at checkout
Hold window duration5 min โ€” fewer ghost holds; worse checkout UX30 min โ€” better UX; more inventory artificially suppressed15 min โ€” balances abandoned-checkout cleanup with realistic checkout time

Failure modes to discuss in an interview:

  • Hold never confirmed (guest abandons checkout): Guest initiates a hold, then closes the browser. The slot stays held and blocks other guests until held_until expires. Mitigation: a background cron job running every 5 minutes selects all availability_slots WHERE status = 'held' AND held_until < NOW() and bulk-updates them back to available. The partial index idx_slots_held_expiry makes this sweep fast even at 2.5 billion rows.
  • Payment succeeds but CONFIRM update fails: Guest is charged but booking status stays held. The guest has a pre-auth on their card and no confirmed booking. Mitigation: transactional outbox pattern โ€” write a booking_events row in the same transaction as the payment response; a background reconciliation job re-drives the CONFIRM step using the stored event if the booking is still held after payment succeeds.
  • Availability cache serves stale data during a booking wave: A room just held still shows "available" for up to 30 seconds for guests whose availability query was cached before the hold. Mitigation: publish a slot.status_changed Kafka event on every hold, confirm, and release; the cache consumer invalidates the Redis key immediately on receipt.
  • Elasticsearch out of sync with PostgreSQL: A host marks dates as blocked in PostgreSQL, but the ES document still shows those dates as available for up to 5 seconds. Mitigation: the Availability Service performs an authoritative availability re-check from PostgreSQL at the start of every HOLD attempt โ€” ES results are used only for display and filtering, never as the source of truth for the booking write.

๐Ÿงญ Decision Guide: Concurrency Control Strategy for Room Booking

SituationRecommendation
Use optimistic locking (OCC)Most listings: booking conflicts are statistically rare; OCC delivers high write throughput without blocking concurrent readers
Use pessimistic lockingUltra-high-value single-inventory items (charter flights, one-of-a-kind event venues) where a conflict retry is commercially unacceptable
Add Redis SETNX front-door lockTop 1 % of listings by booking attempt frequency; prevents thundering-herd OCC conflicts on hot-key rooms during peak windows
Avoid long hold windowsHold windows longer than 30 minutes artificially suppress availability and frustrate guests who see listings as "unavailable" when they are actually just held by an abandoned checkout
Separate search from booking DBAlways โ€” the 20:1 search-to-booking QPS ratio makes co-location a correctness and performance risk, not just a scaling concern
Multi-room (bundle) bookingsApply all-or-nothing HOLD across all rooms in a single transaction; any partial hold must be rolled back atomically, or you create partially-booked inventory that cannot be released cleanly

๐Ÿงช Practical Examples: Write Path and Read Path Walkthrough

Example 1: Guest Makes a Booking โ€” The Concurrency-Safe Write Path

The following sequence shows the full booking flow from "Book Now" to confirmation email, with concurrency control at every critical step.

sequenceDiagram
    participant G as Guest
    participant BK as Booking Service
    participant RD as Redis
    participant AV as Availability Service
    participant PY as Payment Service
    participant KF as Kafka

    G->>BK: POST /bookings (Idempotency-Key: xyz, roomId, check-in/out)
    BK->>RD: GET idempotency:booking:xyz
    RD-->>BK: Cache miss โ€” proceed
    BK->>RD: SETNX booking:lock:{roomId}:{dates} (TTL 20 s)
    RD-->>BK: Lock acquired
    BK->>AV: HOLD slots (version check; held_until = now + 15 min)
    AV-->>BK: Hold confirmed โ€” all slot versions incremented
    BK->>PY: Pre-authorise payment (amountCents, paymentMethodId)
    PY-->>BK: Auth code returned
    BK->>AV: CONFIRM slots (status: held โ†’ booked, version check)
    AV-->>BK: Booking confirmed
    BK->>RD: SET idempotency:booking:xyz = response (TTL 24 h)
    BK->>KF: Emit booking.confirmed event
    KF-->>G: Confirmation email (via Notification Service, async)
    BK-->>G: HTTP 201 { bookingId, status: "confirmed" }

If the HOLD step returns an OCC conflict, the Booking Service retries up to 3 times with exponential backoff (100 ms, 200 ms, 400 ms). If all retries fail โ€” indicating the room is genuinely being booked by another guest โ€” it returns HTTP 409 "room no longer available for selected dates".

If the Redis lock is already held on the first attempt, the service returns HTTP 409 immediately. This prevents the thundering herd from even reaching the database during a booking wave on a hot listing.

Example 2: Guest Searches for Available Rooms โ€” The Read Path

The read path deliberately avoids the booking database entirely. Elasticsearch handles geo and date filtering; Redis serves cached availability windows; only on a cache miss does the system touch PostgreSQL.

graph TD
    A["Guest: GET /search?lat=40.7&lon=-74.0\n&from=2025-12-31&to=2026-01-02&guests=2"]
    B["Search Service: hash query params โ†’ check Redis search cache"]
    C{Search cache hit?}
    D["Return cached search results\n(5-minute TTL)"]
    E["Elasticsearch: geo_distance + date range + price filter"]
    F["Availability Service: verify open slots for top results"]
    G{Availability cache hit?}
    H["Return from Redis availability cache\n(30-second TTL)"]
    I["PostgreSQL: query availability_slots\nfor matched room IDs and dates"]
    J["Merge ES results with confirmed availability\nCache result in Redis (30 s TTL)"]
    K["Return ranked results to guest\n(sorted by distance + rating)"]

    A --> B
    B --> C
    C -- Yes --> D
    C -- No --> E
    E --> F
    F --> G
    G -- Yes --> H
    G -- No --> I
    I --> J
    H --> K
    J --> K
    D --> K

The two-tier cache design reflects different freshness requirements: search results (hotel metadata, photos, ratings) change rarely and tolerate a 5-minute stale window; availability changes every few seconds during a booking wave and requires a 30-second cache with event-driven invalidation on every slot status change.

Elasticsearch is the only open-source technology that can answer "all available rooms within 5 km of Times Square for December 31st, under $300 per night, for 2 guests" in under 200 ms at 10K QPS. The reason is its inverted index combined with first-class geospatial query support via the geo_point field type and the geo_distance filter.

Hotel listings are indexed as ES documents. Each document contains a location field of type geo_point, nightly price, guest capacity, and a multi-value keyword field named available_dates containing every open date for the next 12 months. A CDC pipeline (Debezium consuming PostgreSQL WAL events via Kafka) updates this field within 1โ€“5 seconds whenever the Availability Service changes a slot status.

GET /hotels/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "geo_distance": {
            "distance": "5km",
            "location": { "lat": 40.7128, "lon": -74.0060 }
          }
        },
        { "range":  { "price_per_night": { "lte": 30000 } } },
        { "term":   { "max_guests": { "gte": 2 } } },
        { "terms":  { "available_dates": ["2025-12-31", "2026-01-01"] } }
      ]
    }
  },
  "sort": [
    {
      "_geo_distance": {
        "location": { "lat": 40.7128, "lon": -74.0060 },
        "order": "asc",
        "unit": "km"
      }
    },
    { "rating": { "order": "desc" } }
  ],
  "size": 20
}

The available_dates field in ES is an approximate indicator. It is kept reasonably fresh by the CDC pipeline but is not the authoritative source. The Availability Service always re-confirms availability from PostgreSQL at the start of the HOLD step. This prevents a guest from completing a booking based on stale ES data if the slot was booked in the 1โ€“5 second CDC lag window.

For a full deep-dive on Elasticsearch indexing strategy, inverted index internals, and autocomplete design, see the companion post: System Design HLD Example: Search Autocomplete.

๐Ÿ“š Lessons Learned

  • Check-then-act without a lock is always wrong for inventory. The TOCTOU race condition is the most common root cause of double-bookings. Model availability as a lockable database row with a version counter โ€” not as a derived query result that any concurrent reader can also see as "available."
  • Optimistic locking beats pessimistic locking for most booking systems. Room booking conflicts are rare in aggregate. OCC delivers high write throughput and degrades gracefully with retries. SELECT FOR UPDATE serialises all writes to a room and becomes a bottleneck precisely during the high-traffic windows when you can least afford it.
  • The hold window is a product decision with architectural consequences. A 15-minute hold is long enough for checkout but short enough to limit ghost availability suppression. Hard-code it as a named constant (HOLD_DURATION_MINUTES = 15) and surface it as a tunable configuration value.
  • Search and booking must run on separate data stores. Running 10K search QPS against the booking database creates read-write contention and slows availability writes during booking waves โ€” exactly when correctness matters most.
  • Every write API that crosses a network boundary needs an idempotency key. The idempotency_key UNIQUE constraint is the last line of defence: even if the Redis cache is cold, a duplicate booking attempt at the database layer returns a conflict row rather than creating a second confirmed booking.
  • Model the payment lifecycle as two distinct phases. Pre-auth at hold creation; capture before check-in. This maps cleanly to the two-phase reservation model: a failed payment during the hold phase releases the slots without the guest ever seeing a charge.

๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • Date-level inventory slots โ€” model each room-night as an independent availability_slots row with its own status and version; this is the foundation that makes per-night locking granular and efficient.
  • Optimistic locking + UNIQUE constraint โ€” @Version on the slot row prevents concurrent HOLDs from both succeeding; UNIQUE(room_id, slot_date) is the database-level backstop for any OCC gap.
  • Two-phase reservation โ€” HOLD (15-minute window) โ†’ payment pre-auth โ†’ CONFIRM; failure at any step releases the hold cleanly without stranded charges.
  • Redis SETNX for hot listings โ€” a front-door lock on contested (roomId, dateRange) keys prevents thundering-herd OCC conflicts on popular properties during peak booking windows.
  • Separate search from booking โ€” Elasticsearch for geo + date search at 10K QPS; PostgreSQL for ACID availability writes; never co-locate the two workloads.
  • Idempotency key on every booking write โ€” Redis 24 h cache + DB UNIQUE constraint together provide defence in depth against duplicate bookings from network retries.

๐Ÿ“ Practice Quiz

  1. Why does the hotel booking system use optimistic locking rather than SELECT FOR UPDATE (pessimistic locking) for availability slots?

    • A) PostgreSQL does not support row-level locking
    • B) Optimistic locking provides higher write throughput because it never blocks concurrent readers; conflicts are rare in aggregate and handled by retry logic
    • C) SELECT FOR UPDATE cannot be used on tables with UUID primary keys Correct Answer: B
  2. Three guests simultaneously submit booking requests for the same apartment on December 31st, which has exactly one availability_slots row for that date. What prevents all three from succeeding?

    • A) The API Gateway rate limiter blocks requests after the first one
    • B) The version column: only the first UPDATE that reads the original version succeeds; the other two return 0 rows affected and are retried or rejected
    • C) The Payment Service rejects duplicate pre-auth requests automatically Correct Answer: B
  3. Why does the system use Elasticsearch for search rather than querying the PostgreSQL bookings database directly?

    • A) PostgreSQL cannot store latitude/longitude coordinates
    • B) The search-to-booking QPS ratio is approximately 20:1; co-locating 10K search QPS with ACID booking writes would cause read-write contention and degrade booking throughput precisely during booking waves
    • C) Elasticsearch automatically prevents double-bookings via its document versioning Correct Answer: B
  4. Why must the booking API require a client-supplied Idempotency-Key header on POST /bookings?

    • A) To authenticate the guest and prevent anonymous bookings
    • B) To route the request to the correct regional availability replica
    • C) Network timeouts cause clients to retry; without idempotency the server would create a second confirmed booking from a retry of an already-succeeded request Correct Answer: C
  5. Open-ended challenge: A guest completes checkout successfully. The Payment Service returns a pre-auth code. The Booking Service then crashes before it can UPDATE availability_slots to booked. The guest has a pre-auth charge on their card but no confirmed booking, and the slot remains in held status. Walk through the full recovery strategy: what data would you inspect, what does the background reconciliation job do, how do you prevent the guest from being permanently charged without a confirmed booking, and what invariant must hold at the end of recovery?

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms