All Posts

System Design HLD Example: Hotel Booking System (Airbnb)

A senior-level HLD for a hotel booking platform handling availability, concurrency, and reservations.

Abstract AlgorithmsAbstract Algorithms
Β·Β·14 min read
πŸ“š

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 14 min

AI-assisted content.

TLDR: A robust hotel booking system must guarantee atomicity in inventory subtraction. The core trade-off is Consistency vs. Availability: we prioritize strong consistency for the booking path (PostgreSQL with Optimistic Locking) while allowing eventual consistency and high availability for the search path (Elasticsearch). A two-phase "Hold-then-Confirm" model ensures that inventory isn't leaked during payment failures.

πŸ›‘ The New Year's Eve Nightmare

Imagine it’s 11:59 PM on New Year’s Eve. Two different travelers, one in London and one in New York, are looking at the exact same penthouse in Manhattan for the upcoming weekend. Both click "Book Now" at the same millisecond.

In a poorly designed system, the sequence of events looks like this:

  1. Request A checks the database: "Is the room available?" -> Yes.
  2. Request B checks the database: "Is the room available?" -> Yes.
  3. Request A writes a booking record: "Room booked for User A".
  4. Request B writes a booking record: "Room booked for User B".

Both users receive a confirmation email. Both pay their non-refundable deposits. On Friday, they both show up at the same door with their luggage. This is the Double-Booking Race Condition, and it is the single most important problem a booking system must solve. At scale, "rare" edge cases happen thousands of times a day. If you design for the average case, you fail at the edges.

πŸ“– Global Reservation Systems: Use Cases & Requirements

Actors

  • Guest / Traveler: Searches for rooms, views availability, and makes reservations.
  • Host / Property Manager: Manages inventory, sets pricing, and views upcoming bookings.
  • Admin: Handles disputes, refunds, and platform-wide monitoring.

Functional Requirements

  • Search: Users can search rooms by location (geo-coordinates), date range, and guest count.
  • Availability: Users see real-time availability for a listing before booking.
  • Reservation (Hold): Selecting a room places a temporary 15-minute hold.
  • Booking (Confirm): Successful payment converts a hold into a confirmed booking.
  • Cancellation: Releasing a booking restores inventory for those specific dates.

Non-Functional Requirements

  • Zero Double-Bookings: Strong consistency is non-negotiable for the final booking transaction.
  • High Search Availability: Search should remain functional even if the booking database is under heavy load.
  • Low Latency: Search results should return in < 200ms; booking confirmation in < 2s.
  • Scalability: Handle 100k searches/sec and 500 bookings/sec (peak holiday spikes).

πŸ” Basics: Baseline Architecture

At its core, a booking system is an Inventory Management Engine. Unlike a standard e-commerce site where you might have 1,000 units of a SKU, a hotel booking system has "Perishable Inventory." A room night on December 31st is a different "product" than the same room on January 1st.

The baseline architecture involves:

  1. Inventory Generation: Pre-calculating available slots for every room for the next 365 days.
  2. The Lock Mechanism: Ensuring that only one user can transition a slot from available to booked.
  3. The Buffer (Hold): Providing a grace period for payment processing so the user doesn't lose the room mid-transaction.

Without these basics, you end up with "Phantom Inventory"β€”rooms that appear available but are actually locked in failing payment processes.

βš™οΈ Mechanics: Distribution & Processing Logic

The distribution of inventory must be handled carefully. When a host adds a new listing, we don't just add one row. We must generate 365 rows in the availability_slots table.

  • Inventory Fan-out: Every update to a room's base availability (e.g., taking the room offline for maintenance) must propagate to all 365 days.
  • Search Synchronization: Since search is handled by Elasticsearch, we use an asynchronous pipeline. A write to the primary DB triggers a Kafka event, which is then indexed into ES. This introduces a 1-2 second lag, which is acceptable for search but not for booking.
  • State Machine: Every booking follows a strict state machine: Available -> Held -> Booked (or back to Available if the hold expires).

πŸ“ Estimations & Design Goals

The Math of Inventory

  • Total Listings: 10 Million rooms.
  • Booking Window: 1 year (365 days).
  • Total Inventory Rows: 10M 365 = *3.65 Billion rows.
  • Search-to-Booking Ratio: 20:1. If we have 10k searches/sec, we might have 500 booking attempts/sec.

Design Goal: Decouple the "Read-Heavy" search path from the "Write-Heavy" booking path. We use a Command Query Responsibility Segregation (CQRS) inspired approach where Elasticsearch handles the searches and PostgreSQL handles the ACID transactions.

πŸ“Š High-Level Design: Separating Search from Booking

The following architecture ensures that high-volume search traffic never interferes with the critical booking path.

graph TD
    User((User)) --> LB[Load Balancer]
    LB --> AG[API Gateway]

    subgraph Search_Path
        AG --> SS[Search Service]
        SS --> ES[(Elasticsearch: Geo + Dates)]
        SS --> RC[(Search Cache: Redis)]
    end

    subgraph Booking_Path
        AG --> BS[Booking Service]
        BS --> AS[Availability Service]
        AS --> PDB[(Primary DB: Postgres)]
        BS --> PS[Payment Service]
    end

    subgraph Async_Sync
        PDB --> CDC[Debezium / CDC]
        CDC --> Kafka[Kafka]
        Kafka --> SS
        Kafka --> NS[Notification Service]
    end

The diagram captures the defining architectural decision: a hard separation between the Search Path (Elasticsearch + Redis) and the Booking Path (Postgres with SELECT FOR UPDATE SKIP LOCKED). The CDC pipeline via Debezium keeps the two paths synchronized without coupling them β€” a booking written to Postgres propagates to Elasticsearch within 1–2 seconds, keeping search results fresh while ensuring the booking path never touches the search cluster.

🧠 Deep Dive: How Postgres Atomically Prevents the Double-Booking Race Condition

The Hold mechanism is the most critical internal component. Understanding exactly how it works at the database level reveals why no amount of application-level locking can replace it β€” only the database can guarantee atomicity across concurrent transactions.

Internals: The Hold-then-Confirm State Machine

When a guest selects a room and date range, the Booking Service must atomically transition N rows in the availability_slots table (one per night) from AVAILABLE to HELD. The key word is "atomically": if any single night in the requested range is already HELD or BOOKED by another session, the entire operation must roll back with no writes committed.

This is implemented as a single Postgres transaction using SELECT FOR UPDATE SKIP LOCKED:

StepSQL OperationWhy This Mechanism
1. Lock target rowsSELECT … FOR UPDATE SKIP LOCKEDNon-blocking: if rows are locked by another session, returns fewer rows immediately
2. Check completenessApplication checks all N nights returnedMissing row means another session already holds that night
3. Update to HELDUPDATE slots SET status='HELD', held_until=NOW()+interval '15 min', version=version+1Atomic state transition with optimistic version increment
4. Create bookingINSERT INTO bookings (status='HELD')Booking record created within the same transaction
5. COMMITAll-or-nothing guaranteePostgres atomicity ensures no partial holds

The SKIP LOCKED clause is the key insight. Without it, SELECT FOR UPDATE would block and wait for the competing transaction to release its lock β€” potentially for seconds. With SKIP LOCKED, if another session has the row locked, the query immediately returns that row as missing. The application then detects the incomplete result and returns "unavailable" to the second guest without any waiting.

FieldTypeDescription
slot_idUUIDPrimary key for the availability slot
room_idUUIDFK to rooms table
dateDATEThe specific night this slot represents
statusENUMAVAILABLE, HELD, BOOKED, BLOCKED
held_byUUIDGuest session ID (null when AVAILABLE)
held_untilTIMESTAMPExpiry time for the hold (15-minute TTL)
versionINTEGEROptimistic lock counter

Performance Analysis: Balancing Search Scale Against Booking Correctness

The CQRS-inspired architecture allows each path to scale completely independently.

PathTechnologyPeak ThroughputLatency Target
Search (geo + date range)Elasticsearch100,000 req/sec< 200 ms
Availability pre-checkRedis bitmap cache10,000 req/sec< 50 ms
Hold creationPostgres SKIP LOCKED500 req/sec< 500 ms
Booking confirmationPostgres + payment gateway200 req/sec< 2,000 ms

The search path uses Elasticsearch with a geo-point mapping and a date-range filter on a denormalized availability index. Because this index is refreshed asynchronously (Debezium CDC β†’ Kafka β†’ Elasticsearch consumer), there is an intentional 1–2 second lag between a room becoming HELD and that change appearing in search results. This lag is acceptable because the Hold mechanism at the Booking Service provides the ultimate correctness guarantee β€” a guest who sees a "available" result in search but then gets an "unavailable" response at booking has simply encountered the propagation window. The system remains correct even during this lag.

🌍 Real-World Booking Systems: Airbnb, Booking.com, and Expedia

Airbnb faced the double-booking problem at massive scale as "Instant Book" listings grew. Their solution is a multi-tier availability system: a fast read layer (Redis cache of per-room per-month availability bitmaps) for search, and a strong-consistency write layer (Postgres with row-level locking) for bookings. The Instant Book feature β€” where a guest can confirm immediately without waiting for host approval β€” was only possible after Airbnb built a hold mechanism capable of guaranteeing atomic availability from click to confirmation within 2 seconds.

Booking.com uses a date-level inventory system with one row per room per night, exactly as described in this guide. Their data engineering team processes over 1 billion availability updates per day as hotels worldwide manually manage their calendars through the Booking.com extranet. The Kafka pipeline ingesting these updates into Elasticsearch is one of the highest-throughput event streams in European tech infrastructure.

Expedia solved the meta-search aggregation problem differently: rather than holding inventory itself, Expedia passes the hold request directly to the supplier (hotel) API at booking time. This "pass-through" model shifts the hold complexity to the supplier but introduces latency and availability risk from external API calls β€” a trade-off Expedia accepts in exchange for avoiding the cost of maintaining 3.65 billion inventory rows.

βš–οΈ Consistency vs. Availability: Trade-offs in the Booking Path

Design DecisionAdvantageRisk
Date-level inventory (one row per night)Precise partial-week bookings supported3.65B rows; requires date-partitioned table and composite index
SKIP LOCKED for holdsNon-blocking; competing holds fail fastRequires robust retry logic in the application layer
ES for search, Postgres for bookingSearch scales independently to 100k req/sec1–2 second search-to-reality propagation lag
15-minute hold TTLGraceful payment processing windowPopular rooms unavailable during hold if payment fails slowly
CQRS read/write separationZero cross-path interferenceData synchronization complexity via CDC pipeline

Critical Failure Mode β€” The Hold-Expiry and Payment Gap: A guest places a hold, begins payment, and the payment takes 16 minutes (possible with 3D Secure strong authentication). The hold expires at 15 minutes. A background cleanup job reclaims the slot as AVAILABLE. Another guest immediately books the same room. The first guest's payment then succeeds, creating a double booking. Mitigation: The payment confirmation endpoint must re-validate that the hold is still active β€” with status=HELD and held_until > NOW() β€” in the same transaction that converts the hold to BOOKED. If the hold has expired, the system must immediately refund and surface an "unable to confirm" message, then re-attempt the hold if inventory is still available.

🧭 Choosing the Right Consistency Model for Your Booking System

Use Postgres with SKIP LOCKED when:

  • Inventory has natural row-level granularity (one row per night per room).
  • Concurrent booking attempts are moderate (under 1,000 concurrent holds per cluster).
  • Strong consistency is non-negotiable because the product being sold has real-world, non-refundable value.

Use Redis distributed locking (Redlock algorithm) when:

  • Hold operations span multiple services or databases that cannot participate in a single Postgres transaction.
  • Sub-millisecond lock acquisition is required and the Postgres round-trip overhead is prohibitive.
  • Inventory granularity is coarser β€” whole-room availability rather than per-night slots.

When to introduce Elasticsearch for search:

  • Total listing count exceeds 500,000 where Postgres full-text and geo queries begin to slow below the 200 ms target.
  • Search requires compound filtering: amenities, ratings, geo-polygon boundaries, pet policies.
  • Read-to-write ratio for search queries exceeds 50:1 β€” Elasticsearch's read-optimized index layout provides far superior throughput.

πŸ§ͺ Delivering This Design in a System Design Interview

Act 1 β€” The Double-Booking Race Condition (2 minutes): Describe the New Year's Eve scenario from the introduction. Draw two concurrent requests both reading "Available" from the database and both successfully writing a booking record. Show the resulting state: two confirmed guests, one room, two deposit receipts. Grounding the conversation in a concrete failure scenario immediately demonstrates systems thinking.

Act 2 β€” The CQRS-Inspired Architecture (5 minutes): Divide the whiteboard into a Search Path on the left and a Booking Path on the right. Show that search goes to Elasticsearch and booking goes to Postgres with SKIP LOCKED. Draw the CDC pipeline between them β€” this is the key architectural insight that allows the two paths to stay synchronized without coupling them. Explain that the 1–2 second lag in search is an intentional and acceptable trade-off.

Act 3 β€” Scaling and Edge Cases (3 minutes):

Interviewer QuestionStrong Answer
How do you scale to 10 million listings?Date-partitioned availability table in Postgres; Elasticsearch handles geo-search at full scale
How do you prevent hold abuse by bots?Require valid payment method on file before granting a hold; rate-limit holds per user session
How does a host cancellation flow work?Saga pattern: BOOKED β†’ CANCELLED_BY_HOST triggers slot reversion, Kafka refund event, guest notification

πŸ› οΈ Open Source Components for Booking Platform Infrastructure

Debezium is the standard CDC connector used to stream Postgres write-ahead log changes into Kafka. It captures every INSERT, UPDATE, and DELETE from the availability_slots table and publishes them as structured events. The Elasticsearch sync consumer subscribes to these events and updates the search index in near-real-time.

Apache Kafka provides the durable event backbone for the entire async pipeline. The Notification Service and the Elasticsearch sync consumer both consume from the same Kafka topic with independent consumer group offsets, allowing each to process events at its own pace without impacting the other.

PostGIS (Postgres geographic extension) handles the geo-coordinate storage for the listing location. While Elasticsearch handles geo-search at scale, the canonical listing location is stored in Postgres with a PostGIS GEOGRAPHY column and a spatial index for administrative queries.

πŸ“š Lessons Learned From Building and Operating Booking Systems

Lesson 1 β€” The hold is your correctness anchor. Every architectural decision should be evaluated against one question: "Does this preserve the integrity of the hold?" Adding a caching layer between the Booking Service and Postgres is dangerous if the cached availability can be stale by more than a few milliseconds during the booking transaction.

Lesson 2 β€” Generate availability rows lazily, not eagerly. Pre-generating 365 rows per room at listing creation time (3.65B rows for 10M listings) is an expensive bulk operation. Generate rows on demand when a search or booking request arrives for a date not yet in the table, and use a background job to pre-warm popular date windows.

Lesson 3 β€” Monitor the hold abandonment rate. A high hold abandonment rate (guests placing holds and not completing payment) is both a business metric and a system health signal. A sudden spike may indicate that the payment page is slow, that the payment gateway is timing out, or that the hold window is too short for the typical checkout flow.

Lesson 4 β€” The cancellation refund path is as complex as the booking path. Cancellations must atomically revert availability slots to AVAILABLE, issue a refund via the payment gateway, and notify the host and downstream analytics. Use the Saga pattern for the cancellation flow to ensure each step is idempotent and compensatable if a downstream service is unavailable.

πŸ“Œ TLDR & Key Takeaways for Hotel Booking System Design

  • Core problem: The Double-Booking Race Condition β€” two concurrent requests both reading "Available" and both writing a booking for the same room and dates.
  • Solution: SELECT FOR UPDATE SKIP LOCKED in a single Postgres transaction atomically transitions N availability slots from AVAILABLE to HELD in an all-or-nothing operation.
  • Architecture: CQRS-inspired separation β€” Elasticsearch for search (100k req/sec), Postgres for booking (500 req/sec), Debezium CDC + Kafka for synchronization.
  • Hold model: 15-minute window allows payment processing before the slot is reclaimed by the cleanup job.
  • Key trade-off: Eventual consistency in search (1–2 second lag) is acceptable; strong consistency in the booking transaction is non-negotiable.
  • At scale: 3.65B availability rows require date-partitioned tables and a composite B-tree index on (room_id, date, status).
Share

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms