System Design HLD Example: Hotel Booking System (Airbnb)

A senior-level HLD for a hotel booking platform handling availability, concurrency, and reservations.

System Design Interview Prep

Abstract Algorithms

·Mar 28, 2026·14 min read

📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 14 min

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: A robust hotel booking system must guarantee atomicity in inventory subtraction. The core trade-off is Consistency vs. Availability: we prioritize strong consistency for the booking path (PostgreSQL with Optimistic Locking) while allowing eventual consistency and high availability for the search path (Elasticsearch). A two-phase "Hold-then-Confirm" model ensures that inventory isn't leaked during payment failures.

🛑 The New Year's Eve Nightmare

Imagine it’s 11:59 PM on New Year’s Eve. Two different travelers, one in London and one in New York, are looking at the exact same penthouse in Manhattan for the upcoming weekend. Both click "Book Now" at the same millisecond.

In a poorly designed system, the sequence of events looks like this:

Request A checks the database: "Is the room available?" -> Yes.
Request B checks the database: "Is the room available?" -> Yes.
Request A writes a booking record: "Room booked for User A".
Request B writes a booking record: "Room booked for User B".

Both users receive a confirmation email. Both pay their non-refundable deposits. On Friday, they both show up at the same door with their luggage. This is the Double-Booking Race Condition, and it is the single most important problem a booking system must solve. At scale, "rare" edge cases happen thousands of times a day. If you design for the average case, you fail at the edges.

📖 Global Reservation Systems: Use Cases & Requirements

Actors

Guest / Traveler: Searches for rooms, views availability, and makes reservations.
Host / Property Manager: Manages inventory, sets pricing, and views upcoming bookings.
Admin: Handles disputes, refunds, and platform-wide monitoring.

Functional Requirements

Search: Users can search rooms by location (geo-coordinates), date range, and guest count.
Availability: Users see real-time availability for a listing before booking.
Reservation (Hold): Selecting a room places a temporary 15-minute hold.
Booking (Confirm): Successful payment converts a hold into a confirmed booking.
Cancellation: Releasing a booking restores inventory for those specific dates.

Non-Functional Requirements

Zero Double-Bookings: Strong consistency is non-negotiable for the final booking transaction.
High Search Availability: Search should remain functional even if the booking database is under heavy load.
Low Latency: Search results should return in < 200ms; booking confirmation in < 2s.
Scalability: Handle 100k searches/sec and 500 bookings/sec (peak holiday spikes).

🔍 Basics: Baseline Architecture

At its core, a booking system is an Inventory Management Engine. Unlike a standard e-commerce site where you might have 1,000 units of a SKU, a hotel booking system has "Perishable Inventory." A room night on December 31st is a different "product" than the same room on January 1st.

The baseline architecture involves:

Inventory Generation: Pre-calculating available slots for every room for the next 365 days.
The Lock Mechanism: Ensuring that only one user can transition a slot from available to booked.
The Buffer (Hold): Providing a grace period for payment processing so the user doesn't lose the room mid-transaction.

Without these basics, you end up with "Phantom Inventory"—rooms that appear available but are actually locked in failing payment processes.

⚙️ Mechanics: Distribution & Processing Logic

The distribution of inventory must be handled carefully. When a host adds a new listing, we don't just add one row. We must generate 365 rows in the availability_slots table.

Inventory Fan-out: Every update to a room's base availability (e.g., taking the room offline for maintenance) must propagate to all 365 days.
Search Synchronization: Since search is handled by Elasticsearch, we use an asynchronous pipeline. A write to the primary DB triggers a Kafka event, which is then indexed into ES. This introduces a 1-2 second lag, which is acceptable for search but not for booking.
State Machine: Every booking follows a strict state machine: Available -> Held -> Booked (or back to Available if the hold expires).

📐 Estimations & Design Goals

The Math of Inventory

Total Listings: 10 Million rooms.
Booking Window: 1 year (365 days).
Total Inventory Rows: 10M 365 = *3.65 Billion rows.
Search-to-Booking Ratio: 20:1. If we have 10k searches/sec, we might have 500 booking attempts/sec.

Design Goal: Decouple the "Read-Heavy" search path from the "Write-Heavy" booking path. We use a Command Query Responsibility Segregation (CQRS) inspired approach where Elasticsearch handles the searches and PostgreSQL handles the ACID transactions.

📊 High-Level Design: Separating Search from Booking

The following architecture ensures that high-volume search traffic never interferes with the critical booking path.

graph TD
    User((User)) --> LB[Load Balancer]
    LB --> AG[API Gateway]

    subgraph Search_Path
        AG --> SS[Search Service]
        SS --> ES[(Elasticsearch: Geo + Dates)]
        SS --> RC[(Search Cache: Redis)]
    end

    subgraph Booking_Path
        AG --> BS[Booking Service]
        BS --> AS[Availability Service]
        AS --> PDB[(Primary DB: Postgres)]
        BS --> PS[Payment Service]
    end

    subgraph Async_Sync
        PDB --> CDC[Debezium / CDC]
        CDC --> Kafka[Kafka]
        Kafka --> SS
        Kafka --> NS[Notification Service]
    end

The diagram captures the defining architectural decision: a hard separation between the Search Path (Elasticsearch + Redis) and the Booking Path (Postgres with SELECT FOR UPDATE SKIP LOCKED). The CDC pipeline via Debezium keeps the two paths synchronized without coupling them — a booking written to Postgres propagates to Elasticsearch within 1–2 seconds, keeping search results fresh while ensuring the booking path never touches the search cluster.

🧠 Deep Dive: How Postgres Atomically Prevents the Double-Booking Race Condition

The Hold mechanism is the most critical internal component. Understanding exactly how it works at the database level reveals why no amount of application-level locking can replace it — only the database can guarantee atomicity across concurrent transactions.

Internals: The Hold-then-Confirm State Machine

When a guest selects a room and date range, the Booking Service must atomically transition N rows in the availability_slots table (one per night) from AVAILABLE to HELD. The key word is "atomically": if any single night in the requested range is already HELD or BOOKED by another session, the entire operation must roll back with no writes committed.

This is implemented as a single Postgres transaction using SELECT FOR UPDATE SKIP LOCKED:

Step	SQL Operation	Why This Mechanism
1. Lock target rows	SELECT … FOR UPDATE SKIP LOCKED	Non-blocking: if rows are locked by another session, returns fewer rows immediately
2. Check completeness	Application checks all N nights returned	Missing row means another session already holds that night
3. Update to HELD	UPDATE slots SET status='HELD', held_until=NOW()+interval '15 min', version=version+1	Atomic state transition with optimistic version increment
4. Create booking	INSERT INTO bookings (status='HELD')	Booking record created within the same transaction
5. COMMIT	All-or-nothing guarantee	Postgres atomicity ensures no partial holds

The SKIP LOCKED clause is the key insight. Without it, SELECT FOR UPDATE would block and wait for the competing transaction to release its lock — potentially for seconds. With SKIP LOCKED, if another session has the row locked, the query immediately returns that row as missing. The application then detects the incomplete result and returns "unavailable" to the second guest without any waiting.

Field	Type	Description
slot_id	UUID	Primary key for the availability slot
room_id	UUID	FK to rooms table
date	DATE	The specific night this slot represents
status	ENUM	AVAILABLE, HELD, BOOKED, BLOCKED
held_by	UUID	Guest session ID (null when AVAILABLE)
held_until	TIMESTAMP	Expiry time for the hold (15-minute TTL)
version	INTEGER	Optimistic lock counter

Performance Analysis: Balancing Search Scale Against Booking Correctness

The CQRS-inspired architecture allows each path to scale completely independently.

Path	Technology	Peak Throughput	Latency Target
Search (geo + date range)	Elasticsearch	100,000 req/sec	< 200 ms
Availability pre-check	Redis bitmap cache	10,000 req/sec	< 50 ms
Hold creation	Postgres SKIP LOCKED	500 req/sec	< 500 ms
Booking confirmation	Postgres + payment gateway	200 req/sec	< 2,000 ms

The search path uses Elasticsearch with a geo-point mapping and a date-range filter on a denormalized availability index. Because this index is refreshed asynchronously (Debezium CDC → Kafka → Elasticsearch consumer), there is an intentional 1–2 second lag between a room becoming HELD and that change appearing in search results. This lag is acceptable because the Hold mechanism at the Booking Service provides the ultimate correctness guarantee — a guest who sees a "available" result in search but then gets an "unavailable" response at booking has simply encountered the propagation window. The system remains correct even during this lag.

🌍 Real-World Booking Systems: Airbnb, Booking.com, and Expedia

Airbnb faced the double-booking problem at massive scale as "Instant Book" listings grew. Their solution is a multi-tier availability system: a fast read layer (Redis cache of per-room per-month availability bitmaps) for search, and a strong-consistency write layer (Postgres with row-level locking) for bookings. The Instant Book feature — where a guest can confirm immediately without waiting for host approval — was only possible after Airbnb built a hold mechanism capable of guaranteeing atomic availability from click to confirmation within 2 seconds.

Booking.com uses a date-level inventory system with one row per room per night, exactly as described in this guide. Their data engineering team processes over 1 billion availability updates per day as hotels worldwide manually manage their calendars through the Booking.com extranet. The Kafka pipeline ingesting these updates into Elasticsearch is one of the highest-throughput event streams in European tech infrastructure.

Expedia solved the meta-search aggregation problem differently: rather than holding inventory itself, Expedia passes the hold request directly to the supplier (hotel) API at booking time. This "pass-through" model shifts the hold complexity to the supplier but introduces latency and availability risk from external API calls — a trade-off Expedia accepts in exchange for avoiding the cost of maintaining 3.65 billion inventory rows.

⚖️ Consistency vs. Availability: Trade-offs in the Booking Path

Design Decision	Advantage	Risk
Date-level inventory (one row per night)	Precise partial-week bookings supported	3.65B rows; requires date-partitioned table and composite index
SKIP LOCKED for holds	Non-blocking; competing holds fail fast	Requires robust retry logic in the application layer
ES for search, Postgres for booking	Search scales independently to 100k req/sec	1–2 second search-to-reality propagation lag
15-minute hold TTL	Graceful payment processing window	Popular rooms unavailable during hold if payment fails slowly
CQRS read/write separation	Zero cross-path interference	Data synchronization complexity via CDC pipeline

Critical Failure Mode — The Hold-Expiry and Payment Gap: A guest places a hold, begins payment, and the payment takes 16 minutes (possible with 3D Secure strong authentication). The hold expires at 15 minutes. A background cleanup job reclaims the slot as AVAILABLE. Another guest immediately books the same room. The first guest's payment then succeeds, creating a double booking. Mitigation: The payment confirmation endpoint must re-validate that the hold is still active — with status=HELD and held_until > NOW() — in the same transaction that converts the hold to BOOKED. If the hold has expired, the system must immediately refund and surface an "unable to confirm" message, then re-attempt the hold if inventory is still available.

🧭 Choosing the Right Consistency Model for Your Booking System

Use Postgres with SKIP LOCKED when:

Inventory has natural row-level granularity (one row per night per room).
Concurrent booking attempts are moderate (under 1,000 concurrent holds per cluster).
Strong consistency is non-negotiable because the product being sold has real-world, non-refundable value.

Use Redis distributed locking (Redlock algorithm) when:

Hold operations span multiple services or databases that cannot participate in a single Postgres transaction.
Sub-millisecond lock acquisition is required and the Postgres round-trip overhead is prohibitive.
Inventory granularity is coarser — whole-room availability rather than per-night slots.

When to introduce Elasticsearch for search:

Total listing count exceeds 500,000 where Postgres full-text and geo queries begin to slow below the 200 ms target.
Search requires compound filtering: amenities, ratings, geo-polygon boundaries, pet policies.
Read-to-write ratio for search queries exceeds 50:1 — Elasticsearch's read-optimized index layout provides far superior throughput.

🧪 Delivering This Design in a System Design Interview

Act 1 — The Double-Booking Race Condition (2 minutes): Describe the New Year's Eve scenario from the introduction. Draw two concurrent requests both reading "Available" from the database and both successfully writing a booking record. Show the resulting state: two confirmed guests, one room, two deposit receipts. Grounding the conversation in a concrete failure scenario immediately demonstrates systems thinking.

Act 2 — The CQRS-Inspired Architecture (5 minutes): Divide the whiteboard into a Search Path on the left and a Booking Path on the right. Show that search goes to Elasticsearch and booking goes to Postgres with SKIP LOCKED. Draw the CDC pipeline between them — this is the key architectural insight that allows the two paths to stay synchronized without coupling them. Explain that the 1–2 second lag in search is an intentional and acceptable trade-off.

Act 3 — Scaling and Edge Cases (3 minutes):

Interviewer Question	Strong Answer
How do you scale to 10 million listings?	Date-partitioned availability table in Postgres; Elasticsearch handles geo-search at full scale
How do you prevent hold abuse by bots?	Require valid payment method on file before granting a hold; rate-limit holds per user session
How does a host cancellation flow work?	Saga pattern: BOOKED → CANCELLED_BY_HOST triggers slot reversion, Kafka refund event, guest notification

🛠️ Open Source Components for Booking Platform Infrastructure

Debezium is the standard CDC connector used to stream Postgres write-ahead log changes into Kafka. It captures every INSERT, UPDATE, and DELETE from the availability_slots table and publishes them as structured events. The Elasticsearch sync consumer subscribes to these events and updates the search index in near-real-time.

Apache Kafka provides the durable event backbone for the entire async pipeline. The Notification Service and the Elasticsearch sync consumer both consume from the same Kafka topic with independent consumer group offsets, allowing each to process events at its own pace without impacting the other.

PostGIS (Postgres geographic extension) handles the geo-coordinate storage for the listing location. While Elasticsearch handles geo-search at scale, the canonical listing location is stored in Postgres with a PostGIS GEOGRAPHY column and a spatial index for administrative queries.

📚 Lessons Learned From Building and Operating Booking Systems

Lesson 1 — The hold is your correctness anchor. Every architectural decision should be evaluated against one question: "Does this preserve the integrity of the hold?" Adding a caching layer between the Booking Service and Postgres is dangerous if the cached availability can be stale by more than a few milliseconds during the booking transaction.

Lesson 2 — Generate availability rows lazily, not eagerly. Pre-generating 365 rows per room at listing creation time (3.65B rows for 10M listings) is an expensive bulk operation. Generate rows on demand when a search or booking request arrives for a date not yet in the table, and use a background job to pre-warm popular date windows.

Lesson 3 — Monitor the hold abandonment rate. A high hold abandonment rate (guests placing holds and not completing payment) is both a business metric and a system health signal. A sudden spike may indicate that the payment page is slow, that the payment gateway is timing out, or that the hold window is too short for the typical checkout flow.

Lesson 4 — The cancellation refund path is as complex as the booking path. Cancellations must atomically revert availability slots to AVAILABLE, issue a refund via the payment gateway, and notify the host and downstream analytics. Use the Saga pattern for the cancellation flow to ensure each step is idempotent and compensatable if a downstream service is unavailable.

📌 TLDR & Key Takeaways for Hotel Booking System Design

Core problem: The Double-Booking Race Condition — two concurrent requests both reading "Available" and both writing a booking for the same room and dates.
Solution: SELECT FOR UPDATE SKIP LOCKED in a single Postgres transaction atomically transitions N availability slots from AVAILABLE to HELD in an all-or-nothing operation.
Architecture: CQRS-inspired separation — Elasticsearch for search (100k req/sec), Postgres for booking (500 req/sec), Debezium CDC + Kafka for synchronization.
Hold model: 15-minute window allows payment processing before the slot is reclaimed by the cleanup job.
Key trade-off: Eventual consistency in search (1–2 second lag) is acceptable; strong consistency in the booking transaction is non-negotiable.
At scale: 3.65B availability rows require date-partitioned tables and a composite B-tree index on (room_id, date, status).

System Design HLD: E-Commerce Platform — The Two-Phase Reservation pattern for inventory management shares deep architectural DNA with the Hold-then-Confirm booking model.
System Design HLD: Payment Processing — The payment gateway integration that powers the Confirm step and the refund Saga in the cancellation flow.
System Design HLD: Search Autocomplete — Elasticsearch design patterns that complement the geo-point and date-range search layer of the booking platform.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Stale Reads and Cascading Failures in Distributed Systems

TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable — stale reads...

May 3, 2026•23 min read

NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node — virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...

May 3, 2026•22 min read

Clock Skew and Causality Violations: Why Distributed Clocks Lie

TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions — but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...

May 3, 2026•18 min read

Split Brain Explained: When Two Nodes Both Think They Are Leader

TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader — each accepting writes the other never sees. Prevent it with quorum consensus (at least ⌊N/2⌋+1 nodes must agree before leadership is g...

May 3, 2026•20 min read