System Design HLD Example: Hotel Booking System (Airbnb)
A senior-level HLD for a hotel booking platform handling availability, concurrency, and reservations.
Abstract AlgorithmsIntermediate
For developers with some experience. Builds on fundamentals.
Estimated read time: 14 min
AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: A robust hotel booking system must guarantee atomicity in inventory subtraction. The core trade-off is Consistency vs. Availability: we prioritize strong consistency for the booking path (PostgreSQL with Optimistic Locking) while allowing eventual consistency and high availability for the search path (Elasticsearch). A two-phase "Hold-then-Confirm" model ensures that inventory isn't leaked during payment failures.
π The New Year's Eve Nightmare
Imagine itβs 11:59 PM on New Yearβs Eve. Two different travelers, one in London and one in New York, are looking at the exact same penthouse in Manhattan for the upcoming weekend. Both click "Book Now" at the same millisecond.
In a poorly designed system, the sequence of events looks like this:
- Request A checks the database: "Is the room available?" -> Yes.
- Request B checks the database: "Is the room available?" -> Yes.
- Request A writes a booking record: "Room booked for User A".
- Request B writes a booking record: "Room booked for User B".
Both users receive a confirmation email. Both pay their non-refundable deposits. On Friday, they both show up at the same door with their luggage. This is the Double-Booking Race Condition, and it is the single most important problem a booking system must solve. At scale, "rare" edge cases happen thousands of times a day. If you design for the average case, you fail at the edges.
π Global Reservation Systems: Use Cases & Requirements
Actors
- Guest / Traveler: Searches for rooms, views availability, and makes reservations.
- Host / Property Manager: Manages inventory, sets pricing, and views upcoming bookings.
- Admin: Handles disputes, refunds, and platform-wide monitoring.
Functional Requirements
- Search: Users can search rooms by location (geo-coordinates), date range, and guest count.
- Availability: Users see real-time availability for a listing before booking.
- Reservation (Hold): Selecting a room places a temporary 15-minute hold.
- Booking (Confirm): Successful payment converts a hold into a confirmed booking.
- Cancellation: Releasing a booking restores inventory for those specific dates.
Non-Functional Requirements
- Zero Double-Bookings: Strong consistency is non-negotiable for the final booking transaction.
- High Search Availability: Search should remain functional even if the booking database is under heavy load.
- Low Latency: Search results should return in < 200ms; booking confirmation in < 2s.
- Scalability: Handle 100k searches/sec and 500 bookings/sec (peak holiday spikes).
π Basics: Baseline Architecture
At its core, a booking system is an Inventory Management Engine. Unlike a standard e-commerce site where you might have 1,000 units of a SKU, a hotel booking system has "Perishable Inventory." A room night on December 31st is a different "product" than the same room on January 1st.
The baseline architecture involves:
- Inventory Generation: Pre-calculating available slots for every room for the next 365 days.
- The Lock Mechanism: Ensuring that only one user can transition a slot from
availabletobooked. - The Buffer (Hold): Providing a grace period for payment processing so the user doesn't lose the room mid-transaction.
Without these basics, you end up with "Phantom Inventory"βrooms that appear available but are actually locked in failing payment processes.
βοΈ Mechanics: Distribution & Processing Logic
The distribution of inventory must be handled carefully. When a host adds a new listing, we don't just add one row. We must generate 365 rows in the availability_slots table.
- Inventory Fan-out: Every update to a room's base availability (e.g., taking the room offline for maintenance) must propagate to all 365 days.
- Search Synchronization: Since search is handled by Elasticsearch, we use an asynchronous pipeline. A write to the primary DB triggers a Kafka event, which is then indexed into ES. This introduces a 1-2 second lag, which is acceptable for search but not for booking.
- State Machine: Every booking follows a strict state machine:
Available->Held->Booked(or back toAvailableif the hold expires).
π Estimations & Design Goals
The Math of Inventory
- Total Listings: 10 Million rooms.
- Booking Window: 1 year (365 days).
- Total Inventory Rows: 10M 365 = *3.65 Billion rows.
- Search-to-Booking Ratio: 20:1. If we have 10k searches/sec, we might have 500 booking attempts/sec.
Design Goal: Decouple the "Read-Heavy" search path from the "Write-Heavy" booking path. We use a Command Query Responsibility Segregation (CQRS) inspired approach where Elasticsearch handles the searches and PostgreSQL handles the ACID transactions.
π High-Level Design: Separating Search from Booking
The following architecture ensures that high-volume search traffic never interferes with the critical booking path.
graph TD
User((User)) --> LB[Load Balancer]
LB --> AG[API Gateway]
subgraph Search_Path
AG --> SS[Search Service]
SS --> ES[(Elasticsearch: Geo + Dates)]
SS --> RC[(Search Cache: Redis)]
end
subgraph Booking_Path
AG --> BS[Booking Service]
BS --> AS[Availability Service]
AS --> PDB[(Primary DB: Postgres)]
BS --> PS[Payment Service]
end
subgraph Async_Sync
PDB --> CDC[Debezium / CDC]
CDC --> Kafka[Kafka]
Kafka --> SS
Kafka --> NS[Notification Service]
end
The diagram captures the defining architectural decision: a hard separation between the Search Path (Elasticsearch + Redis) and the Booking Path (Postgres with SELECT FOR UPDATE SKIP LOCKED). The CDC pipeline via Debezium keeps the two paths synchronized without coupling them β a booking written to Postgres propagates to Elasticsearch within 1β2 seconds, keeping search results fresh while ensuring the booking path never touches the search cluster.
π§ Deep Dive: How Postgres Atomically Prevents the Double-Booking Race Condition
The Hold mechanism is the most critical internal component. Understanding exactly how it works at the database level reveals why no amount of application-level locking can replace it β only the database can guarantee atomicity across concurrent transactions.
Internals: The Hold-then-Confirm State Machine
When a guest selects a room and date range, the Booking Service must atomically transition N rows in the availability_slots table (one per night) from AVAILABLE to HELD. The key word is "atomically": if any single night in the requested range is already HELD or BOOKED by another session, the entire operation must roll back with no writes committed.
This is implemented as a single Postgres transaction using SELECT FOR UPDATE SKIP LOCKED:
| Step | SQL Operation | Why This Mechanism |
| 1. Lock target rows | SELECT β¦ FOR UPDATE SKIP LOCKED | Non-blocking: if rows are locked by another session, returns fewer rows immediately |
| 2. Check completeness | Application checks all N nights returned | Missing row means another session already holds that night |
| 3. Update to HELD | UPDATE slots SET status='HELD', held_until=NOW()+interval '15 min', version=version+1 | Atomic state transition with optimistic version increment |
| 4. Create booking | INSERT INTO bookings (status='HELD') | Booking record created within the same transaction |
| 5. COMMIT | All-or-nothing guarantee | Postgres atomicity ensures no partial holds |
The SKIP LOCKED clause is the key insight. Without it, SELECT FOR UPDATE would block and wait for the competing transaction to release its lock β potentially for seconds. With SKIP LOCKED, if another session has the row locked, the query immediately returns that row as missing. The application then detects the incomplete result and returns "unavailable" to the second guest without any waiting.
| Field | Type | Description |
| slot_id | UUID | Primary key for the availability slot |
| room_id | UUID | FK to rooms table |
| date | DATE | The specific night this slot represents |
| status | ENUM | AVAILABLE, HELD, BOOKED, BLOCKED |
| held_by | UUID | Guest session ID (null when AVAILABLE) |
| held_until | TIMESTAMP | Expiry time for the hold (15-minute TTL) |
| version | INTEGER | Optimistic lock counter |
Performance Analysis: Balancing Search Scale Against Booking Correctness
The CQRS-inspired architecture allows each path to scale completely independently.
| Path | Technology | Peak Throughput | Latency Target |
| Search (geo + date range) | Elasticsearch | 100,000 req/sec | < 200 ms |
| Availability pre-check | Redis bitmap cache | 10,000 req/sec | < 50 ms |
| Hold creation | Postgres SKIP LOCKED | 500 req/sec | < 500 ms |
| Booking confirmation | Postgres + payment gateway | 200 req/sec | < 2,000 ms |
The search path uses Elasticsearch with a geo-point mapping and a date-range filter on a denormalized availability index. Because this index is refreshed asynchronously (Debezium CDC β Kafka β Elasticsearch consumer), there is an intentional 1β2 second lag between a room becoming HELD and that change appearing in search results. This lag is acceptable because the Hold mechanism at the Booking Service provides the ultimate correctness guarantee β a guest who sees a "available" result in search but then gets an "unavailable" response at booking has simply encountered the propagation window. The system remains correct even during this lag.
π Real-World Booking Systems: Airbnb, Booking.com, and Expedia
Airbnb faced the double-booking problem at massive scale as "Instant Book" listings grew. Their solution is a multi-tier availability system: a fast read layer (Redis cache of per-room per-month availability bitmaps) for search, and a strong-consistency write layer (Postgres with row-level locking) for bookings. The Instant Book feature β where a guest can confirm immediately without waiting for host approval β was only possible after Airbnb built a hold mechanism capable of guaranteeing atomic availability from click to confirmation within 2 seconds.
Booking.com uses a date-level inventory system with one row per room per night, exactly as described in this guide. Their data engineering team processes over 1 billion availability updates per day as hotels worldwide manually manage their calendars through the Booking.com extranet. The Kafka pipeline ingesting these updates into Elasticsearch is one of the highest-throughput event streams in European tech infrastructure.
Expedia solved the meta-search aggregation problem differently: rather than holding inventory itself, Expedia passes the hold request directly to the supplier (hotel) API at booking time. This "pass-through" model shifts the hold complexity to the supplier but introduces latency and availability risk from external API calls β a trade-off Expedia accepts in exchange for avoiding the cost of maintaining 3.65 billion inventory rows.
βοΈ Consistency vs. Availability: Trade-offs in the Booking Path
| Design Decision | Advantage | Risk |
| Date-level inventory (one row per night) | Precise partial-week bookings supported | 3.65B rows; requires date-partitioned table and composite index |
| SKIP LOCKED for holds | Non-blocking; competing holds fail fast | Requires robust retry logic in the application layer |
| ES for search, Postgres for booking | Search scales independently to 100k req/sec | 1β2 second search-to-reality propagation lag |
| 15-minute hold TTL | Graceful payment processing window | Popular rooms unavailable during hold if payment fails slowly |
| CQRS read/write separation | Zero cross-path interference | Data synchronization complexity via CDC pipeline |
Critical Failure Mode β The Hold-Expiry and Payment Gap: A guest places a hold, begins payment, and the payment takes 16 minutes (possible with 3D Secure strong authentication). The hold expires at 15 minutes. A background cleanup job reclaims the slot as AVAILABLE. Another guest immediately books the same room. The first guest's payment then succeeds, creating a double booking. Mitigation: The payment confirmation endpoint must re-validate that the hold is still active β with status=HELD and held_until > NOW() β in the same transaction that converts the hold to BOOKED. If the hold has expired, the system must immediately refund and surface an "unable to confirm" message, then re-attempt the hold if inventory is still available.
π§ Choosing the Right Consistency Model for Your Booking System
Use Postgres with SKIP LOCKED when:
- Inventory has natural row-level granularity (one row per night per room).
- Concurrent booking attempts are moderate (under 1,000 concurrent holds per cluster).
- Strong consistency is non-negotiable because the product being sold has real-world, non-refundable value.
Use Redis distributed locking (Redlock algorithm) when:
- Hold operations span multiple services or databases that cannot participate in a single Postgres transaction.
- Sub-millisecond lock acquisition is required and the Postgres round-trip overhead is prohibitive.
- Inventory granularity is coarser β whole-room availability rather than per-night slots.
When to introduce Elasticsearch for search:
- Total listing count exceeds 500,000 where Postgres full-text and geo queries begin to slow below the 200 ms target.
- Search requires compound filtering: amenities, ratings, geo-polygon boundaries, pet policies.
- Read-to-write ratio for search queries exceeds 50:1 β Elasticsearch's read-optimized index layout provides far superior throughput.
π§ͺ Delivering This Design in a System Design Interview
Act 1 β The Double-Booking Race Condition (2 minutes): Describe the New Year's Eve scenario from the introduction. Draw two concurrent requests both reading "Available" from the database and both successfully writing a booking record. Show the resulting state: two confirmed guests, one room, two deposit receipts. Grounding the conversation in a concrete failure scenario immediately demonstrates systems thinking.
Act 2 β The CQRS-Inspired Architecture (5 minutes): Divide the whiteboard into a Search Path on the left and a Booking Path on the right. Show that search goes to Elasticsearch and booking goes to Postgres with SKIP LOCKED. Draw the CDC pipeline between them β this is the key architectural insight that allows the two paths to stay synchronized without coupling them. Explain that the 1β2 second lag in search is an intentional and acceptable trade-off.
Act 3 β Scaling and Edge Cases (3 minutes):
| Interviewer Question | Strong Answer |
| How do you scale to 10 million listings? | Date-partitioned availability table in Postgres; Elasticsearch handles geo-search at full scale |
| How do you prevent hold abuse by bots? | Require valid payment method on file before granting a hold; rate-limit holds per user session |
| How does a host cancellation flow work? | Saga pattern: BOOKED β CANCELLED_BY_HOST triggers slot reversion, Kafka refund event, guest notification |
π οΈ Open Source Components for Booking Platform Infrastructure
Debezium is the standard CDC connector used to stream Postgres write-ahead log changes into Kafka. It captures every INSERT, UPDATE, and DELETE from the availability_slots table and publishes them as structured events. The Elasticsearch sync consumer subscribes to these events and updates the search index in near-real-time.
Apache Kafka provides the durable event backbone for the entire async pipeline. The Notification Service and the Elasticsearch sync consumer both consume from the same Kafka topic with independent consumer group offsets, allowing each to process events at its own pace without impacting the other.
PostGIS (Postgres geographic extension) handles the geo-coordinate storage for the listing location. While Elasticsearch handles geo-search at scale, the canonical listing location is stored in Postgres with a PostGIS GEOGRAPHY column and a spatial index for administrative queries.
π Lessons Learned From Building and Operating Booking Systems
Lesson 1 β The hold is your correctness anchor. Every architectural decision should be evaluated against one question: "Does this preserve the integrity of the hold?" Adding a caching layer between the Booking Service and Postgres is dangerous if the cached availability can be stale by more than a few milliseconds during the booking transaction.
Lesson 2 β Generate availability rows lazily, not eagerly. Pre-generating 365 rows per room at listing creation time (3.65B rows for 10M listings) is an expensive bulk operation. Generate rows on demand when a search or booking request arrives for a date not yet in the table, and use a background job to pre-warm popular date windows.
Lesson 3 β Monitor the hold abandonment rate. A high hold abandonment rate (guests placing holds and not completing payment) is both a business metric and a system health signal. A sudden spike may indicate that the payment page is slow, that the payment gateway is timing out, or that the hold window is too short for the typical checkout flow.
Lesson 4 β The cancellation refund path is as complex as the booking path. Cancellations must atomically revert availability slots to AVAILABLE, issue a refund via the payment gateway, and notify the host and downstream analytics. Use the Saga pattern for the cancellation flow to ensure each step is idempotent and compensatable if a downstream service is unavailable.
π TLDR & Key Takeaways for Hotel Booking System Design
- Core problem: The Double-Booking Race Condition β two concurrent requests both reading "Available" and both writing a booking for the same room and dates.
- Solution:
SELECT FOR UPDATE SKIP LOCKEDin a single Postgres transaction atomically transitions N availability slots from AVAILABLE to HELD in an all-or-nothing operation. - Architecture: CQRS-inspired separation β Elasticsearch for search (100k req/sec), Postgres for booking (500 req/sec), Debezium CDC + Kafka for synchronization.
- Hold model: 15-minute window allows payment processing before the slot is reclaimed by the cleanup job.
- Key trade-off: Eventual consistency in search (1β2 second lag) is acceptable; strong consistency in the booking transaction is non-negotiable.
- At scale: 3.65B availability rows require date-partitioned tables and a composite B-tree index on
(room_id, date, status).
π Related Posts
- System Design HLD: E-Commerce Platform β The Two-Phase Reservation pattern for inventory management shares deep architectural DNA with the Hold-then-Confirm booking model.
- System Design HLD: Payment Processing β The payment gateway integration that powers the Confirm step and the refund Saga in the cancellation flow.
- System Design HLD: Search Autocomplete β Elasticsearch design patterns that complement the geo-point and date-range search layer of the booking platform.
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Stale Reads and Cascading Failures in Distributed Systems
TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable β stale reads...
NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data
TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node β virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...
Clock Skew and Causality Violations: Why Distributed Clocks Lie
TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions β but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...
Split Brain Explained: When Two Nodes Both Think They Are Leader
TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader β each accepting writes the other never sees. Prevent it with quorum consensus (at least βN/2β+1 nodes must agree before leadership is g...
