System Design HLD Example: Hotel Booking System (Airbnb)
A practical interview-ready HLD for a hotel booking platform handling availability, concurrency, and reservations.
Abstract AlgorithmsTLDR: Design a hotel booking system like Airbnb. This article covers availability tracking with date-level inventory, double-booking prevention via optimistic locking, two-phase reservation (hold โ confirm), geospatial search with Elasticsearch, and payment pre-authorisation โ with concrete SQL schema, Redis key layout, and a targeted Java
@Versionsnippet.TLDR: Booking systems must solve one problem above all others: preventing two guests from successfully reserving the same room for the same night, even under high concurrency.
At 11:59 PM on New Year's Eve, two guests click "Book Now" on the same Manhattan apartment. Both see "Available." Both reach the payment screen. Both receive a confirmation email. Their hosts wake up to two strangers at the door on the same night. This is the double-booking race condition, and it is the reason hotel booking systems are a favourite system design interview question.
Airbnb's engineering team encountered this exact failure mode as the platform scaled beyond 100 million nights booked. The fix required rethinking availability tracking at the database row level, introducing a two-phase hold-then-confirm reservation model, and separating the read-heavy search workload from the write-critical booking path.
By the end of this walkthrough you will know why availability tracking uses a date-level inventory table rather than a simple "is booked" flag, why optimistic locking outperforms pessimistic locks for most booking loads, and why search and booking must never share the same data store.
๐ The Double-Booking Problem: Two Guests, One Room, One Disaster
Actors
| Actor | Role |
| Guest | Searches for rooms, views availability, creates and cancels bookings |
| Host | Lists property, manages availability calendar, receives payments |
| Search Service | Handles geo-location + date range queries against Elasticsearch |
| Availability Service | Tracks which dates are booked per room; enforces the hold and confirm lifecycle |
| Booking Service | Orchestrates hold โ payment โ confirm with idempotency |
| Payment Service | Pre-authorises card at booking creation; captures charge 48 h before check-in |
| Notification Service | Sends confirmation emails, host alerts, and cancellation notices |
Use Cases
- Primary interview prompt: Design a hotel or room booking platform like Airbnb or Booking.com.
- Core user journeys:
- Search โ guest queries rooms by city or coordinates, date range, guest count, and filters (price, amenities, rating)
- View availability โ guest views the per-night calendar for a specific listing before committing
- Create booking โ guest selects dates โ system places a 15-minute hold โ payment pre-auth โ confirmed booking
- Cancel booking โ guest cancels; system releases held or booked slots and triggers a refund per the host's cancellation policy
- Host availability management โ host blocks or opens dates; sets nightly price and minimum stay rules
- Payment lifecycle โ pre-authorise at booking creation; capture 48 h before check-in; refund on cancellation
- Reviews โ post-checkout; guest and host each submit a rating
- Read and write paths are explained separately to keep bottlenecks and consistency boundaries explicit.
The double-booking race condition is the reason this system appears in nearly every senior engineering interview: it looks like straightforward CRUD until you model two users simultaneously selecting the same room.
๐ Functional Requirements: What the System Must Cover
In Scope
- Room search โ
GET /searchwith geo bounding box orgeo_distance, date range, guest count, price ceiling - Availability calendar โ
GET /rooms/{id}/availability?from=&to=returning per-date status - Create booking โ
POST /bookingswithIdempotency-Key; places hold then awaits payment confirmation - Cancel booking โ
DELETE /bookings/{id}; configurable refund policy per host (flexible, moderate, strict) - Payment pre-auth and capture โ integrate via Payment Service; pre-auth at hold; capture before check-in
- Host calendar management โ block dates, set nightly pricing, manage minimum stay and advance notice rules
Out of Scope (v1 boundary)
- Dynamic pricing and revenue optimisation (ML-driven surge pricing algorithms)
- Fraud scoring and risk model evaluation during booking
- Multi-currency FX conversion
- Loyalty programmes and reward redemption
Functional Breakdown
| Feature | API Contract | Key Decision |
| Room search | GET /search?lat=&lon=&radius=&from=&to=&guests= | Elasticsearch for geo + date filtering; Availability Service for confirmed open slots |
| Availability calendar | GET /rooms/{id}/availability?from=&to= | availability_slots table; Redis cache with 30 s TTL |
| Create booking | POST /bookings + Idempotency-Key | Two-phase: HOLD slots (version check) โ payment โ CONFIRM; UNIQUE(room_id, slot_date) backstop |
| Cancel | DELETE /bookings/{id} | Release held or booked slots atomically; enqueue refund per cancellation policy |
| Host calendar | PATCH /rooms/{id}/availability | Batch update availability_slots; invalidate Redis availability cache |
A strong answer names non-goals explicitly. Interviewers use this to judge prioritisation quality and architectural maturity under time constraints.
โ๏ธ Non-Functional Requirements and Design Goals
| Dimension | Target | Why It Matters |
| Correctness | Zero double-bookings under 10K concurrent requests for the same room | The financial and reputational cost of a double-booking is unbounded |
| Availability | 99.99 % search path; 99.9 % booking path | Downtime during peak booking windows (New Year's Eve, school holidays) directly costs revenue |
| Performance | Search p95 < 200 ms; booking confirmation p95 < 2 s | Guests abandon checkout when confirmation is slow; search must feel instant |
| Scalability | 150M DAU; 10K search QPS; 500 booking TPS peak | Peak booking windows produce sudden 10ร write spikes on popular listings |
| Consistency | Strong consistency for availability slots; eventual for reviews and analytics | A guest must never see "available" for a room that is already held or booked |
๐ง Deep Dive: Estimations and Design Goals
The Internals: Data Model and Locking Strategy
Three tables form the availability backbone. The central insight is that availability is tracked at the date-slot level โ each individual night is an independent lockable unit. This design means holding December 31st does not require locking December 30th, which maximises write concurrency for partial-week bookings.
CREATE TABLE rooms (
room_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
host_id UUID NOT NULL,
title TEXT NOT NULL,
latitude DECIMAL(9,6) NOT NULL,
longitude DECIMAL(9,6) NOT NULL,
price_per_night INT NOT NULL, -- stored in cents; never FLOAT
max_guests SMALLINT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_rooms_host ON rooms(host_id);
-- One row per room per night โ the lockable unit of availability.
-- The `version` column is the optimistic locking counter.
CREATE TABLE availability_slots (
slot_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
room_id UUID NOT NULL REFERENCES rooms(room_id),
slot_date DATE NOT NULL,
status TEXT NOT NULL DEFAULT 'available'
CHECK (status IN ('available', 'held', 'booked', 'blocked')),
booking_id UUID, -- NULL when available or blocked
version INT NOT NULL DEFAULT 0, -- incremented on every UPDATE
held_until TIMESTAMPTZ, -- expiry for held slots; 15-minute TTL
UNIQUE (room_id, slot_date) -- one slot per room per night; DB backstop
);
CREATE INDEX idx_slots_room_date ON availability_slots(room_id, slot_date);
CREATE INDEX idx_slots_held_expiry ON availability_slots(status, held_until)
WHERE status = 'held'; -- efficient sweep for expired holds
-- Each booking spans one or more availability_slots rows.
CREATE TABLE bookings (
booking_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
idempotency_key TEXT UNIQUE NOT NULL, -- client-supplied dedup anchor
room_id UUID NOT NULL REFERENCES rooms(room_id),
guest_id UUID NOT NULL,
check_in DATE NOT NULL,
check_out DATE NOT NULL,
status TEXT NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending','held','confirmed','cancelled','expired')),
total_cents BIGINT NOT NULL,
payment_auth_id TEXT, -- pre-auth reference from Payment Service
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
confirmed_at TIMESTAMPTZ,
cancelled_at TIMESTAMPTZ
);
CREATE INDEX idx_bookings_room_dates ON bookings(room_id, check_in, check_out);
CREATE INDEX idx_bookings_guest ON bookings(guest_id, created_at DESC);
The version column is the double-booking prevention mechanism. Every UPDATE on a slot increments version by 1. A concurrent UPDATE that reads the old version returns 0 rows โ the application layer treats 0 rows as a conflict and retries or rejects the booking attempt. The UNIQUE(room_id, slot_date) constraint is the database-level backstop that catches any conflict the application layer misses.
๐ Data Model: Key Relationships
erDiagram
ROOMS {
uuid room_id PK
uuid host_id
text title
decimal latitude
decimal longitude
int price_per_night
smallint max_guests
}
AVAILABILITY_SLOTS {
uuid slot_id PK
uuid room_id FK
date slot_date
text status
uuid booking_id FK
int version
timestamptz held_until
}
BOOKINGS {
uuid booking_id PK
text idempotency_key UK
uuid room_id FK
uuid guest_id
date check_in
date check_out
text status
bigint total_cents
text payment_auth_id
}
ROOMS ||--|{ AVAILABILITY_SLOTS : "one slot per night"
BOOKINGS ||--|{ AVAILABILITY_SLOTS : "holds or books"
Redis key layout:
| Key Pattern | Value | TTL |
availability:{roomId}:{from}:{to} | JSON array of slot statuses | 30 seconds |
booking:lock:{roomId}:{checkIn}:{checkOut} | "1" (distributed lock) | 20 seconds |
search:cache:{sha256(query)} | JSON search result set | 5 minutes |
idempotency:booking:{key} | Serialised booking response JSON | 24 hours |
Estimations
Assumptions for Airbnb-scale:
| Dimension | Assumption | Derivation |
| DAU | 150M | ~10 % of 1.5B MAU active daily |
| Search QPS | ~10K steady; ~26K peak | 150M ร 5 searches/day รท 86,400 โ 8.7K steady; 3ร holiday peak |
| Booking TPS | ~500 | Peak holiday windows; ~1 % of searches convert to a hold |
| Listings | 7M rooms worldwide | Airbnb scale; grows ~10 % year-over-year |
| Availability rows | ~2.5B | 7M rooms ร 365 days; pre-populated for 12-month rolling window |
| Redis availability cache | ~200 GB | 7M rooms ร 28-day window ร ~1 KB per window |
| Booking rows per year | ~180M | 500 bookings/sec ร 3,600 s ร ~100 peak hours/year |
Key insight: The search-to-booking QPS ratio is roughly 20:1. This asymmetry drives the single most important architectural decision: Elasticsearch for search (optimised for geo and date read throughput) and PostgreSQL for bookings (optimised for ACID writes with row-level versioning). Mixing both workloads in one database would cause read-write contention during the booking peaks that matter most.
Design Goals
| Goal | Why It Matters | Decision It Drives |
| Zero double-bookings | Two concurrent HOLD requests for the same room and date must not both succeed | Optimistic locking (version column); UNIQUE(room_id, slot_date) as the DB backstop |
| 15-minute hold window | Guest must complete checkout without the listing disappearing; abandoned holds must release | held_until timestamp + background cleanup job on a 5-minute cron schedule |
| Search must not query the booking DB | Booking DB cannot absorb 10K read QPS from search without degrading write performance | Elasticsearch index synced asynchronously from PostgreSQL via CDC (Debezium) |
| Idempotent booking creation | Network retries must not create two confirmed bookings for the same checkout attempt | idempotency_key UNIQUE constraint at DB level; Redis 24 h response cache as the fast path |
Performance Analysis: Bottlenecks and Scaling Pressure
| Pressure Point | Symptom | First Response | Second Response |
| Hot listing (popular room) | Hundreds of concurrent HOLD attempts โ OCC conflict rate spikes | Exponential backoff + 3 retries in Booking Service | Redis SETNX lock per (roomId, dateRange) to serialise hold attempts before DB |
| Availability cache staleness | Guests see "available" for an already-held room during booking wave | Shorten cache TTL to 10 s during peak window | Publish slot.status_changed event; cache consumer invalidates the key immediately |
| Search QPS peak | Elasticsearch query latency climbs under holiday load | Read replicas behind a load balancer | Cache identical query hashes for 5 minutes at the Search Service layer |
| Expired hold accumulation | held rows accumulate and incorrectly suppress availability | Cron job every 5 minutes bulk-updates expired holds back to available | idx_slots_held_expiry partial index makes sweep fast even at 2.5B rows |
| Booking DB write contention | availability_slots UPDATE p99 climbs under sustained 500 TPS | Partition availability_slots by room_id range | Separate read replica for availability calendar reads; primary for writes only |
๐ High-Level Architecture
Five services handle the booking platform. The critical topology decision is that the Search Service and the Booking Service never share a database: Elasticsearch absorbs all geo and date read traffic; PostgreSQL handles only ACID availability writes. Kafka decouples booking confirmation events from notification delivery and Elasticsearch index updates.
graph TD
Client["๐ง Guest / Host Client"]
GW["API Gateway\n(Auth ยท Rate Limit ยท Routing)"]
Search["Search Service\n(Elasticsearch)"]
Avail["Availability Service\n(PostgreSQL)"]
Booking["Booking Service\n(PostgreSQL)"]
Payment["Payment Service\n(Stripe / Adyen)"]
Notify["Notification Service\n(SES / FCM)"]
Redis["Redis\n(Availability Cache\nIdempotency Store\nBooking Locks)"]
ES["Elasticsearch\n(Geo + Date Index)"]
DB[("PostgreSQL\nBookings + Availability")]
Kafka["Kafka\n(Event Bus)"]
Client --> GW
GW --> Search
GW --> Booking
Search --> ES
Search --> Redis
Booking --> Avail
Booking --> Redis
Avail --> DB
Booking --> DB
Booking --> Payment
Booking --> Kafka
Kafka --> Notify
Kafka --> ES
The API Gateway enforces authentication and rate limiting (guests are limited to 10 booking attempts per minute to prevent inventory squatting). It routes search traffic to the Search Service and booking traffic to the Booking Service, keeping the two latency profiles and consistency requirements fully isolated.
Component responsibilities at a glance:
| Component | Responsibility | Technology |
| Search Service | Geo + date + filter queries; returns listing metadata | Elasticsearch with geo_distance + range filters |
| Availability Service | Date-level inventory; hold/book/release lifecycle | PostgreSQL; availability_slots with @Version |
| Booking Service | Orchestrate hold โ payment โ confirm; idempotency | Spring Boot; PostgreSQL; Redis SETNX lock |
| Payment Service | Pre-auth at booking; capture before check-in; refund | Stripe adapter with client-supplied idempotency keys |
| Notification Service | Guest confirmation, host alert, cancellation notices | Kafka consumer; async delivery via SES and FCM |
| Redis | Availability cache (30 s TTL); booking lock (SETNX); idempotency store | Redis Cluster |
| Kafka | Decouple booking events from notifications and ES index updates | Topics: booking.confirmed, booking.cancelled, slot.status_changed |
๐ Real-World Application: How Airbnb Engineers Eliminated Double-Bookings
Airbnb's early architecture used a naive check-and-book pattern: read availability, then if available create the booking. This is a classic TOCTOU (time-of-check-time-of-use) race condition. Two guests reading at the same millisecond both see the room as available, both write a booking row, and both receive a confirmation email โ while the host faces an impossible situation.
The production fix was a two-phase model with optimistic locking on the availability slot row:
Phase 1 โ HOLD: The Booking Service attempts to UPDATE all availability_slots rows for the requested dates from available to held, checking the current version value. If any slot's UPDATE returns 0 rows affected (because another request already incremented that version), the hold fails immediately. The guest sees "Room no longer available for selected dates."
Phase 2 โ CONFIRM: After successful payment pre-authorisation, the service transitions slot status from held to booked, again with a version check. On cancellation or payment failure, it reverts to available.
// JPA @Version causes Hibernate to include the version in every UPDATE WHERE clause.
// If the version has changed (another request won the race), Hibernate throws
// OptimisticLockingFailureException โ the application layer retries up to 3 times.
@Entity
@Table(name = "availability_slots")
public class AvailabilitySlot {
@Id
private UUID slotId;
@Column(nullable = false)
private UUID roomId;
@Column(nullable = false)
private LocalDate slotDate;
@Enumerated(EnumType.STRING)
private SlotStatus status; // AVAILABLE โ HELD โ BOOKED / back to AVAILABLE
@Version // Hibernate increments this on every UPDATE
private int version; // Concurrent UPDATE with stale version โ exception
private Instant heldUntil; // non-null only when status = HELD
}
// Booking Service: retry up to 3 times with exponential backoff on OCC conflict.
@Retryable(value = OptimisticLockingFailureException.class,
maxAttempts = 3, backoff = @Backoff(delay = 100, multiplier = 2))
public BookingResult holdDates(UUID roomId, LocalDate checkIn,
LocalDate checkOut, UUID bookingId) {
List<AvailabilitySlot> slots =
slotRepository.findByRoomIdAndDateRange(roomId, checkIn, checkOut);
slots.forEach(slot -> {
if (slot.getStatus() != SlotStatus.AVAILABLE)
throw new RoomNotAvailableException(slot.getSlotDate());
slot.setStatus(SlotStatus.HELD);
slot.setHeldUntil(Instant.now().plus(Duration.ofMinutes(15)));
});
return new BookingResult(slotRepository.saveAll(slots)); // version incremented here
}
For the most contested listings โ a city-centre apartment listed during a sold-out concert weekend โ optimistic locking alone can generate a thundering herd of retries. A Redis SETNX front-door lock on the (roomId, dateRange) key serialises concurrent HOLD attempts before they reach the database, eliminating unnecessary OCC conflicts:
// SETNX (SET if Not eXists) acquires a lock that expires automatically.
// If the lock is already held, the request is rejected immediately rather than
// hammering the DB with 200 concurrent optimistic-lock retries.
String lockKey = String.format("booking:lock:%s:%s:%s",
roomId, checkIn, checkOut);
Boolean acquired = redisTemplate.opsForValue()
.setIfAbsent(lockKey, "1", Duration.ofSeconds(20));
if (Boolean.FALSE.equals(acquired)) {
throw new BookingLockConflictException(
"Room is being reserved. Please retry in a moment.");
}
// Proceed to the database HOLD inside a try/finally that releases the lock.
This two-layer pattern โ Redis front-door lock for hot listings, OCC for the long tail โ is the industry-standard approach that allows a platform like Airbnb to absorb sudden booking spikes on popular properties without database deadlocks or runaway retry storms.
โ๏ธ Trade-offs & Failure Modes in Booking Concurrency and Availability
| Trade-off | Option A | Option B | Chosen Approach |
| Locking strategy | Pessimistic (SELECT FOR UPDATE) โ simple; no conflicts | Optimistic (version column) โ high throughput; requires retry logic | OCC for most rooms; Redis SETNX for top-1 % contested listings |
| Availability store | Same PostgreSQL as bookings โ always consistent | Separate read replica for calendar reads | Read replica for calendar; all writes go to primary |
| Search freshness | Query booking DB directly โ always accurate | Elasticsearch with CDC sync โ 1โ5 s lag | ES for search speed; Availability Service is authoritative at checkout |
| Hold window duration | 5 min โ fewer ghost holds; worse checkout UX | 30 min โ better UX; more inventory artificially suppressed | 15 min โ balances abandoned-checkout cleanup with realistic checkout time |
Failure modes to discuss in an interview:
- Hold never confirmed (guest abandons checkout): Guest initiates a hold, then closes the browser. The slot stays
heldand blocks other guests untilheld_untilexpires. Mitigation: a background cron job running every 5 minutes selects allavailability_slots WHERE status = 'held' AND held_until < NOW()and bulk-updates them back toavailable. The partial indexidx_slots_held_expirymakes this sweep fast even at 2.5 billion rows. - Payment succeeds but CONFIRM update fails: Guest is charged but booking status stays
held. The guest has a pre-auth on their card and no confirmed booking. Mitigation: transactional outbox pattern โ write abooking_eventsrow in the same transaction as the payment response; a background reconciliation job re-drives the CONFIRM step using the stored event if the booking is stillheldafter payment succeeds. - Availability cache serves stale data during a booking wave: A room just held still shows "available" for up to 30 seconds for guests whose availability query was cached before the hold. Mitigation: publish a
slot.status_changedKafka event on every hold, confirm, and release; the cache consumer invalidates the Redis key immediately on receipt. - Elasticsearch out of sync with PostgreSQL: A host marks dates as blocked in PostgreSQL, but the ES document still shows those dates as available for up to 5 seconds. Mitigation: the Availability Service performs an authoritative availability re-check from PostgreSQL at the start of every HOLD attempt โ ES results are used only for display and filtering, never as the source of truth for the booking write.
๐งญ Decision Guide: Concurrency Control Strategy for Room Booking
| Situation | Recommendation |
| Use optimistic locking (OCC) | Most listings: booking conflicts are statistically rare; OCC delivers high write throughput without blocking concurrent readers |
| Use pessimistic locking | Ultra-high-value single-inventory items (charter flights, one-of-a-kind event venues) where a conflict retry is commercially unacceptable |
| Add Redis SETNX front-door lock | Top 1 % of listings by booking attempt frequency; prevents thundering-herd OCC conflicts on hot-key rooms during peak windows |
| Avoid long hold windows | Hold windows longer than 30 minutes artificially suppress availability and frustrate guests who see listings as "unavailable" when they are actually just held by an abandoned checkout |
| Separate search from booking DB | Always โ the 20:1 search-to-booking QPS ratio makes co-location a correctness and performance risk, not just a scaling concern |
| Multi-room (bundle) bookings | Apply all-or-nothing HOLD across all rooms in a single transaction; any partial hold must be rolled back atomically, or you create partially-booked inventory that cannot be released cleanly |
๐งช Practical Examples: Write Path and Read Path Walkthrough
Example 1: Guest Makes a Booking โ The Concurrency-Safe Write Path
The following sequence shows the full booking flow from "Book Now" to confirmation email, with concurrency control at every critical step.
sequenceDiagram
participant G as Guest
participant BK as Booking Service
participant RD as Redis
participant AV as Availability Service
participant PY as Payment Service
participant KF as Kafka
G->>BK: POST /bookings (Idempotency-Key: xyz, roomId, check-in/out)
BK->>RD: GET idempotency:booking:xyz
RD-->>BK: Cache miss โ proceed
BK->>RD: SETNX booking:lock:{roomId}:{dates} (TTL 20 s)
RD-->>BK: Lock acquired
BK->>AV: HOLD slots (version check; held_until = now + 15 min)
AV-->>BK: Hold confirmed โ all slot versions incremented
BK->>PY: Pre-authorise payment (amountCents, paymentMethodId)
PY-->>BK: Auth code returned
BK->>AV: CONFIRM slots (status: held โ booked, version check)
AV-->>BK: Booking confirmed
BK->>RD: SET idempotency:booking:xyz = response (TTL 24 h)
BK->>KF: Emit booking.confirmed event
KF-->>G: Confirmation email (via Notification Service, async)
BK-->>G: HTTP 201 { bookingId, status: "confirmed" }
If the HOLD step returns an OCC conflict, the Booking Service retries up to 3 times with exponential backoff (100 ms, 200 ms, 400 ms). If all retries fail โ indicating the room is genuinely being booked by another guest โ it returns HTTP 409 "room no longer available for selected dates".
If the Redis lock is already held on the first attempt, the service returns HTTP 409 immediately. This prevents the thundering herd from even reaching the database during a booking wave on a hot listing.
Example 2: Guest Searches for Available Rooms โ The Read Path
The read path deliberately avoids the booking database entirely. Elasticsearch handles geo and date filtering; Redis serves cached availability windows; only on a cache miss does the system touch PostgreSQL.
graph TD
A["Guest: GET /search?lat=40.7&lon=-74.0\n&from=2025-12-31&to=2026-01-02&guests=2"]
B["Search Service: hash query params โ check Redis search cache"]
C{Search cache hit?}
D["Return cached search results\n(5-minute TTL)"]
E["Elasticsearch: geo_distance + date range + price filter"]
F["Availability Service: verify open slots for top results"]
G{Availability cache hit?}
H["Return from Redis availability cache\n(30-second TTL)"]
I["PostgreSQL: query availability_slots\nfor matched room IDs and dates"]
J["Merge ES results with confirmed availability\nCache result in Redis (30 s TTL)"]
K["Return ranked results to guest\n(sorted by distance + rating)"]
A --> B
B --> C
C -- Yes --> D
C -- No --> E
E --> F
F --> G
G -- Yes --> H
G -- No --> I
I --> J
H --> K
J --> K
D --> K
The two-tier cache design reflects different freshness requirements: search results (hotel metadata, photos, ratings) change rarely and tolerate a 5-minute stale window; availability changes every few seconds during a booking wave and requires a 30-second cache with event-driven invalidation on every slot status change.
๐ ๏ธ Elasticsearch: How It Powers Hotel Search
Elasticsearch is the only open-source technology that can answer "all available rooms within 5 km of Times Square for December 31st, under $300 per night, for 2 guests" in under 200 ms at 10K QPS. The reason is its inverted index combined with first-class geospatial query support via the geo_point field type and the geo_distance filter.
Hotel listings are indexed as ES documents. Each document contains a location field of type geo_point, nightly price, guest capacity, and a multi-value keyword field named available_dates containing every open date for the next 12 months. A CDC pipeline (Debezium consuming PostgreSQL WAL events via Kafka) updates this field within 1โ5 seconds whenever the Availability Service changes a slot status.
GET /hotels/_search
{
"query": {
"bool": {
"filter": [
{
"geo_distance": {
"distance": "5km",
"location": { "lat": 40.7128, "lon": -74.0060 }
}
},
{ "range": { "price_per_night": { "lte": 30000 } } },
{ "term": { "max_guests": { "gte": 2 } } },
{ "terms": { "available_dates": ["2025-12-31", "2026-01-01"] } }
]
}
},
"sort": [
{
"_geo_distance": {
"location": { "lat": 40.7128, "lon": -74.0060 },
"order": "asc",
"unit": "km"
}
},
{ "rating": { "order": "desc" } }
],
"size": 20
}
The available_dates field in ES is an approximate indicator. It is kept reasonably fresh by the CDC pipeline but is not the authoritative source. The Availability Service always re-confirms availability from PostgreSQL at the start of the HOLD step. This prevents a guest from completing a booking based on stale ES data if the slot was booked in the 1โ5 second CDC lag window.
For a full deep-dive on Elasticsearch indexing strategy, inverted index internals, and autocomplete design, see the companion post: System Design HLD Example: Search Autocomplete.
๐ Lessons Learned
- Check-then-act without a lock is always wrong for inventory. The TOCTOU race condition is the most common root cause of double-bookings. Model availability as a lockable database row with a version counter โ not as a derived query result that any concurrent reader can also see as "available."
- Optimistic locking beats pessimistic locking for most booking systems. Room booking conflicts are rare in aggregate. OCC delivers high write throughput and degrades gracefully with retries.
SELECT FOR UPDATEserialises all writes to a room and becomes a bottleneck precisely during the high-traffic windows when you can least afford it. - The hold window is a product decision with architectural consequences. A 15-minute hold is long enough for checkout but short enough to limit ghost availability suppression. Hard-code it as a named constant (
HOLD_DURATION_MINUTES = 15) and surface it as a tunable configuration value. - Search and booking must run on separate data stores. Running 10K search QPS against the booking database creates read-write contention and slows availability writes during booking waves โ exactly when correctness matters most.
- Every write API that crosses a network boundary needs an idempotency key. The
idempotency_key UNIQUEconstraint is the last line of defence: even if the Redis cache is cold, a duplicate booking attempt at the database layer returns a conflict row rather than creating a second confirmed booking. - Model the payment lifecycle as two distinct phases. Pre-auth at hold creation; capture before check-in. This maps cleanly to the two-phase reservation model: a failed payment during the hold phase releases the slots without the guest ever seeing a charge.
๐ TLDR: Summary & Key Takeaways
- Date-level inventory slots โ model each room-night as an independent
availability_slotsrow with its ownstatusandversion; this is the foundation that makes per-night locking granular and efficient. - Optimistic locking + UNIQUE constraint โ
@Versionon the slot row prevents concurrent HOLDs from both succeeding;UNIQUE(room_id, slot_date)is the database-level backstop for any OCC gap. - Two-phase reservation โ HOLD (15-minute window) โ payment pre-auth โ CONFIRM; failure at any step releases the hold cleanly without stranded charges.
- Redis SETNX for hot listings โ a front-door lock on contested
(roomId, dateRange)keys prevents thundering-herd OCC conflicts on popular properties during peak booking windows. - Separate search from booking โ Elasticsearch for geo + date search at 10K QPS; PostgreSQL for ACID availability writes; never co-locate the two workloads.
- Idempotency key on every booking write โ Redis 24 h cache + DB
UNIQUEconstraint together provide defence in depth against duplicate bookings from network retries.
๐ Practice Quiz
Why does the hotel booking system use optimistic locking rather than
SELECT FOR UPDATE(pessimistic locking) for availability slots?- A) PostgreSQL does not support row-level locking
- B) Optimistic locking provides higher write throughput because it never blocks concurrent readers; conflicts are rare in aggregate and handled by retry logic
- C)
SELECT FOR UPDATEcannot be used on tables with UUID primary keys Correct Answer: B
Three guests simultaneously submit booking requests for the same apartment on December 31st, which has exactly one
availability_slotsrow for that date. What prevents all three from succeeding?- A) The API Gateway rate limiter blocks requests after the first one
- B) The
versioncolumn: only the firstUPDATEthat reads the original version succeeds; the other two return 0 rows affected and are retried or rejected - C) The Payment Service rejects duplicate pre-auth requests automatically Correct Answer: B
Why does the system use Elasticsearch for search rather than querying the PostgreSQL bookings database directly?
- A) PostgreSQL cannot store latitude/longitude coordinates
- B) The search-to-booking QPS ratio is approximately 20:1; co-locating 10K search QPS with ACID booking writes would cause read-write contention and degrade booking throughput precisely during booking waves
- C) Elasticsearch automatically prevents double-bookings via its document versioning Correct Answer: B
Why must the booking API require a client-supplied
Idempotency-Keyheader onPOST /bookings?- A) To authenticate the guest and prevent anonymous bookings
- B) To route the request to the correct regional availability replica
- C) Network timeouts cause clients to retry; without idempotency the server would create a second confirmed booking from a retry of an already-succeeded request Correct Answer: C
Open-ended challenge: A guest completes checkout successfully. The Payment Service returns a pre-auth code. The Booking Service then crashes before it can
UPDATE availability_slotstobooked. The guest has a pre-auth charge on their card but no confirmed booking, and the slot remains inheldstatus. Walk through the full recovery strategy: what data would you inspect, what does the background reconciliation job do, how do you prevent the guest from being permanently charged without a confirmed booking, and what invariant must hold at the end of recovery?
๐ Related Posts on System Design

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
LangChain Tools and Agents: The Classic Agent Loop
๐ฏ Quick TLDR: The Classic Agent Loop TLDR: LangChain's @tool decorator plus AgentExecutor give you a working tool-calling agent in about 30 lines of Python. The ReAct loop โ Thought โ Action โ Observation โ drives every reasoning step. For simple l...
LangChain RAG: Retrieval-Augmented Generation in Practice
โก TLDR: RAG in 30 Seconds TLDR: RAG (Retrieval-Augmented Generation) fixes the LLM knowledge-cutoff problem by fetching relevant documents at query time and injecting them as context. With LangChain you build the full pipeline โ load โ split โ embed...
LangChain Memory: Conversation History and Summarization
TLDR: LLMs are stateless โ every API call starts fresh. LangChain memory classes (Buffer, Window, Summary, SummaryBuffer) explicitly inject history into each call, and RunnableWithMessageHistory is the modern LCEL replacement for the legacy Conversat...

LangChain 101: Chains, Prompts, and LLM Integration
TLDR: LangChain's LCEL pipe operator (|) wires prompts, models, and output parsers into composable chains โ swap OpenAI for Anthropic or Ollama by changing one line without touching the rest of your c
