System Design HLD Example: E-Commerce Platform (Amazon)
A practical interview-ready HLD for a large-scale e-commerce system handling catalog, cart, inventory, and orders.
Abstract AlgorithmsTLDR: A large-scale e-commerce platform separates catalog, cart, inventory, orders, and payments into independent services. The hardest unsolved problem in every interview is inventory correctness under concurrent checkout โ solved with Redis atomic
DECRfor speed and an optimistic-lock SQL fallback for durability. Flash sale traffic is absorbed by API-gateway rate limiting, an in-memory inventory counter, and a request queue for overflow.
During Amazon Prime Day 2023, 375 million items were ordered in 48 hours โ roughly 2,170 orders per second at peak. A single flash sale for AirPods can trigger 50,000 concurrent add-to-cart requests in 30 seconds. Overselling inventory by even 1 unit causes a cascading customer-service nightmare: the item ships, the warehouse discovers zero stock, the order is cancelled post-payment, the customer is angry, and the support cost exceeds the margin on the sale.
How do you build a system that handles both that scale and that correctness simultaneously? The answer is not one system โ it is seven, each tuned for a different consistency and throughput profile, wired together by an event bus.
By the end of this walkthrough you will know: why the cart service lives in Redis while orders live in PostgreSQL; why DECR beats a SQL UPDATE for inventory reservation under a flash sale; why order creation must be idempotent; and how Kafka decouples the checkout write path from notifications, warehouse events, and analytics.
๐ 375 Million Orders in 48 Hours: Defining the Problem Space
Actors
| Actor | Role |
| Customer | Browses catalog, adds items to cart, places and tracks orders |
| Seller / Merchant | Lists products, manages inventory levels, fulfils shipments |
| Order Service | Orchestrates checkout: reserve inventory โ process payment โ create order |
| Warehouse System | Consumes order events; manages pick-pack-ship fulfilment |
| Notification Service | Sends order confirmation, shipment, and delivery emails/SMS |
| Analytics Pipeline | Consumes order events; populates dashboards and recommendation models |
Core Use Cases
- Browse catalog โ paginated category browse and full-text product search with filters (price, rating, brand)
- View product detail โ product page with images, description, seller info, pricing, and stock availability indicator
- Add to cart โ unauthenticated (guest cart) or authenticated (persistent cart); cart stored 30 days
- Checkout โ cart โ reserve inventory โ payment authorization โ order confirmation
- Process payment โ pre-authorize then capture; idempotent with client-supplied idempotency key
- Manage inventory โ seller adjusts stock levels; system reserves and releases on checkout/cancellation
- Order tracking โ state machine from PLACED through SHIPPED to DELIVERED; customer-visible status page
- Product reviews โ submit, read, and aggregate (average rating) post-delivery
Read and write paths are analyzed separately to keep consistency boundaries and bottleneck profiles explicit.
โ๏ธ Non-Functional Requirements and System Boundaries
In Scope
| Feature | Key Decision |
| Product catalog reads | Eventual consistency acceptable; strong CDN caching |
| Cart persistence | Redis with 30-day TTL; guest-to-user cart merge on login |
| Inventory reservation | Strong consistency required; Redis DECR + SQL fallback |
| Order creation | Idempotent; at-most-once payment charge |
| Flash sale traffic | Rate limit at gateway; queue overflow; atomic Redis counter |
| Notification delivery | Async via Kafka; at-least-once delivery acceptable |
Out of Scope (v1 Boundary)
- Seller onboarding โ KYC, bank account verification, tax registration
- Fraud detection โ ML scoring of orders; chargeback risk models
- Logistics and delivery routing โ last-mile optimization, carrier selection, route planning
- Personalized recommendations โ collaborative filtering, embedding-based retrieval
Non-Functional Requirements
| Dimension | Target | Rationale |
| Catalog read latency | p99 < 50 ms | CDN + Redis cache; database is cold path only |
| Cart add/update latency | p99 < 30 ms | Redis in-memory write |
| Checkout (reserve + payment) | p95 < 3 s | Dominated by payment gateway SLA |
| Inventory correctness | Zero oversells under 50K concurrent checkouts for same SKU | Non-negotiable โ 1 oversell = order cancellation + CS cost |
| Order read availability | 99.99% | Order status page must never show 500 |
| Catalog availability | 99.95% | Brief degradation tolerated; cached content served stale |
| Notification delivery | At-least-once; no SLA on latency | Email/SMS can arrive seconds after order |
๐ง Deep Dive: Data Internals, Capacity Math, and Performance Boundaries
The Internals: Data Model and Order State Machine
The two most important tables are inventory and orders. The inventory table must support both optimistic locking (SQL path) and serve as the source of truth for periodic reconciliation against the Redis counter.
-- Inventory table with optimistic locking via version column
CREATE TABLE inventory (
sku_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
product_id UUID NOT NULL REFERENCES products(product_id),
stock_count INT NOT NULL DEFAULT 0 CHECK (stock_count >= 0),
reserved INT NOT NULL DEFAULT 0 CHECK (reserved >= 0),
version BIGINT NOT NULL DEFAULT 0, -- bump on every write; OCC version
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_inventory_product ON inventory (product_id);
-- Orders table with idempotency anchor
CREATE TABLE orders (
order_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
idempotency_key TEXT UNIQUE NOT NULL, -- client-supplied; dedup anchor
customer_id UUID NOT NULL,
status TEXT NOT NULL DEFAULT 'PLACED'
CHECK (status IN (
'PLACED','PAYMENT_PENDING','CONFIRMED',
'SHIPPED','DELIVERED','CANCELLED','REFUNDED'
)),
total_cents BIGINT NOT NULL,
currency CHAR(3) NOT NULL DEFAULT 'USD',
payment_id UUID,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_orders_customer_created ON orders (customer_id, created_at DESC);
CREATE UNIQUE INDEX idx_orders_idempotency ON orders (idempotency_key);
CREATE TABLE order_items (
item_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
order_id UUID NOT NULL REFERENCES orders(order_id),
sku_id UUID NOT NULL,
quantity INT NOT NULL CHECK (quantity > 0),
unit_cents BIGINT NOT NULL
);
CREATE INDEX idx_order_items_order ON order_items (order_id);
The order status column is enforced by a Java enum that also defines allowed transitions, preventing illegal state changes:
public enum OrderStatus {
PLACED,
PAYMENT_PENDING,
CONFIRMED,
SHIPPED,
DELIVERED,
CANCELLED,
REFUNDED;
private static final Map<OrderStatus, Set<OrderStatus>> ALLOWED_TRANSITIONS = Map.of(
PLACED, Set.of(PAYMENT_PENDING, CANCELLED),
PAYMENT_PENDING, Set.of(CONFIRMED, CANCELLED),
CONFIRMED, Set.of(SHIPPED, CANCELLED),
SHIPPED, Set.of(DELIVERED),
DELIVERED, Set.of(REFUNDED),
CANCELLED, Set.of(),
REFUNDED, Set.of()
);
public boolean canTransitionTo(OrderStatus next) {
return ALLOWED_TRANSITIONS.getOrDefault(this, Set.of()).contains(next);
}
}
Mathematical Model: Throughput and Storage Sizing
Capacity estimation provides the numbers that drive every architectural decision โ shard count, cache size, replica count, and storage tier selection.
Traffic assumptions:
| Variable | Value | Derivation |
| DAU | 300M | Given |
| Browse requests per user per day | ~20 | Typical e-commerce session depth |
| Total browse reads/day | 6B | 300M ร 20 |
| Peak read multiplier | 5ร daily average | Flash sales, Prime Day |
| Peak read QPS | ~350K req/s | 6B / 86,400 s ร 5 |
| Orders per day | 30M | ~10% of DAU place an order |
| Peak order QPS | ~10K orders/s | 30M / 86,400 ร 30ร peak |
| Inventory reservation QPS | ~10K/s | 1 reservation per checkout |
Storage sizing:
$$\text{Catalog storage} = N_{\text{products}} \times S_{\text{product\_row}}$$
With 1 million products, each product row โ 4 KB (description, attributes, JSONB metadata):
$$1\text{M} \times 4\text{ KB} = 4\text{ GB (DB)} + \text{image CDN (TB scale)}$$
Order storage grows at roughly 30M orders/day ร 2 KB/order โ 60 GB/day. After 3 years that is ~65 TB โ partition by created_at month and archive cold partitions to object storage.
Redis inventory counter memory:
$$N_{\text{active\_SKUs}} \times 50\text{ bytes/key} = 10\text{M} \times 50\text{ B} \approx 500\text{ MB}$$
Performance Analysis: Bottleneck Profiles Under Peak Load
| Component | Pressure Point | Symptom | Mitigation |
| Inventory Redis | Hot SKU during flash sale โ single key, 50K DECR/s | Redis CPU spike; latency spikes | Lua-guarded DECR; Redis Cluster shards key space; shard hot key across N sub-keys, sum for reads |
| Order DB write | 10K INSERT/s into a single PostgreSQL primary | WAL flush becomes bottleneck | Write to Kafka first (order-events); async writer batch-inserts into DB; async ACK pattern |
| Catalog DB read | 350K read QPS against product rows | DB CPU saturation | CDN (product images) + Redis read-through cache (product JSON, 1-hr TTL); DB is cold path only |
| Payment gateway | Third-party SLA variance (p99 > 5s during Stripe incidents) | Checkout stalls | Circuit breaker; timeout at 8s; async capture path after authorization |
| Search Elasticsearch | Complex filter queries at 10K req/s | Heap pressure; GC pauses | Dedicated coordinating nodes; cache common filter result sets in Redis (5-min TTL) |
๐ High-Level Architecture: Seven Services Serving 300 Million Daily Active Users
The seven services are stateless except for the data stores they own. Each service owns exactly one data store โ cross-service data access goes through APIs, never direct DB queries.
graph TD
Client["๐ Client (Web/Mobile)"]
APIGW["API Gateway\n(Auth ยท Rate Limit ยท Route)"]
PCS["Product Catalog Service\n(PostgreSQL + Elasticsearch)"]
SS["Search Service\n(Elasticsearch)"]
CS["Cart Service\n(Redis)"]
IS["Inventory Service\n(Redis + PostgreSQL)"]
OS["Order Service\n(PostgreSQL + Kafka)"]
PS["Payment Service\n(PostgreSQL + Provider)"]
NS["Notification Service\n(Kafka Consumer)"]
CDN["CDN\n(Product Images)"]
Kafka["Apache Kafka\n(order-events topic)"]
Client --> CDN
Client --> APIGW
APIGW --> PCS
APIGW --> SS
APIGW --> CS
APIGW --> IS
APIGW --> OS
OS --> IS
OS --> PS
OS --> Kafka
Kafka --> NS
Kafka --> IS
Every client request enters through the API Gateway, which handles authentication (JWT validation), per-user rate limiting, and routing. The gateway is the only public-facing layer โ all services operate inside a private network. The Product Catalog Service and Search Service handle all read traffic; the Cart, Inventory, Order, and Payment services handle the write-heavy checkout flow. Kafka connects the Order Service to downstream consumers (Notification, Analytics, Warehouse) without coupling them to the checkout critical path.
๐ Real-World Applications: Component Deep Dives from Product Catalog to Flash Sale
Product Catalog: Denormalized Cache + Elasticsearch for Filtering
The catalog database holds the master product record (PostgreSQL). Elasticsearch holds a denormalized copy of every product for full-text search and faceted filtering. The CDN holds product images. When a seller updates a product, a background job updates both the Elasticsearch document and invalidates the Redis product-page cache.
Cache hierarchy for a product page read:
- CDN edge cache (product page HTML fragment) โ TTL 1 hour; hit rate ~85%
- Redis product-page JSON cache (
product:{id}, TTL 1 hour) โ hit rate ~12% - PostgreSQL product row โ <3% of requests reach here under normal load
Shopping Cart: Redis as the Session Store
Each cart is stored as a Redis Hash keyed by cart:{user_id}. Each field is a sku_id and the value is quantity + price snapshot at time of addition.
| Redis Key | Type | Value | TTL |
cart:{user_id} | Hash | sku_id โ {qty, unit_cents} | 30 days, reset on activity |
cart:guest:{session_id} | Hash | Same structure for unauthenticated users | 24 hours |
On login, the guest cart is merged into the user cart (HGETALL guest cart โ HSET user cart for each field โ DEL guest cart). On checkout, the cart is not deleted immediately โ it is marked as checked_out by setting a cart:{user_id}:status key to CHECKED_OUT. Deletion happens asynchronously after order confirmation to allow order cancellation to restore the cart.
Inventory Management: Preventing Overselling Under 50K Concurrent Checkouts
This is the hardest correctness problem in the system. Two concurrent checkouts for the same AirPods SKU must both see the exact available stock and one of them must fail cleanly if only 1 unit remains.
Two-phase reservation flow:
- Reserve โ atomic
DECRon Redis keyinventory:{sku_id}. Returns new value. If value drops below 0,INCRto rollback and returnOUT_OF_STOCK. - Confirm โ after payment succeeds, write the reservation to PostgreSQL with an optimistic-lock update.
- Release โ if payment fails or order is cancelled,
INCRthe Redis counter and update the DB.
Redis Lua script for atomic check-and-decrement:
-- inventory_reserve.lua
-- KEYS[1]: inventory:{sku_id}
-- ARGV[1]: quantity to reserve
local current = tonumber(redis.call('GET', KEYS[1]))
local qty = tonumber(ARGV[1])
if current == nil or current < qty then
return -1 -- insufficient stock
end
return redis.call('DECRBY', KEYS[1], qty)
The Lua script runs atomically โ Redis executes it as a single command with no interleaving. This is faster than a distributed lock and safe against race conditions.
The SQL optimistic-lock confirmation (only executed after payment succeeds):
UPDATE inventory
SET reserved = reserved + :qty,
version = version + 1,
updated_at = NOW()
WHERE sku_id = :sku_id
AND version = :expected_version
AND (stock_count - reserved) >= :qty;
-- 0 rows updated = concurrent conflict; retry or abort
If the DB update returns 0 rows, a concurrent checkout won the race at the DB layer. The inventory Redis counter was already decremented; the transaction is rolled back by issuing an INCRBY on the Redis key and returning a checkout failure.
Order Processing: Idempotent Creation and Event Publication
The Order Service receives the checkout request after inventory is reserved and payment is authorized. It creates the order row inside a database transaction, then publishes an ORDER_CREATED event to Kafka. Idempotency is enforced by the UNIQUE constraint on idempotency_key โ a client retry with the same key hits the constraint, and the service returns the cached response from a Redis lookup keyed on order:idempotency:{key}.
Payment Integration: Pre-Auth, Capture, and Idempotency
Payment flows as two phases: pre-authorization (hold funds on card) during checkout, and capture (settle funds) after warehouse confirms inventory pick. If the warehouse finds zero stock after order confirmation, the authorization is voided and the hold is released โ the customer was never charged. Every payment API call includes the order's idempotency_key as the Idempotency-Key header sent to the payment provider (Stripe/Adyen), ensuring provider-side deduplication in addition to platform-side.
โ๏ธ Trade-offs and Failure Modes: Overselling, Thundering Herds, and System Breaks
| Failure Mode | Trigger | Cascade | Mitigation |
| Redis inventory key eviction | Memory pressure flushes key mid-sale | All checkouts bypass counter; oversell risk | Set maxmemory-policy noeviction for inventory keyspace; alert on eviction events |
| Redis-DB desync | Redis DECR succeeds but DB reservation fails after payment timeout | Redis shows 0 stock but DB shows units available (or vice versa) | Periodic reconciliation job compares Redis counter to stock_count - reserved in DB; auto-corrects drift |
| Payment gateway brownout | Provider p99 spikes to 10s | Checkout requests queue up; cart service receives repeat retries | Circuit breaker opens after 5 consecutive timeouts; fail fast with 503 Service Unavailable; retry queue for background re-attempt |
| Kafka consumer lag | Notification consumer falls behind | Delivery emails delayed | Monitor consumer group lag; autoscale consumer group; DLQ after 3 retries |
| Hot SKU during flash sale | 50K concurrent requests for same SKU | Redis key becomes single bottleneck | Lua guard script; shard hot inventory key as inventory:{sku_id}:{shard_N} across N sub-keys; aggregate for display, DECR individual shards |
| Order DB write bottleneck | 10K INSERT/s into single primary | WAL flush latency increases | Write-ahead Kafka buffer; async DB insert from Kafka consumer; decouple checkout latency from DB write |
๐งญ Decision Guide: Optimistic Locking vs. Redis Atomic Decrements
| Situation | Recommendation |
| < 100 concurrent checkouts for the same SKU | Optimistic locking on the DB is sufficient; lower operational complexity, no Redis dependency |
| 10Kโ50K concurrent checkouts for the same SKU (flash sale) | Redis Lua DECRBY is required; DB cannot sustain the concurrent update rate without contention |
| Inventory correctness is the primary SLO | Two-phase: Redis DECR for speed + SQL confirmation for durability + periodic reconciliation for drift correction |
| Operational simplicity matters more than peak throughput | Single-path DB with row-level SELECT ... FOR UPDATE; horizontal scaling of the Inventory Service limits contention |
| System must continue if Redis is unavailable | Degrade to DB-only path (slower, ~200ms vs ~5ms); circuit breaker pattern on Redis dependency |
| SKU sells out regularly and inventory accuracy is auditable | Always persist the two-phase SQL confirmation; Redis counter alone is not an audit trail |
๐งช Two Critical Request Paths: Flash Checkout Write and Category Browse Read
Example 1: The Full Checkout Write Path
This path must complete atomically or roll back completely. A partial completion โ payment charged but order not created โ is a business-critical failure.
sequenceDiagram
participant C as Client
participant GW as API Gateway
participant Cart as Cart Service
participant Inv as Inventory Service
participant Pay as Payment Service
participant Ord as Order Service
participant K as Kafka
C->>GW: POST /checkout {cart_id, idempotency_key}
GW->>Cart: GET cart:{user_id}
Cart-->>GW: [{sku_id, qty, unit_cents}, ...]
GW->>Inv: RESERVE inventory (Lua DECR)
Inv-->>GW: reservation_token (or OUT_OF_STOCK)
GW->>Pay: POST /payments (pre-authorize, idempotency_key)
Pay-->>GW: {payment_id, status: authorized}
GW->>Ord: POST /orders {cart, payment_id, idempotency_key}
Ord->>Inv: CONFIRM reservation (SQL UPDATE + version check)
Ord-->>GW: {order_id, status: CONFIRMED}
Ord->>K: publish ORDER_CREATED event
GW-->>C: 201 Created {order_id}
K-->>NS: send confirmation email (async)
The sequence diagram shows the linear happy path. On any step failure: if inventory reservation fails โ return 409 Out of Stock with no payment call; if payment fails โ release the inventory reservation (INCRBY Redis + DB rollback); if order creation fails โ void the payment authorization and release inventory. Each rollback is idempotent.
Example 2: The Category Browse Read Path
Read traffic dominates at 350K req/s and is served almost entirely from cache. The goal is that the database never sees more than ~3% of read traffic.
graph LR
C["Client"] --> CDN["CDN Edge\n(HTML fragment)"]
CDN -->|cache miss| SS["Search Service\n(Elasticsearch)"]
SS -->|product IDs| RC["Redis Cache\n(product JSON TTL 1hr)"]
RC -->|cache miss| DB["Product DB\n(PostgreSQL read replica)"]
DB --> RC
RC --> SS
SS --> CDN
CDN --> C
The CDN serves the rendered product-list HTML fragment on a hit. On a CDN miss, the Search Service queries Elasticsearch for matching product IDs, then batch-fetches product JSON from Redis. Only on a Redis miss does the system reach the database read replica. The database is never on the read hot path during normal operation.
๐ ๏ธ Apache Kafka: How It Powers Order Event Processing
Apache Kafka decouples the Order Service's write path from every downstream consumer. Without Kafka, the Order Service would need to synchronously call Notification, Analytics, and Warehouse โ adding latency, coupling failure domains, and making the Order Service brittle.
Topic design:
| Topic | Producers | Consumers | Retention |
order-events | Order Service | Notification, Analytics, Warehouse | 7 days |
inventory-events | Inventory Service | Analytics, Seller Dashboard | 3 days |
payment-events | Payment Service | Order Service, Reconciliation Job | 7 days |
Kafka producer for order events:
@Service
public class OrderEventProducer {
private final KafkaTemplate<String, OrderEvent> kafkaTemplate;
public OrderEventProducer(KafkaTemplate<String, OrderEvent> kafkaTemplate) {
this.kafkaTemplate = kafkaTemplate;
}
public void publishOrderCreated(Order order) {
OrderEvent event = OrderEvent.builder()
.orderId(order.getOrderId().toString())
.customerId(order.getCustomerId().toString())
.status(order.getStatus().name())
.totalCents(order.getTotalCents())
.eventType("ORDER_CREATED")
.occurredAt(Instant.now())
.build();
// Partition key = order_id โ all events for same order go to same partition (ordering guarantee)
kafkaTemplate.send("order-events", event.getOrderId(), event)
.whenComplete((result, ex) -> {
if (ex != null) {
log.error("Failed to publish ORDER_CREATED for order {}", event.getOrderId(), ex);
// Retry handled by Spring Kafka's RetryTemplate; DLQ after max retries
}
});
}
}
For a full deep-dive on Kafka reliability patterns (consumer groups, exactly-once semantics, DLQ), see the System Design: Message Queues and Event-Driven Architecture companion post.
๐๏ธ Flash Sale Hardening: Rate Limiting, Atomic Counters, and Queue Overflow
Flash sales are adversarial load scenarios: a predictable traffic spike to 50โ100ร normal, concentrated on a tiny subset of SKUs. Three layers of defence are required.
Layer 1 โ API Gateway rate limiting: Limit each user to 5 add-to-cart requests per second for the flash SKU. This prevents abuse bots from exhausting inventory before real customers. Token-bucket enforcement at the gateway is O(1) per request using Redis Lua (see System Design HLD Example: Rate Limiter).
Layer 2 โ Inventory counter atomicity: The Redis Lua DECRBY script ensures only one request wins each decrement. The counter is pre-loaded at flash sale start time with exactly the available quantity (for example, 10,000 AirPods units). Pre-loading prevents the counter from drifting from the DB at sale start.
Layer 3 โ Queue for overflow requests: When the flash sale inventory reaches 0, subsequent checkout requests are written to a waiting queue (backed by Kafka or an SQS FIFO queue). If a reservation is released due to payment failure, the next request in the queue is processed automatically. This prevents the OUT_OF_STOCK surge from propagating to the client immediately โ a configurable wait time of up to 30 seconds is shown as a "joining waitlist" UX, which converts better than an instant failure.
Scaling the Inventory Service during flash sales:
| Action | When | Effect |
| Pre-load Redis counter from DB | 5 minutes before sale start | Eliminates DB read spike at T=0 |
| Scale Inventory Service pods to 20ร | 10 minutes before sale start | Handles reservation RPC volume |
| Enable hot-key sharding for the flash SKU | At sale start | Distributes Redis traffic across N shards |
| Disable catalog cache TTL refresh for flash SKU | During sale | Prevents stale stock indicator on product page |
| Run reconciliation job immediately after sale | T + 5 minutes | Catches any Redis-DB drift before next sale |
๐ Lessons Learned from Building at Scale
1. Separate the inventory reservation from the payment โ never do them in a single database transaction. The payment provider call takes 300msโ3s. Holding a DB row lock for that duration at 10K orders/sec will cause lock contention that cascades into queue buildup and timeouts.
2. The idempotency key must be client-generated, not server-generated. If the server generates the key, a request that times out before returning the key leaves the client with no way to deduplicate the retry. Clients should generate a UUID before calling the checkout endpoint and retry with the same UUID.
3. Redis inventory counters will drift from the DB. Network partitions, application bugs, and rollbacks all create drift. A nightly (or post-flash-sale) reconciliation job that compares Redis GET inventory:{sku_id} with stock_count - reserved in the DB and auto-corrects the delta is non-negotiable in production.
4. Order state machines must be enforced at the application layer, not just by the DB column constraint. The canTransitionTo() method on the enum prevents illegal transitions (for example, DELIVERED โ PLACED) before the DB write is attempted, giving clear error messages and preventing partial-state corruption.
5. Read replicas are not enough for the catalog at 350K req/s. Most teams add a read replica and assume it will handle catalog reads. It will not. A layered cache (CDN + Redis + replica) is required; the replica is only for cache misses and admin queries.
6. Kafka decoupling is worth the operational cost. Teams that synchronously call Notification and Analytics from the Order Service during checkout discover this the hard way: when the notification service slows down during a surge, checkout latency degrades with it. Event-driven decoupling protects the checkout p95 from every downstream service's hiccups.
๐ TLDR: Summary & Key Takeaways
- Seven services, seven data stores โ each service owns exactly one store. No cross-service direct DB queries. Inter-service communication via APIs and Kafka events.
- Cart in Redis, orders in PostgreSQL โ carts are ephemeral, latency-sensitive, and expire; orders are durable, auditable, and grow indefinitely. Different durability profiles demand different stores.
- Redis DECR for inventory speed, SQL for inventory durability โ the atomic Lua script handles flash-sale concurrency; the optimistic-lock SQL update persists the reservation and provides the audit trail; periodic reconciliation fixes drift.
- Idempotency keys must be client-generated โ server-generated keys cannot protect against the client's timed-out retry scenario.
- Kafka is the seam between transactional and analytical โ order events flow downstream to Notification, Warehouse, Analytics, and ML pipelines without adding latency to the checkout write path.
- Flash sales require pre-loading โ counter, pods, and sharding must be ready before the sale starts; reactive scaling at T=0 is too late.
- One-liner: Inventory correctness under concurrent checkout is the hardest problem in e-commerce system design โ solve it with a Redis Lua atomic decrement for speed and an optimistic-lock SQL confirm for durability.
๐ Practice Quiz
Why does the inventory reservation use a Redis Lua script instead of a plain
DECRfollowed by anifcheck in application code?- A) Lua scripts are faster to write than application code
- B) A Lua script executes atomically โ no interleaving between the GET and DECR, eliminating the TOCTOU race condition
- C) Redis does not support
DECRwithout a Lua wrapper - D) Application-level checks require a distributed lock that adds 200 ms latency Correct Answer: B
During a flash sale, 50,000 checkout requests arrive simultaneously for a SKU with 1 unit of inventory. What happens under the two-phase reservation design?
- A) All 50,000 requests succeed because Redis queues them
- B) One request receives the decremented counter value โฅ 0 and proceeds; all others receive -1 (insufficient stock) and are rejected or queued
- C) The first 50,000 decrement the counter to -49,999; an out-of-sync DB reconciliation job corrects this overnight
- D) The API Gateway blocks all but 1 request via rate limiting before they reach the Inventory Service Correct Answer: B
Why must idempotency keys for order creation be generated by the client rather than the server?
- A) Clients have faster UUIDs than servers
- B) Server-generated keys can only be returned in the HTTP response; if the response is lost in transit, the client cannot recover the key and will create a duplicate order on retry
- C) Server-generated keys violate the REST constraint of statelessness
- D) The payment provider requires a client-side UUID format Correct Answer: B
An engineer proposes replacing Kafka in the order pipeline with synchronous HTTP calls from the Order Service to the Notification, Analytics, and Warehouse services. What is the primary risk of this approach?
- A) HTTP is not reliable enough for internal microservice calls
- B) The checkout write path latency becomes coupled to the slowest downstream service; a Notification Service slowdown will directly inflate order creation p99 latency
- C) Analytics data will become inconsistent because HTTP does not support transactions
- D) Warehouse systems do not support HTTP APIs Correct Answer: B
Open-ended challenge: A post-sale reconciliation job finds that the Redis inventory counter for AirPods SKU shows 0 remaining, but the PostgreSQL
inventorytable showsstock_count = 500, reserved = 450โ meaning 50 units should still be available. Walk through the likely cause of this drift, the correct remediation, and how you would prevent it happening again.
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Modern Table Formats: Delta Lake vs Apache Iceberg vs Apache Hudi
TLDR: Delta Lake, Apache Iceberg, and Apache Hudi are open table formats that wrap Parquet files with a transaction log (or snapshot tree) to deliver ACID guarantees, time travel, schema evolution, and efficient upserts on object storage. Choose Delt...
Medallion Architecture: Bronze, Silver, and Gold Layers in Practice
TLDR: Medallion Architecture solves the "data swamp" problem by organizing a data lake into three progressively refined zones โ Bronze (raw, immutable), Silver (cleaned, conformed), Gold (aggregated, business-ready) โ so teams always build on a trust...
Kappa Architecture: Streaming-First Data Pipelines
TLDR: Kappa architecture replaces Lambda's batch + speed dual codebases with a single streaming pipeline backed by a replayable Kafka log. Reprocessing becomes replaying from offset 0. One codebase, no drift. TLDR: Kappa is the right call when your t...
Big Data 101: The 5 Vs, Ecosystem, and Why Scale Breaks Everything
TLDR: Traditional databases fail at big data scale for three concrete reasons โ storage saturation, compute bottleneck, and write-lock contention. The 5 Vs (Volume, Velocity, Variety, Veracity, Value) frame what makes data "big." A layered ecosystem ...
