All Posts

System Design HLD Example: E-Commerce Platform (Amazon)

A practical interview-ready HLD for a large-scale e-commerce system handling catalog, cart, inventory, and orders.

Abstract AlgorithmsAbstract Algorithms
ยทยท22 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: A large-scale e-commerce platform separates catalog, cart, inventory, orders, and payments into independent services. The hardest unsolved problem in every interview is inventory correctness under concurrent checkout โ€” solved with Redis atomic DECR for speed and an optimistic-lock SQL fallback for durability. Flash sale traffic is absorbed by API-gateway rate limiting, an in-memory inventory counter, and a request queue for overflow.

During Amazon Prime Day 2023, 375 million items were ordered in 48 hours โ€” roughly 2,170 orders per second at peak. A single flash sale for AirPods can trigger 50,000 concurrent add-to-cart requests in 30 seconds. Overselling inventory by even 1 unit causes a cascading customer-service nightmare: the item ships, the warehouse discovers zero stock, the order is cancelled post-payment, the customer is angry, and the support cost exceeds the margin on the sale.

How do you build a system that handles both that scale and that correctness simultaneously? The answer is not one system โ€” it is seven, each tuned for a different consistency and throughput profile, wired together by an event bus.

By the end of this walkthrough you will know: why the cart service lives in Redis while orders live in PostgreSQL; why DECR beats a SQL UPDATE for inventory reservation under a flash sale; why order creation must be idempotent; and how Kafka decouples the checkout write path from notifications, warehouse events, and analytics.

๐Ÿ“– 375 Million Orders in 48 Hours: Defining the Problem Space

Actors

ActorRole
CustomerBrowses catalog, adds items to cart, places and tracks orders
Seller / MerchantLists products, manages inventory levels, fulfils shipments
Order ServiceOrchestrates checkout: reserve inventory โ†’ process payment โ†’ create order
Warehouse SystemConsumes order events; manages pick-pack-ship fulfilment
Notification ServiceSends order confirmation, shipment, and delivery emails/SMS
Analytics PipelineConsumes order events; populates dashboards and recommendation models

Core Use Cases

  • Browse catalog โ€” paginated category browse and full-text product search with filters (price, rating, brand)
  • View product detail โ€” product page with images, description, seller info, pricing, and stock availability indicator
  • Add to cart โ€” unauthenticated (guest cart) or authenticated (persistent cart); cart stored 30 days
  • Checkout โ€” cart โ†’ reserve inventory โ†’ payment authorization โ†’ order confirmation
  • Process payment โ€” pre-authorize then capture; idempotent with client-supplied idempotency key
  • Manage inventory โ€” seller adjusts stock levels; system reserves and releases on checkout/cancellation
  • Order tracking โ€” state machine from PLACED through SHIPPED to DELIVERED; customer-visible status page
  • Product reviews โ€” submit, read, and aggregate (average rating) post-delivery

Read and write paths are analyzed separately to keep consistency boundaries and bottleneck profiles explicit.

โš™๏ธ Non-Functional Requirements and System Boundaries

In Scope

FeatureKey Decision
Product catalog readsEventual consistency acceptable; strong CDN caching
Cart persistenceRedis with 30-day TTL; guest-to-user cart merge on login
Inventory reservationStrong consistency required; Redis DECR + SQL fallback
Order creationIdempotent; at-most-once payment charge
Flash sale trafficRate limit at gateway; queue overflow; atomic Redis counter
Notification deliveryAsync via Kafka; at-least-once delivery acceptable

Out of Scope (v1 Boundary)

  • Seller onboarding โ€” KYC, bank account verification, tax registration
  • Fraud detection โ€” ML scoring of orders; chargeback risk models
  • Logistics and delivery routing โ€” last-mile optimization, carrier selection, route planning
  • Personalized recommendations โ€” collaborative filtering, embedding-based retrieval

Non-Functional Requirements

DimensionTargetRationale
Catalog read latencyp99 < 50 msCDN + Redis cache; database is cold path only
Cart add/update latencyp99 < 30 msRedis in-memory write
Checkout (reserve + payment)p95 < 3 sDominated by payment gateway SLA
Inventory correctnessZero oversells under 50K concurrent checkouts for same SKUNon-negotiable โ€” 1 oversell = order cancellation + CS cost
Order read availability99.99%Order status page must never show 500
Catalog availability99.95%Brief degradation tolerated; cached content served stale
Notification deliveryAt-least-once; no SLA on latencyEmail/SMS can arrive seconds after order

๐Ÿง  Deep Dive: Data Internals, Capacity Math, and Performance Boundaries

The Internals: Data Model and Order State Machine

The two most important tables are inventory and orders. The inventory table must support both optimistic locking (SQL path) and serve as the source of truth for periodic reconciliation against the Redis counter.

-- Inventory table with optimistic locking via version column
CREATE TABLE inventory (
    sku_id       UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
    product_id   UUID        NOT NULL REFERENCES products(product_id),
    stock_count  INT         NOT NULL DEFAULT 0 CHECK (stock_count >= 0),
    reserved     INT         NOT NULL DEFAULT 0 CHECK (reserved >= 0),
    version      BIGINT      NOT NULL DEFAULT 0,  -- bump on every write; OCC version
    updated_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_inventory_product ON inventory (product_id);

-- Orders table with idempotency anchor
CREATE TABLE orders (
    order_id          UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
    idempotency_key   TEXT        UNIQUE NOT NULL,  -- client-supplied; dedup anchor
    customer_id       UUID        NOT NULL,
    status            TEXT        NOT NULL DEFAULT 'PLACED'
                      CHECK (status IN (
                          'PLACED','PAYMENT_PENDING','CONFIRMED',
                          'SHIPPED','DELIVERED','CANCELLED','REFUNDED'
                      )),
    total_cents       BIGINT      NOT NULL,
    currency          CHAR(3)     NOT NULL DEFAULT 'USD',
    payment_id        UUID,
    created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at        TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_orders_customer_created ON orders (customer_id, created_at DESC);
CREATE UNIQUE INDEX idx_orders_idempotency ON orders (idempotency_key);

CREATE TABLE order_items (
    item_id    UUID   PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id   UUID   NOT NULL REFERENCES orders(order_id),
    sku_id     UUID   NOT NULL,
    quantity   INT    NOT NULL CHECK (quantity > 0),
    unit_cents BIGINT NOT NULL
);
CREATE INDEX idx_order_items_order ON order_items (order_id);

The order status column is enforced by a Java enum that also defines allowed transitions, preventing illegal state changes:

public enum OrderStatus {
    PLACED,
    PAYMENT_PENDING,
    CONFIRMED,
    SHIPPED,
    DELIVERED,
    CANCELLED,
    REFUNDED;

    private static final Map<OrderStatus, Set<OrderStatus>> ALLOWED_TRANSITIONS = Map.of(
        PLACED,           Set.of(PAYMENT_PENDING, CANCELLED),
        PAYMENT_PENDING,  Set.of(CONFIRMED, CANCELLED),
        CONFIRMED,        Set.of(SHIPPED, CANCELLED),
        SHIPPED,          Set.of(DELIVERED),
        DELIVERED,        Set.of(REFUNDED),
        CANCELLED,        Set.of(),
        REFUNDED,         Set.of()
    );

    public boolean canTransitionTo(OrderStatus next) {
        return ALLOWED_TRANSITIONS.getOrDefault(this, Set.of()).contains(next);
    }
}

Mathematical Model: Throughput and Storage Sizing

Capacity estimation provides the numbers that drive every architectural decision โ€” shard count, cache size, replica count, and storage tier selection.

Traffic assumptions:

VariableValueDerivation
DAU300MGiven
Browse requests per user per day~20Typical e-commerce session depth
Total browse reads/day6B300M ร— 20
Peak read multiplier5ร— daily averageFlash sales, Prime Day
Peak read QPS~350K req/s6B / 86,400 s ร— 5
Orders per day30M~10% of DAU place an order
Peak order QPS~10K orders/s30M / 86,400 ร— 30ร— peak
Inventory reservation QPS~10K/s1 reservation per checkout

Storage sizing:

$$\text{Catalog storage} = N_{\text{products}} \times S_{\text{product\_row}}$$

With 1 million products, each product row โ‰ˆ 4 KB (description, attributes, JSONB metadata):

$$1\text{M} \times 4\text{ KB} = 4\text{ GB (DB)} + \text{image CDN (TB scale)}$$

Order storage grows at roughly 30M orders/day ร— 2 KB/order โ‰ˆ 60 GB/day. After 3 years that is ~65 TB โ€” partition by created_at month and archive cold partitions to object storage.

Redis inventory counter memory:

$$N_{\text{active\_SKUs}} \times 50\text{ bytes/key} = 10\text{M} \times 50\text{ B} \approx 500\text{ MB}$$

Performance Analysis: Bottleneck Profiles Under Peak Load

ComponentPressure PointSymptomMitigation
Inventory RedisHot SKU during flash sale โ€” single key, 50K DECR/sRedis CPU spike; latency spikesLua-guarded DECR; Redis Cluster shards key space; shard hot key across N sub-keys, sum for reads
Order DB write10K INSERT/s into a single PostgreSQL primaryWAL flush becomes bottleneckWrite to Kafka first (order-events); async writer batch-inserts into DB; async ACK pattern
Catalog DB read350K read QPS against product rowsDB CPU saturationCDN (product images) + Redis read-through cache (product JSON, 1-hr TTL); DB is cold path only
Payment gatewayThird-party SLA variance (p99 > 5s during Stripe incidents)Checkout stallsCircuit breaker; timeout at 8s; async capture path after authorization
Search ElasticsearchComplex filter queries at 10K req/sHeap pressure; GC pausesDedicated coordinating nodes; cache common filter result sets in Redis (5-min TTL)

๐Ÿ“Š High-Level Architecture: Seven Services Serving 300 Million Daily Active Users

The seven services are stateless except for the data stores they own. Each service owns exactly one data store โ€” cross-service data access goes through APIs, never direct DB queries.

graph TD
    Client["๐ŸŒ Client (Web/Mobile)"]
    APIGW["API Gateway\n(Auth ยท Rate Limit ยท Route)"]
    PCS["Product Catalog Service\n(PostgreSQL + Elasticsearch)"]
    SS["Search Service\n(Elasticsearch)"]
    CS["Cart Service\n(Redis)"]
    IS["Inventory Service\n(Redis + PostgreSQL)"]
    OS["Order Service\n(PostgreSQL + Kafka)"]
    PS["Payment Service\n(PostgreSQL + Provider)"]
    NS["Notification Service\n(Kafka Consumer)"]
    CDN["CDN\n(Product Images)"]
    Kafka["Apache Kafka\n(order-events topic)"]

    Client --> CDN
    Client --> APIGW
    APIGW --> PCS
    APIGW --> SS
    APIGW --> CS
    APIGW --> IS
    APIGW --> OS
    OS --> IS
    OS --> PS
    OS --> Kafka
    Kafka --> NS
    Kafka --> IS

Every client request enters through the API Gateway, which handles authentication (JWT validation), per-user rate limiting, and routing. The gateway is the only public-facing layer โ€” all services operate inside a private network. The Product Catalog Service and Search Service handle all read traffic; the Cart, Inventory, Order, and Payment services handle the write-heavy checkout flow. Kafka connects the Order Service to downstream consumers (Notification, Analytics, Warehouse) without coupling them to the checkout critical path.

๐ŸŒ Real-World Applications: Component Deep Dives from Product Catalog to Flash Sale

Product Catalog: Denormalized Cache + Elasticsearch for Filtering

The catalog database holds the master product record (PostgreSQL). Elasticsearch holds a denormalized copy of every product for full-text search and faceted filtering. The CDN holds product images. When a seller updates a product, a background job updates both the Elasticsearch document and invalidates the Redis product-page cache.

Cache hierarchy for a product page read:

  1. CDN edge cache (product page HTML fragment) โ€” TTL 1 hour; hit rate ~85%
  2. Redis product-page JSON cache (product:{id}, TTL 1 hour) โ€” hit rate ~12%
  3. PostgreSQL product row โ€” <3% of requests reach here under normal load

Shopping Cart: Redis as the Session Store

Each cart is stored as a Redis Hash keyed by cart:{user_id}. Each field is a sku_id and the value is quantity + price snapshot at time of addition.

Redis KeyTypeValueTTL
cart:{user_id}Hashsku_id โ†’ {qty, unit_cents}30 days, reset on activity
cart:guest:{session_id}HashSame structure for unauthenticated users24 hours

On login, the guest cart is merged into the user cart (HGETALL guest cart โ†’ HSET user cart for each field โ†’ DEL guest cart). On checkout, the cart is not deleted immediately โ€” it is marked as checked_out by setting a cart:{user_id}:status key to CHECKED_OUT. Deletion happens asynchronously after order confirmation to allow order cancellation to restore the cart.

Inventory Management: Preventing Overselling Under 50K Concurrent Checkouts

This is the hardest correctness problem in the system. Two concurrent checkouts for the same AirPods SKU must both see the exact available stock and one of them must fail cleanly if only 1 unit remains.

Two-phase reservation flow:

  1. Reserve โ€” atomic DECR on Redis key inventory:{sku_id}. Returns new value. If value drops below 0, INCR to rollback and return OUT_OF_STOCK.
  2. Confirm โ€” after payment succeeds, write the reservation to PostgreSQL with an optimistic-lock update.
  3. Release โ€” if payment fails or order is cancelled, INCR the Redis counter and update the DB.

Redis Lua script for atomic check-and-decrement:

-- inventory_reserve.lua
-- KEYS[1]: inventory:{sku_id}
-- ARGV[1]: quantity to reserve
local current = tonumber(redis.call('GET', KEYS[1]))
local qty     = tonumber(ARGV[1])
if current == nil or current < qty then
  return -1   -- insufficient stock
end
return redis.call('DECRBY', KEYS[1], qty)

The Lua script runs atomically โ€” Redis executes it as a single command with no interleaving. This is faster than a distributed lock and safe against race conditions.

The SQL optimistic-lock confirmation (only executed after payment succeeds):

UPDATE inventory
   SET reserved = reserved + :qty,
       version  = version + 1,
       updated_at = NOW()
 WHERE sku_id  = :sku_id
   AND version = :expected_version
   AND (stock_count - reserved) >= :qty;
-- 0 rows updated = concurrent conflict; retry or abort

If the DB update returns 0 rows, a concurrent checkout won the race at the DB layer. The inventory Redis counter was already decremented; the transaction is rolled back by issuing an INCRBY on the Redis key and returning a checkout failure.

Order Processing: Idempotent Creation and Event Publication

The Order Service receives the checkout request after inventory is reserved and payment is authorized. It creates the order row inside a database transaction, then publishes an ORDER_CREATED event to Kafka. Idempotency is enforced by the UNIQUE constraint on idempotency_key โ€” a client retry with the same key hits the constraint, and the service returns the cached response from a Redis lookup keyed on order:idempotency:{key}.

Payment Integration: Pre-Auth, Capture, and Idempotency

Payment flows as two phases: pre-authorization (hold funds on card) during checkout, and capture (settle funds) after warehouse confirms inventory pick. If the warehouse finds zero stock after order confirmation, the authorization is voided and the hold is released โ€” the customer was never charged. Every payment API call includes the order's idempotency_key as the Idempotency-Key header sent to the payment provider (Stripe/Adyen), ensuring provider-side deduplication in addition to platform-side.

โš–๏ธ Trade-offs and Failure Modes: Overselling, Thundering Herds, and System Breaks

Failure ModeTriggerCascadeMitigation
Redis inventory key evictionMemory pressure flushes key mid-saleAll checkouts bypass counter; oversell riskSet maxmemory-policy noeviction for inventory keyspace; alert on eviction events
Redis-DB desyncRedis DECR succeeds but DB reservation fails after payment timeoutRedis shows 0 stock but DB shows units available (or vice versa)Periodic reconciliation job compares Redis counter to stock_count - reserved in DB; auto-corrects drift
Payment gateway brownoutProvider p99 spikes to 10sCheckout requests queue up; cart service receives repeat retriesCircuit breaker opens after 5 consecutive timeouts; fail fast with 503 Service Unavailable; retry queue for background re-attempt
Kafka consumer lagNotification consumer falls behindDelivery emails delayedMonitor consumer group lag; autoscale consumer group; DLQ after 3 retries
Hot SKU during flash sale50K concurrent requests for same SKURedis key becomes single bottleneckLua guard script; shard hot inventory key as inventory:{sku_id}:{shard_N} across N sub-keys; aggregate for display, DECR individual shards
Order DB write bottleneck10K INSERT/s into single primaryWAL flush latency increasesWrite-ahead Kafka buffer; async DB insert from Kafka consumer; decouple checkout latency from DB write

๐Ÿงญ Decision Guide: Optimistic Locking vs. Redis Atomic Decrements

SituationRecommendation
< 100 concurrent checkouts for the same SKUOptimistic locking on the DB is sufficient; lower operational complexity, no Redis dependency
10Kโ€“50K concurrent checkouts for the same SKU (flash sale)Redis Lua DECRBY is required; DB cannot sustain the concurrent update rate without contention
Inventory correctness is the primary SLOTwo-phase: Redis DECR for speed + SQL confirmation for durability + periodic reconciliation for drift correction
Operational simplicity matters more than peak throughputSingle-path DB with row-level SELECT ... FOR UPDATE; horizontal scaling of the Inventory Service limits contention
System must continue if Redis is unavailableDegrade to DB-only path (slower, ~200ms vs ~5ms); circuit breaker pattern on Redis dependency
SKU sells out regularly and inventory accuracy is auditableAlways persist the two-phase SQL confirmation; Redis counter alone is not an audit trail

๐Ÿงช Two Critical Request Paths: Flash Checkout Write and Category Browse Read

Example 1: The Full Checkout Write Path

This path must complete atomically or roll back completely. A partial completion โ€” payment charged but order not created โ€” is a business-critical failure.

sequenceDiagram
    participant C as Client
    participant GW as API Gateway
    participant Cart as Cart Service
    participant Inv as Inventory Service
    participant Pay as Payment Service
    participant Ord as Order Service
    participant K as Kafka

    C->>GW: POST /checkout {cart_id, idempotency_key}
    GW->>Cart: GET cart:{user_id}
    Cart-->>GW: [{sku_id, qty, unit_cents}, ...]
    GW->>Inv: RESERVE inventory (Lua DECR)
    Inv-->>GW: reservation_token (or OUT_OF_STOCK)
    GW->>Pay: POST /payments (pre-authorize, idempotency_key)
    Pay-->>GW: {payment_id, status: authorized}
    GW->>Ord: POST /orders {cart, payment_id, idempotency_key}
    Ord->>Inv: CONFIRM reservation (SQL UPDATE + version check)
    Ord-->>GW: {order_id, status: CONFIRMED}
    Ord->>K: publish ORDER_CREATED event
    GW-->>C: 201 Created {order_id}
    K-->>NS: send confirmation email (async)

The sequence diagram shows the linear happy path. On any step failure: if inventory reservation fails โ†’ return 409 Out of Stock with no payment call; if payment fails โ†’ release the inventory reservation (INCRBY Redis + DB rollback); if order creation fails โ†’ void the payment authorization and release inventory. Each rollback is idempotent.

Example 2: The Category Browse Read Path

Read traffic dominates at 350K req/s and is served almost entirely from cache. The goal is that the database never sees more than ~3% of read traffic.

graph LR
    C["Client"] --> CDN["CDN Edge\n(HTML fragment)"]
    CDN -->|cache miss| SS["Search Service\n(Elasticsearch)"]
    SS -->|product IDs| RC["Redis Cache\n(product JSON TTL 1hr)"]
    RC -->|cache miss| DB["Product DB\n(PostgreSQL read replica)"]
    DB --> RC
    RC --> SS
    SS --> CDN
    CDN --> C

The CDN serves the rendered product-list HTML fragment on a hit. On a CDN miss, the Search Service queries Elasticsearch for matching product IDs, then batch-fetches product JSON from Redis. Only on a Redis miss does the system reach the database read replica. The database is never on the read hot path during normal operation.

๐Ÿ› ๏ธ Apache Kafka: How It Powers Order Event Processing

Apache Kafka decouples the Order Service's write path from every downstream consumer. Without Kafka, the Order Service would need to synchronously call Notification, Analytics, and Warehouse โ€” adding latency, coupling failure domains, and making the Order Service brittle.

Topic design:

TopicProducersConsumersRetention
order-eventsOrder ServiceNotification, Analytics, Warehouse7 days
inventory-eventsInventory ServiceAnalytics, Seller Dashboard3 days
payment-eventsPayment ServiceOrder Service, Reconciliation Job7 days

Kafka producer for order events:

@Service
public class OrderEventProducer {

    private final KafkaTemplate<String, OrderEvent> kafkaTemplate;

    public OrderEventProducer(KafkaTemplate<String, OrderEvent> kafkaTemplate) {
        this.kafkaTemplate = kafkaTemplate;
    }

    public void publishOrderCreated(Order order) {
        OrderEvent event = OrderEvent.builder()
            .orderId(order.getOrderId().toString())
            .customerId(order.getCustomerId().toString())
            .status(order.getStatus().name())
            .totalCents(order.getTotalCents())
            .eventType("ORDER_CREATED")
            .occurredAt(Instant.now())
            .build();

        // Partition key = order_id โ†’ all events for same order go to same partition (ordering guarantee)
        kafkaTemplate.send("order-events", event.getOrderId(), event)
            .whenComplete((result, ex) -> {
                if (ex != null) {
                    log.error("Failed to publish ORDER_CREATED for order {}", event.getOrderId(), ex);
                    // Retry handled by Spring Kafka's RetryTemplate; DLQ after max retries
                }
            });
    }
}

For a full deep-dive on Kafka reliability patterns (consumer groups, exactly-once semantics, DLQ), see the System Design: Message Queues and Event-Driven Architecture companion post.

๐Ÿ—๏ธ Flash Sale Hardening: Rate Limiting, Atomic Counters, and Queue Overflow

Flash sales are adversarial load scenarios: a predictable traffic spike to 50โ€“100ร— normal, concentrated on a tiny subset of SKUs. Three layers of defence are required.

Layer 1 โ€” API Gateway rate limiting: Limit each user to 5 add-to-cart requests per second for the flash SKU. This prevents abuse bots from exhausting inventory before real customers. Token-bucket enforcement at the gateway is O(1) per request using Redis Lua (see System Design HLD Example: Rate Limiter).

Layer 2 โ€” Inventory counter atomicity: The Redis Lua DECRBY script ensures only one request wins each decrement. The counter is pre-loaded at flash sale start time with exactly the available quantity (for example, 10,000 AirPods units). Pre-loading prevents the counter from drifting from the DB at sale start.

Layer 3 โ€” Queue for overflow requests: When the flash sale inventory reaches 0, subsequent checkout requests are written to a waiting queue (backed by Kafka or an SQS FIFO queue). If a reservation is released due to payment failure, the next request in the queue is processed automatically. This prevents the OUT_OF_STOCK surge from propagating to the client immediately โ€” a configurable wait time of up to 30 seconds is shown as a "joining waitlist" UX, which converts better than an instant failure.

Scaling the Inventory Service during flash sales:

ActionWhenEffect
Pre-load Redis counter from DB5 minutes before sale startEliminates DB read spike at T=0
Scale Inventory Service pods to 20ร—10 minutes before sale startHandles reservation RPC volume
Enable hot-key sharding for the flash SKUAt sale startDistributes Redis traffic across N shards
Disable catalog cache TTL refresh for flash SKUDuring salePrevents stale stock indicator on product page
Run reconciliation job immediately after saleT + 5 minutesCatches any Redis-DB drift before next sale

๐Ÿ“š Lessons Learned from Building at Scale

1. Separate the inventory reservation from the payment โ€” never do them in a single database transaction. The payment provider call takes 300msโ€“3s. Holding a DB row lock for that duration at 10K orders/sec will cause lock contention that cascades into queue buildup and timeouts.

2. The idempotency key must be client-generated, not server-generated. If the server generates the key, a request that times out before returning the key leaves the client with no way to deduplicate the retry. Clients should generate a UUID before calling the checkout endpoint and retry with the same UUID.

3. Redis inventory counters will drift from the DB. Network partitions, application bugs, and rollbacks all create drift. A nightly (or post-flash-sale) reconciliation job that compares Redis GET inventory:{sku_id} with stock_count - reserved in the DB and auto-corrects the delta is non-negotiable in production.

4. Order state machines must be enforced at the application layer, not just by the DB column constraint. The canTransitionTo() method on the enum prevents illegal transitions (for example, DELIVERED โ†’ PLACED) before the DB write is attempted, giving clear error messages and preventing partial-state corruption.

5. Read replicas are not enough for the catalog at 350K req/s. Most teams add a read replica and assume it will handle catalog reads. It will not. A layered cache (CDN + Redis + replica) is required; the replica is only for cache misses and admin queries.

6. Kafka decoupling is worth the operational cost. Teams that synchronously call Notification and Analytics from the Order Service during checkout discover this the hard way: when the notification service slows down during a surge, checkout latency degrades with it. Event-driven decoupling protects the checkout p95 from every downstream service's hiccups.

๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • Seven services, seven data stores โ€” each service owns exactly one store. No cross-service direct DB queries. Inter-service communication via APIs and Kafka events.
  • Cart in Redis, orders in PostgreSQL โ€” carts are ephemeral, latency-sensitive, and expire; orders are durable, auditable, and grow indefinitely. Different durability profiles demand different stores.
  • Redis DECR for inventory speed, SQL for inventory durability โ€” the atomic Lua script handles flash-sale concurrency; the optimistic-lock SQL update persists the reservation and provides the audit trail; periodic reconciliation fixes drift.
  • Idempotency keys must be client-generated โ€” server-generated keys cannot protect against the client's timed-out retry scenario.
  • Kafka is the seam between transactional and analytical โ€” order events flow downstream to Notification, Warehouse, Analytics, and ML pipelines without adding latency to the checkout write path.
  • Flash sales require pre-loading โ€” counter, pods, and sharding must be ready before the sale starts; reactive scaling at T=0 is too late.
  • One-liner: Inventory correctness under concurrent checkout is the hardest problem in e-commerce system design โ€” solve it with a Redis Lua atomic decrement for speed and an optimistic-lock SQL confirm for durability.

๐Ÿ“ Practice Quiz

  1. Why does the inventory reservation use a Redis Lua script instead of a plain DECR followed by an if check in application code?

    • A) Lua scripts are faster to write than application code
    • B) A Lua script executes atomically โ€” no interleaving between the GET and DECR, eliminating the TOCTOU race condition
    • C) Redis does not support DECR without a Lua wrapper
    • D) Application-level checks require a distributed lock that adds 200 ms latency Correct Answer: B
  2. During a flash sale, 50,000 checkout requests arrive simultaneously for a SKU with 1 unit of inventory. What happens under the two-phase reservation design?

    • A) All 50,000 requests succeed because Redis queues them
    • B) One request receives the decremented counter value โ‰ฅ 0 and proceeds; all others receive -1 (insufficient stock) and are rejected or queued
    • C) The first 50,000 decrement the counter to -49,999; an out-of-sync DB reconciliation job corrects this overnight
    • D) The API Gateway blocks all but 1 request via rate limiting before they reach the Inventory Service Correct Answer: B
  3. Why must idempotency keys for order creation be generated by the client rather than the server?

    • A) Clients have faster UUIDs than servers
    • B) Server-generated keys can only be returned in the HTTP response; if the response is lost in transit, the client cannot recover the key and will create a duplicate order on retry
    • C) Server-generated keys violate the REST constraint of statelessness
    • D) The payment provider requires a client-side UUID format Correct Answer: B
  4. An engineer proposes replacing Kafka in the order pipeline with synchronous HTTP calls from the Order Service to the Notification, Analytics, and Warehouse services. What is the primary risk of this approach?

    • A) HTTP is not reliable enough for internal microservice calls
    • B) The checkout write path latency becomes coupled to the slowest downstream service; a Notification Service slowdown will directly inflate order creation p99 latency
    • C) Analytics data will become inconsistent because HTTP does not support transactions
    • D) Warehouse systems do not support HTTP APIs Correct Answer: B
  5. Open-ended challenge: A post-sale reconciliation job finds that the Redis inventory counter for AirPods SKU shows 0 remaining, but the PostgreSQL inventory table shows stock_count = 500, reserved = 450 โ€” meaning 50 units should still be available. Walk through the likely cause of this drift, the correct remediation, and how you would prevent it happening again.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms