System Design HLD Example: Payment Processing Platform

An interview-ready HLD for payments focusing on correctness, idempotency, and recovery.

Abstract Algorithms

·Mar 13, 2026·15 min read

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Payment systems optimize for correctness first, then throughput. This guide covers idempotency, double-entry ledgers, and reconciliation.

Stripe processes over 250 million API requests per day, and every single payment must be idempotent: a user clicking "Pay" twice must never produce two charges, even when the first HTTP call timed out and no confirmation reached the client. The Knight Capital incident of 2012 — where a software bug executed $440 million in unintended trades in 45 minutes — illustrated what happens when financial operations lack idempotency guards and a ledger write-ahead pattern.

Designing a payment system forces you to prioritise correctness above all else. It is the domain where eventual consistency is a business and regulatory risk rather than merely a technical trade-off, and where every architecture decision has a permanent audit trail.

Imagine a scenario where a user initiates a $100 payment. The request reaches your API, you call the payment provider (like Stripe), the provider successfully charges the card, but your network connection drops before you receive the "OK" response. If your system simply retries the request without an idempotency key, the user is charged twice. If your system fails to record the successful charge in its own ledger because of the crash, you have a "lost payment" that results in a customer support nightmare and a reconciliation imbalance.

By the end of this walkthrough you'll know why payment amounts must be stored in integer cents (not floating point decimals), why ledger writes use double-entry accounting for reconcilability, and why idempotency keys must be client-supplied and cached for 24 hours to survive network retries without producing duplicate charges.

📖 Payment Processing: Use Cases & Requirements

A robust payment platform must handle several distinct user journeys and operational requirements. In an interview, defining these boundaries early is critical.

Functional Requirements

Idempotent Authorization: Ensure that a payment request is processed exactly once, regardless of retries.
Two-Phase Settlement (Authorize & Capture): Support "holding" funds (authorization) and later "settling" them (capture), common in e-commerce and hospitality.
Refunds & Reversals: Handle the return of funds with full auditability.
Webhooks: Notify downstream services (like order fulfillment) asynchronously when payment status changes.
Reconciliation: A background process to verify that the internal ledger matches the external provider's reality.

Non-Functional Requirements (NFRs)

Correctness (Accuracy): 100% data integrity. No "ghost" payments or missing ledger entries.
Strong Consistency: The internal ledger must reflect the most accurate state of funds at all times.
High Availability (for Intake): The API must be available to receive payment intents, though the actual settlement can be slightly more resilient to transient failures.
Low Latency: Checkout experiences must be fast; p95 latency for authorization should be < 500ms.
PCI Compliance: The system must minimize the exposure of raw credit card data (handled via tokenization).

🔍 Basics: Baseline Architecture

At its core, a payment system is a coordinator between three entities: the User (Client), the Internal Ledger (Source of Truth), and the External Payment Service Provider (PSP).

The baseline architecture consists of:

Payment API: The entry point that receives requests and orchestrates the flow.
Idempotency Store: A fast, key-value store (like Redis) to track request keys.
Ledger Database: A relational database (PostgreSQL) using ACID transactions to record every movement of money.
Wallet/Account Service: Manages the balances of users and merchants.
Reconciliation Engine: An offline batch job that compares PSP logs with internal ledger rows.

⚙️ Mechanics: Key Logic

The most critical logic in a payment system is the Persist-Before-Call pattern combined with Double-Entry Bookkeeping.

1. The Persist-Before-Call Pattern

To prevent lost payments, you must record the intent to pay in your database before calling the external PSP. This ensures that if the call to the PSP times out or the server crashes, you have a record to reconcile against later.

2. Double-Entry Bookkeeping

Money is never simply "updated" in a row. Every transaction must have at least two entries: a debit from one account and a credit to another.

User Account: Debit $100
Merchant Account: Credit $100 The sum of all debits and credits in the system must always equal zero. If it doesn't, you have a bug or a fraud event.

3. Idempotency Flow

Every write API must accept an Idempotency-Key.

Check Redis: Has this key been processed in the last 24h?
If yes, return the cached response.
If no, acquire a distributed lock on the key.
Execute the business logic.
Store the response in Redis and release the lock.

📐 Estimations & Design Goals

Before diving into the high-level design, let's establish the scale.

Throughput: 1,000 Transactions Per Second (TPS) average, 5,000 TPS peak (e.g., Black Friday).
Storage: Each transaction record is ~2KB. At 1,000 TPS, that's 2MB/sec, or ~172GB/day. We need a sharding strategy for the ledger or a cold-storage archival plan for old transactions.
Availability: 99.99% for the intake API.
Consistency: Strong consistency for ledger writes; eventual consistency for analytics and webhooks.

📊 High-Level Design

The following diagram illustrates the orchestrated flow of a payment transaction.

graph TD
    User[Customer/App] -->|POST /payments| API[Payment API]
    API -->|1. Check/Lock Key| Redis[(Idempotency Redis)]
    API -->|2. Create Intent| Ledger[(PostgreSQL Ledger)]
    API -->|3. Authorize| PSP[Payment Provider - Stripe/Adyen]
    PSP -->|4. Response| API
    API -->|5. Update Status| Ledger
    API -->|6. Cache Result| Redis
    Ledger -->|7. CDC Event| Kafka{Kafka}
    Kafka -->|8. Notify| Webhook[Webhook Service]
    Webhook -->|9. Callback| Merchant[Merchant Server]
    Recon[Reconciliation Job] -->|Daily Sync| PSP
    Recon -->|Verify| Ledger

The flow enforces the Persist-Before-Call pattern: the Payment API writes a PENDING intent record to the PostgreSQL ledger (step 2) before calling the external PSP (step 3). This ensures that even if the network drops between steps 3 and 4, the system has a record to reconcile against. The idempotency check in Redis (step 1) prevents a duplicate upstream call if the client retries with the same key. The CDC (Change Data Capture) event from the ledger drives the asynchronous webhook path — order fulfillment is notified only after the ledger confirms the charge, never before.

Ledger and Idempotency Data Model

These tables form the financial source of truth. All amounts are stored in integer cents — never floating-point — to eliminate floating-point arithmetic errors on financial values.

Table	Column	Type	Description
payment_intents	intent_id	UUID	Unique payment identifier
payment_intents	idempotency_key	VARCHAR(255)	Client-supplied deduplication key (indexed)
payment_intents	amount_cents	BIGINT	Payment amount in smallest currency unit (cents)
payment_intents	currency	CHAR(3)	ISO 4217 currency code (e.g., USD, EUR)
payment_intents	status	ENUM	PENDING → AUTHORIZED → CAPTURED → REFUNDED → FAILED
payment_intents	psp_reference	VARCHAR	External PSP transaction ID for reconciliation
payment_intents	created_at	TIMESTAMP	Intent creation time
ledger_entries	entry_id	UUID	Unique ledger entry identifier
ledger_entries	intent_id	UUID	Foreign key to the payment intent
ledger_entries	account_id	UUID	Account being debited or credited
ledger_entries	entry_type	ENUM	DEBIT or CREDIT
ledger_entries	amount_cents	BIGINT	Entry amount in integer cents
ledger_entries	created_at	TIMESTAMP	When this ledger line was written

🧠 Deep Dive: Double-Entry Bookkeeping, Idempotency Key Lifecycle, and the Reconciliation Engine

Three mechanisms define correctness in a payment system: double-entry bookkeeping that makes every financial state change auditable, idempotency that prevents duplicate charges from network retries, and reconciliation that catches discrepancies between internal records and external PSP reality.

Internals: Double-Entry Bookkeeping and Why Integer Cents Matter

In a double-entry ledger, money is never simply "updated in a row." Every movement of value produces at minimum two ledger entries: a debit from one account and a credit to another. When a customer pays $100 to a merchant, the system writes: DEBIT $100 from the customer's account, CREDIT $100 to the merchant's escrow account. The fundamental accounting invariant is that the sum of all debits must equal the sum of all credits across the entire ledger at all times. If this invariant is violated, there is either a bug or a fraud event — either way, it is a P0 incident.

Amounts must be stored as integer cents (or the smallest currency unit) rather than floating-point decimals. In IEEE 754 floating-point arithmetic, 0.1 + 0.2 = 0.30000000000000004. In a payment context, this rounding error compounds across millions of transactions into reconciliation imbalances. $100.00 is stored as 10000 cents. All arithmetic is integer arithmetic. Currency formatting (adding the decimal point for display) is a presentation-layer concern, never a storage or calculation concern.

Performance Analysis: Idempotency Key Cache and the Distributed Lock Lifecycle

The idempotency key lifecycle is a four-stage state machine in Redis. Stage 1: the API receives a request with key idem_abc123. It checks Redis for an existing entry. If not present, the API acquires a distributed lock using SET idem_abc123 LOCKED NX EX 30 (set-if-not-exists with a 30-second expiry). Stage 2: the API executes the payment business logic. Stage 3: on completion (success or failure), the API writes the response body to Redis under key idem_abc123 with a 24-hour TTL and releases the lock. Stage 4: any subsequent request with the same key finds the cached response and returns it immediately without re-executing business logic.

The 30-second lock TTL handles the crash recovery case: if the API server crashes during execution (after acquiring the lock but before completing), the lock expires automatically. The next retry attempt finds no lock and no cached response, so it re-executes from scratch. The 24-hour response cache handles the user retry case: if the client retries hours later because they didn't receive a confirmation, they get the original response. This two-TTL design is the key insight: the lock TTL handles crash recovery; the response cache TTL handles user retries.

🌍 Real-World Payment Architectures: Stripe, PayPal, and Square

Stripe built its idempotency key system as a first-class API primitive from day one. Every write endpoint in the Stripe API accepts an Idempotency-Key header. Stripe stores the idempotency key, the request fingerprint (URL + body hash), and the response body in a dedicated idempotency database with a 24-hour retention window. Critically, Stripe validates that the same idempotency key is not used with a different request body — preventing a class of bugs where a client reuses a key across conceptually different operations.

PayPal's payment processing architecture uses a saga-based orchestration model for multi-party payments (buyer → PayPal escrow → seller). Each saga step — authorization, capture, settlement — is a separate idempotent API call with its own idempotency key derived from the parent order ID plus a step identifier. If any step fails, the saga orchestrator replays only the failed step rather than the entire payment flow. PayPal's reconciliation engine runs as a continuous streaming job (Apache Flink) rather than a nightly batch, reducing the window for detecting and correcting discrepancies from 24 hours to under 60 seconds.

Square's point-of-sale architecture must handle payments when the internet connection is unavailable (e.g., a food truck with spotty connectivity). Square's offline mode stores payment intents locally on the POS device, signs them with a device-level key for fraud prevention, and syncs to the central ledger when connectivity is restored. The idempotency key in Square's offline mode is device-generated and includes the device ID, timestamp, and a sequence number to prevent replay attacks.

⚖️ Trade-offs and Failure Modes in Payment System Design

Dimension	Trade-off	Recommendation
Synchronous PSP call vs. async queue	Sync: user gets immediate confirmation; Async: better resilience but user waits for webhook	Sync authorization for checkout UX; async capture and settlement for risk scoring
Authorize-then-capture vs. immediate charge	Auth-capture: supports hotel holds and ride-shares; immediate: simpler for most e-commerce	Use auth-capture for variable-amount flows; immediate charge for fixed-price products
Strong vs. eventual consistency for ledger	Strong: always correct; Eventual: risks temporary imbalances in the audit trail	Always strong consistency for ledger writes — eventual is not acceptable for financial records
Internal PSP vs. third-party PSP	Internal: lower fees at massive scale; Third-party: faster time-to-market, compliance handled	Third-party PSP (Stripe, Adyen) until you are processing billions annually

The Double-Charge Failure Mode: The API calls the PSP, the PSP charges the card and returns success, but the network drops before the response reaches the API. The API times out, marks the intent as FAILED, and the client retries with the same idempotency key. If the idempotency layer is not checking the PSP's idempotency key, the user is charged twice. The fix: always pass the intent's intent_id as the idempotency key to the PSP. When the PSP receives the same idempotency key twice, it returns the original response without processing a new charge.

The Zombie Transaction: The PSP charges the customer, but the API server crashes before writing the ledger entry. The customer is charged, but the internal ledger shows no record. The reconciliation engine detects this as a discrepancy (PSP shows a charge; ledger does not). The fix: the reconciliation engine creates the missing ledger entries and triggers a refund if the charge cannot be attributed to a valid order, then pages the on-call team.

🧭 Decision Guide: Choosing the Right Payment Guarantee Level for Each Operation

Operation Type	Consistency Requirement	Recommended Pattern
Authorizing a card charge	Strong — must not double-charge	Idempotency key + persist-before-call + PSP idempotency key
Capturing a pre-authorized hold	Strong — amount may be adjusted	Two-phase: verify authorization still valid, then capture
Issuing a refund	Strong — must not double-refund	Separate idempotency key per refund request, distinct from the original charge
Sending a webhook to merchant	At-least-once — OK to retry	Kafka + exponential backoff; merchant endpoint must be idempotent
Daily reconciliation	Eventual — nightly batch is acceptable	Compare PSP settlement report with ledger entries; flag discrepancies for manual review
Real-time fraud scoring	Eventual — non-blocking	Async call to fraud service; result gates capture, not authorization

🧪 Interview Delivery Example: Designing a Payment System in 45 Minutes

Minutes 1–5 — Frame the Problem: Open with the double-charge scenario. The user clicks "Pay," the first request times out, the client retries, and the card is charged twice. This single failure mode motivates the entire architecture. Establish NFRs: 100% correctness (no ghost charges, no lost payments), p95 authorization latency under 500ms, 99.99% availability for the intake API.

Minutes 6–15 — Idempotency First: Before drawing any service boxes, explain the idempotency key lifecycle. Redis distributed lock prevents concurrent processing. 24-hour response cache handles user retries. PSP idempotency key propagation prevents double-charges at the provider level. This is the highest-signal explanation in the first 15 minutes.

Minutes 16–30 — Persist-Before-Call and Double-Entry Ledger: Draw the four steps: (1) check idempotency key, (2) write PENDING intent to ledger, (3) call PSP, (4) update intent status. Explain why step 2 must precede step 3. Draw the ledger_entries table and show the debit + credit pair for a $100 transaction. Mention integer cents storage.

Minutes 31–40 — Reconciliation Engine: Explain that even with idempotency, the PSP and the internal ledger can drift due to crashes, timeouts, and PSP-side failures. The reconciliation engine runs on a schedule, compares PSP settlement files against ledger rows, and flags discrepancies. Describe the zombie transaction failure mode and how reconciliation detects it.

Minutes 41–45 — Trade-offs: Compare synchronous PSP calls vs. async queuing. Recommend sync for authorization (user-facing latency) and async for capture and settlement (non-blocking background operations). Mention the authorize-then-capture pattern for hotel and ride-share use cases where the final amount is unknown at authorization time.

🛠️ Open-Source Payment Infrastructure Worth Knowing

Stripe: Industry-standard PSP with first-class idempotency key support, hosted payment pages (Stripe Checkout), and Radar for ML-based fraud detection.
Adyen: Preferred by enterprise retailers for multi-currency, multi-acquirer settlement and direct card network integrations that bypass the PSP markup.
Temporal: Workflow orchestration engine used by payment platforms to model the saga of authorize → capture → settle as durable, retryable workflow steps that survive server crashes.
Apache Flink: Stream processing engine used by PayPal and Square for real-time reconciliation — comparing live PSP event streams against the ledger within seconds rather than nightly batches.

📚 Lessons Learned from Payment System Failures

The Knight Capital Incident: No Idempotency Guard on Algorithmic Orders. In August 2012, a software deployment bug caused Knight Capital's trading system to execute millions of unintended buy and sell orders over 45 minutes, accumulating $440 million in losses. The system had no idempotency guard — it could not distinguish between a retried order and a new order, and had no circuit breaker to detect abnormal order volume. The lesson for payment systems: idempotency keys are not a nice-to-have optimization; they are a safety mechanism that prevents your retry logic from becoming your worst attacker.

A Missing PSP Idempotency Key Caused 0.3% of Payments to Double-Charge. A payment platform passed its own internal idempotency key to the API layer but did not forward a corresponding idempotency key to the PSP. On network timeouts between the API and the PSP, the API retried the PSP call without a deduplication key. The PSP processed each retry as a new charge. At 1,000 TPS with a 0.3% timeout rate, this was 3 double-charges per second — 259,200 per day. The fix: always derive the PSP idempotency key deterministically from the internal intent_id, ensuring that every retry of the same intent uses the same PSP deduplication key.

Integer vs. Float Amounts Caused a $0.01 Reconciliation Imbalance at $10 Million Volume. An early-stage payment platform stored amounts as floating-point decimals. At $10 million in daily volume, floating-point rounding errors accumulated to $0.01 of unattributed funds in the daily reconciliation report. The compliance team filed an incident report. The fix took one month of careful data migration. Store amounts in integer cents from day one — this is not a premature optimization, it is a correctness requirement for financial software.

📌 TLDR & Key Takeaways

Idempotency first: every write endpoint must accept a client-supplied idempotency key. Redis holds the distributed lock (short TTL for crash recovery) and the cached response (24-hour TTL for user retries). Forward the intent_id as the PSP idempotency key to prevent double-charges at the provider level.
Persist-Before-Call: write the payment intent with status PENDING to the ledger before calling the external PSP. If the server crashes between the write and the PSP call, the reconciliation engine will detect and resolve the discrepancy.
Double-entry bookkeeping: every financial transaction produces a DEBIT and a CREDIT entry in the ledger. The sum of all entries must be zero. Violations indicate bugs or fraud.
Integer cents only: store all amounts in the smallest currency unit as a BIGINT. Never use floating-point arithmetic for financial calculations.
Reconciliation is the safety net: run it continuously (Flink) or at minimum nightly. It catches every failure mode that idempotency missed — zombie transactions, PSP-side discrepancies, and partial ledger writes.
Prefer synchronous PSP calls for authorization (user-facing latency) and asynchronous processing for capture, settlement, and webhooks (non-blocking background operations).

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...

Apr 19, 2026•23 min read