All Posts

System Design HLD Example: Payment Processing Platform

An interview-ready HLD for payments focusing on correctness, idempotency, and recovery.

Abstract AlgorithmsAbstract Algorithms
ยทยท8 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Design a payment processing system for online checkout. This article now follows your system design interview template flow: use cases, requirements, estimations, design goals, HLD, and design deep dive.

TLDR: Payment systems optimize for correctness first, then throughput, because mistakes are expensive and auditable.

๐Ÿ“– Use Cases

Actors

  • End users consuming the primary product surface.
  • Producer entities that create or update domain content.
  • Platform services enforcing policy, routing, and reliability controls.

Use Cases

  • Primary interview prompt: Design a payment processing system for online checkout.
  • Core user journeys: Authorize, capture, refund, idempotency, ledger writes, webhook processing, and reconciliation.
  • Read and write paths are explained separately so bottlenecks and consistency boundaries are explicit.

This template starts with actors and use cases because architecture only makes sense when user behavior and workload shape are clear. In interviews, this section prevents random tool selection and keeps the answer grounded in business outcomes.

๐Ÿ” Functional Requirements

In Scope

  • Support the core product flow end-to-end with clear API contracts.
  • Preserve business correctness for critical operations.
  • Expose reliable read and write interfaces with predictable behavior.
  • Support an incremental scaling path instead of requiring a redesign.

Out of Scope (v1 boundary)

  • Full global active-active writes across every region.
  • Heavy analytical workloads mixed into latency-critical request paths.
  • Complex personalization experiments in the first architecture version.

Functional Breakdown

  • Prompt: Design a payment processing system for online checkout.
  • Focus: Authorize, capture, refund, idempotency, ledger writes, webhook processing, and reconciliation.
  • Initial building-block perspective: Payment API, idempotency store, ledger service, provider adapters, webhook worker, reconciliation jobs.

A strong answer names non-goals explicitly. Interviewers use this to judge prioritization quality and architectural maturity under time constraints.

โš™๏ธ Non Functional Requirements

DimensionTargetWhy it matters
ScalabilityHorizontal scale across services and workersHandles growth without rewriting core flows
Availability99.9% baseline with path to 99.99%Reduces user-visible downtime
PerformanceClear p95 and p99 latency SLOsAvoids average-latency blind spots
ConsistencyExplicit strong vs eventual boundariesPrevents hidden correctness defects
OperabilityMetrics, logs, traces, and runbooksSpeeds incident isolation and recovery

Non-functional requirements are where many designs fail in practice. Naming measurable targets and coupling architecture decisions to those targets is far more useful than listing technologies.

๐Ÿง  Estimations and Design Goals

The Internals

  • Service boundaries should align with ownership and deployment isolation.
  • Data model choices should follow access patterns, not default preferences.
  • Retries, idempotency, and timeout budgets must be explicit before scale.
  • Dependency failure behavior should be defined before incidents happen.

Estimations

Use structured rough-order numbers in interviews:

  1. Read and write throughput (steady and peak).
  2. Read/write ratio and burst amplification factor.
  3. Typical payload size and large-object edge cases.
  4. Daily storage growth and retention horizon.
  5. Cache memory for hot keys and frequently accessed entities.
Estimation axisQuestion to answer early
Read QPSWhich read path saturates first at 10x?
Write QPSWhich state mutation becomes the first bottleneck?
Storage growthWhen does repartitioning become mandatory?
Memory envelopeWhat hot set must remain in memory?
Network profileWhich hops create the highest latency variance?

Design Goals

  • Keep synchronous user-facing paths short and deterministic.
  • Shift heavy side effects and fan-out work to asynchronous channels.
  • Minimize coupling between control-plane and data-plane components.
  • Introduce complexity in phases tied to measurable bottlenecks.

Performance Analysis

Pressure pointSymptomFirst responseSecond response
Hot partitionsTail latency spikesKey redesignRepartition by load
Cache churnMiss stormsTTL and key tuningMulti-layer caching
Async backlogDelayed downstream workWorker scale-outPriority queues
Dependency instabilityTimeout cascadesFail-fast budgetsDegraded fallback mode

Metrics that should drive architecture evolution:

  • p95 and p99 latency by operation.
  • Error-budget burn by service and endpoint.
  • Queue lag, retry volume, and dead-letter trends.
  • Cache hit ratio by key family.
  • Partition or shard utilization skew.

๐Ÿ“Š High Level Design - Architecture for Functional Requirements

Building Blocks

  • Payment API, idempotency store, ledger service, provider adapters, webhook worker, reconciliation jobs.
  • API edge layer for authentication, authorization, and policy checks.
  • Domain services for read and write responsibilities.
  • Durable storage plus cache for fast retrieval and controlled consistency.
  • Async event path for secondary processing and integrations.

Design the APIs

  • Keep contracts explicit and version-friendly.
  • Use idempotency keys for retriable writes.
  • Return actionable error metadata for clients and retries.

Communication Between Components

  • Synchronous path for user-visible confirmation.
  • Asynchronous path for fan-out, indexing, notifications, and analytics.

Data Flow

  • Checkout -> idempotency guard -> provider authorization -> ledger write -> reconciliation pipeline.
flowchart TD
    A[Client or Producer] --> B[API and Policy Layer]
    B --> C[Core Domain Service]
    C --> D[Primary Data Store and Cache]
    C --> E[Async Event or Job Queue]
    D --> F[User-Facing Response]
    E --> G[Workers and Integrations]
    G --> H[State Update and Telemetry]

๐ŸŒ API Mapping and Real-World Applications

This architecture pattern appears in real production systems because traffic is bursty, dependencies fail partially, and correctness requirements vary by operation type.

Practical API mapping examples:

  • POST /resources for write operations with idempotency support.
  • GET /resources/{id} for low-latency object retrieval.
  • GET /resources?cursor= for scalable pagination and stable traversal.
  • Async event emissions for indexing, notifications, and reporting.

Real-world system behavior is defined during failure, not normal operation. Good designs clearly specify what can be stale, what must be exact, and what should fail fast to preserve reliability.

โš–๏ธ Trade-offs & Failure Modes (Design Deep Dive for Non Functional Requirements)

Scaling Strategy

  • Scale stateless services horizontally behind load balancing.
  • Partition stateful data by access-pattern-aware keys.
  • Add queue-based buffering where write bursts exceed synchronous capacity.

Availability and Resilience

  • Multi-instance deployment across failure domains.
  • Replication and failover planning for stateful systems.
  • Circuit breakers, retries with backoff, and bounded timeouts.

Storage and Caching

  • Cache-aside for read-heavy access paths.
  • Explicit invalidation and refresh policy.
  • Tiered storage for hot, warm, and cold access profiles.

Consistency, Security, and Monitoring

  • Clear strong vs eventual consistency contracts per operation.
  • Authentication, authorization, and encryption in transit and at rest.
  • Monitoring stack with metrics, logs, traces, SLO dashboards, and alerting.

This section is the architecture-for-NFRs view from your template. It explains how the system remains stable under scale, failures, and incident pressure.

๐Ÿงญ Decision Guide

SituationRecommendation
Early stage with moderate trafficKeep architecture minimal and highly observable
Read-heavy workload dominatesOptimize cache and read model before complex rewrites
Write hotspots appearRework key strategy and partitioning plan
Incident frequency increasesStrengthen SLOs, runbooks, and fallback controls

๐Ÿงช Practical Example for Interview Delivery

A repeatable way to deliver this design in interviews:

  1. Start with actors, use cases, and scope boundaries.
  2. State estimation assumptions (QPS, payload size, storage growth).
  3. Draw HLD and explain each component responsibility.
  4. Walk through one failure cascade and mitigation strategy.
  5. Describe phase-based evolution for 10x traffic.

Question-specific practical note:

  • Persist ledger intent before external calls, enforce idempotent APIs, and reconcile asynchronously.

A concise closing sentence that works well: "I would launch with this minimal architecture, monitor p95 latency, error-budget burn, and queue lag, then scale the first saturated component before adding further complexity."

๐Ÿ—๏ธ Advanced Concepts for Production Evolution

When interviewers ask follow-up scaling questions, use a phased approach:

  1. Stabilize critical path dependencies with better observability.
  2. Increase throughput by isolating heavy side effects asynchronously.
  3. Reduce hotspot pressure through key redesign and repartitioning.
  4. Improve resilience using automated failover and tested runbooks.
  5. Expand to multi-region only when latency, compliance, or reliability targets require it.

This framing demonstrates that architecture decisions are tied to measurable outcomes, not architecture fashion trends.

๐Ÿ“š Lessons Learned

  • Start with actors and use cases before drawing any diagram.
  • Define in-scope and out-of-scope boundaries to prevent architecture sprawl.
  • Convert NFRs into measurable SLO-style targets.
  • Separate functional HLD from non-functional deep dive reasoning.
  • Scale the first measured bottleneck, not the most visible component.

๐Ÿ“Œ Summary & Key Takeaways

  • Template-aligned answers are clearer, faster to evaluate, and easier to communicate.
  • Good HLDs explain both request flow and state update flow.
  • Non-functional architecture determines reliability under pressure.
  • Phase-based evolution outperforms one-shot overengineering.
  • Theory-linked reasoning improves consistency across different interview prompts.

๐Ÿ“ Practice Quiz

  1. Why should system design answers begin with actors and use cases?

A) To avoid architecture work entirely
B) To anchor architecture decisions to workload and user behavior
C) To skip non-functional requirements

Correct Answer: B

  1. Which section should define p95 and p99 targets?

A) Non Functional Requirements
B) Only the quiz section
C) Only the related posts section

Correct Answer: A

  1. What is the primary benefit of separating synchronous and asynchronous paths?

A) It removes all consistency trade-offs
B) It isolates latency-critical user flows from heavy side effects
C) It eliminates monitoring needs

Correct Answer: B

  1. Open-ended challenge: for this design, which component would you scale first at 10x traffic and which metric would you use to justify that decision?
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms