System Design HLD Example: Chat and Messaging Platform

A clear HLD for real-time chat with ordering, delivery states, and offline synchronization.

Abstract Algorithms

·Mar 13, 2026·17 min read

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: A distributed chat system must balance low-latency delivery with strong per-conversation ordering. The architectural crux is the WebSocket Gateway for persistent stateful connections and Cassandra for append-heavy message storage partitioned by conversation. By using a Kafka-driven Fan-out model and Redis-based Presence heartbeats, the system can handle millions of concurrent users while ensuring no message is lost, even across device reconnects and network shifts.

💬 The Real-Time Connectivity Paradox

Imagine you are building a chat application like WhatsApp or Slack. On day one, with 100 users, it's easy. A simple HTTP poll or a single WebSocket server works perfectly. But then you scale to 2 billion users.

Suddenly, you're handling 1.16 million messages per second. Your servers are maintaining 100 million concurrent WebSocket connections. If a user in India sends a message to a user in Brazil, it needs to arrive in under 500ms. If the recipient is offline, the message must be buffered securely. When they reconnect on a train with a spotty 3G connection, their entire history must sync perfectly, in the correct order, without duplicates.

The design paradox of chat is this: it is a stateful problem (connections) that must be solved with a stateless mindset (scaling). Every WebSocket connection is a "heavy" resource that ties a specific user to a specific server. If that server crashes, 100,000 users "disappear" from the network. This tension between persistent connection state and horizontal scalability drives every major decision in a production-grade messaging platform.

📖 Chat Platform: Use Cases & Requirements

Actors & Journeys

Message Sender: Initiates a message (text, media, or emoji). Expects a "sent" ACK immediately and "delivered/read" updates later.
Message Recipient: Receives push updates in real-time if online, or push notifications if offline.
System Administrator: Monitors global delivery lag, connection churn, and partition health.
Group Participant: Participates in multi-user conversations where fan-out load is multiplied by the number of active members.

In/Out Scope

In-Scope: Real-time 1-on-1 and group messaging, delivery/read receipts, online/offline presence, message persistence, and reconnect synchronization.
Out-of-Scope: End-to-end encryption (Signal Protocol implementation), voice/video calling (WebRTC signalling), and full-text search across billions of messages.

Functional Requirements

Low-Latency Delivery: p95 delivery time < 200ms for online users.
Message Ordering: Messages must appear in the same order for all participants in a conversation.
Reliability: At-least-once delivery guarantee; no messages should be lost due to server failures.
Presence: Real-time "Online/Away/Typing" indicators with < 30s staleness.
Multi-device Sync: Messages sent from a phone must appear on the web/desktop client.

Non-Functional Requirements (NFRs)

High Availability: 99.99% uptime for the messaging core.
Scalability: Support 100M concurrent connections and horizontal scaling of all components.
Partition Tolerance: The system must remain functional even if a specific gateway cluster or database node fails.

🔍 Foundations: State vs. Statelessness in Chat

In a typical RESTful system, servers are stateless. Any server can handle any request. Chat is different. Because we need push capabilities (server-to-client), we maintain persistent WebSockets.

This means:

The Gateway is Stateful: Server A "knows" that User 123 is connected to it.
The Registry is Global: If User 456 wants to send a message to User 123, the system must "lookup" which server User 123 is currently connected to.
Connection Churn is Constant: Mobile clients drop and reconnect every time a user switches from WiFi to 4G. The system must handle this "flicker" without overwhelming the database.

⚙️ The Message Lifecycle: From Sender to Recipient

The mechanism of a message send involves a carefully orchestrated sequence of events to ensure durability before confirmation.

The Handshake: Client establishes a WebSocket connection with a Gateway Node.
The Submission: Client sends a message over the socket.
The Buffer: The Gateway forwards the message to the Message Service.
The Durability Gate: The Message Service writes to Cassandra.
The Distribution: Once persisted, the message is published to a Kafka Topic (keyed by Conversation ID).
The Fan-out: Fan-out workers consume the message, lookup recipients' active sessions in Redis, and route the message to the specific Gateways holding those connections.
The ACK: Only after the message is in Kafka does the sender receive a "Sent" tick.

📐 Estimations & Design Goals

Capacity Math (The WhatsApp Scale)

Assumptions for a global-scale service:

Daily Active Users (DAU): 500M.
Concurrent Connections: 100M.
Messages per Day: 100 Billion.
Average Message Size: 1 KB (including metadata).
Write Throughput: $100B / 86,400 \approx \mathbf{1.16M \text{ msgs/sec}}$.
Storage Growth: $100B \times 1KB = \mathbf{100 \text{ TB/day}}$.

Scaling Targets

Gateway Capacity: Each node (e.g., 16 vCPU, 64GB RAM) handles ~100K-200K WebSocket connections.
Redis Throughput: Presence heartbeats generate massive O(1) write load; we need a partitioned Redis cluster.
Cassandra Write Latency: Must be $< 10ms$ for high-concurrency appends.

📊 High-Level Design: The Distributed Messaging Architecture

The architecture separates the Stateful Connection Layer from the Stateless Logic Layer.

graph TD
    ClientA[Sender Client] -- WebSocket --> GW1[Gateway Node 1]
    ClientB[Recipient Client] -- WebSocket --> GW2[Gateway Node 2]

    GW1 --> MsgSvc[Message Service]
    MsgSvc --> Cassandra[(Cassandra: Messages)]
    MsgSvc --> Kafka{Kafka: Message Bus}

    Kafka --> Fanout[Fan-out Workers]
    Fanout --> Redis[(Redis: Session Registry)]
    Fanout -- Push --> GW2
    GW2 -- Push --> ClientB

    ClientA -- Heartbeat --> Presence[Presence Service]
    Presence --> RedisP[(Redis: Presence)]

The diagram illustrates the clean separation between the stateful connection layer and the stateless processing layer. Gateway Nodes are stateful — each holds a pool of persistent WebSocket connections from specific clients. The Message Service and Fan-out Workers are completely stateless and horizontally scalable. Redis serves double duty: the Session Registry maps user IDs to the specific gateway node address holding their connection, while the Presence Store maintains heartbeat-based online/offline state. This layered design allows the gateway tier to scale independently from the message processing tier, and allows Fan-out Workers to grow in parallel with message volume without touching the connection infrastructure.

🧠 Deep Dive: How Cassandra and the Session Registry Power Distributed Chat at Scale

Internals: How TUUID Clustering Enforces Ordering Without Application-Level Sorting

Cassandra's Time-UUID (type 1 UUID) embeds a 60-bit timestamp with nanosecond precision and a 14-bit clock sequence number that prevents collisions when multiple messages arrive within the same nanosecond. The first 60 bits of a TUUID represent the count of 100-nanosecond intervals since October 15, 1582 (the Gregorian reform date used by RFC 4122). Because Cassandra uses TUUIDs as the clustering column, rows within a partition are stored on disk in TUUID order — which is chronological order — without any runtime sorting.

When a Cassandra query requests the last 50 messages in a conversation, Cassandra performs a reversed range scan within the partition: it seeks to the end of the partition's on-disk SSTable segment for that conversation_id and reads backward. This is a sequential I/O operation — the most efficient access pattern for an LSM-tree-based storage engine — meaning the query completes in time proportional to the number of results returned, not the total number of messages in the conversation.

Performance Analysis: Gateway Capacity, Message Throughput, and Redis Write Load

Understanding the numbers behind chat architecture prevents under-provisioning during capacity planning:

Component	Metric	Calculation	Value
Messages per second	Write throughput	100B messages/day ÷ 86,400	1.16M msgs/sec
Cassandra nodes needed	At 100K writes/sec per node	1.16M ÷ 100K	12 nodes (36 with replication factor 3)
Redis session registry writes	Presence heartbeats (naive)	100M users × 1 heartbeat/15s	6.7M SETEX/sec
Redis session registry writes (batched)	Heartbeat batching per gateway node	200 gateways × 1 batch/15s	200 ops/sec
Gateway nodes (WebSocket connections)	500K connections per node	100M ÷ 500K	200 Gateway Nodes
Kafka partitions for message bus	1 partition per active conversation	Estimate 10M active convos	10,000 partitions (across brokers)

The starkest number in the table is the heartbeat optimization: batching reduces the presence write load from 6.7 million operations per second to 200 operations per second — a 33,000× reduction in Redis write load by simply aggregating at the Gateway tier.

The two hardest technical problems in distributed chat are (1) guaranteeing message ordering within a conversation across multiple writes and (2) routing a delivered message to the correct gateway when thousands of gateway instances exist. Both problems require careful data modeling decisions.

Ordering Messages with Cassandra's Time-UUID Clustering

Cassandra's data model is purpose-built for chat workloads. Messages are stored in a table partitioned by conversation_id, with a Time-UUID (TUUID) as the clustering column. Because TUUID encodes a nanosecond-precision timestamp, rows within a partition are automatically ordered chronologically without any application-level sorting. A query for the last 50 messages in a conversation is a single-partition range scan — the most efficient operation in Cassandra's execution model.

Messages Table Design:

Column	Type	Role
conversation_id	UUID	Partition key — all messages in one conversation are co-located on the same node
message_id	TIMEUUID	Clustering column — nanosecond-ordered within the partition
sender_id	UUID	Message author reference
body	TEXT	Message content (max 10KB enforced at API layer)
message_type	TEXT	ENUM: text, image, system, reaction, file
sent_at	TIMESTAMPTZ	Wall-clock time for display purposes
delivery_status	TEXT	ENUM: sent, delivered, read
client_message_id	UUID	Client-generated idempotency key for deduplication

The client_message_id is the deduplication key. If a client retransmits a message due to a network retry, the Message Service checks Redis for this key before inserting into Cassandra. If the key already exists, the server returns the original message_id and sends an ACK — the message is not duplicated in storage.

The Session Registry: Routing Fan-out to the Right Gateway

When a Fan-out Worker determines that a recipient is online, it must route the message to the specific Gateway Node holding that user's WebSocket connection. The Session Registry in Redis stores this mapping with automatic expiry:

Redis Key	Value	TTL	Purpose
`session:{user_id}`	`gateway-node-7:8080` (host and port)	30 seconds, renewed by heartbeat	Routes fan-out to correct gateway node
`presence:{user_id}`	`online` / `away` / `offline`	5 minutes	Presence state for UI display
`typing:{conv_id}:{user_id}`	`1` (flag)	10 seconds (auto-expires)	Typing indicator state
`dedup:{client_message_id}`	`{message_id}` (stored ID)	5 minutes	Idempotency check for retransmissions

The 30-second TTL on session keys is the recovery mechanism for Gateway Node failures. If a node crashes, all its users' session keys expire within 30 seconds, and those users are automatically treated as offline. When they reconnect to a healthy gateway, they re-register their session key. Fan-out Workers then route subsequent messages to the updated gateway address.

Delivery State Machine

Every message transitions through a well-defined state machine from client submission to recipient acknowledgment:

graph LR
    A[Client Submits] --> B[Sent — Kafka ACK received]
    B --> C[Delivered — Gateway pushed to recipient device]
    C --> D[Read — Recipient opens the conversation]
    B --> E[Failed — 3 retry attempts exhausted]
    E --> F[Dead Letter Queue — manual review or replay]

The state machine shows how a message moves from client submission through the Kafka acknowledgment (Sent tick), gateway delivery confirmation (double tick), and recipient read receipt (blue double tick). Messages that exhaust retry attempts move to the Dead Letter Queue, where operators can inspect the failure reason and replay the message without data loss.

🌍 Real-World Applications: How WhatsApp, Slack, and Discord Scaled Their Chat Architectures

WhatsApp handles over 100 billion messages per day across 2 billion users. WhatsApp's Gateway layer is built on Erlang/OTP, whose lightweight BEAM processes — far lighter than OS threads — allow a single node to maintain millions of WebSocket connections with microsecond context-switching overhead. Message storage uses a customized Mnesia cluster (Erlang's built-in distributed store) sharded by phone number hash. Because WhatsApp uses end-to-end encryption via the Signal Protocol, servers store only ciphertext blobs — the platform never holds plaintext message content.

Slack operates a fundamentally different fan-out model because Slack channels have many concurrent readers (team channels, not just 1-on-1 DMs). Slack's fan-out is channel-based: when a message arrives in a large channel with 10,000 members, online users receive an immediate push, while the message is also written to the channel's durable message store so users who join later can scroll back through history. Slack's full-text message search — across billions of messages — runs on Apache Solr with per-workspace index sharding.

Discord famously migrated its message store from MongoDB to Cassandra in 2017, then to ScyllaDB (a high-performance C++ Cassandra-compatible store) in 2022. Discord's post-migration analysis showed a 4× reduction in p99 read latency and 50% reduction in infrastructure cost after the ScyllaDB migration. The key insight from Discord's experience: Cassandra is excellent for append-heavy chat workloads, but high-traffic Discord servers (with millions of messages per day in a single channel) created hot partitions that required manual time-window partitioning strategies to resolve.

⚖️ Trade-offs and Failure Modes in Distributed Chat Architecture

WebSocket Stickiness vs. Stateless Gateway Scaling

Maintaining a persistent WebSocket connection to a specific Gateway Node is unavoidable for push delivery — this is an inherently stateful design. The trade-off is that Gateway Node failures cause connection drops for all users on that node. The mitigation is a health-check system that detects failed gateways within 10 seconds, combined with client-side automatic reconnect with exponential backoff. Clients reconnect to any healthy gateway, register their new session in Redis, and resume receiving messages — the Fan-out Worker picks up the new routing address automatically on the next delivery attempt.

At-Least-Once vs. Exactly-Once Delivery

Kafka guarantees at-least-once delivery, meaning a message can be delivered to a recipient's gateway more than once in failure and retry scenarios. The client suppresses duplicate message_id values when rendering the conversation view — this is the standard industry approach. Exactly-once delivery end-to-end would require distributed transactions across Kafka, Cassandra, and the Gateway, which introduces complexity and latency penalties that no production chat system has deemed worth the cost.

Presence Heartbeat Storm

If 100 million online users each send a presence heartbeat every 15 seconds, that generates 6.7 million Redis SETEX operations per second — a severe write load. Production systems use Gateway-Level Heartbeat Batching: each Gateway Node aggregates all heartbeats from its connected users and sends a single batch update to Redis every 15 seconds, rather than one Redis write per user. This reduces the Redis write load by 3–4 orders of magnitude, from millions of per-user operations to a few hundred per-gateway batch operations.

🧭 Decision Guide: Key Architecture Choices When Designing a Chat System

Decision Point	Option A	Option B	When to Choose Which
Connection Protocol	HTTP Long-Polling (every 1–2 sec)	WebSockets (persistent)	WebSockets for sub-200ms latency; long-polling only for firewall-constrained enterprise environments
Message Storage	MySQL with sharding	Cassandra with TUUID	Cassandra for > 1M messages per day or > 100K concurrent conversations
Fan-out Strategy	Fan-out on Write (push all)	Fan-out on Read (pull per request)	Push for groups under 500 members; pull for large public channels
Presence Storage	Redis TTL heartbeats	Dedicated Presence Service	Redis TTL for simplicity; dedicated service for complex "away after N minutes of inactivity" rules
Message Ordering	Client-side timestamp sort	Server-side TUUID cluster	Server-side TUUID is authoritative and eliminates clock-skew disputes between devices
Delivery Guarantee	At-least-once + client dedup	Exactly-once distributed transaction	At-least-once is 10× simpler and adequate; client dedup handles rare duplicates cleanly

🧪 Interview Delivery Example: Walking Through Chat System Design in 45 Minutes

Minute 1–5: Scope clarification. Ask: "Is this 1-on-1 only, or do we need group chat with large channel support? What is the maximum group size? Do we need message history search? Are read receipts required?" These questions reveal the design differences between WhatsApp (1-on-1 optimized) and Slack (channel-scale optimized) — and show the interviewer you think in product terms.

Minute 6–15: Connection model. Open with WebSockets: "We need persistent, bidirectional connections for server-initiated push delivery. HTTP polling at scale would generate 100 million requests per second from clients alone — before any actual messages are sent. A WebSocket connection is persistent and message delivery is O(1) per push." Describe the Gateway Node role and the Session Registry that makes routing possible.

Minute 16–30: Message lifecycle walkthrough. Walk through all 7 steps: WebSocket submission → Message Service → Cassandra write → Kafka publish → Fan-out Worker → Session Registry lookup → Gateway push. Emphasize: "The sender receives an ACK only after Kafka persistence — this gives us the durability guarantee. Even if the Fan-out Worker crashes after publishing, Kafka retains the message and a replacement consumer reprocesses it."

Minute 31–40: Data model and scale. Present the Cassandra schema. Explain the TUUID clustering column and why it enables natural ordering without a separate sort step. Discuss write throughput: "1.16M messages per second across the Cassandra cluster, with each node handling roughly 100K writes per second, means we need approximately 12 Cassandra nodes at baseline — scaling to 24 nodes with 3× replication."

Minute 41–45: Failure modes. Address Gateway Node failure, Redis eviction under memory pressure, and Kafka consumer lag. Showing that you have planned recovery paths — not just the happy path — is what separates intermediate from senior-level answers.

🛠️ Cassandra, Kafka, and Redis: The Production Infrastructure for Scalable Chat

Apache Cassandra is the industry-standard message store for high-volume chat due to its write-optimized log-structured storage engine and its ability to model ordered time-series data per conversation partition. Cassandra's replication factor (typically 3) ensures message durability even if two nodes fail simultaneously. Operators tune consistency levels: LOCAL_QUORUM for message writes (strong durability) and LOCAL_ONE for presence reads (lowest latency acceptable for eventually-consistent data).

Apache Kafka provides the durable message bus between the Gateway tier and the Fan-out Workers. Topics are partitioned by conversation_id, ensuring all messages within one conversation are processed in order by the same consumer partition. Kafka's configurable message retention (7–30 days) allows Fan-out Workers to replay missed events after recovering from a consumer failure — zero messages are permanently lost due to processing failures.

Redis Cluster stores the Session Registry, Presence data, and typing indicators. Redis Sorted Sets (ZADD/ZRANGEBYSCORE) efficiently manage presence state with automatic expiry. In some implementations, Redis Pub/Sub broadcasts gateway-routing events across the cluster when session keys are updated, enabling Fan-out Workers to react to reconnects within milliseconds without polling the Session Registry.

📚 Lessons Learned from Building Production Chat Systems

Hot Conversation Partitions Are Real and Painful. A Cassandra partition for a single high-traffic public server or channel — 10 million messages per day — concentrates all writes on one Cassandra node, creating a hot partition that degrades p99 write latency for that conversation by 10× or more. Discord solved this by splitting conversation partitions by time window (one partition per day per conversation). The query changes from one partition scan to potentially two (spanning a day boundary), but hot-partition degradation is eliminated.

Gateway Node Capacity Requires Conservative Planning. Each WebSocket connection consumes roughly 64KB of memory on a Linux server with tuned TCP socket buffers. A 64GB Gateway Node can theoretically hold 1 million connections, but at a 50% safety margin for memory spikes and GC pressure, plan for 500K connections per node. Serving 100M concurrent users requires at least 200 Gateway Nodes at baseline — more during peak hours.

Presence Eventual Consistency Is the Right Default. Showing a user as "online" for 30 seconds after they disconnect is a universally accepted UX trade-off across WhatsApp, Telegram, and iMessage. Attempting strongly-consistent presence with distributed locks introduces coordination overhead that no production system justifies. Design presence as eventually consistent from day one and document the staleness window explicitly in product specs.

Mobile Network Turbulence Is Brutal at Scale. Mobile clients switch between WiFi and cellular constantly. Each network switch drops the TCP connection, triggering a WebSocket reconnect. Production Gateway Nodes handle up to 100K reconnects per second during commute hours. Use non-blocking I/O frameworks — Netty on JVM, epoll on Linux — that decouple connection count from thread count. A thread-per-connection model fails catastrophically under this reconnect churn.

📌 Key Takeaways: Chat System Design

Chat is a stateful problem (persistent WebSocket connections) that must be solved with a stateless mindset (horizontal scaling behind the gateway). The Gateway tier is stateful; everything behind it — Message Service, Fan-out Workers, Presence Service — must be stateless and independently scalable.
Cassandra with TUUID clustering columns is the standard message store for high-volume chat. It provides natural chronological ordering, write optimization via LSM trees, and linear horizontal scaling by adding nodes.
The Session Registry in Redis is the routing table for fan-out. It maps user IDs to specific gateway node addresses and auto-expires via TTL on gateway failure — enabling automatic recovery without manual intervention.
Kafka between the Message Service and Fan-out Workers guarantees at-least-once message delivery and enables fan-out workers to scale independently, replay missed events, and recover from failures without data loss.
Presence is inherently eventually consistent and should be implemented with Redis TTL heartbeats, not distributed locks. A 30-second staleness window is universally acceptable across all major chat platforms.
Client-side deduplication using message_id is the standard mechanism for handling at-least-once delivery. Do not attempt exactly-once delivery end-to-end — the coordination cost far exceeds the benefit.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...

Apr 19, 2026•23 min read