Real-Time Communication: WebSockets, SSE, and Long Polling Explained

When to use WebSockets vs Server-Sent Events vs Long Polling — with real production examples.

System Design Interview Prep

Abstract Algorithms

·Mar 28, 2026·22 min read

📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 22 min

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: 🔌 WebSockets = bidirectional persistent channel — use for chat, gaming, collaborative editing. SSE = one-way server push over HTTP with built-in reconnect — use for AI streaming, live logs, notifications. Long Polling = held HTTP requests — the pragmatic fallback when WebSockets are blocked. Short Polling = simplest but most wasteful — only for low-frequency, low-traffic scenarios. The decisive question is always: does the client need to send data back in real time?

📖 The Problem HTTP Was Never Designed to Solve

When you type a message in Slack and it appears instantly on your colleague's screen in Tokyo, that's not magic — it's a persistent WebSocket connection that stays open between your browser and Slack's servers. But when you check Twitter's notification bell, it doesn't use WebSockets — it uses Long Polling. And ChatGPT's streaming responses use a third approach entirely: Server-Sent Events. Why do three similar "show me new data" problems get three different solutions?

The root cause is HTTP's fundamental design. HTTP is a request-response protocol: the client asks, the server answers, the connection closes. The server has no way to push data to the client unprompted. Every piece of data you see on a webpage started with your browser explicitly asking for it.

This worked fine for 1990s web pages. Modern applications need the opposite: server-initiated events. Your Slack client needs to know when a message arrives — without constantly asking "anything new?"

The naive fix is to poll the server on a timer. At 1 million users polling every second, you generate 1 million HTTP requests per second. At roughly 1 KB per request and response, that's 1 GB/s of bandwidth — 99% of which returns an empty "nothing new" response. The polling tax compounds with scale.

Real-time communication patterns are engineering solutions to this HTTP limitation, each trading off latency, connection overhead, infrastructure complexity, and directionality in different ways.

🔍 Four Approaches to Real-Time Data: An Overview

All four patterns solve the same problem — getting fresh data from server to client — but they differ along two critical axes: how long the connection stays open and whether the client can also send data.

Pattern	Connection Lifetime	Direction	Relative Latency	Relative Server Overhead
Short Polling	New request each interval	Client → Server → Client	High (interval delay)	High (requests/sec)
Long Polling	Held open until data arrives	Client → Server → Client	Low (~100 ms)	Moderate (held connections)
Server-Sent Events (SSE)	Single stream, kept open	Server → Client only	Very low (~50 ms)	Low (open streams)
WebSockets	Persistent after upgrade	Bidirectional	Excellent (~20 ms)	Low per message; high at scale

The right choice depends not just on latency needs but on infrastructure reality: corporate proxies, CDNs, and load balancers often make WebSockets harder to deploy than their latency profile would suggest.

⚙️ How Each Pattern Delivers Data in Practice

Short Polling: The Simplest Approach That Doesn't Scale

Short Polling is the default mental model for most engineers: the client sends a regular HTTP GET on a timer and immediately receives whatever data is available — including an empty response when there's nothing new.

In practice, a polling client sends a regular GET /api/notifications HTTP request every few seconds. In the vast majority of calls — 99% or more when notifications are infrequent — the server responds immediately with an empty events array. Only occasionally does a response carry actual event data. Every one of these requests, whether empty or not, pays the full cost of an HTTP roundtrip: headers, TLS overhead, connection setup, and response parsing.

The appeal is simplicity: any HTTP server handles it, no special infrastructure, easy to debug. It works acceptably for truly infrequent updates (status checks every 30 seconds, config polling) with small user counts. Every request carries full HTTP headers (~500 bytes), opens a connection, completes a TLS handshake, and tears down — even when returning nothing. At 100,000 users polling every 2 seconds, that's 50,000 requests/second for a feature that's idle most of the day.

Long Polling: Holding the Line Until There's News

Long Polling improves on Short Polling by holding the HTTP request open on the server until data becomes available. The server responds only when it has something to say.

When a Long Polling client sends GET /api/events, the server holds the HTTP response object open rather than answering immediately. If nothing happens for 20 seconds, the server either responds with an empty body or waits for its timeout threshold. The moment a relevant event occurs, the server writes the response — with the event payload — and closes the connection. The client receives the event and immediately re-issues the identical request, maintaining near-continuous server coverage. This Hanging GET pattern means a message published server-side reaches the client within one network round-trip (~50–100 ms), not after a full polling interval.

The limitation: each held-open request occupies a file descriptor and memory on the server. At 500,000 concurrent users, that's 500,000 open connections. Async I/O runtimes (Node.js, Netty) handle this efficiently, but stateful load balancers and proxies sometimes close "idle" connections prematurely, triggering immediate reconnect storms.

Server-Sent Events: A Persistent One-Way HTTP Stream

Server-Sent Events establish a single long-lived HTTP response over which the server continuously streams events. Unlike Long Polling, the connection stays open indefinitely — the server keeps writing to it.

The client opens a standard HTTP GET request to the SSE endpoint, sending an Accept: text/event-stream header to signal its intent. The server responds with Content-Type: text/event-stream and Cache-Control: no-cache, then keeps the response body open indefinitely rather than closing it after the first write. Each event is sent as one or more data: field lines followed by a blank line as a separator. Named events carry an event: field before the data: payload, allowing the client to dispatch different event types to separate handlers. The server can write new events at any time; the client simply reads from the open stream as data arrives.

The browser's EventSource API handles all the complexity of an SSE consumer automatically. After construction, it dispatches incoming messages to registered handlers — unnamed events go to the default onmessage handler, while named events are dispatched to specific addEventListener callbacks. Crucially, EventSource reconnects automatically when the connection drops — no application-level retry logic is needed. On reconnect, the browser includes a Last-Event-ID header containing the ID of the last successfully received event, giving the server enough context to replay any events that arrived during the disconnection gap.

SSE's built-in reconnection is its most underrated feature. When a mobile network drops, EventSource automatically retries and includes Last-Event-ID, allowing the server to replay any events the client missed — at-least-once delivery semantics with zero application code. ChatGPT uses SSE to stream generated tokens to your browser as the model produces them. GitHub Actions uses it to stream live build logs line by line.

The hard constraint: SSE is strictly one-directional — server to client only. The client cannot send data back over the SSE channel. For scenarios where the client also needs to push (chat, live editing), a separate HTTP request or a different pattern is required.

WebSockets: Full-Duplex Over a Persistent TCP Channel

WebSockets begin as an ordinary HTTP request but immediately negotiate an upgrade to a raw TCP channel.

The WebSocket handshake begins as a standard HTTP GET request to the target endpoint, with two additional headers: Upgrade: websocket signaling the protocol switch request, and a randomly generated Sec-WebSocket-Key that the server uses to prove it understands the WebSocket protocol. The server validates the key, responds with 101 Switching Protocols, and returns a Sec-WebSocket-Accept header computed from the client's key. After this 101 response, the HTTP layer is discarded entirely. What remains is a raw TCP connection carrying framed binary or text messages in both directions simultaneously — either side can write at any time, no polling, no request overhead, no half-duplex constraint.

A WebSocket frame for a small message (under 126 bytes) carries only 2 bytes of framing overhead, compared to 500+ bytes of HTTP headers for the equivalent polling request. Slack keeps one WebSocket connection open per active browser tab — typing a message sends a text frame directly to the server, and a reply from Tokyo arrives as an incoming frame without any client-side request. Figma synchronizes cursor positions across all participants using WebSocket binary frames — each mouse movement generates a roughly 20-byte frame fanned out to every connected client in the same document.

🧠 Deep Dive: Connection State, Internals, and Protocol Costs

The Internals: What the Server Actually Holds

Each pattern has a fundamentally different server-side resource profile, and choosing the wrong one for your scale can break production.

Short Polling is stateless. Each request is independent. Any HTTP server handles it natively — no persistent state, no special infrastructure. The cost is pure throughput: CPU, bandwidth, and request-handling capacity consumed proportionally to polling frequency.

Long Polling requires the server to hold the HTTP response object in memory for each waiting client. In Node.js, a held response is just a deferred callback on the event loop — lightweight. In thread-per-request frameworks (early Java Servlet containers), each held connection occupies a thread. At 100,000 concurrent users, that's 100,000 blocked threads — a deployment that needs async I/O or you'll run out of threads before you run out of memory.

SSE holds a single HTTP response stream per client. The server calls response.write() whenever a new event is ready. The connection object lives in application memory for the session lifetime. HTTP/2 SSE is especially efficient: multiple streams multiplex over a shared TCP connection. Most browser EventSource implementations default to HTTP/1.1, where each SSE stream occupies one of the browser's 6 per-domain connections — a real constraint to watch for.

WebSockets hold a raw TCP socket per client, managed outside the normal HTTP request lifecycle. At 1 million concurrent connections you need roughly:

1 GB RAM minimum (1 KB per socket × 1 M sockets) just for socket state
A message broker (Redis pub/sub or Kafka) to fan out messages across server instances — a WebSocket connection is pinned to the server that accepted it
Sticky sessions or a connection router so outbound messages reach the correct server

Performance Analysis: Overhead per Delivered Message

Metric	Short Polling	Long Polling	SSE	WebSockets
Header overhead per event	~500 bytes	~500 bytes	~20 bytes	2–14 bytes
TCP handshake	Every request	Per reconnect	Once per session	Once per session
Effective latency	Interval-length	~50–200 ms	~30–100 ms	~10–50 ms
Memory: 1 M concurrent users	Stateless	~500 MB	~200 MB	~1 GB
Fan-out complexity	None (stateless)	Moderate	Moderate	High (requires broker)

WebSockets dominate on raw latency and per-message bandwidth. Short Polling is the worst performer at scale. SSE occupies a well-balanced middle ground for one-directional workloads — nearly as efficient as WebSockets per event, with dramatically simpler server infrastructure and no need for a message broker.

The practical trade-off: WebSockets are best when you need bidirectionality and can absorb the scaling infrastructure cost. For everything else, SSE is almost always the more economical choice.

📊 Connection Lifecycle: Seeing All Four Patterns Side by Side

graph LR
  subgraph SP[ Short Polling]
    C1[Client] -->|GET every N seconds| S1[Server]
    S1 -->|Instant response  empty or data| C1
  end

  subgraph LP[ Long Polling]
    C2[Client] -->|GET /updates| S2[Server]
    S2 -->|Held open  responds only when data ready| C2
  end

  subgraph SSE_[ Server-Sent Events]
    C3[Client] -->|GET /stream Accept: text/event-stream| S3[Server]
    S3 -->|Persistent stream  server writes events continuously| C3
  end

  subgraph WS_[ WebSockets]
    C4[Client] -->|HTTP GET + Upgrade: websocket| S4[Server]
    S4 -->|101 Switching Protocols| C4
    C4 <-->|Full-duplex framed messages  either side writes any time| S4
  end

Reading the diagram: Short Polling and Long Polling both use repeated HTTP requests — the only difference is how long the server waits before responding. SSE opens a single HTTP stream that the server continuously writes to, with the client as a passive reader. WebSockets are the only pattern where the client and server each hold a write handle simultaneously — the bidirectional arrow is the defining architectural distinction.

Why connection lifetime matters: Every new HTTP connection pays a TCP + TLS handshake cost (~3 round trips for TLS 1.3). Short Polling pays this on every interval. Long Polling pays it on every event delivery (reconnect after response). SSE and WebSockets pay it exactly once per session, then amortize that cost across all subsequent messages.

🌍 Real-World Applications: Who Uses WebSockets, SSE, and Long Polling in Production

Slack → WebSockets. Every active Slack session maintains a persistent WebSocket. Messages, typing indicators, read receipts, and presence heartbeats all transit the same channel. Slack's backend routes WebSocket frames through a Redis pub/sub layer so that when Alice's message arrives on Server A, it fans out to Bob's connection on Server B. The bidirectionality of WebSockets is essential — the client sends typing events and ACKs back to the server over the same connection.

ChatGPT → Server-Sent Events. GPT-4's output is generated token-by-token. OpenAI streams each token as an SSE event the moment the model produces it. The "typing" effect you see in the interface is real — tokens are generated sequentially and delivered via SSE with ~50 ms latency per batch. SSE is the correct choice here because communication is one-way only: the server streams, the client displays. The initial prompt was already sent via a regular HTTP POST.

Twitter (historically) → Long Polling. Twitter's notification bell and real-time timeline used Long Polling for years. Their engineering team documented that WebSockets added sticky session and fan-out complexity that was disproportionate to their primarily-read workload — notifications flow from server to client, and sub-second delivery was achievable without a persistent bidirectional channel.

GitHub Actions → Server-Sent Events. Live build logs stream to your browser as each pipeline step executes. The server appends log lines to an SSE stream in sequence. If your network drops mid-build, Last-Event-ID lets the browser resume the stream from the last received line — no missed log output and no full-page reload.

Online multiplayer games → WebSockets. Agar.io, Slither.io, and similar browser games use WebSockets for bidirectional position sync at 20–60 messages/second per player. At this frequency, the 2-byte WebSocket frame header is critical — an equivalent polling approach would spend more bandwidth on HTTP headers than on the actual game state.

⚖️ Trade-offs & Failure Modes: Why Each Real-Time Pattern Has a Price

Dimension	Short Polling	Long Polling	SSE	WebSockets
Delivery latency	High — interval delay	Low — ~100 ms	Very low — ~50 ms	Excellent — ~20 ms
Bandwidth efficiency	Poor — full headers per poll	Moderate — headers per event	High — framing only	Highest — 2-byte frames
Server state	None (stateless)	Held response objects	Held stream objects	Raw TCP sockets
Bidirectional?	No	No	No	Yes
Auto-reconnect	Client implements	Client must re-poll	Built-in (EventSource)	Manual or library
Proxy/firewall compatibility	Excellent	Good	Good	Variable — often blocked
Scaling infrastructure	None needed	Async server	Async server	Broker + sticky sessions
HTTP/2 benefit	Multiplexing	Multiplexing	Multiplexing + stream reuse	Not applicable (upgrades away from HTTP)

Failure modes to plan for:

WebSocket connection loss at scale. Mobile networks and corporate proxies aggressively terminate connections that appear idle. Without server-to-client heartbeat pings every 25–30 seconds, clients silently disconnect while the server still believes the connection is alive. Implement ping/pong and exponential-backoff reconnect on every WebSocket deployment.

Long Polling timeout storms. If your server imposes a 30-second hold timeout and 100,000 clients all hit it simultaneously (e.g., after a server restart), you receive a 100,000-request wave the instant the timeout expires. Jitter individual timeouts by ±5–10 seconds to spread the reconnect load.

SSE and the HTTP/1.1 browser connection limit. Browsers cap HTTP/1.1 connections per origin at 6. If you open six SSE streams from a single tab (or a single page uses multiple SSE-backed widgets), you exhaust the limit and subsequent requests queue. The fix is HTTP/2 (which multiplexes over one TCP connection with no per-stream limit) or consolidating all server events into a single SSE endpoint filtered by topic.

WebSocket fan-out across instances. A WebSocket connection is pinned to the server instance that accepted the upgrade. When you scale horizontally, a message arriving at Instance A must reach users connected to Instance B, C, and D. The standard solution is a Redis pub/sub topic or Kafka partition that every instance subscribes to — but this pub/sub layer becomes a throughput bottleneck under heavy fan-out.

🧭 Decision Guide: Choosing the Right Real-Time Protocol

graph TD
    A{Is real-time communication bidirectional?} -->|Yes| WS[WebSockets Slack  Figma  Google Docs Live trading  Games]
    A -->|No  server push only| B{Can you use long-lived HTTP?}
    B -->|"Yes (most environments)"| C{How frequent are updates?}
    B -->|"No  proxy or CDN blocks it"| LP[Long Polling Twitter-style notifications Legacy proxy environments]
    C -->|High or continuous stream| SSE[Server-Sent Events ChatGPT  GitHub Actions Stock tickers  Notifications]
    C -->|Low  minutes apart| D{User scale?}
    D -->|"Small (< 10,000 users)"| SP[Short Polling Admin dashboards Status checks]
    D -->|"Large (> 10,000 users)"| LP2[Long Polling Notification dots Email clients]

Start at the bidirectional question every time. Engineers frequently reach for WebSockets by default, but WebSockets add broker infrastructure, sticky session requirements, and reconnect management that SSE or Long Polling handle more simply for unidirectional workloads.

Use When	Recommended Pattern
Chat, multiplayer games, collaborative editing	WebSockets
AI token streaming, live build logs, push notifications	Server-Sent Events
Notification dots, timeline updates, enterprise firewall environments	Long Polling
Config checks, status polling with < 10,000 users	Short Polling
Updates are strictly server-to-client	SSE preferred over WebSockets
Corporate proxies or CDNs block persistent connections	Long Polling as fallback
You need client-initiated messages at high frequency	WebSockets

🛠️ Socket.IO: Intelligent Fallback for Any Environment

Socket.IO is the most widely used real-time library for Node.js. Its core value is automatic transport negotiation: it attempts a WebSocket upgrade first, falls back to HTTP Long Polling if WebSockets are blocked (by a proxy or CDN), and degrades further if needed — transparently, without any application code change. This makes it the pragmatic choice for public-facing applications where you cannot control every client network environment.

Socket.IO also layers on rooms (named broadcast groups), namespaces (logical sub-connections), built-in exponential-backoff reconnect, and acknowledgment callbacks — infrastructure you would otherwise write yourself.

On the server, Socket.IO wraps a standard HTTP server and negotiates the best available transport on each connection — attempting WebSockets first and falling back to HTTP Long Polling transparently if the client's network or proxy blocks the upgrade. The pingInterval and pingTimeout settings control heartbeat frequency and liveness detection: if the server sends a ping and receives no pong within the timeout window, the connection is treated as dead and cleaned up. When a socket connects, it can be added to named rooms; any event emitted to a room is broadcast to every socket in that room, regardless of which server instance they are connected to (once a Redis adapter is attached). On the client, Socket.IO auto-negotiates the transport and reconnects automatically with exponential backoff on disconnect — the application code interacts with a simple event-emitter API regardless of which underlying transport is active.

Scaling Socket.IO beyond one server requires the @socket.io/redis-adapter, which uses Redis pub/sub to relay events across instances. Without it, a message emitted on Instance A never reaches clients connected to Instance B. The same fan-out problem applies to raw WebSockets — Socket.IO just makes the adapter integration a single dependency away.

For a full treatment of Socket.IO's clustering model and Redis adapter configuration, see the official Socket.IO multi-node documentation.

🧪 Designing a Live Notification System

Scenario: A project management tool (similar to Linear or Jira) needs to deliver real-time notifications to users when a task is assigned, a comment is posted, or a sprint is kicked off.

Traffic profile: 200,000 active users, averaging 5 notifications per user per hour, with bursts of 10,000 notifications in 10 seconds when a sprint launches.

Pattern Selection Walkthrough

Decision Point	Answer	Implication
Bidirectional communication?	No — server pushes to browser	Eliminates the WebSocket mandate
Update frequency?	Low on average; burst-able	SSE handles bursts without reconnect storms
Reconnect behavior needed?	Yes — mobile clients drop constantly	SSE's built-in `Last-Event-ID` replay handles this
Infrastructure available?	HTTP/2 load balancer (AWS ALB v2)	SSE streams multiplex over shared TCP; no per-stream limit
WebSockets blocked by proxies?	Some enterprise users affected	SSE is HTTP — passes through all standard proxies

Verdict: Server-Sent Events. Each authenticated user opens one EventSource('/api/notifications/stream') connection. The backend publishes a notification event whenever one is triggered. On mobile reconnect, the browser automatically retries with Last-Event-ID, and the server replays any events the client missed.

What the backend holds: At 200,000 open SSE streams, the server holds approximately 400 MB of active HTTP response stream objects — well within budget for a service of this profile. No message broker fan-out is needed if notifications are user-specific (each user has their own stream); a simple in-memory pub/sub per user-id is sufficient.

Why not WebSockets? The communication is strictly one-way — the browser never needs to push notification data back to the server. Using WebSockets would require sticky sessions, a Redis adapter for fan-out, and reconnect handling logic — all for zero functional gain over SSE on this workload.

Why not Long Polling? SSE's persistent stream is more efficient than repeatedly reconnecting Long Polling connections, especially during the 10,000-notification burst. A burst triggers 10,000 Long Polling responses in 10 seconds, causing 10,000 immediate reconnect requests. SSE delivers the same 10,000 events as writes to open streams — no reconnect wave.

📚 Lessons Learned from Real Production Systems

Don't reach for WebSockets by default. The most common mistake is defaulting to WebSockets because they feel "more real-time." WebSockets are the right tool for bidirectional workloads. For one-way server push, SSE is almost always the more operationally simple and cost-effective choice — fewer moving parts, no broker dependency, and automatic reconnect built in.

Corporate proxies break WebSockets more often than engineers expect. HTTP CONNECT proxies, layer-7 load balancers, and enterprise firewalls routinely reject or timeout WebSocket upgrade requests. Always design a fallback (Socket.IO's transport negotiation exists precisely because this is a pervasive real-world problem, not a theoretical edge case).

Heartbeats are non-negotiable for any persistent connection. Without server-to-client pings every 25–30 seconds, network middleboxes silently kill "idle" connections after 60–300 seconds. The server's connection map diverges from reality — it believes 200,000 clients are connected while the actual number is far lower. Measure active vs. registered connections continuously.

SSE's Last-Event-ID is free at-least-once delivery — use it. Many teams implement SSE without event IDs and only discover the gap when mobile users report missed notifications after a network transition. Assigning a monotonically increasing ID to every event and replaying from Last-Event-ID on reconnect costs almost nothing server-side and eliminates a whole class of production bugs.

Model the fan-out math before you scale WebSockets. A single Redis pub/sub topic delivering to 1 million subscribers generates roughly 1 million network writes per message published. At 100 messages/second, that's 100 million writes/second — well beyond a single Redis instance. Partition by topic, room, or user cohort before you hit this wall, not after.

📌 TLDR: Summary & Key Takeaways

Short Polling is simple and stateless but wastes bandwidth proportionally to user count — viable only for low-frequency, small-scale polling scenarios.
Long Polling achieves near-real-time delivery over standard HTTP with no special protocol — the pragmatic fallback when WebSockets are blocked by infrastructure.
SSE offers one-way server push with automatic reconnect and Last-Event-ID replay baked in — the best default for notifications, AI token streaming, and live logs.
WebSockets provide full-duplex bidirectional real-time communication at minimal per-message overhead — essential for chat, collaborative editing, gaming, and live trading.
The bidirectionality question drives the primary decision: if the client never needs to push data back in real time, SSE is almost always simpler and cheaper than WebSockets.
Socket.IO adds transport fallback, rooms, and built-in reconnect — production-ready scaffolding for Node.js real-time apps, with Redis adapter for horizontal scale.
Scaling WebSockets requires a message broker for fan-out, sticky sessions or a routing layer, and careful heartbeat management — plan this architecture before you need it.

The one-liner: Use SSE when the server talks, WebSockets when both sides talk, Long Polling when infrastructure gets in the way, and Short Polling only when immediacy doesn't matter.

System Design HLD: Chat and Messaging Platform — Full design walkthrough covering WebSocket fan-out, delivery receipts, presence, and multi-device sync in a production chat system.
System Design Protocols: REST, RPC, and TCP/UDP — The transport-layer context for where WebSockets fit in the broader protocol stack: HTTP, TCP, and the application layer.
System Design Networking: DNS, CDNs, and Load Balancers — How CDNs and load balancers interact with persistent connections, WebSocket termination, and sticky session routing.
System Design HLD: Collaborative Document Editor — How Google Docs-style real-time collaboration is built on WebSocket fan-out combined with Operational Transformation or CRDTs. (Coming soon)

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Split Brain Explained: When Two Nodes Both Think They Are Leader

TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader — each accepting writes the other never sees. Prevent it with quorum consensus (at least ⌊N/2⌋+1 nodes must agree before leadership is g...

May 3, 2026•20 min read

SQL Partitioning: Range, Hash, List, and Composite Strategies Explained

TLDR: SQL partitioning divides one logical table into smaller physical child tables, all accessed through the parent table name. The query optimizer skips irrelevant child tables entirely — a process called partition pruning — turning a 30-second ful...

May 3, 2026•23 min read

NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node — virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...

May 3, 2026•22 min read

Clock Skew and Causality Violations: Why Distributed Clocks Lie

TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions — but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...

May 3, 2026•18 min read