All Posts

The 8 Fallacies of Distributed Systems

The Network is Reliable. Latency is Zero. Bandwidth is Infinite. If you believe these, your system will fail. We debunk the 8 fallacies.

Abstract AlgorithmsAbstract Algorithms
ยทยท5 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: In 1994, L. Peter Deutsch at Sun Microsystems listed 8 assumptions that developers make about distributed systems โ€” all of which are false. Believing them leads to hard-to-reproduce bugs, timeout cascades, and security holes. Knowing them is a prerequisite for designing systems that actually work at scale.


๐Ÿ“– The Eight Assumptions That Will Break Your System

These are not theoretical warnings. They are a field guide to the most common production bugs in distributed software.

The eight fallacies:

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn't change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

๐Ÿ”ข Network Fallacies 1โ€“4: Reliability, Latency, Bandwidth, and Security

Fallacy 1: The network is reliable.

Packets are dropped. Connections are reset. Load balancers time out. A single function call that succeeds 99.9% of the time fails once every 1,000 requests, and distributed systems call each other thousands of times per second.

Design response: Retry with exponential backoff. Use circuit breakers (Hystrix, Resilience4j). Design for idempotency.

Fallacy 2: Latency is zero.

A function call on the same machine takes nanoseconds. A call to a service in the same data center takes ~0.5 ms. A call across regions takes 50โ€“200 ms. At 100 chained calls, that is 20+ seconds.

Design response: Avoid deep synchronous call chains. Use async messaging. Parallelize independent calls.

Fallacy 3: Bandwidth is infinite.

Sending large JSON payloads is cheap in development where teams work on fast LANs. In production, AWS cross-AZ bandwidth costs money; serializing large object graphs creates GC pressure.

Design response: Use binary serialization (Protobuf, Avro). Filter fields at the API boundary. Use compression for large payloads.

Fallacy 4: The network is secure.

Traffic between services inside your VPC is not automatically encrypted or authenticated. An attacker who gains access to your network can intercept or inject requests.

Design response: mTLS between services. Zero-trust network model. Never pass secrets in logs.


โš™๏ธ Infrastructure Fallacies 5โ€“8: Topology, Administration, Cost, and Compatibility

Fallacy 5: Topology doesn't change.

Servers fail. Auto-scaling adds and removes instances. Kubernetes restarts pods. Hardcoding IP addresses breaks as soon as a node is replaced.

Design response: Use service discovery (Consul, Kubernetes DNS, AWS Cloud Map). Never hardcode IPs.

Fallacy 6: There is one administrator.

Real systems involve the platform team, the security team, the application team, and the database team. A schema migration "owned" by the app team may be blocked by the DBA team for a week.

Design response: Design for backward and forward compatibility. Feature flags for deployments. Self-service infra via IaC.

Fallacy 7: Transport cost is zero.

Serializing a Java object to JSON, compressing it, encrypting it, sending it over a socket, deserializing it on the other side โ€” all of this costs CPU cycles, memory, and money (cloud egress charges).

Design response: Right-size payloads. Batch small messages. Cache at the boundary.

Fallacy 8: The network is homogeneous.

Mobile clients, desktop browsers, IoT devices, and internal services all speak different protocols, have different MTUs, and fail in different ways. Expecting all consumers to behave like your tested Java client will lead to interoperability bugs.

Design response: Use standard protocols (HTTP/1.1, HTTP/2, gRPC). Handle content negotiation. Test with diverse client types.


๐Ÿง  The Practical Antidote: Designing for Failure

flowchart LR
    Call[Service A calls B] --> Retry{Retry logic?}
    Retry -- No --> Crash[Hard failure\nno retry = cascading outage]
    Retry -- Yes --> CB{Circuit breaker?}
    CB -- No --> Flood[B is down\nA floods with retries]
    CB -- Yes --> Timeout[Fail fast\nReturn fallback]

The minimal production checklist for every service call:

  • Timeout set (never rely on OS default)
  • Retry with exponential backoff + jitter
  • Circuit breaker to stop cascade
  • Bulkhead (limit concurrent calls per downstream)

โš–๏ธ Why These Fallacies Still Bite Senior Engineers

These fallacies are taught in university โ€” and still violated in code every week because:

  • Local development masks network problems (everything runs on localhost)
  • Unit tests don't simulate network partitions or latency spikes
  • Monolith-to-microservices migrations often copy in-process assumptions to network calls

The most common production outage pattern: a service that worked fine in staging fails under load in production because no retry logic handles the 1-in-1,000 packet drop rate.


๐Ÿ“Œ Key Takeaways

  • All 8 fallacies are false assumptions developers make about networks that cause production bugs.
  • The four network fallacies: reliability, latency, bandwidth, security.
  • The four infrastructure fallacies: topology, administration, transport cost, homogeneity.
  • Every service call needs: timeout, retry with backoff, circuit breaker, and idempotency.
  • These bugs appear in production โ€” not in local development โ€” because localhost masks all of them.

๐Ÿงฉ Test Your Understanding

  1. Your service has no retry logic on a downstream call. Which fallacy are you relying on?
  2. A developer hardcodes a database IP after migration tests pass. Which fallacy does this violate?
  3. Cross-AZ traffic in AWS is not free. Which fallacy does billing prove false?
  4. You add mTLS between services. Which fallacy are you addressing?

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms