How Kubernetes Works: The Container Orchestrator
Docker runs containers. Kubernetes manages them. We explain Pods, Nodes, Deployments, and Services to demystify the world's most popular orchestrator.
Abstract AlgorithmsTLDR
TLDR: Kubernetes (K8s) is an operating system for the cloud. It manages clusters of computers (Nodes) and schedules applications (Pods) onto them via a continuous declarative control loop โ you describe what you want, and Kubernetes continuously reconciles reality to match it, self-healing crashes and scaling replicas without manual intervention.
๐ From Manual SSH to Automated Orchestration: Why Kubernetes Exists
Before Kubernetes, deploying an app meant SSH-ing into servers and running commands manually. If a server died, so did your app. If traffic spiked, you provisioned a new server yourself. There was no standard way to restart crashed processes, spread load, or move workloads away from failing hardware.
Kubernetes introduces a Shipping Port Manager model:
- Container (Docker image): A standardized, portable shipping container.
- Pod: A crane holding one or more containers together on the same network.
- Node: A cargo ship (server) carrying Pods.
- Control Plane: The port manager in the tower โ she says "keep 3 cranes running at all times" and enforces it continuously, even when ships sink.
You never SSH into ships. You talk to the manager, declare your intent, and she handles execution.
๐ Pods, Nodes, Deployments, and Services: The Core Object Model
Kubernetes organizes everything into typed objects stored in etcd, its distributed key-value database. The four you will use on day one:
| Object | What it is | Analogy |
| Pod | Smallest schedulable unit; wraps 1+ containers sharing an IP | A crane on a ship |
| Node | A worker server running Pods | A cargo ship |
| Deployment | Declares a desired number of Pod replicas and manages rolling updates | The port manager's standing order |
| Service | A stable virtual IP + DNS name load-balancing to a set of Pods | The radio frequency that always reaches the right crane |
Pods are ephemeral โ they crash, restart, and change IPs constantly. Services give you a stable address. Deployments ensure you always have the right number of healthy Pods running.
โ๏ธ The Control Loop: How Kubernetes Reconciles Desired State
This is the one concept that unlocks everything else in Kubernetes.
flowchart LR
YAML["Desired State in etcd (replicas: 3)"] --> CM["kube-controller-manager watches etcd continuously"]
CM --> Obs["Observe Current State (replicas: 2 โ one crashed)"]
Obs --> Act["Reconcile: schedule 1 new Pod on an available Node"]
Act --> CM
style CM fill:#f0f4ff,stroke:#4a6cf7
The loop never stops. Every few seconds, each controller:
- Reads the desired state from
etcd. - Observes the current state โ how many Pods are actually running, on which Nodes.
- If they differ โ acts (start, stop, or reschedule Pods).
This is why Kubernetes is declarative: you write what you want (a YAML spec), not how to do it. K8s figures out the "how" and keeps retrying until the world matches your spec.
apiVersion: apps/v1
kind: Deployment
metadata:
name: storefront
spec:
replicas: 3
selector:
matchLabels:
app: storefront
template:
metadata:
labels:
app: storefront
spec:
containers:
- nam
e: storefront
image: acme/storefront:v2.1
ports:
- containerPort: 8080
Apply this once. Kubernetes creates 3 Pods and maintains exactly 3 forever, auto-replacing any that crash.
๐ง Deep Dive: The Scheduler, Reconciler Pattern, and Custom Resources
How the Scheduler Places Pods
When a new Pod needs to be placed, kube-scheduler runs a two-phase algorithm:
- Filter โ eliminate Nodes that cannot fit the Pod (insufficient CPU/RAM, taint mismatches, wrong node labels).
- Score โ rank remaining Nodes (prefer more free resources, spread replicas across failure zones).
The highest-scoring Node wins. The scheduler writes the binding to etcd; the Node's kubelet picks it up and starts the container.
| Phase | What It Evaluates |
| Filter | CPU/memory headroom, nodeSelector, taints, pod affinity rules |
| Score | Resource balance, topology spread, inter-pod affinity bonuses |
๐ Pod Scheduling Sequence
sequenceDiagram
participant U as kubectl apply
participant A as API Server
participant E as etcd
participant S as Scheduler
participant K as Kubelet (Node)
U->>A: POST /apis/apps/v1/deployments
A->>E: store Deployment spec
A-->>U: 201 Created
S->>A: watch for unscheduled Pods
A-->>S: new Pod (node not assigned)
S->>S: Filter + Score nodes
S->>A: bind Pod to Node X
A->>E: store binding
K->>A: watch assigned Pods
A-->>K: Pod spec for Node X
K->>K: pull image + start container
This sequence diagram traces the complete lifecycle of a kubectl apply from the moment the user submits a manifest to the moment a container starts running on a node. The key insight is that no component talks directly to another: everything is mediated through the API Server and persisted in etcd. The Scheduler watches for unscheduled Pods, selects a node, and writes that binding back to etcd โ only then does the Kubelet pick up the assignment and act on it.
๐ Pod Lifecycle States
stateDiagram-v2
[*] --> Pending : Pod created, awaiting scheduling
Pending --> Running : Node assigned, container started
Running --> Succeeded : all containers exited 0
Running --> Failed : container exited non-zero
Running --> Terminating : delete signal sent
Terminating --> [*] : graceful shutdown complete
Pending --> Failed : image pull error / no node fits
This state diagram shows every phase a Pod can be in from creation to termination. The Pending state is where scheduling and image-pulling happen, making it the most common place pods stall โ an ImagePullBackOff keeps a Pod in Pending rather than advancing to Running. The Terminating state represents the graceful-shutdown window (controlled by terminationGracePeriodSeconds) where the container can finish in-flight requests before being forcibly stopped.
Reconcilers: The Universal Pattern
Every Kubernetes resource type has a dedicated controller โ a reconcile loop watching one object kind and acting on divergence. The Deployment controller watches Deployments and manages ReplicaSets. This pattern is intentionally modular: a new controller adds a new capability with zero changes to the core.
Custom Resource Definitions (CRDs)
Extend Kubernetes with your own object types using CRDs. Istio's VirtualService, Argo's Workflow, and Cert-Manager's Certificate are all custom resources with custom reconcilers. See the Service Mesh pattern to see CRDs in action at scale.
๐ The Request Journey: From Browser to Pod
Here is the complete path a request takes from the internet to a Pod inside your cluster:
flowchart TD
Browser["Browser / API Client"] --> LB["Cloud Load Balancer (AWS ALB / GCP LB)"]
LB --> Ingress["Ingress Controller (nginx-ingress Pod)"]
Ingress --> SVC["Service (ClusterIP: 10.96.0.42)"]
SVC --> P1["Pod 1 (10.244.1.5)"]
SVC --> P2["Pod 2 (10.244.2.8)"]
SVC --> P3["Pod 3 (10.244.3.2)"]
style LB fill:#fff3cd,stroke:#f0ad4e
style SVC fill:#d4edda,stroke:#28a745
External traffic enters through the cloud load balancer, hits the Ingress controller (host/path routing), then reaches the Service's stable ClusterIP. The Service distributes requests across all healthy Pods โ even as individual Pods restart and get new IPs.
Every Pod has a unique cluster-internal IP. Services expose stable DNS names inside the cluster:
http://payment-service.default.svc.cluster.local:8080
No hardcoded IPs. The DNS name resolves to the ClusterIP, which balances across healthy Pods automatically.
๐ Real-World Application: Running a Production E-Commerce Platform on Kubernetes
Shopify, Zalando, and Airbnb run Kubernetes clusters handling millions of requests per hour. A production slice: three services (storefront, cart, payment) โ each with a Deployment and Service. An Ingress exposes storefront externally. A HorizontalPodAutoscaler (HPA) scales storefront automatically on CPU.
flowchart LR
HPA["HorizontalPodAutoscaler (CPU > 70% โ add Pods)"] -.->|controls| SF
Ingress["Ingress (public)"] --> SF["storefront Service Pods x3โ20"]
SF --> Cart["cart Service Pods x2"]
SF --> Pay["payment Service Pods x2"]
style HPA fill:#d4edda,stroke:#28a745
During Black Friday, CPU crosses 70% โ K8s scales from 3 to 15 storefront Pods automatically. After the rush it scales back down. Zero manual intervention, zero over-provisioning at idle.
Kubernetes handles these scenarios automatically:
- A Node is drained for OS maintenance โ K8s reschedules its Pods onto healthy Nodes before the drain starts.
- A bad
storefront:v3deploy causes crash-loops โ K8s pauses the rolling update and keepsv2serving traffic. - A canary deployment routes 10% of traffic to
storefront:v3โ rollout completes automatically once error rates stay clean.
โ๏ธ Trade-offs & Failure Modes: Trade-offs, Failure Modes, and the Operational Complexity Tax
Kubernetes is powerful, but the operational cost is real.
| Concern | Real-world impact | Mitigation |
| Steep learning curve | RBAC, CRDs, networking policies, admission webhooks โ weeks before production-confidence | Use managed K8s (GKE, EKS, AKS) to offload control-plane operations |
| Failure: missing resource limits | A Pod without requests/limits consumes an entire Node, evicting its neighbours | Set namespace-level LimitRange objects as a safety floor |
| Failure: misconfigured liveness probes | An over-aggressive probe kills healthy Pods in a restart loop | Use startupProbe for slow-starting apps; tune failureThreshold conservatively |
| Networking complexity | Services, Ingresses, NetworkPolicies, and CNI plugins interact in non-obvious ways | Start with a managed CNI; add NetworkPolicies incrementally |
| Cluster upgrade risk | Skipping minor versions breaks deprecated APIs and admission webhooks | Upgrade one minor version at a time; run kubectl deprecations before each upgrade |
The honest trade-off: Kubernetes removes individual server management toil but introduces platform management toil. For teams without a dedicated platform engineer, this swap rarely pays off until you are running many services at real scale.
๐งญ Decision Guide: When Kubernetes Pays Off
| Situation | Recommendation |
| 10+ microservices, multiple teams | Kubernetes โ automation ROI justifies the platform investment |
| Cloud-hosted, need auto-scaling | Start with EKS / GKE / AKS managed K8s โ control plane is handled for you |
| 1โ3 services, single team, steady traffic | Docker Compose on a VM or PaaS (Railway, Render, Fly.io) โ far less overhead |
| Serverless / event-driven workloads | AWS Lambda / Google Cloud Run โ no cluster to manage |
| Batch or ML training jobs | Kubernetes + Argo Workflows or Kueue, or a dedicated tool like Airflow |
| Startup, pre-product-market fit | Skip K8s. Return when your team is 5+ engineers and you have real scaling pain |
๐งช Practical Example: Auto-Scaling Storefront with an HPA
The HorizontalPodAutoscaler watches the metrics-server and adjusts your Deployment's replica count continuously:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: storefront-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: storefront
minReplicas: 3
maxReplicas: 20
metrics:
- typ
e: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
With this applied:
- Deploy normally with
kubectl apply -f deployment.yamlandkubectl apply -f hpa.yaml. - Black Friday traffic hits โ average CPU crosses 70% โ the HPA adds replicas every 15 seconds until CPU returns to ~70%.
- Traffic subsides โ HPA scales back down, respecting
stabilizationWindowSecondsto avoid flapping.
No code changes. No manual intervention. The Bulkhead Pattern adds per-namespace resource quotas so one noisy service cannot consume all cluster capacity.
๐ ๏ธ Minikube & k3s: Containerizing a Spring Boot App and Deploying to Kubernetes
Minikube runs a single-node Kubernetes cluster locally on your laptop โ the fastest way to test Deployments, Services, and HPAs without a cloud account. k3s is a lightweight, production-grade K8s distribution packaged as a single binary, ideal for edge, IoT, and CI pipelines.
The example below containerizes a Spring Boot application and deploys it as a Kubernetes Deployment with a Service โ applying the control-loop, Pod, and Deployment concepts from this post end-to-end.
# Dockerfile โ multi-stage build for a Spring Boot fat JAR
FROM eclipse-temurin:21-jdk AS build
WORKDIR /app
COPY . .
RUN ./mvnw package -DskipTests
FROM eclipse-temurin:21-jre
WORKDIR /app
COPY --from=build /app/target/*.jar app.jar
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "app.jar"]
# deployment.yaml โ Kubernetes Deployment + Service
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
spec:
replicas: 3 # desired state: K8s control loop maintains exactly 3
selector:
matchLabels:
app: payments-api
template:
metadata:
labels:
app: payments-api
spec:
containers:
- nam
e: payments-api
image: acme/payments-api:1.0.0
ports:
- containerPort: 8080
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
readinessProbe: # K8s won't route traffic until Spring Boot is ready
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 15
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: payments-api
spec:
selector:
app: payments-api
ports:
- por
t: 80
targetPort: 8080
type: ClusterIP
# Local development with Minikube
minikube start
eval $(minikube docker-env) # point Docker CLI at Minikube's daemon
docker build -t acme/payments-api:1.0.0 .
kubectl apply -f deployment.yaml
kubectl rollout status deployment/payments-api
kubectl port-forward svc/payments-api 8080:80
curl http://localhost:8080/actuator/health
Spring Boot's Actuator /actuator/health/readiness endpoint maps perfectly to the Kubernetes readinessProbe โ the control loop will not route traffic to a Pod until the probe returns 200 OK, preventing cold-start request failures described in the lessons section above.
For a full deep-dive on deploying Spring Boot to Kubernetes with Helm, GitOps, and Argo CD, a dedicated follow-up post is planned.
๐ Lessons from Running Kubernetes in Production
- Never skip resource requests and limits. A single Pod without limits can evict its entire Node's neighbours during a memory spike. This is the number-one newcomer mistake.
- Liveness probes kill healthy Pods. An HTTP health check that times out during a garbage collection pause triggers a restart loop. Use
startupProbefor JVM-based services. - Namespaces are cost and policy boundaries. Use separate namespaces per team and per environment (
payments-prod,payments-staging). AddResourceQuotaandLimitRangefrom day one. kubectl applyis idempotent;kubectl createis not. Useapplyin CI/CD pipelines so re-runs never fail on "already exists."- etcd is the cluster brain โ back it up. Managed K8s (GKE/EKS/AKS) handles etcd backup automatically; self-hosted clusters need a dedicated etcd backup CronJob.
๐ TLDR: Summary & Key Takeaways
- Pods are the atomic scheduling unit โ they wrap one or more containers sharing a network namespace.
- Deployments declare a desired replica count; Kubernetes's control loop maintains it indefinitely and rolls out updates safely.
- Services give stable DNS names and virtual IPs to ephemeral Pods, providing transparent load balancing without hardcoded IPs.
- The control loop (desired state โ observe โ reconcile) is the core idea; every other K8s feature is a specific implementation of it.
- etcd holds all desired state; the scheduler, controllers, and kubelet all read from and write to it.
- HPA auto-scales Pods based on metrics โ no manual scaling during traffic spikes.
- Kubernetes trades server management toil for platform management toil โ reach for it when you have real scaling or resilience problems, not as a default for every project.
๐ Practice Quiz
You set
replicas: 3in a Deployment and one Pod crashes. What does Kubernetes do next?A) Sends an alert to the on-call engineer and waits for a manual restart. B) Starts a replacement Pod automatically to restore the desired count of 3. C) Marks the Deployment as degraded and halts all other scheduling. D) Evicts one of the remaining healthy Pods to rebalance the cluster.
Correct Answer: B โ the Deployment controller's reconcile loop detects
current (2) != desired (3)and immediately schedules a new Pod on an available Node.Why should you always connect to a Service's DNS name rather than a Pod's IP address directly?
A) Pod IPs are only routable outside the cluster, not internally. B) Kubernetes encrypts Pod IPs so they cannot be resolved by other services. C) Pods are ephemeral โ they crash and restart with different IPs; a Service provides a stable endpoint that always routes to healthy Pods. D) Direct Pod connections bypass RBAC authorization policies.
Correct Answer: C โ Services abstract away ephemeral Pod IPs with a stable ClusterIP and DNS name that always points to the live Pod set.
What does
etcddo in a Kubernetes cluster?A) Acts as the container runtime that starts and stops containers on each Node. B) Serves as a distributed key-value store holding the entire desired state of the cluster. C) Provides the external load balancer for Services of type
LoadBalancer. D) Runs liveness and readiness health probes on behalf of the kubelet.Correct Answer: B โ every
kubectl applywrite lands in etcd first; all controllers and the scheduler read desired state from etcd.Your storefront Deployment runs 3 Pods. You add an HPA with
minReplicas: 3,maxReplicas: 20, and a CPU target of 70%. During a flash sale, average CPU hits 90%. What happens?A) Kubernetes evicts existing Pods and replaces them with instances on larger Nodes. B) The HPA ignores the spike because
minReplicas: 3is already satisfied. C) The HPA scales the Deployment upward, adding Pods until average CPU returns to ~70%, capped at 20 replicas. D) Kubernetes restarts the Node hosting the highest-CPU Pod to free capacity.Correct Answer: C โ the HPA continuously monitors the CPU metric and increases the replica count until the average converges back to the target utilisation, bounded by
maxReplicas.
๐ Related Posts
- Service Mesh Pattern: Control Plane, Data Plane, and Zero-Trust Traffic โ the natural infrastructure layer on top of Kubernetes: Envoy sidecars, mTLS, and traffic policy applied cluster-wide without touching application code.
- Canary Deployment Pattern: Progressive Delivery with SLOs โ how to ship new Kubernetes Deployment versions to a small traffic slice and auto-rollback if SLOs degrade.
- Circuit Breaker Pattern: Prevent Cascading Failures โ resilience patterns that protect your Kubernetes services from cascading failures when a downstream dependency degrades.
- Bulkhead Pattern: Isolate Capacity and Failure Domains โ namespace-level resource quotas and Pod Disruption Budgets to contain blast radius inside a Kubernetes cluster.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Partitioning Approaches in SQL and NoSQL: Horizontal, Vertical, Range, Hash, and List Partitioning
TLDR: Partitioning splits one logical table into smaller physical pieces called partitions. The database planner skips irrelevant partitions entirely โ turning a 30-second full-table scan into a 200ms single-partition read. Range partitioning is best...

Dirty Write Explained: When Uncommitted Data Gets Overwritten
TLDR: A dirty write occurs when Transaction B overwrites data that Transaction A has written but not yet committed. The result is not a rollback or an error โ it is silently inconsistent committed data: one table reflects Transaction B's intent, anot...

Read Skew Explained: Inconsistent Snapshots Across Multiple Objects
TLDR: Read skew occurs when a transaction reads two logically related objects at different points in time โ one before and one after a concurrent transaction commits โ producing a view that never existed as a committed whole. Read Committed isolation...

Lost Update Explained: When Two Writes Become One
TLDR: A lost update occurs when two concurrent read-modify-write transactions both read the same committed value, both compute a new value from it, and both write back โ with the second write silently discarding the first. No error is raised. Both tr...
