How Kubernetes Works: The Container Orchestrator
Docker runs containers. Kubernetes manages them. We explain Pods, Nodes, Deployments, and Services to demystify the world's most popular orchestrator.
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR
TLDR: Kubernetes (K8s) is an operating system for the cloud. It manages clusters of computers (Nodes) and schedules applications (Pods) onto them via a continuous declarative control loop โ you describe what you want, and Kubernetes continuously reconciles reality to match it, self-healing crashes and scaling replicas without manual intervention.
๐ From Manual SSH to Automated Orchestration: Why Kubernetes Exists
Before Kubernetes, deploying an app meant SSH-ing into servers and running commands manually. If a server died, so did your app. If traffic spiked, you provisioned a new server yourself. There was no standard way to restart crashed processes, spread load, or move workloads away from failing hardware.
Kubernetes introduces a Shipping Port Manager model:
- Container (Docker image): A standardized, portable shipping container.
- Pod: A crane holding one or more containers together on the same network.
- Node: A cargo ship (server) carrying Pods.
- Control Plane: The port manager in the tower โ she says "keep 3 cranes running at all times" and enforces it continuously, even when ships sink.
You never SSH into ships. You talk to the manager, declare your intent, and she handles execution.
๐ Pods, Nodes, Deployments, and Services: The Core Object Model
Kubernetes organizes everything into typed objects stored in etcd, its distributed key-value database. The four you will use on day one:
| Object | What it is | Analogy |
| Pod | Smallest schedulable unit; wraps 1+ containers sharing an IP | A crane on a ship |
| Node | A worker server running Pods | A cargo ship |
| Deployment | Declares a desired number of Pod replicas and manages rolling updates | The port manager's standing order |
| Service | A stable virtual IP + DNS name load-balancing to a set of Pods | The radio frequency that always reaches the right crane |
Pods are ephemeral โ they crash, restart, and change IPs constantly. Services give you a stable address. Deployments ensure you always have the right number of healthy Pods running.
โ๏ธ The Control Loop: How Kubernetes Reconciles Desired State
This is the one concept that unlocks everything else in Kubernetes.
flowchart LR
YAML[Desired State in etcd (replicas: 3)] --> CM[kube-controller-manager watches etcd continuously]
CM --> Obs[Observe Current State (replicas: 2 one crashed)]
Obs --> Act[Reconcile: schedule 1 new Pod on an available Node]
Act --> CM
style CM fill:#f0f4ff,stroke:#4a6cf7
The loop never stops. Every few seconds, each controller:
- Reads the desired state from
etcd. - Observes the current state โ how many Pods are actually running, on which Nodes.
- If they differ โ acts (start, stop, or reschedule Pods).
This is why Kubernetes is declarative: you write what you want (a YAML spec), not how to do it. K8s figures out the "how" and keeps retrying until the world matches your spec.
apiVersion: apps/v1
kind: Deployment
metadata:
name: storefront
spec:
replicas: 3
selector:
matchLabels:
app: storefront
template:
metadata:
labels:
app: storefront
spec:
containers:
- nam
e: storefront
image: acme/storefront:v2.1
ports:
- containerPort: 8080
Apply this once. Kubernetes creates 3 Pods and maintains exactly 3 forever, auto-replacing any that crash.
๐ง Deep Dive: The Scheduler, Reconciler Pattern, and Custom Resources
How the Scheduler Places Pods
When a new Pod needs to be placed, kube-scheduler runs a two-phase algorithm:
- Filter โ eliminate Nodes that cannot fit the Pod (insufficient CPU/RAM, taint mismatches, wrong node labels).
- Score โ rank remaining Nodes (prefer more free resources, spread replicas across failure zones).
The highest-scoring Node wins. The scheduler writes the binding to etcd; the Node's kubelet picks it up and starts the container.
| Phase | What It Evaluates |
| Filter | CPU/memory headroom, nodeSelector, taints, pod affinity rules |
| Score | Resource balance, topology spread, inter-pod affinity bonuses |
๐ Pod Scheduling Sequence
sequenceDiagram
participant U as kubectl apply
participant A as API Server
participant E as etcd
participant S as Scheduler
participant K as Kubelet (Node)
U->>A: POST /apis/apps/v1/deployments
A->>E: store Deployment spec
A-->>U: 201 Created
S->>A: watch for unscheduled Pods
A-->>S: new Pod (node not assigned)
S->>S: Filter + Score nodes
S->>A: bind Pod to Node X
A->>E: store binding
K->>A: watch assigned Pods
A-->>K: Pod spec for Node X
K->>K: pull image + start container
This sequence diagram traces the complete lifecycle of a kubectl apply from the moment the user submits a manifest to the moment a container starts running on a node. The key insight is that no component talks directly to another: everything is mediated through the API Server and persisted in etcd. The Scheduler watches for unscheduled Pods, selects a node, and writes that binding back to etcd โ only then does the Kubelet pick up the assignment and act on it.
๐ Pod Lifecycle States
stateDiagram-v2
[*] --> Pending : Pod created, awaiting scheduling
Pending --> Running : Node assigned, container started
Running --> Succeeded : all containers exited 0
Running --> Failed : container exited non-zero
Running --> Terminating : delete signal sent
Terminating --> [*] : graceful shutdown complete
Pending --> Failed : image pull error / no node fits
This state diagram shows every phase a Pod can be in from creation to termination. The Pending state is where scheduling and image-pulling happen, making it the most common place pods stall โ an ImagePullBackOff keeps a Pod in Pending rather than advancing to Running. The Terminating state represents the graceful-shutdown window (controlled by terminationGracePeriodSeconds) where the container can finish in-flight requests before being forcibly stopped.
Reconcilers: The Universal Pattern
Every Kubernetes resource type has a dedicated controller โ a reconcile loop watching one object kind and acting on divergence. The Deployment controller watches Deployments and manages ReplicaSets. This pattern is intentionally modular: a new controller adds a new capability with zero changes to the core.
Custom Resource Definitions (CRDs)
Extend Kubernetes with your own object types using CRDs. Istio's VirtualService, Argo's Workflow, and Cert-Manager's Certificate are all custom resources with custom reconcilers. See the Service Mesh pattern to see CRDs in action at scale.
๐ The Request Journey: From Browser to Pod
Here is the complete path a request takes from the internet to a Pod inside your cluster:
flowchart TD
Browser[Browser / API Client] --> LB[Cloud Load Balancer (AWS ALB / GCP LB)]
LB --> Ingress[Ingress Controller (nginx-ingress Pod)]
Ingress --> SVC[Service (ClusterIP: 10.96.0.42)]
SVC --> P1[Pod 1 (10.244.1.5)]
SVC --> P2[Pod 2 (10.244.2.8)]
SVC --> P3[Pod 3 (10.244.3.2)]
style LB fill:#fff3cd,stroke:#f0ad4e
style SVC fill:#d4edda,stroke:#28a745
External traffic enters through the cloud load balancer, hits the Ingress controller (host/path routing), then reaches the Service's stable ClusterIP. The Service distributes requests across all healthy Pods โ even as individual Pods restart and get new IPs.
Every Pod has a unique cluster-internal IP. Services expose stable DNS names inside the cluster:
http://payment-service.default.svc.cluster.local:8080
No hardcoded IPs. The DNS name resolves to the ClusterIP, which balances across healthy Pods automatically.
๐ Real-World Application: Running a Production E-Commerce Platform on Kubernetes
Shopify, Zalando, and Airbnb run Kubernetes clusters handling millions of requests per hour. A production slice: three services (storefront, cart, payment) โ each with a Deployment and Service. An Ingress exposes storefront externally. A HorizontalPodAutoscaler (HPA) scales storefront automatically on CPU.
flowchart LR
HPA[HorizontalPodAutoscaler (CPU > 70% add Pods)] -.->|controls| SF
Ingress[Ingress (public)] --> SF[storefront Service Pods x320]
SF --> Cart[cart Service Pods x2]
SF --> Pay[payment Service Pods x2]
style HPA fill:#d4edda,stroke:#28a745
During Black Friday, CPU crosses 70% โ K8s scales from 3 to 15 storefront Pods automatically. After the rush it scales back down. Zero manual intervention, zero over-provisioning at idle.
Kubernetes handles these scenarios automatically:
- A Node is drained for OS maintenance โ K8s reschedules its Pods onto healthy Nodes before the drain starts.
- A bad
storefront:v3deploy causes crash-loops โ K8s pauses the rolling update and keepsv2serving traffic. - A canary deployment routes 10% of traffic to
storefront:v3โ rollout completes automatically once error rates stay clean.
โ๏ธ Trade-offs & Failure Modes: Trade-offs, Failure Modes, and the Operational Complexity Tax
Kubernetes is powerful, but the operational cost is real.
| Concern | Real-world impact | Mitigation |
| Steep learning curve | RBAC, CRDs, networking policies, admission webhooks โ weeks before production-confidence | Use managed K8s (GKE, EKS, AKS) to offload control-plane operations |
| Failure: missing resource limits | A Pod without requests/limits consumes an entire Node, evicting its neighbours | Set namespace-level LimitRange objects as a safety floor |
| Failure: misconfigured liveness probes | An over-aggressive probe kills healthy Pods in a restart loop | Use startupProbe for slow-starting apps; tune failureThreshold conservatively |
| Networking complexity | Services, Ingresses, NetworkPolicies, and CNI plugins interact in non-obvious ways | Start with a managed CNI; add NetworkPolicies incrementally |
| Cluster upgrade risk | Skipping minor versions breaks deprecated APIs and admission webhooks | Upgrade one minor version at a time; run kubectl deprecations before each upgrade |
The honest trade-off: Kubernetes removes individual server management toil but introduces platform management toil. For teams without a dedicated platform engineer, this swap rarely pays off until you are running many services at real scale.
๐งญ Decision Guide: When Kubernetes Pays Off
| Situation | Recommendation |
| 10+ microservices, multiple teams | Kubernetes โ automation ROI justifies the platform investment |
| Cloud-hosted, need auto-scaling | Start with EKS / GKE / AKS managed K8s โ control plane is handled for you |
| 1โ3 services, single team, steady traffic | Docker Compose on a VM or PaaS (Railway, Render, Fly.io) โ far less overhead |
| Serverless / event-driven workloads | AWS Lambda / Google Cloud Run โ no cluster to manage |
| Batch or ML training jobs | Kubernetes + Argo Workflows or Kueue, or a dedicated tool like Airflow |
| Startup, pre-product-market fit | Skip K8s. Return when your team is 5+ engineers and you have real scaling pain |
๐งช Practical Example: Auto-Scaling Storefront with an HPA
The HorizontalPodAutoscaler watches the metrics-server and adjusts your Deployment's replica count continuously:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: storefront-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: storefront
minReplicas: 3
maxReplicas: 20
metrics:
- typ
e: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
With this applied:
- Deploy normally with
kubectl apply -f deployment.yamlandkubectl apply -f hpa.yaml. - Black Friday traffic hits โ average CPU crosses 70% โ the HPA adds replicas every 15 seconds until CPU returns to ~70%.
- Traffic subsides โ HPA scales back down, respecting
stabilizationWindowSecondsto avoid flapping.
No code changes. No manual intervention. The Bulkhead Pattern adds per-namespace resource quotas so one noisy service cannot consume all cluster capacity.
๐ ๏ธ Minikube & k3s: Containerizing a Spring Boot App and Deploying to Kubernetes
Minikube runs a single-node Kubernetes cluster locally on your laptop โ the fastest way to test Deployments, Services, and HPAs without a cloud account. k3s is a lightweight, production-grade K8s distribution packaged as a single binary, ideal for edge, IoT, and CI pipelines.
The example below containerizes a Spring Boot application and deploys it as a Kubernetes Deployment with a Service โ applying the control-loop, Pod, and Deployment concepts from this post end-to-end.
# Dockerfile โ multi-stage build for a Spring Boot fat JAR
FROM eclipse-temurin:21-jdk AS build
WORKDIR /app
COPY . .
RUN ./mvnw package -DskipTests
FROM eclipse-temurin:21-jre
WORKDIR /app
COPY --from=build /app/target/*.jar app.jar
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "app.jar"]
# deployment.yaml โ Kubernetes Deployment + Service
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
spec:
replicas: 3 # desired state: K8s control loop maintains exactly 3
selector:
matchLabels:
app: payments-api
template:
metadata:
labels:
app: payments-api
spec:
containers:
- nam
e: payments-api
image: acme/payments-api:1.0.0
ports:
- containerPort: 8080
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
readinessProbe: # K8s won't route traffic until Spring Boot is ready
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 15
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: payments-api
spec:
selector:
app: payments-api
ports:
- por
t: 80
targetPort: 8080
type: ClusterIP
# Local development with Minikube
minikube start
eval $(minikube docker-env) # point Docker CLI at Minikube's daemon
docker build -t acme/payments-api:1.0.0 .
kubectl apply -f deployment.yaml
kubectl rollout status deployment/payments-api
kubectl port-forward svc/payments-api 8080:80
curl http://localhost:8080/actuator/health
Spring Boot's Actuator /actuator/health/readiness endpoint maps perfectly to the Kubernetes readinessProbe โ the control loop will not route traffic to a Pod until the probe returns 200 OK, preventing cold-start request failures described in the lessons section above.
For a full deep-dive on deploying Spring Boot to Kubernetes with Helm, GitOps, and Argo CD, a dedicated follow-up post is planned.
๐ Lessons from Running Kubernetes in Production
- Never skip resource requests and limits. A single Pod without limits can evict its entire Node's neighbours during a memory spike. This is the number-one newcomer mistake.
- Liveness probes kill healthy Pods. An HTTP health check that times out during a garbage collection pause triggers a restart loop. Use
startupProbefor JVM-based services. - Namespaces are cost and policy boundaries. Use separate namespaces per team and per environment (
payments-prod,payments-staging). AddResourceQuotaandLimitRangefrom day one. kubectl applyis idempotent;kubectl createis not. Useapplyin CI/CD pipelines so re-runs never fail on "already exists."- etcd is the cluster brain โ back it up. Managed K8s (GKE/EKS/AKS) handles etcd backup automatically; self-hosted clusters need a dedicated etcd backup CronJob.
๐ TLDR: Summary & Key Takeaways
- Pods are the atomic scheduling unit โ they wrap one or more containers sharing a network namespace.
- Deployments declare a desired replica count; Kubernetes's control loop maintains it indefinitely and rolls out updates safely.
- Services give stable DNS names and virtual IPs to ephemeral Pods, providing transparent load balancing without hardcoded IPs.
- The control loop (desired state โ observe โ reconcile) is the core idea; every other K8s feature is a specific implementation of it.
- etcd holds all desired state; the scheduler, controllers, and kubelet all read from and write to it.
- HPA auto-scales Pods based on metrics โ no manual scaling during traffic spikes.
- Kubernetes trades server management toil for platform management toil โ reach for it when you have real scaling or resilience problems, not as a default for every project.
๐ Related Posts
- Service Mesh Pattern: Control Plane, Data Plane, and Zero-Trust Traffic โ the natural infrastructure layer on top of Kubernetes: Envoy sidecars, mTLS, and traffic policy applied cluster-wide without touching application code.
- Canary Deployment Pattern: Progressive Delivery with SLOs โ how to ship new Kubernetes Deployment versions to a small traffic slice and auto-rollback if SLOs degrade.
- Circuit Breaker Pattern: Prevent Cascading Failures โ resilience patterns that protect your Kubernetes services from cascading failures when a downstream dependency degrades.
- Bulkhead Pattern: Isolate Capacity and Failure Domains โ namespace-level resource quotas and Pod Disruption Budgets to contain blast radius inside a Kubernetes cluster.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer โ 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2ร A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
