How Kubernetes Works: The Container Orchestrator

Docker runs containers. Kubernetes manages them. We explain Pods, Nodes, Deployments, and Services to demystify the world's most popular orchestrator.

How It Works: Internals Explained

Abstract Algorithms

·Mar 9, 2026·13 min read

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR

TLDR: Kubernetes (K8s) is an operating system for the cloud. It manages clusters of computers (Nodes) and schedules applications (Pods) onto them via a continuous declarative control loop — you describe what you want, and Kubernetes continuously reconciles reality to match it, self-healing crashes and scaling replicas without manual intervention.

📖 From Manual SSH to Automated Orchestration: Why Kubernetes Exists

Before Kubernetes, deploying an app meant SSH-ing into servers and running commands manually. If a server died, so did your app. If traffic spiked, you provisioned a new server yourself. There was no standard way to restart crashed processes, spread load, or move workloads away from failing hardware.

Kubernetes introduces a Shipping Port Manager model:

Container (Docker image): A standardized, portable shipping container.
Pod: A crane holding one or more containers together on the same network.
Node: A cargo ship (server) carrying Pods.
Control Plane: The port manager in the tower — she says "keep 3 cranes running at all times" and enforces it continuously, even when ships sink.

You never SSH into ships. You talk to the manager, declare your intent, and she handles execution.

🔍 Pods, Nodes, Deployments, and Services: The Core Object Model

Kubernetes organizes everything into typed objects stored in etcd, its distributed key-value database. The four you will use on day one:

Object	What it is	Analogy
Pod	Smallest schedulable unit; wraps 1+ containers sharing an IP	A crane on a ship
Node	A worker server running Pods	A cargo ship
Deployment	Declares a desired number of Pod replicas and manages rolling updates	The port manager's standing order
Service	A stable virtual IP + DNS name load-balancing to a set of Pods	The radio frequency that always reaches the right crane

Pods are ephemeral — they crash, restart, and change IPs constantly. Services give you a stable address. Deployments ensure you always have the right number of healthy Pods running.

⚙️ The Control Loop: How Kubernetes Reconciles Desired State

This is the one concept that unlocks everything else in Kubernetes.

flowchart LR
    YAML[Desired State in etcd (replicas: 3)] --> CM[kube-controller-manager watches etcd continuously]
    CM --> Obs[Observe Current State (replicas: 2  one crashed)]
    Obs --> Act[Reconcile: schedule 1 new Pod on an available Node]
    Act --> CM
    style CM fill:#f0f4ff,stroke:#4a6cf7

The loop never stops. Every few seconds, each controller:

Reads the desired state from etcd.
Observes the current state — how many Pods are actually running, on which Nodes.
If they differ — acts (start, stop, or reschedule Pods).

This is why Kubernetes is declarative: you write what you want (a YAML spec), not how to do it. K8s figures out the "how" and keeps retrying until the world matches your spec.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: storefront
spec:
  replicas: 3
  selector:
    matchLabels:
      app: storefront
  template:
    metadata:
      labels:
        app: storefront
    spec:
      containers:
        - nam
e: storefront
          image: acme/storefront:v2.1
          ports:
            - containerPort: 8080

Apply this once. Kubernetes creates 3 Pods and maintains exactly 3 forever, auto-replacing any that crash.

🧠 Deep Dive: The Scheduler, Reconciler Pattern, and Custom Resources

How the Scheduler Places Pods

When a new Pod needs to be placed, kube-scheduler runs a two-phase algorithm:

Filter — eliminate Nodes that cannot fit the Pod (insufficient CPU/RAM, taint mismatches, wrong node labels).
Score — rank remaining Nodes (prefer more free resources, spread replicas across failure zones).

The highest-scoring Node wins. The scheduler writes the binding to etcd; the Node's kubelet picks it up and starts the container.

Phase	What It Evaluates
Filter	CPU/memory headroom, `nodeSelector`, taints, pod affinity rules
Score	Resource balance, topology spread, inter-pod affinity bonuses

📊 Pod Scheduling Sequence

sequenceDiagram
    participant U as kubectl apply
    participant A as API Server
    participant E as etcd
    participant S as Scheduler
    participant K as Kubelet (Node)

    U->>A: POST /apis/apps/v1/deployments
    A->>E: store Deployment spec
    A-->>U: 201 Created
    S->>A: watch for unscheduled Pods
    A-->>S: new Pod (node not assigned)
    S->>S: Filter + Score nodes
    S->>A: bind Pod to Node X
    A->>E: store binding
    K->>A: watch assigned Pods
    A-->>K: Pod spec for Node X
    K->>K: pull image + start container

This sequence diagram traces the complete lifecycle of a kubectl apply from the moment the user submits a manifest to the moment a container starts running on a node. The key insight is that no component talks directly to another: everything is mediated through the API Server and persisted in etcd. The Scheduler watches for unscheduled Pods, selects a node, and writes that binding back to etcd — only then does the Kubelet pick up the assignment and act on it.

📊 Pod Lifecycle States

stateDiagram-v2
    [*] --> Pending : Pod created, awaiting scheduling
    Pending --> Running : Node assigned, container started
    Running --> Succeeded : all containers exited 0
    Running --> Failed : container exited non-zero
    Running --> Terminating : delete signal sent
    Terminating --> [*] : graceful shutdown complete
    Pending --> Failed : image pull error / no node fits

This state diagram shows every phase a Pod can be in from creation to termination. The Pending state is where scheduling and image-pulling happen, making it the most common place pods stall — an ImagePullBackOff keeps a Pod in Pending rather than advancing to Running. The Terminating state represents the graceful-shutdown window (controlled by terminationGracePeriodSeconds) where the container can finish in-flight requests before being forcibly stopped.

Reconcilers: The Universal Pattern

Every Kubernetes resource type has a dedicated controller — a reconcile loop watching one object kind and acting on divergence. The Deployment controller watches Deployments and manages ReplicaSets. This pattern is intentionally modular: a new controller adds a new capability with zero changes to the core.

Custom Resource Definitions (CRDs)

Extend Kubernetes with your own object types using CRDs. Istio's VirtualService, Argo's Workflow, and Cert-Manager's Certificate are all custom resources with custom reconcilers. See the Service Mesh pattern to see CRDs in action at scale.

📊 The Request Journey: From Browser to Pod

Here is the complete path a request takes from the internet to a Pod inside your cluster:

flowchart TD
    Browser[Browser / API Client] --> LB[Cloud Load Balancer (AWS ALB / GCP LB)]
    LB --> Ingress[Ingress Controller (nginx-ingress Pod)]
    Ingress --> SVC[Service (ClusterIP: 10.96.0.42)]
    SVC --> P1[Pod 1 (10.244.1.5)]
    SVC --> P2[Pod 2 (10.244.2.8)]
    SVC --> P3[Pod 3 (10.244.3.2)]
    style LB fill:#fff3cd,stroke:#f0ad4e
    style SVC fill:#d4edda,stroke:#28a745

External traffic enters through the cloud load balancer, hits the Ingress controller (host/path routing), then reaches the Service's stable ClusterIP. The Service distributes requests across all healthy Pods — even as individual Pods restart and get new IPs.

Every Pod has a unique cluster-internal IP. Services expose stable DNS names inside the cluster:

http://payment-service.default.svc.cluster.local:8080

No hardcoded IPs. The DNS name resolves to the ClusterIP, which balances across healthy Pods automatically.

🌍 Real-World Application: Running a Production E-Commerce Platform on Kubernetes

Shopify, Zalando, and Airbnb run Kubernetes clusters handling millions of requests per hour. A production slice: three services (storefront, cart, payment) — each with a Deployment and Service. An Ingress exposes storefront externally. A HorizontalPodAutoscaler (HPA) scales storefront automatically on CPU.

flowchart LR
    HPA[HorizontalPodAutoscaler (CPU > 70%  add Pods)] -.->|controls| SF
    Ingress[Ingress (public)] --> SF[storefront Service Pods x320]
    SF --> Cart[cart Service Pods x2]
    SF --> Pay[payment Service Pods x2]
    style HPA fill:#d4edda,stroke:#28a745

During Black Friday, CPU crosses 70% — K8s scales from 3 to 15 storefront Pods automatically. After the rush it scales back down. Zero manual intervention, zero over-provisioning at idle.

Kubernetes handles these scenarios automatically:

A Node is drained for OS maintenance — K8s reschedules its Pods onto healthy Nodes before the drain starts.
A bad storefront:v3 deploy causes crash-loops — K8s pauses the rolling update and keeps v2 serving traffic.
A canary deployment routes 10% of traffic to storefront:v3 — rollout completes automatically once error rates stay clean.

⚖️ Trade-offs & Failure Modes: Trade-offs, Failure Modes, and the Operational Complexity Tax

Kubernetes is powerful, but the operational cost is real.

Concern	Real-world impact	Mitigation
Steep learning curve	RBAC, CRDs, networking policies, admission webhooks — weeks before production-confidence	Use managed K8s (GKE, EKS, AKS) to offload control-plane operations
Failure: missing resource limits	A Pod without `requests`/`limits` consumes an entire Node, evicting its neighbours	Set namespace-level `LimitRange` objects as a safety floor
Failure: misconfigured liveness probes	An over-aggressive probe kills healthy Pods in a restart loop	Use `startupProbe` for slow-starting apps; tune `failureThreshold` conservatively
Networking complexity	Services, Ingresses, NetworkPolicies, and CNI plugins interact in non-obvious ways	Start with a managed CNI; add NetworkPolicies incrementally
Cluster upgrade risk	Skipping minor versions breaks deprecated APIs and admission webhooks	Upgrade one minor version at a time; run `kubectl deprecations` before each upgrade

The honest trade-off: Kubernetes removes individual server management toil but introduces platform management toil. For teams without a dedicated platform engineer, this swap rarely pays off until you are running many services at real scale.

🧭 Decision Guide: When Kubernetes Pays Off

Situation	Recommendation
10+ microservices, multiple teams	Kubernetes — automation ROI justifies the platform investment
Cloud-hosted, need auto-scaling	Start with EKS / GKE / AKS managed K8s — control plane is handled for you
1–3 services, single team, steady traffic	Docker Compose on a VM or PaaS (Railway, Render, Fly.io) — far less overhead
Serverless / event-driven workloads	AWS Lambda / Google Cloud Run — no cluster to manage
Batch or ML training jobs	Kubernetes + Argo Workflows or Kueue, or a dedicated tool like Airflow
Startup, pre-product-market fit	Skip K8s. Return when your team is 5+ engineers and you have real scaling pain

🧪 Practical Example: Auto-Scaling Storefront with an HPA

The HorizontalPodAutoscaler watches the metrics-server and adjusts your Deployment's replica count continuously:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: storefront-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: storefront
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - typ
e: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

With this applied:

Deploy normally with kubectl apply -f deployment.yaml and kubectl apply -f hpa.yaml.
Black Friday traffic hits — average CPU crosses 70% — the HPA adds replicas every 15 seconds until CPU returns to ~70%.
Traffic subsides — HPA scales back down, respecting stabilizationWindowSeconds to avoid flapping.

No code changes. No manual intervention. The Bulkhead Pattern adds per-namespace resource quotas so one noisy service cannot consume all cluster capacity.

🛠️ Minikube & k3s: Containerizing a Spring Boot App and Deploying to Kubernetes

Minikube runs a single-node Kubernetes cluster locally on your laptop — the fastest way to test Deployments, Services, and HPAs without a cloud account. k3s is a lightweight, production-grade K8s distribution packaged as a single binary, ideal for edge, IoT, and CI pipelines.

The example below containerizes a Spring Boot application and deploys it as a Kubernetes Deployment with a Service — applying the control-loop, Pod, and Deployment concepts from this post end-to-end.

# Dockerfile — multi-stage build for a Spring Boot fat JAR
FROM eclipse-temurin:21-jdk AS build
WORKDIR /app
COPY . .
RUN ./mvnw package -DskipTests

FROM eclipse-temurin:21-jre
WORKDIR /app
COPY --from=build /app/target/*.jar app.jar
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "app.jar"]

# deployment.yaml — Kubernetes Deployment + Service
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
spec:
  replicas: 3                   # desired state: K8s control loop maintains exactly 3
  selector:
    matchLabels:
      app: payments-api
  template:
    metadata:
      labels:
        app: payments-api
    spec:
      containers:
        - nam
e: payments-api
          image: acme/payments-api:1.0.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "250m"
              memory: "256Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"
          readinessProbe:       # K8s won't route traffic until Spring Boot is ready
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: payments-api
spec:
  selector:
    app: payments-api
  ports:
    - por
t: 80
      targetPort: 8080
  type: ClusterIP

# Local development with Minikube
minikube start
eval $(minikube docker-env)            # point Docker CLI at Minikube's daemon
docker build -t acme/payments-api:1.0.0 .
kubectl apply -f deployment.yaml
kubectl rollout status deployment/payments-api
kubectl port-forward svc/payments-api 8080:80
curl http://localhost:8080/actuator/health

Spring Boot's Actuator /actuator/health/readiness endpoint maps perfectly to the Kubernetes readinessProbe — the control loop will not route traffic to a Pod until the probe returns 200 OK, preventing cold-start request failures described in the lessons section above.

For a full deep-dive on deploying Spring Boot to Kubernetes with Helm, GitOps, and Argo CD, a dedicated follow-up post is planned.

📚 Lessons from Running Kubernetes in Production

Never skip resource requests and limits. A single Pod without limits can evict its entire Node's neighbours during a memory spike. This is the number-one newcomer mistake.
Liveness probes kill healthy Pods. An HTTP health check that times out during a garbage collection pause triggers a restart loop. Use startupProbe for JVM-based services.
Namespaces are cost and policy boundaries. Use separate namespaces per team and per environment (payments-prod, payments-staging). Add ResourceQuota and LimitRange from day one.
kubectl apply is idempotent; kubectl create is not. Use apply in CI/CD pipelines so re-runs never fail on "already exists."
etcd is the cluster brain — back it up. Managed K8s (GKE/EKS/AKS) handles etcd backup automatically; self-hosted clusters need a dedicated etcd backup CronJob.

📌 TLDR: Summary & Key Takeaways

Pods are the atomic scheduling unit — they wrap one or more containers sharing a network namespace.
Deployments declare a desired replica count; Kubernetes's control loop maintains it indefinitely and rolls out updates safely.
Services give stable DNS names and virtual IPs to ephemeral Pods, providing transparent load balancing without hardcoded IPs.
The control loop (desired state → observe → reconcile) is the core idea; every other K8s feature is a specific implementation of it.
etcd holds all desired state; the scheduler, controllers, and kubelet all read from and write to it.
HPA auto-scales Pods based on metrics — no manual scaling during traffic spikes.
Kubernetes trades server management toil for platform management toil — reach for it when you have real scaling or resilience problems, not as a default for every project.

Service Mesh Pattern: Control Plane, Data Plane, and Zero-Trust Traffic — the natural infrastructure layer on top of Kubernetes: Envoy sidecars, mTLS, and traffic policy applied cluster-wide without touching application code.
Canary Deployment Pattern: Progressive Delivery with SLOs — how to ship new Kubernetes Deployment versions to a small traffic slice and auto-rollback if SLOs degrade.
Circuit Breaker Pattern: Prevent Cascading Failures — resilience patterns that protect your Kubernetes services from cascading failures when a downstream dependency degrades.
Bulkhead Pattern: Isolate Capacity and Failure Domains — namespace-level resource quotas and Pod Disruption Budgets to contain blast radius inside a Kubernetes cluster.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...

Apr 19, 2026•23 min read