All Posts

How Kubernetes Works: The Container Orchestrator

Docker runs containers. Kubernetes manages them. We explain Pods, Nodes, Deployments, and Services to demystify the world's most popular orchestrator.

Abstract AlgorithmsAbstract Algorithms
ยทยท13 min read

AI-assisted content.

TLDR

TLDR: Kubernetes (K8s) is an operating system for the cloud. It manages clusters of computers (Nodes) and schedules applications (Pods) onto them via a continuous declarative control loop โ€” you describe what you want, and Kubernetes continuously reconciles reality to match it, self-healing crashes and scaling replicas without manual intervention.


๐Ÿ“– From Manual SSH to Automated Orchestration: Why Kubernetes Exists

Before Kubernetes, deploying an app meant SSH-ing into servers and running commands manually. If a server died, so did your app. If traffic spiked, you provisioned a new server yourself. There was no standard way to restart crashed processes, spread load, or move workloads away from failing hardware.

Kubernetes introduces a Shipping Port Manager model:

  • Container (Docker image): A standardized, portable shipping container.
  • Pod: A crane holding one or more containers together on the same network.
  • Node: A cargo ship (server) carrying Pods.
  • Control Plane: The port manager in the tower โ€” she says "keep 3 cranes running at all times" and enforces it continuously, even when ships sink.

You never SSH into ships. You talk to the manager, declare your intent, and she handles execution.


๐Ÿ” Pods, Nodes, Deployments, and Services: The Core Object Model

Kubernetes organizes everything into typed objects stored in etcd, its distributed key-value database. The four you will use on day one:

ObjectWhat it isAnalogy
PodSmallest schedulable unit; wraps 1+ containers sharing an IPA crane on a ship
NodeA worker server running PodsA cargo ship
DeploymentDeclares a desired number of Pod replicas and manages rolling updatesThe port manager's standing order
ServiceA stable virtual IP + DNS name load-balancing to a set of PodsThe radio frequency that always reaches the right crane

Pods are ephemeral โ€” they crash, restart, and change IPs constantly. Services give you a stable address. Deployments ensure you always have the right number of healthy Pods running.


โš™๏ธ The Control Loop: How Kubernetes Reconciles Desired State

This is the one concept that unlocks everything else in Kubernetes.

flowchart LR
    YAML[Desired State in etcd (replicas: 3)] --> CM[kube-controller-manager watches etcd continuously]
    CM --> Obs[Observe Current State (replicas: 2  one crashed)]
    Obs --> Act[Reconcile: schedule 1 new Pod on an available Node]
    Act --> CM
    style CM fill:#f0f4ff,stroke:#4a6cf7

The loop never stops. Every few seconds, each controller:

  1. Reads the desired state from etcd.
  2. Observes the current state โ€” how many Pods are actually running, on which Nodes.
  3. If they differ โ€” acts (start, stop, or reschedule Pods).

This is why Kubernetes is declarative: you write what you want (a YAML spec), not how to do it. K8s figures out the "how" and keeps retrying until the world matches your spec.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: storefront
spec:
  replicas: 3
  selector:
    matchLabels:
      app: storefront
  template:
    metadata:
      labels:
        app: storefront
    spec:
      containers:
        - nam
e: storefront
          image: acme/storefront:v2.1
          ports:
            - containerPort: 8080

Apply this once. Kubernetes creates 3 Pods and maintains exactly 3 forever, auto-replacing any that crash.


๐Ÿง  Deep Dive: The Scheduler, Reconciler Pattern, and Custom Resources

How the Scheduler Places Pods

When a new Pod needs to be placed, kube-scheduler runs a two-phase algorithm:

  1. Filter โ€” eliminate Nodes that cannot fit the Pod (insufficient CPU/RAM, taint mismatches, wrong node labels).
  2. Score โ€” rank remaining Nodes (prefer more free resources, spread replicas across failure zones).

The highest-scoring Node wins. The scheduler writes the binding to etcd; the Node's kubelet picks it up and starts the container.

PhaseWhat It Evaluates
FilterCPU/memory headroom, nodeSelector, taints, pod affinity rules
ScoreResource balance, topology spread, inter-pod affinity bonuses

๐Ÿ“Š Pod Scheduling Sequence

sequenceDiagram
    participant U as kubectl apply
    participant A as API Server
    participant E as etcd
    participant S as Scheduler
    participant K as Kubelet (Node)

    U->>A: POST /apis/apps/v1/deployments
    A->>E: store Deployment spec
    A-->>U: 201 Created
    S->>A: watch for unscheduled Pods
    A-->>S: new Pod (node not assigned)
    S->>S: Filter + Score nodes
    S->>A: bind Pod to Node X
    A->>E: store binding
    K->>A: watch assigned Pods
    A-->>K: Pod spec for Node X
    K->>K: pull image + start container

This sequence diagram traces the complete lifecycle of a kubectl apply from the moment the user submits a manifest to the moment a container starts running on a node. The key insight is that no component talks directly to another: everything is mediated through the API Server and persisted in etcd. The Scheduler watches for unscheduled Pods, selects a node, and writes that binding back to etcd โ€” only then does the Kubelet pick up the assignment and act on it.

๐Ÿ“Š Pod Lifecycle States

stateDiagram-v2
    [*] --> Pending : Pod created, awaiting scheduling
    Pending --> Running : Node assigned, container started
    Running --> Succeeded : all containers exited 0
    Running --> Failed : container exited non-zero
    Running --> Terminating : delete signal sent
    Terminating --> [*] : graceful shutdown complete
    Pending --> Failed : image pull error / no node fits

This state diagram shows every phase a Pod can be in from creation to termination. The Pending state is where scheduling and image-pulling happen, making it the most common place pods stall โ€” an ImagePullBackOff keeps a Pod in Pending rather than advancing to Running. The Terminating state represents the graceful-shutdown window (controlled by terminationGracePeriodSeconds) where the container can finish in-flight requests before being forcibly stopped.

Reconcilers: The Universal Pattern

Every Kubernetes resource type has a dedicated controller โ€” a reconcile loop watching one object kind and acting on divergence. The Deployment controller watches Deployments and manages ReplicaSets. This pattern is intentionally modular: a new controller adds a new capability with zero changes to the core.

Custom Resource Definitions (CRDs)

Extend Kubernetes with your own object types using CRDs. Istio's VirtualService, Argo's Workflow, and Cert-Manager's Certificate are all custom resources with custom reconcilers. See the Service Mesh pattern to see CRDs in action at scale.


๐Ÿ“Š The Request Journey: From Browser to Pod

Here is the complete path a request takes from the internet to a Pod inside your cluster:

flowchart TD
    Browser[Browser / API Client] --> LB[Cloud Load Balancer (AWS ALB / GCP LB)]
    LB --> Ingress[Ingress Controller (nginx-ingress Pod)]
    Ingress --> SVC[Service (ClusterIP: 10.96.0.42)]
    SVC --> P1[Pod 1 (10.244.1.5)]
    SVC --> P2[Pod 2 (10.244.2.8)]
    SVC --> P3[Pod 3 (10.244.3.2)]
    style LB fill:#fff3cd,stroke:#f0ad4e
    style SVC fill:#d4edda,stroke:#28a745

External traffic enters through the cloud load balancer, hits the Ingress controller (host/path routing), then reaches the Service's stable ClusterIP. The Service distributes requests across all healthy Pods โ€” even as individual Pods restart and get new IPs.

Every Pod has a unique cluster-internal IP. Services expose stable DNS names inside the cluster:

http://payment-service.default.svc.cluster.local:8080

No hardcoded IPs. The DNS name resolves to the ClusterIP, which balances across healthy Pods automatically.


๐ŸŒ Real-World Application: Running a Production E-Commerce Platform on Kubernetes

Shopify, Zalando, and Airbnb run Kubernetes clusters handling millions of requests per hour. A production slice: three services (storefront, cart, payment) โ€” each with a Deployment and Service. An Ingress exposes storefront externally. A HorizontalPodAutoscaler (HPA) scales storefront automatically on CPU.

flowchart LR
    HPA[HorizontalPodAutoscaler (CPU > 70%  add Pods)] -.->|controls| SF
    Ingress[Ingress (public)] --> SF[storefront Service Pods x320]
    SF --> Cart[cart Service Pods x2]
    SF --> Pay[payment Service Pods x2]
    style HPA fill:#d4edda,stroke:#28a745

During Black Friday, CPU crosses 70% โ€” K8s scales from 3 to 15 storefront Pods automatically. After the rush it scales back down. Zero manual intervention, zero over-provisioning at idle.

Kubernetes handles these scenarios automatically:

  • A Node is drained for OS maintenance โ€” K8s reschedules its Pods onto healthy Nodes before the drain starts.
  • A bad storefront:v3 deploy causes crash-loops โ€” K8s pauses the rolling update and keeps v2 serving traffic.
  • A canary deployment routes 10% of traffic to storefront:v3 โ€” rollout completes automatically once error rates stay clean.

โš–๏ธ Trade-offs & Failure Modes: Trade-offs, Failure Modes, and the Operational Complexity Tax

Kubernetes is powerful, but the operational cost is real.

ConcernReal-world impactMitigation
Steep learning curveRBAC, CRDs, networking policies, admission webhooks โ€” weeks before production-confidenceUse managed K8s (GKE, EKS, AKS) to offload control-plane operations
Failure: missing resource limitsA Pod without requests/limits consumes an entire Node, evicting its neighboursSet namespace-level LimitRange objects as a safety floor
Failure: misconfigured liveness probesAn over-aggressive probe kills healthy Pods in a restart loopUse startupProbe for slow-starting apps; tune failureThreshold conservatively
Networking complexityServices, Ingresses, NetworkPolicies, and CNI plugins interact in non-obvious waysStart with a managed CNI; add NetworkPolicies incrementally
Cluster upgrade riskSkipping minor versions breaks deprecated APIs and admission webhooksUpgrade one minor version at a time; run kubectl deprecations before each upgrade

The honest trade-off: Kubernetes removes individual server management toil but introduces platform management toil. For teams without a dedicated platform engineer, this swap rarely pays off until you are running many services at real scale.


๐Ÿงญ Decision Guide: When Kubernetes Pays Off

SituationRecommendation
10+ microservices, multiple teamsKubernetes โ€” automation ROI justifies the platform investment
Cloud-hosted, need auto-scalingStart with EKS / GKE / AKS managed K8s โ€” control plane is handled for you
1โ€“3 services, single team, steady trafficDocker Compose on a VM or PaaS (Railway, Render, Fly.io) โ€” far less overhead
Serverless / event-driven workloadsAWS Lambda / Google Cloud Run โ€” no cluster to manage
Batch or ML training jobsKubernetes + Argo Workflows or Kueue, or a dedicated tool like Airflow
Startup, pre-product-market fitSkip K8s. Return when your team is 5+ engineers and you have real scaling pain

๐Ÿงช Practical Example: Auto-Scaling Storefront with an HPA

The HorizontalPodAutoscaler watches the metrics-server and adjusts your Deployment's replica count continuously:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: storefront-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: storefront
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - typ
e: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

With this applied:

  1. Deploy normally with kubectl apply -f deployment.yaml and kubectl apply -f hpa.yaml.
  2. Black Friday traffic hits โ€” average CPU crosses 70% โ€” the HPA adds replicas every 15 seconds until CPU returns to ~70%.
  3. Traffic subsides โ€” HPA scales back down, respecting stabilizationWindowSeconds to avoid flapping.

No code changes. No manual intervention. The Bulkhead Pattern adds per-namespace resource quotas so one noisy service cannot consume all cluster capacity.


๐Ÿ› ๏ธ Minikube & k3s: Containerizing a Spring Boot App and Deploying to Kubernetes

Minikube runs a single-node Kubernetes cluster locally on your laptop โ€” the fastest way to test Deployments, Services, and HPAs without a cloud account. k3s is a lightweight, production-grade K8s distribution packaged as a single binary, ideal for edge, IoT, and CI pipelines.

The example below containerizes a Spring Boot application and deploys it as a Kubernetes Deployment with a Service โ€” applying the control-loop, Pod, and Deployment concepts from this post end-to-end.

# Dockerfile โ€” multi-stage build for a Spring Boot fat JAR
FROM eclipse-temurin:21-jdk AS build
WORKDIR /app
COPY . .
RUN ./mvnw package -DskipTests

FROM eclipse-temurin:21-jre
WORKDIR /app
COPY --from=build /app/target/*.jar app.jar
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "app.jar"]
# deployment.yaml โ€” Kubernetes Deployment + Service
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
spec:
  replicas: 3                   # desired state: K8s control loop maintains exactly 3
  selector:
    matchLabels:
      app: payments-api
  template:
    metadata:
      labels:
        app: payments-api
    spec:
      containers:
        - nam
e: payments-api
          image: acme/payments-api:1.0.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "250m"
              memory: "256Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"
          readinessProbe:       # K8s won't route traffic until Spring Boot is ready
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: payments-api
spec:
  selector:
    app: payments-api
  ports:
    - por
t: 80
      targetPort: 8080
  type: ClusterIP
# Local development with Minikube
minikube start
eval $(minikube docker-env)            # point Docker CLI at Minikube's daemon
docker build -t acme/payments-api:1.0.0 .
kubectl apply -f deployment.yaml
kubectl rollout status deployment/payments-api
kubectl port-forward svc/payments-api 8080:80
curl http://localhost:8080/actuator/health

Spring Boot's Actuator /actuator/health/readiness endpoint maps perfectly to the Kubernetes readinessProbe โ€” the control loop will not route traffic to a Pod until the probe returns 200 OK, preventing cold-start request failures described in the lessons section above.

For a full deep-dive on deploying Spring Boot to Kubernetes with Helm, GitOps, and Argo CD, a dedicated follow-up post is planned.


๐Ÿ“š Lessons from Running Kubernetes in Production

  • Never skip resource requests and limits. A single Pod without limits can evict its entire Node's neighbours during a memory spike. This is the number-one newcomer mistake.
  • Liveness probes kill healthy Pods. An HTTP health check that times out during a garbage collection pause triggers a restart loop. Use startupProbe for JVM-based services.
  • Namespaces are cost and policy boundaries. Use separate namespaces per team and per environment (payments-prod, payments-staging). Add ResourceQuota and LimitRange from day one.
  • kubectl apply is idempotent; kubectl create is not. Use apply in CI/CD pipelines so re-runs never fail on "already exists."
  • etcd is the cluster brain โ€” back it up. Managed K8s (GKE/EKS/AKS) handles etcd backup automatically; self-hosted clusters need a dedicated etcd backup CronJob.

๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • Pods are the atomic scheduling unit โ€” they wrap one or more containers sharing a network namespace.
  • Deployments declare a desired replica count; Kubernetes's control loop maintains it indefinitely and rolls out updates safely.
  • Services give stable DNS names and virtual IPs to ephemeral Pods, providing transparent load balancing without hardcoded IPs.
  • The control loop (desired state โ†’ observe โ†’ reconcile) is the core idea; every other K8s feature is a specific implementation of it.
  • etcd holds all desired state; the scheduler, controllers, and kubelet all read from and write to it.
  • HPA auto-scales Pods based on metrics โ€” no manual scaling during traffic spikes.
  • Kubernetes trades server management toil for platform management toil โ€” reach for it when you have real scaling or resilience problems, not as a default for every project.


Share
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms