Skip to content

Handbook for Kubernetes platforms at Scale

Running Kubernetes at scale isn’t about deploying many clusters. It’s about enabling the business through safe, repeatable, and observable delivery. A few years ago most teams had one cluster, today many organizations operate lots of clusters across clouds, regions, data centers, and the edge. That proliferation is driven by tenant isolation needs, hardware differences, security boundaries, and the reality of multiple environments (dev/stage/prod) plus compliance zones.

This is my personal handbook I use to design, implement and operate platforms. It includes Principles, trade‑offs, and the concrete practices to make Kubernetes Platforms manageable at scale.

Guiding Paradigms

GitOps

GitOps is the foundational delivery model for this handbook. Desired state lives in Git and pull-based controllers continuously reconcile clusters to match that state. The value is auditability, automated drift reconciliation, safe rollbacks, and repeatability.

For detailed practices, repository patterns, and CI/CD workflows see GitOps Practices.

Platform Engineering

Platform teams own the full cluster lifecycle and the product that developers consume, you create the paved road. That includes what's in the stack, how it's updated and governed, and who owns what. Day‑2 operations matter as much as day‑1 provisioning.

There is always a central balancing act: strong, centralized consistency and guardrails vs. developer autonomy and speed. Mature platforms expose self‑service within safe defaults: golden templates, quotas, policies, and clear ownership boundaries.

Some opinions I’ve formed in the field:

  • Draw a hard line between core platform and app‑level infra:
    • Core platform is clusters, VPC and IAM. Should be owned Centrally by the platform team
    • App‑level infra is buckets, queues, and databases. Teams should be able to self‑serve with guardrails
  • Use the right tool per layer:
    • Cloud-agnostic IaC like Terraform or Pulumi is best for foundational cloud construction and gives you portability.
    • Crossplane shines for app‑adjacent resources via claims because it keeps the developer feedback loop inside Kubernetes. However, you'll need to introduce strong policy enforcement to ensure proper guardrails.
  • Do not overcomplicate the platform. More tools = more maintenance, cognitive load, and points of failure. Keep it lean and documented.
  • Don't build what you don't need yet. An IDP makes sense at 50+ developers, not 5. A service mesh solves specific problems. If you don't have those problems, you don't need the mesh.

You can read more about the concept of platform engineering here: https://platformengineering.org/

Key Architecture Decisions

  • Cattle, not pets. Be ready to recreate clusters; don't hand‑raise them. A cluster should be considered to be completely emphemeral, state is in git.
  • Cluster per environment vs namespace per environment: prefer cluster‑per‑env beyond the tiniest shops. It simplifies blast‑radius, upgrades, and guardrails.
  • Multi‑tenancy vs cluster per team: in cloud I lean cluster per team for isolation and simpler RBAC and policy. When costs force multi‑tenancy invest in quotas, network policy, and admission control
  • Cloud‑agnostic vs. cloud‑native: be pragmatic. Keep portability at the interfaces (container, ingress, storage classes, identity) while embracing managed services where they meaningfully reduce toil.

For detailed architectural patterns, tooling choices, and minimal platform stacks see Platform Architecture.

Diving Deeper

This handbook goes beyond principles. Each section translates theory into concrete implementation guidance:

  • GitOps Practices: Repository patterns, CI/CD separation, trunk-based workflows, and why pull-based reconciliation matters.
  • Platform Architecture: Cluster topology decisions, minimal viable platform stack, and IaC tooling trade-offs.
  • Operating Model: Access control, developer onboarding, observability, policy enforcement, FinOps, and the often-overlooked Day-2 operations.
  • Reference Architectures: Real-world implementations showing centralized, decentralized, and hybrid patterns in practice.

The reference architectures section is where theory meets reality. It shows how these principles and decisions play out in actual deployments across different organizational contexts and constraints.

Closing Principles

  • Keep the blast radius small. Prefer isolation over clever multi‑tenancy.
  • Everything declarative, everything documented, everything observable.
  • Fewer tools, less cognitive load, faster recovery.
  • Self‑service within guardrails beats tickets and shadow ops.

This is a living handbook. I’ll keep refining it as the platform and the reality of running clusters at scale evolves.