Skip to content

Handbook for Running Kubernetes platforms at Scale

Running Kubernetes at scale isn’t about “making clusters.” It’s about enabling the business through safe, repeatable, and observable delivery. A few years ago most teams had one cluster. Today many organizations operate lots of clusters across clouds, regions, data centers, and the edge. That proliferation is driven by tenant isolation needs, hardware differences, security boundaries, and the reality of multiple environments (dev/stage/prod) plus compliance zones.

This is the handbook I use to build platforms. It includes Principles, trade‑offs, and the concrete practices I use to make Kubernetes Platforms manageable at scale.

Guiding Paradigms

GitOps

I stick to the canonical definitions from the GitOps community (e.g., Codefresh and OpenGitOps): desired state lives in Git, and a pull‑based controller continuously reconciles clusters to match that state. The value is simple: auditability, automated drift reconciliation, safe rollbacks, and repeatability.

GitOps best practices I recommend

  • Fully declarative everything: clusters, platform components, and app workloads.
  • Separate source code, infra, argocd app crds and platform manifests into dedicated repositories. Different lifecycles and different owners -> strict separation of concerns.
  • Eliminate drift with Kustomize or Helm and reconciliation. Kubernetes’ superpower is reconciliation and a consistent API; Use it.
  • Environment folders, not long‑lived branches. Promotion is a PR from devstageprod with version pins. Branches are for work, not for state
  • Trunk‑based development. Keep changes small, reviewable, and flowing
  • Agents reconcile automatically and are the only writers to the cluster. Humans commit to Git; controllers write the cluster. This closed loop eliminates silent configuration drift.

My recommendation for repo separation that works at scale

You want clear separation of concerns. Infra and app manifests should not live in the same repo. I use three repos for the platform layer and then 2 repos per dev team / domain.

RepositoryPurposeOwnership
platformInstalls Argo CD and core platform componentsPlatform (cluster-admin)
argocd-appsArgo CD root application (app-of-apps pattern)Platform
k8s-resourcesPlatform workload manifests (custom apps, post-deploy tasks, etc.)Platform
<team>-argocd-appsTeam Argo CD root app (projects, RBAC, cluster targets scaffolded)Dev Team
<team>-k8s-resourcesApp manifests (Helm/Kustomize) + optional CI that bumps env version pinsDev Team

Why this split?

  • Platform is minimal and stable. It brings up Argo CD and only the bare minimum required components for argocd to run (ESO, ingress controller, ...).
  • Environments are the source of truth for what runs where
  • ArgoCD root apps hold all the config related to RBAC, when onboarding a new dev team a default scaffolding template is provided by the platform team that has the correct argocd projects & cluster names pre-configured.
  • Team app repos own their runtime and declare only what they need, not how the platform is built

Platform Engineering

Platform teams own the full cluster lifecycle and the product that developers consume, you create the paved road. That includes what’s in the stack, how it’s updated and governed, and who owns what. Day‑2 operations matter as much as day‑1 provisioning.

The is always a central balancing act: strong, centralized consistency and guardrails vs. developer autonomy and speed. Mature platforms expose self‑service within safe defaults: golden templates, quotas, policies, and clear ownership boundaries.

Some opinions I’ve formed in the field:

  • Draw a hard line between core platform and app‑level infra:
    • Core platform is clusters, VPC and IAM. Should be owned Centrally by the platform team
    • App‑level infra is buckets, queues, and databases. Teams should be able to self‑serve with guardrails
  • Use the right tool per layer:
    • Terraform or cloud‑native IaC is great for foundational cloud construction.
    • Crossplane shines for app‑adjacent resources via claims because it keeps the developer feedback loop inside Kubernetes. However, you'll need to introduce strong policy enforcement to ensure proper guardrails.
  • Do not overcomplicate the platform. More tools = more maintenance, cognitive load, and points of failure. Keep it lean and documented.
  • Don't build what you don't need yet. An IDP makes sense at 50+ developers, not 5. A service mesh solves specific problems. If you don't have those problems, you don't need the mesh.

You can read more about the concept of platform engineering here: https://platformengineering.org/

Key Architecture Decisions

  • Cattle, not pets. Be ready to recreate clusters; don’t hand‑raise them. A cluster should be considered to be completely emphemeral, state is in git.
  • Cluster per environment vs namespace per environment.: prefer cluster‑per‑env beyond the tiniest shops. It simplifies blast‑radius, upgrades, and guardrails.
  • Multi‑tenancy vs cluster per team: in cloud I lean cluster per team for isolation and simpler RBAC and policy. When costs force multi‑tenancy invest in quotas, network policy, and admission control
  • Cloud‑agnostic vs. cloud‑native: be pragmatic. Keep portability at the interfaces (container, ingress, storage classes, identity) while embracing managed services where they meaningfully reduce toil.

Operating Model and Practices

Access Model: GitOps‑First

  • Default: no direct cluster access. Developers interact with Argo CD; the platform team holds admin. Make exceptions only for a few trusted DevEx champions
  • Centralize authz in Argo CD using SSO groups. Keep cluster RBAC scoped and minimal.

Why: it removes config drift, centralizes changelog/audit in Git, and keeps the platform reproducible.

Minimal Viable Platform Stack

Keep it lean. More tools means more maintenance

CapabilityPurposeExamples
SecretsManage app and platform secretsSeal Secrets, External Secrets Operator, cloud secret managers
Continuous DeliveryReconcile desired state to clustersArgo CD with app‑of‑apps
PolicyEnforce guardrails and defaultsKyverno
PackagingReusable app and platform manifestsHelm, Kustomize overlays
Networking and IngressNorth‑south and east‑west trafficCNI of choice, Gateway API or Ingress Controller, mesh only when needed
Storage and CSIPersistent storage and backupsCloud CSI drivers, Velero
ObservabilityMetrics, logs, tracing, dashboards, alertsPrometheus, Loki or ELK, Tempo or Jaeger, Grafana and Alertmanager
Supply Chain SecurityScan, sign, and verifyTrivy, Cosign, admission checks

Document every component and the SLO it supports. If it isn’t declarative and documented, it’s tech debt.

Developer Onboarding and Templates

  • Provide golden repo templates: app service with Helm or Kustomize. Crossplane claims for common resources. Example tests and basic alerts
  • Offer preview environments via GitOps for PRs where it makes sense
  • Keep service‑level docs in the repo: runbooks, dependency contracts, and SLOs.

Environment Parity and Promotion

  • Keep environments aligned. Promote via PRs, go through the environments, do not drift by "one‑off fixes" in prod
  • Pin chart and module versions; upgrade in lower envs first, promote once healthy. No long‑lived env branches.

Observability and Alerting

  • Golden signals for every service. Latency, traffic, errors, and saturation.
  • SLOs with burn‑rate alerts. Page on user impact. Ticket on toil.
  • Platform dashboards for cluster health, control plane, etcd, CNI and CSI, and Argo sync status.

Policies and Security Automation

  • Kyverno policy packs: required labels and annotations, disallow :latest, enforce resource requests/limits, restrict hostPath and privileged, default NetworkPolicy, restrict node selectors, runtimeClass, image provenance.
  • Security baseline: CIS Kubernetes benchmarks and related guidance where relevant (NSA/CISA). Automate checks in CI and admission where possible
  • Supply chain: sign images with Cosign and verify in admission; scan images and manifests; block secrets via scanners.

FinOps and Cost Guardrails

  • Right‑size requests and limits and use periodic reports. Enforce defaults with policy - kubernetes resource recommender by robusta is always good to use as a baseline, use KEDA and hpa/vpa for optimal scaling to prevent idle resources from costing you money.
  • TTL for ephemeral resources and preview namespaces
  • Use tools like kor to find unused and dangling resources

Central Crossplane Pattern (Self‑Service with Guardrails)

  • Run Crossplane in a management cluster or account. Configure provider credentials to assume minimal roles into workload accounts and projects
  • Publish Compositions for common app resources like Postgres, buckets, queues, and cache with defaults and quotas. Developers request them via Kubernetes claims. Crossplane provisions in the right cloud account
  • Keep core cloud guardrails like VPCs, org policies, and identity in Terraform or native cloud IaC owned by the platform or landing‑zone team

Result. App teams get fast, safe infra self‑service. The platform keeps control and visibility.

Central Argo CD Platform Cluster Pattern

  • Run a single highly available Argo CD instance in a management (platform) account. It holds the root apps and platform components.
  • Automated onboarding of new clusters via GitOps using my custom integrations for both Azure and AWS:
    • Automated Cluster Onboarding Using Terraform
    • Declarative Cluster Onboarding Using Terraform & External Secrets Operator
  • using the custom integrations we can easily recreate downstream workload clusters in a disaster recovery scenario or if we need to scale out to new regions or accounts.
    • declarative onboarding requires an extra commit to the argcd platform repository but, you do not need to re-run each cluster's terraform workspace if the management cluster needs to be redeployed.
      • This is the classic reconciliation loop vs push based approach trade-off, pick your poison.

Choosing Your Platform Stack

How to Build Clusters

Start with one question: what will you run? Regulatory constraints, latency, data gravity, GPU needs, and SLAs will shape choices.

  • Managed control planes like EKS, GKE, and AKS. Fastest path in cloud and reduces undifferentiated ops.
  • Cluster API for consistent provisioning across clouds and on‑prem. Great for fleet standardization
  • RKE2 / K3S (or equivalents) on self‑hosted: full control when you need it, budget for lifecycle automation, OS hardening, and upgrades.
  • Baseline OS images with Packer; automate node bootstrap; keep images small and boring.

How to Deliver Software

WARNING

Push‑based pipelines that apply manifests directly to clusters are an anti‑pattern for Kubernetes. They fight reconciliation, they hide drift, and they fragment audit. If you want push only, you do not need Kubernetes

Embrace pull‑based with agents

  • CI builds, tests, scans, and signs artifacts
  • CI commits or updates version pins in the env Git repo
  • Argo CD pulls, reconciles, and surfaces health and drift. Rollbacks are Git and instant

Day-2 operations

each cluster needs to be maintained

  • Regular K8S version upgrades
  • at least monthly OS updates
  • regular cert renewal & continuous security scanning
  • any other required changes

Some cloud distros might make life easier for you when it comes to these operations but rarely do they cover all your needs. This means you might have to do some heavy lifting here.

Reference Architectures

The diagrams below capture typical platform topologies, guardrails, and delivery flows that I use as starting points and adapt per client:

TODO

Closing Principles

  • Keep the blast radius small. Prefer isolation over clever multi‑tenancy.
  • Everything declarative, everything documented, everything observable.
  • Fewer tools, less cognitive load, faster recovery.
  • Self‑service within guardrails beats tickets and shadow ops.

This is a living handbook. I’ll keep refining it as the platform and the reality of running clusters at scale evolves.