Platform Architecture Decisions
Making the right architectural decisions early shapes your platform's scalability, operational overhead, and developer experience. This section covers cluster topology choices, minimal platform components, and infrastructure tooling trade-offs.
How to Build Clusters
Start with two questions:
1. What will you run?
Workload shape drives everything. Regulatory constraints, latency, data gravity, GPU demand, and SLAs narrow the viable paths. If you need GPUs, strict residency, or ultra low latency you immediately prune options.
2. Where will you run it?
Cloud or on prem is the primary fork. Cloud gives you managed control planes and faster iteration. On prem gives you sovereignty and hardware level control but adds lifecycle overhead. Hybrid only if you have a clear reason: residency split, edge latency, or staged migration.
Cloud
Managed control planes like EKS, GKE, and AKS are almost always the right choice in cloud. They reduce undifferentiated ops and give you the fastest path to production.
Cluster API or custom terraform module works well for consistent provisioning across clouds and edge if you need fleet standardization.
On-Prem
RKE2 and K3S give you full control but require custom controllers, lifecycle automation, OS hardening, and upgrade orchestration. Rancher products excel here since they're batteries-included distributions with built-in security hardening, lifecycle management, and fleet tooling, all cloud-agnostic.
Build baseline OS images with Packer, automate node bootstrap, and keep images small and boring.
Minimal Viable Platform Stack
Key Advice
Keep it lean. More tools means more maintenance and cognitive load.
Below I share what I believe are the minimum required capabilities a modern production ready platform should have:
| Capability | Purpose | Examples |
|---|---|---|
| Secrets | Manage app and platform secrets | Seal Secrets, External Secrets Operator, cloud secret managers |
| Continuous Delivery | Reconcile desired state to clusters | Argo CD with app‑of‑apps |
| Policy | Enforce guardrails and defaults | Kyverno |
| Packaging | Reusable app and platform manifests | Helm, Kustomize overlays |
| Networking and Ingress | North‑south and east‑west traffic | CNI of choice, Gateway API or Ingress Controller, mesh only when needed |
| Storage and CSI | Persistent storage and backups | Cloud CSI drivers, Velero |
| Observability | Metrics, logs, tracing, dashboards, alerts | Prometheus, Loki or ELK, Tempo or Jaeger, Grafana and Alertmanager |
| Supply Chain Security | Scan, sign, and verify | Trivy, Cosign, admission checks |
Document every component and the SLO it supports. If it isn't declarative and documented, it may be considered as technical debt.
Use one tool for each purpose: it's rarely a good idea to rely on two tools that do the same thing. I say rarely because there are exceptions, like using both Crossplane and Terraform. That combo can make sense when you want to keep developers focused purely on Kubernetes manifests. Of course, it adds extra strain on the platform team, who now have to manage another tool for infrastructure, but it can also reduce day-to-day operational work.
At the end of the day, platform design is all about organizational context, making deliberate choices with clear trade-offs. Only adopt features if they genuinely support your business goals or cultural direction.
Infrastructure as Code Tooling
When choosing your infrastructure automation tool you need to consider several important things. There are a whole host of cultural reasons why you might opt for a certain tool over another one.
Cloud-agnostic is almost always the right choice. Tools like Terraform, Pulumi, and Crossplane let you define infrastructure that can sort of move between clouds without complete rewrites. Cloud-native IaC like CloudFormation, ARM templates, or CDK for Terraform locks you into a single provider. That might be fine if you're certain you'll never need portability, but most organizations eventually hit multi-cloud requirements through acquisition, compliance, or strategic diversification.
The main questions to ask yourself:
How fast and often do you need to deploy new clusters? If you need to bootstrap new clusters daily you are best off going with advanced automations including but not limited to Cluster API, Rancher Fleet Manager, and similar tools. If your cluster requirements are rather static I would opt for a more controlled approach like Terraform or Pulumi.
Who needs to interact with infrastructure provisioning? If only the platform team provisions infrastructure, Terraform or Pulumi works well. If developers need self-service infrastructure, Crossplane Compositions let them request resources via Kubernetes manifests without leaving their workflow.
General guidance:
- Keep core cloud guardrails like VPCs, org policies, and identity in Terraform, Pulumi, or equivalent IaC owned by the platform or landing zone team.
- Publish Crossplane Compositions for common app resources like Postgres, buckets, queues, and caches with defaults and quotas. Developers request them via Kubernetes claims.
- Prefer cloud-agnostic tooling unless you have a compelling reason to lock in.
Common Anti-Pattern: Single IaC Tool for Everything
I've seen teams use a single IaC tool like Terraform to manage every cloud resource, including app-level infrastructure that developers need daily like databases, buckets, queues, and caches. This creates a bottleneck.
Developers end up filing tickets or opening PRs in the platform team's IaC repo just to get a database. The platform team becomes a request queue, not a product team. Velocity drops and frustration builds on both sides.
The fix is simple: draw a clear boundary.
Use Terraform or Pulumi for foundational infrastructure the platform team owns: VPCs, IAM roles, org policies, cluster provisioning, and baseline security guardrails. These change infrequently and need tight central control.
Use Crossplane Compositions (or equivalent) for app-adjacent resources that developers self-serve: databases, buckets, queues, caches. Developers submit a claim via a Kubernetes manifest and get back a resource that meets platform standards. The platform team defines the Compositions once with quotas, tagging, backup policies, and encryption baked in. Developers get autonomy within guardrails.
This isn't about replacing your IaC tool. It's about using the right tool at the right layer. Terraform and Pulumi excel at foundational cloud construction with operator-driven workflows. Crossplane excels at developer self-service within Kubernetes through pull-based reconciliation.
If you force everything through a single IaC tool you're optimizing for central control at the expense of developer velocity. That trade-off rarely makes sense past a certain team size.
One-liners
Some strong one-liners to keep in mind when you are designing a platform:
- Ease of use almost always means vendor lock-in.
- Tooling should solve problems, not create new ones.
- the best engineering decisions are about trade-offs, not trends.
