Operating Model and Best Practices
Building a platform is one thing. Running it successfully is another. This section covers the operational practices, access models, and hard-earned lessons I've learned keeping platforms healthy and teams productive across different organizations and scales.
Access Model: GitOps‑First
- Default: no direct cluster access. Developers interact with Argo CD; the platform team holds admin. Make exceptions only for a few trusted DevEx champions.
- Centralize authZ in Argo CD using SSO groups. Keep cluster RBAC scoped and minimal. Preferably use an external provider for authN.
Why: it removes config drift, centralizes changelog/audit in Git, and keeps the platform reproducible.
Developer Onboarding and Templates
Developer onboarding should be a smooth experience for all parties involved. This will greatly determine the percieved quality of your platfrom.
Getting developers to deploy applications should be as easy as giving them 2 template repositories.
the platform team should provide the following:
golden repo templates:
argocd root app repo with Helm or Kustomize.
- Shows the required ArgoCD projects structures, cluster names, ... and demonstrates an example deployment
K8S resources repository which will include an example deployment and Crossplane claims for common app related cloud resources. You could also include example tests and basic alerts.
- This repo should also include a link to the Argo CD instance if you are using a centralised approach.
Offer preview environments via GitOps for PRs where it makes sense
Keep service‑level docs in their respective repos:
- source code docs in source code repos
- document app usage decisions in kubernetes resources repo aswell as any dependencies and SLOs.
Environment Parity and Promotion
- Keep environments aligned. Promote via PRs, go through the environments, do not drift by "one‑off fixes" in prod
- Pin chart and module versions; upgrade in lower envs first, promote once healthy. No long‑lived env branches.
Observability and Alerting
- Golden signals for every service. Latency, traffic, errors, and saturation.
- SLOs with burn‑rate alerts. Page on user impact. Ticket on toil.
- Platform dashboards for cluster health, control plane, etcd, CNI and CSI, and Argo sync status.
Policies and Security Automation
- Kyverno policy packs: required labels and annotations, disallow
:latest, enforce resource requests/limits, restrict hostPath and privileged, default NetworkPolicy, restrict node selectors, runtimeClass, image provenance. - Security baseline: CIS Kubernetes benchmarks and related guidance where relevant (NSA/CISA). Automate checks in CI and admission where possible
- Supply chain: sign images with Cosign and verify in admission; scan images and manifests; block secrets via scanners.
FinOps and Cost Guardrails
- Right‑size requests and limits and use periodic reports. Enforce defaults with policy - kubernetes resource recommender by robusta is always good to use as a baseline, use KEDA and hpa/vpa for optimal scaling to prevent idle resources from costing you money.
- TTL for ephemeral resources and preview namespaces
- Use tools like kor to find unused and dangling resources
- kubecost for cloud cost optimization
Day-2 Operations
Day-2 operations cover the ongoing maintenance and lifecycle management of your clusters. This is where platforms succeed or fail at scale. Automation is not optional.
Cluster Lifecycle Management
Each cluster needs regular maintenance:
- Kubernetes version upgrades: Plan quarterly upgrades at minimum. Test in lower environments first.
- OS updates: At least monthly patches for nodes. Automate with node image pipelines or managed node groups.
- Certificate renewal: Automate certificate rotation. Do not rely on manual renewal.
- Security scanning: Continuous vulnerability scanning for images, manifests, and cluster configuration.
- Capacity planning: Monitor resource utilization trends. Scale or optimize before hitting limits.
Cloud vs On-Prem Trade-offs
Managed cloud distributions (EKS, GKE, AKS) handle some Day-2 tasks:
- Control plane upgrades managed by the provider.
- Managed node pools with automated OS patching.
- Integrated certificate management.
However, they rarely cover everything:
- You still own workload compatibility testing across Kubernetes versions.
- Addon upgrades (CNI, CSI, observability agents) remain your responsibility.
- Policy enforcement and configuration drift detection require custom automation.
On-prem or self-managed clusters require full lifecycle ownership:
- Custom controllers for OS image updates and node rotation.
- Automated backup and disaster recovery procedures.
- Manual or scripted control plane upgrades with careful testing.
- Hardware lifecycle management and capacity forecasting.
Automation Strategies
- Use GitOps for cluster configuration updates. Version and test changes in lower environments.
- Implement automated node image pipelines with Packer or equivalent.
- Build runbooks for emergency procedures. Practice disaster recovery regularly.
- Monitor control plane and etcd health. Alert on certificate expiration windows.
- Document upgrade procedures and maintain a tested rollback plan.
The operational burden is real. Budget team capacity for Day-2 work or accept the risk of running outdated, vulnerable clusters.
Lessons from the Field
Platform transformations are rarely purely technical. The hardest challenges are cultural and organizational, not architectural. Here's what I've learned deploying platforms across greenfield startups and large brownfield enterprises.
Transformations Are Cultural First
Adopting GitOps, platform engineering, and Kubernetes at scale requires shifts in how teams work, not just what tools they use.
Developers need to embrace declarative thinking and trust reconciliation loops instead of imperative scripts. Operations teams need to move from ticket-driven requests to product ownership. Leadership needs to accept that platform teams are not cost centers but force multipliers.
The technical implementation is often straightforward. The organizational change is where most transformations stall.
Invest early in:
- Clear ownership boundaries between platform and app teams.
- Training and documentation that meets teams where they are.
- Early wins that demonstrate value before asking for major workflow changes.
- Executive sponsorship that protects platform investment when short-term pressure builds.
If you treat this as a tooling migration instead of a cultural shift you will struggle.
Greenfield vs Brownfield Reality
Greenfield projects are easier. You can architect cleanly from day one without legacy constraints, political baggage, or migration risk. The handbook principles apply directly.
Brownfield transformations require pragmatism and sequencing.
You will inherit:
- Existing deployment pipelines teams depend on.
- Snowflake clusters with undocumented manual changes.
- Hard-coded configuration scattered across wikis, scripts, and institutional knowledge.
- Compliance requirements that weren't designed for declarative workflows.
Your migration strategy matters as much as your target architecture.
Start small:
- Pick one new project or non-critical workload to prove the model.
- Run dual-stack during migration. Let old pipelines coexist with GitOps until trust builds.
- Document everything as you migrate. Turn tribal knowledge into declarative config.
- Automate the boring parts first. Let teams feel the toil reduction before asking them to change workflows.
Do not attempt a big-bang migration. Brownfield transformations succeed through incremental adoption and early proof points, not grand plans.
Common Migration Pitfalls
Underestimating the effort to make existing workloads declarative. Migrating 200 apps that were deployed via bash scripts and manual clicks into Helm charts or Kustomize manifests takes real engineering time.
Skipping the platform team entirely. Telling app teams to "just adopt GitOps" without providing templates, guardrails, or support creates chaos and resentment.
Over-engineering before proving value. Building a full Internal Developer Platform with service catalogs, golden paths, and policy engines before running a single workload in production is a recipe for wasted effort and misalignment with actual needs.
Ignoring the operational learning curve. GitOps and Kubernetes introduce new failure modes. Teams need time to learn debugging, observability, and incident response in this model. Budget for that learning.
Organizational Change Management
Platform engineering is a product discipline. Your customers are internal development teams. Treat them like customers.
- Gather feedback continuously. What's blocking them? What's causing toil?
- Prioritize developer experience as much as technical correctness.
- Provide clear migration guides and golden templates.
- Celebrate early adopters and share their success stories.
- Accept that adoption will be uneven. Some teams will move fast, others will resist. Meet them where they are.
The best platform is one teams actually use. If you build something technically perfect but culturally misaligned it will fail.
Technology is the easy part. People and process are where platforms succeed or collapse.
