Operating Model and Best Practices

Building a platform is one thing. Running it successfully is another. This section covers the operational practices, access models, and hard-earned lessons I've learned keeping platforms healthy and teams productive across different organizations and scales.

Access Model: GitOps‑First

Default: no direct cluster access. Developers interact with Argo CD; the platform team holds admin. Make exceptions only for a few trusted DevEx champions.
Centralize authZ in Argo CD using SSO groups. Keep cluster RBAC scoped and minimal. Preferably use an external provider for authN.

Why: it removes config drift, centralizes changelog/audit in Git, and keeps the platform reproducible.

Developer Onboarding and Templates

Developer onboarding should be a smooth experience for all parties involved. This will greatly determine the percieved quality of your platfrom.

Getting developers to deploy applications should be as easy as giving them 2 template repositories.

the platform team should provide the following:

golden repo templates:

argocd root app repo with Helm or Kustomize.
- Shows the required ArgoCD projects structures, cluster names, ... and demonstrates an example deployment
K8S resources repository which will include an example deployment and Crossplane claims for common app related cloud resources. You could also include example tests and basic alerts.
- This repo should also include a link to the Argo CD instance if you are using a centralised approach.
Offer preview environments via GitOps for PRs where it makes sense
Keep service‑level docs in their respective repos:
- source code docs in source code repos
- document app usage decisions in kubernetes resources repo aswell as any dependencies and SLOs.

Environment Parity and Promotion

Keep environments aligned. Promote via PRs, go through the environments, do not drift by "one‑off fixes" in prod
Pin chart and module versions; upgrade in lower envs first, promote once healthy. No long‑lived env branches.

Observability and Alerting

Golden signals for every service. Latency, traffic, errors, and saturation.
SLOs with burn‑rate alerts. Page on user impact. Ticket on toil.
Platform dashboards for cluster health, control plane, etcd, CNI and CSI, and Argo sync status.

Policies and Security Automation

Kyverno policy packs: required labels and annotations, disallow :latest, enforce resource requests/limits, restrict hostPath and privileged, default NetworkPolicy, restrict node selectors, runtimeClass, image provenance.
Security baseline: CIS Kubernetes benchmarks and related guidance where relevant (NSA/CISA). Automate checks in CI and admission where possible
Supply chain: sign images with Cosign and verify in admission; scan images and manifests; block secrets via scanners.

FinOps and Cost Guardrails

Right‑size requests and limits and use periodic reports. Enforce defaults with policy - kubernetes resource recommender by robusta is always good to use as a baseline, use KEDA and hpa/vpa for optimal scaling to prevent idle resources from costing you money.
TTL for ephemeral resources and preview namespaces
Use tools like kor to find unused and dangling resources
kubecost for cloud cost optimization

Day-2 Operations

Day-2 operations cover the ongoing maintenance and lifecycle management of your clusters. This is where platforms succeed or fail at scale. Automation is not optional.

Cluster Lifecycle Management

Each cluster needs regular maintenance:

Kubernetes version upgrades: Plan quarterly upgrades at minimum. Test in lower environments first.
OS updates: At least monthly patches for nodes. Automate with node image pipelines or managed node groups.
Certificate renewal: Automate certificate rotation. Do not rely on manual renewal.
Security scanning: Continuous vulnerability scanning for images, manifests, and cluster configuration.
Capacity planning: Monitor resource utilization trends. Scale or optimize before hitting limits.

Cloud vs On-Prem Trade-offs

Managed cloud distributions (EKS, GKE, AKS) handle some Day-2 tasks:

Control plane upgrades managed by the provider.
Managed node pools with automated OS patching.
Integrated certificate management.

However, they rarely cover everything:

You still own workload compatibility testing across Kubernetes versions.
Addon upgrades (CNI, CSI, observability agents) remain your responsibility.
Policy enforcement and configuration drift detection require custom automation.

On-prem or self-managed clusters require full lifecycle ownership:

Custom controllers for OS image updates and node rotation.
Automated backup and disaster recovery procedures.
Manual or scripted control plane upgrades with careful testing.
Hardware lifecycle management and capacity forecasting.

Automation Strategies

Use GitOps for cluster configuration updates. Version and test changes in lower environments.
Implement automated node image pipelines with Packer or equivalent.
Build runbooks for emergency procedures. Practice disaster recovery regularly.
Monitor control plane and etcd health. Alert on certificate expiration windows.
Document upgrade procedures and maintain a tested rollback plan.

The operational burden is real. Budget team capacity for Day-2 work or accept the risk of running outdated, vulnerable clusters.

Lessons from the Field

Platform transformations are rarely purely technical. The hardest challenges are cultural and organizational, not architectural. Here's what I've learned deploying platforms across greenfield startups and large brownfield enterprises.

Transformations Are Cultural First

Adopting GitOps, platform engineering, and Kubernetes at scale requires shifts in how teams work, not just what tools they use.

Developers need to embrace declarative thinking and trust reconciliation loops instead of imperative scripts. Operations teams need to move from ticket-driven requests to product ownership. Leadership needs to accept that platform teams are not cost centers but force multipliers.

The technical implementation is often straightforward. The organizational change is where most transformations stall.

Invest early in:

Clear ownership boundaries between platform and app teams.
Training and documentation that meets teams where they are.
Early wins that demonstrate value before asking for major workflow changes.
Executive sponsorship that protects platform investment when short-term pressure builds.

If you treat this as a tooling migration instead of a cultural shift you will struggle.

Greenfield vs Brownfield Reality

Greenfield projects are easier. You can architect cleanly from day one without legacy constraints, political baggage, or migration risk. The handbook principles apply directly.

Brownfield transformations require pragmatism and sequencing.

You will inherit:

Existing deployment pipelines teams depend on.
Snowflake clusters with undocumented manual changes.
Hard-coded configuration scattered across wikis, scripts, and institutional knowledge.
Compliance requirements that weren't designed for declarative workflows.

Your migration strategy matters as much as your target architecture.

Start small:

Pick one new project or non-critical workload to prove the model.
Run dual-stack during migration. Let old pipelines coexist with GitOps until trust builds.
Document everything as you migrate. Turn tribal knowledge into declarative config.
Automate the boring parts first. Let teams feel the toil reduction before asking them to change workflows.

Do not attempt a big-bang migration. Brownfield transformations succeed through incremental adoption and early proof points, not grand plans.

Common Migration Pitfalls

Underestimating the effort to make existing workloads declarative. Migrating 200 apps that were deployed via bash scripts and manual clicks into Helm charts or Kustomize manifests takes real engineering time.

Skipping the platform team entirely. Telling app teams to "just adopt GitOps" without providing templates, guardrails, or support creates chaos and resentment.

Over-engineering before proving value. Building a full Internal Developer Platform with service catalogs, golden paths, and policy engines before running a single workload in production is a recipe for wasted effort and misalignment with actual needs.

Ignoring the operational learning curve. GitOps and Kubernetes introduce new failure modes. Teams need time to learn debugging, observability, and incident response in this model. Budget for that learning.

Organizational Change Management

Platform engineering is a product discipline. Your customers are internal development teams. Treat them like customers.

Gather feedback continuously. What's blocking them? What's causing toil?
Prioritize developer experience as much as technical correctness.
Provide clear migration guides and golden templates.
Celebrate early adopters and share their success stories.
Accept that adoption will be uneven. Some teams will move fast, others will resist. Meet them where they are.

The best platform is one teams actually use. If you build something technically perfect but culturally misaligned it will fail.

Technology is the easy part. People and process are where platforms succeed or collapse.

github actions suite

platform

aks

aro

eks

rke2

handbook

Reference Architectures

devops

kubernetes

vault

edge

armbian

Build custom armbian images

single board compute

Operating Model and Best Practices

Access Model: GitOps‑First

Developer Onboarding and Templates

Environment Parity and Promotion

Observability and Alerting

Policies and Security Automation

FinOps and Cost Guardrails

Day-2 Operations

Cluster Lifecycle Management

Cloud vs On-Prem Trade-offs

Automation Strategies

Lessons from the Field

Transformations Are Cultural First

Greenfield vs Brownfield Reality

Common Migration Pitfalls

Organizational Change Management

aro

rke2

handbook

Reference Architectures

vault

armbian

Build custom armbian images

single board compute

Operating Model and Best Practices ​

Access Model: GitOps‑First ​

Developer Onboarding and Templates ​

Environment Parity and Promotion ​

Observability and Alerting ​

Policies and Security Automation ​

FinOps and Cost Guardrails ​

Day-2 Operations ​

Cluster Lifecycle Management ​

Cloud vs On-Prem Trade-offs ​

Automation Strategies ​

Lessons from the Field ​

Transformations Are Cultural First ​

Greenfield vs Brownfield Reality ​

Common Migration Pitfalls ​

Organizational Change Management ​

Operating Model and Best Practices

Access Model: GitOps‑First

Developer Onboarding and Templates

Environment Parity and Promotion

Observability and Alerting

Policies and Security Automation

FinOps and Cost Guardrails

Day-2 Operations

Cluster Lifecycle Management

Cloud vs On-Prem Trade-offs

Automation Strategies

Lessons from the Field

Transformations Are Cultural First

Greenfield vs Brownfield Reality

Common Migration Pitfalls

Organizational Change Management