Client overview
- Industry focus
- Enterprise SaaS
- Portfolio segment
- SaaS / Enterprise
- Organization profile
- B2B SaaS unicorn scale-up, ~650 engineers across 11 tribes
Kubernetes sprawl multiplied as teams self-served clusters without guardrails; security exceptions piled up faster than remediation. Incident retrospectives blamed "culture" without measurable platform leverage. CFO questioned rising cloud spend nonlinear with ARR.
Problem
Undifferentiated infra toil and inconsistent pipelines slowed safe releases and inflated incident rates.
Each tribe maintained bespoke Terraform forks; drift detection was aspirational only. QA environments diverged wildly from prod, masking defects until Fridays.
Secrets rotation playbooks lived in Notion pages engineers ignored until scanners screamed.
No tiered service catalog meant product teams negotiated capacity via Slack instead of APIs.
Solution
Internal developer platform with golden Terraform modules, GitOps promotion model, ephemeral preview environments per PR, automated policy checks, and SRE coaching embedded in squads.
Platform team published opinionated stacks (Node/Java baseline) with baked-in observability exporters and cost allocation tags. Backstage portal exposed self-service RDS + Redis patterns with quotas.
Deploy pipelines promoted artifacts through staging governed by progressive delivery (Argo Rollouts & canaries). Synthetic monitors gated promotions using user-journey probes.
Incident tooling integrated PagerDuty with unified runbooks referencing live query packs; blameless RCA templates fed roadmap funding for systemic fixes.
Implementation
- 1
Baseline chaos
Tagged services by criticality map; injected failure drills exposing missing circuit breakers. Established error budget math leadership could understand financially.
- 2
Golden path rollout
Pilot tribes adopted templates; friction points prioritized weekly. Coaches paired with skeptical teams skeptical on "central platform" narrative.
- 3
Continuous compliance
Policy-as-code for network segments and IAM; automated evidence exports for SOC2 auditors.
Tools & platforms
- Backstage
- Argo CD/Rollouts
- Terraform Cloud
- OPA/Gatekeeper
- GitHub Actions
Engineering challenges addressed
- Negotiating autonomy vs. standards — solved with escape hatches taxed via architecture review SLA.
Program artifacts & environments


Tech stack
- Kubernetes
- Terraform
- Argo CD
- Prometheus
- Grafana
- PagerDuty
- AWS
- GitHub Actions
Results
- Deploy frequency per service up 4.2× YoY median
- MTTR down 61% after platform telemetry + runbooks
- Annual Sev-1 count down 58% YoY despite traffic growth
Quantified impact
61% MTTR improvement
Measured from page to verified mitigation.
Cloud unit cost per MAU −19%
Rightsizing plus autoscaling tuned to request patterns.
Key takeaways
- Platform engineering must sell reductions in cognitive load — not abstract "best practices."
- Golden paths without escape valves breed shadow IT worse than no paths.
- Executive sponsorship links reliability investment to revenue risk — quantify it.
