Client overview
- Industry focus
- Developer tools
- Portfolio segment
- SaaS / Enterprise
- Organization profile
- Series D developer workflow SaaS, ~55k tenants
Large enterprise tenants saturated shared clusters during CI spikes; smaller tenants suffered latency correlated with Fortune 500 neighbor usage. Finance challenged infra COGS before IPO readiness. Security demanded logical isolation narratives stronger than "we partition by tenant_id."
Problem
Shared pool multitenant clusters created noisy-neighbor incidents and weak isolation storytelling for enterprise procurement.
Postgres connection storms during heavy CI bursts stalled unrelated OLTP workloads. Background jobs lacked fair scheduling across tenant tiers.
Backup/restore granularity could not satisfy enterprise RPO without restoring neighbors — unacceptable.
Sales promised "dedicated-like" SLAs without engineering pricing guardrails.
Solution
Cell architecture with deterministic routing, tiered pools (shared vs. performance vs. isolated cells), automated placement algorithm using historical CPU/IO signatures, per-cell observability, and backup scopes matching commercial contracts.
Edge gateway resolves tenant→cell using signed config served from control plane with strong consistency; migrations orchestrated online with dual-write verification windows.
Cells run smaller blast-radius Kubernetes clusters; regional pairs for DR. Postgres per cell with horizontal sharding only when telemetry demands.
Cost simulation priced "isolation uplift" transparently for sales overlays; FinOps dashboards compared cell utilization quarterly.
Implementation
- 1
Traffic fingerprinting
Historical metrics modeled tenant CPU/IO signatures; classified tenants into placement buckets. Validated model during stress weeks.
- 2
Pilot cell + enterprise migrations
Highest ARR tenants moved first; synthetic probes verified routing. Runbooks rehearsed rollback with Terraform state discipline.
- 3
Automated rebalancing
Quarterly jobs suggested migrations with change risk scores; CAB approved large moves.
Tools & platforms
- Envoy tenant filters
- Crossplane
- Terraform
- Cilium network policies
Engineering challenges addressed
- Online migration cutovers without dual-write bugs on idempotency keys.
- Educating support on new diagnostics when tenants asked "which cell am I on?"
Program artifacts & environments


Tech stack
- Kubernetes
- Envoy
- PostgreSQL
- Kafka
- Crossplane
- Terraform
- AWS
- Go
- OpenTelemetry
Results
- Noisy-neighbor Sev-2 frequency down 82% YoY post program
- Median API latency for SMB tenants improved 38% after rebalancing
- Infra unit cost per MAU down 24% via higher average utilization
Quantified impact
82% incident reduction
Attributable noisy-neighbor class after cell rollout.
24% infra cost per MAU reduction
Blended across cells with rightsizing gains.
Key takeaways
- Multi-tenancy is a product + finance problem — isolation tiers need commercial truth, not engineering idealism.
- Placement automation beats manual sharding projects that never finish.
- Migration tooling quality determines whether sales will ever sell isolation again confidently.
