Client overview
- Industry focus
- Enterprise SaaS
- Portfolio segment
- SaaS / Enterprise
- Organization profile
- B2B vertical SaaS, ARR ~$140M, 1,900 customers
Product velocity remained high, but enterprise accounts benchmarked competitor performance in POCs. CS flagged "spinning dashboards" during quarter close; engineering blamed database yet lacked unified tracing from browser to warehouse. Investors linked gross retention softness to qualitative performance complaints in diligence calls.
Problem
Tail latency and dashboard stalls threatened enterprise renewals; profiling culture was immature.
APM tools showed averages hiding multi-second outliers on invoice list endpoints. ORM-generated queries multiplied during nested dashboard loads. Background exporters monopolized connection pools nightly.
Frontend shipped large JS bundles without route-based code splitting; LCP exceeded 4s on low-bandwidth hospitality clients.
No SLO definitions meant teams optimized features, not customer-impacting journeys.
Solution
SLO-first program: tracing, query review board, Redis edge cache for hot aggregates, worker autoscaling with backpressure, and frontend performance budget in CI.
OpenTelemetry bridged browser spans to Postgres query plans; weekly perf council prioritized fixes by dollar-at-risk estimates from CS tagging.
Critical endpoints gained explicit indexes and covering patterns; Hibernate fetch graphs rewritten where necessary. Redis cached tenant-scoped KPI tiles with TTL jitter.
Next.js adoption on marketing app improved LCP; dashboard SPA moved heavy charts to intersection observers.
Implementation
- 1
Instrumentation & baseline
Deployed tracing with sampling tuned for noisy tenants; captured Core Web Vitals in RUM pipelines. Established p95/p99 budgets per endpoint family.
- 2
Hot path burn-down
Two-sprint cycles per domain team with shared DBA office hours. Kill list of N+1 offenders published in wiki for pride/shame accountability.
- 3
Renewal firewall
CS playbooks referenced performance attestations before QBR decks; synthetic checks simulated top 20 tenant configurations hourly.
Tools & platforms
- OpenTelemetry
- Datadog
- pg_stat_statements
- k6
- Redis
- Next.js
Engineering challenges addressed
- Tenant hot spots skewing benchmarks — solved with weighted SLOs by revenue band.
- Balancing cache freshness with finance close deadlines.
Program artifacts & environments


Tech stack
- Next.js
- React
- Java
- PostgreSQL
- Redis
- Kafka
- Kubernetes
- AWS
- OpenTelemetry
Results
- 63% reduction in API p99 for billing and dashboard families
- Enterprise logo churn down 3.2 points YoY after performance program
- Median LCP improved from 4.1s to 1.9s on hospitality profile
Quantified impact
63% API p99 reduction on targeted routes
Pre/post across 30-day steady state.
$6.7M expansion pipeline re-engaged
Opportunities previously stalled on performance POCs.
Key takeaways
- Performance is a product discipline — not a heroics sprint before renewal season.
- SLOs should be revenue-aware; not all tenants deserve equal latency targets.
- Frontend and backend tracing must stitch — otherwise teams optimize wrong layers.
