SaaS Performance Engineering Case Study | Latency & Renewals

Client overview

Industry focus: Enterprise SaaS
Portfolio segment: SaaS / Enterprise
Organization profile: B2B vertical SaaS, ARR ~$140M, 1,900 customers

Product velocity remained high, but enterprise accounts benchmarked competitor performance in POCs. CS flagged "spinning dashboards" during quarter close; engineering blamed database yet lacked unified tracing from browser to warehouse. Investors linked gross retention softness to qualitative performance complaints in diligence calls.

Problem

Tail latency and dashboard stalls threatened enterprise renewals; profiling culture was immature.

APM tools showed averages hiding multi-second outliers on invoice list endpoints. ORM-generated queries multiplied during nested dashboard loads. Background exporters monopolized connection pools nightly.

Frontend shipped large JS bundles without route-based code splitting; LCP exceeded 4s on low-bandwidth hospitality clients.

No SLO definitions meant teams optimized features, not customer-impacting journeys.

Solution

SLO-first program: tracing, query review board, Redis edge cache for hot aggregates, worker autoscaling with backpressure, and frontend performance budget in CI.

OpenTelemetry bridged browser spans to Postgres query plans; weekly perf council prioritized fixes by dollar-at-risk estimates from CS tagging.

Critical endpoints gained explicit indexes and covering patterns; Hibernate fetch graphs rewritten where necessary. Redis cached tenant-scoped KPI tiles with TTL jitter.

Next.js adoption on marketing app improved LCP; dashboard SPA moved heavy charts to intersection observers.

Implementation

1
Instrumentation & baseline
Deployed tracing with sampling tuned for noisy tenants; captured Core Web Vitals in RUM pipelines. Established p95/p99 budgets per endpoint family.
2
Hot path burn-down
Two-sprint cycles per domain team with shared DBA office hours. Kill list of N+1 offenders published in wiki for pride/shame accountability.
3
Renewal firewall
CS playbooks referenced performance attestations before QBR decks; synthetic checks simulated top 20 tenant configurations hourly.

Tools & platforms

OpenTelemetry
Datadog
pg_stat_statements
k6
Redis
Next.js

Engineering challenges addressed

Tenant hot spots skewing benchmarks — solved with weighted SLOs by revenue band.
Balancing cache freshness with finance close deadlines.

Program artifacts & environments

Software developer looking at multiple monitors — Query plan review boards paired DBAs with feature teams weekly.

Application performance monitoring charts — Tenant-weighted SLOs changed prioritization debates from opinion to math.

Tech stack

Next.js
React
Java
PostgreSQL
Redis
Kafka
Kubernetes
AWS
OpenTelemetry

Results

63% reduction in API p99 for billing and dashboard families
Enterprise logo churn down 3.2 points YoY after performance program
Median LCP improved from 4.1s to 1.9s on hospitality profile

Quantified impact

63% API p99 reduction on targeted routes
Pre/post across 30-day steady state.
$6.7M expansion pipeline re-engaged
Opportunities previously stalled on performance POCs.

Key takeaways

Performance is a product discipline — not a heroics sprint before renewal season.
SLOs should be revenue-aware; not all tenants deserve equal latency targets.
Frontend and backend tracing must stitch — otherwise teams optimize wrong layers.