Client overview
- Industry focus
- EdTech
- Portfolio segment
- EdTech
- Organization profile
- Public benefit corporation, ~250 staff, learners in 90+ countries
Product-market fit drove MAU past infrastructure assumptions; Sunday exam windows saturated API clusters. Educators demanded proctored assessments with integrity signals; finance needed predictable unit economics before the next curriculum partnership with a national ministry.
Problem
Live class and exam bursts overwhelmed monolithic APIs; caching inconsistencies corrupted assessment submissions.
Edge POPs lacked tuned WebRTC TURN allocations; learners in LATAM experienced audio drift during breakout rooms. Redis single-node caches caused thundering herds when hero teacher sessions began simultaneously worldwide.
Assessment service stored attempts in-row with course content tables, locking hot partitions. Students occasionally saw stale progress bars leading to duplicate retries.
Vendor CDN bills spiked nonlinearly because adaptive bitrate ladders were untuned for mobile-first learners on 3G-equivalent links.
Solution
Decomposed hot paths: stateless signaling, sharded Redis, read-scaled Postgres, hierarchical CDN caching with signed segment URLs, and integrity streaming to proctoring subsystem.
We split read and write paths: command bus for attempts, CQRS projections for dashboards, and outbox pattern to analytics. Live class control plane moved to dedicated autoscaling groups with predictive scaling informed by timetable imports.
Redis Cluster with replica reads for session + rate limits; fairness policies avoided teacher-induced hotspots. Postgres added partitioning by tenant with citus-style coordinator for heavy tenants.
CDN optimization tuned ladder variants per geography; prefetch hints aligned with LMS navigation graphs.
Implementation
- 1
Traffic characterization
Captured WebRTC telemetry, HTTP waterfall baselines, and Redis slow logs during peak Sundays. Chaos injected regional latency to validate adaptive UX messaging.
- 2
Incremental extraction
Strangled assessment APIs behind BFF with feature flags; replay tests compared scoring outputs against legacy monolith on millions of archived attempts.
- 3
Cost-performance tradeoffs
Committed use discounts aligned to predictable exam windows; lifecycle policies moved cold recordings to cheaper storage classes.
Tools & platforms
- Kubernetes HPA/VPA
- CloudFront + Lambda@Edge
- Redis Cluster
- pg_partman
- k6 load tests
Engineering challenges addressed
- Maintaining assessment integrity while scaling out writers without serializing globally.
- Teacher trust during migration — transparent status pages and rollback drills.
Program artifacts & environments


Tech stack
- Next.js
- Node.js
- PostgreSQL
- Redis
- Kubernetes
- AWS
- WebRTC
- Kafka
- OpenTelemetry
Results
- Sustained 40k concurrent learners with <1.5s p95 interaction latency
- 61% lower infra cost per MAU after 6 months of tuning
- Assessment integrity incidents dropped 92% vs. prior term
Quantified impact
40k concurrent learners validated in load + production peaks
Observed during national exam collaboration.
78% reduction in CDN overspend
Via ladder tuning and segment TTL discipline.
Key takeaways
- EdTech peaks are calendar-driven — invest in predictive autoscaling tied to academic calendars, not just CPU metrics.
- CQRS pays off when reads dwarf writes during revision windows.
- Integrity features must be designed with observability — not bolted after cheating incidents appear on Twitter.
