EdTech Platform Scalability | Live Classes & Assessment Case Study

Client overview

Industry focus: EdTech
Portfolio segment: EdTech
Organization profile: Public benefit corporation, ~250 staff, learners in 90+ countries

Product-market fit drove MAU past infrastructure assumptions; Sunday exam windows saturated API clusters. Educators demanded proctored assessments with integrity signals; finance needed predictable unit economics before the next curriculum partnership with a national ministry.

Problem

Live class and exam bursts overwhelmed monolithic APIs; caching inconsistencies corrupted assessment submissions.

Edge POPs lacked tuned WebRTC TURN allocations; learners in LATAM experienced audio drift during breakout rooms. Redis single-node caches caused thundering herds when hero teacher sessions began simultaneously worldwide.

Assessment service stored attempts in-row with course content tables, locking hot partitions. Students occasionally saw stale progress bars leading to duplicate retries.

Vendor CDN bills spiked nonlinearly because adaptive bitrate ladders were untuned for mobile-first learners on 3G-equivalent links.

Solution

Decomposed hot paths: stateless signaling, sharded Redis, read-scaled Postgres, hierarchical CDN caching with signed segment URLs, and integrity streaming to proctoring subsystem.

We split read and write paths: command bus for attempts, CQRS projections for dashboards, and outbox pattern to analytics. Live class control plane moved to dedicated autoscaling groups with predictive scaling informed by timetable imports.

Redis Cluster with replica reads for session + rate limits; fairness policies avoided teacher-induced hotspots. Postgres added partitioning by tenant with citus-style coordinator for heavy tenants.

CDN optimization tuned ladder variants per geography; prefetch hints aligned with LMS navigation graphs.

Implementation

1
Traffic characterization
Captured WebRTC telemetry, HTTP waterfall baselines, and Redis slow logs during peak Sundays. Chaos injected regional latency to validate adaptive UX messaging.
2
Incremental extraction
Strangled assessment APIs behind BFF with feature flags; replay tests compared scoring outputs against legacy monolith on millions of archived attempts.
3
Cost-performance tradeoffs
Committed use discounts aligned to predictable exam windows; lifecycle policies moved cold recordings to cheaper storage classes.

Tools & platforms

Kubernetes HPA/VPA
CloudFront + Lambda@Edge
Redis Cluster
pg_partman
k6 load tests

Engineering challenges addressed

Maintaining assessment integrity while scaling out writers without serializing globally.
Teacher trust during migration — transparent status pages and rollback drills.

Program artifacts & environments

Students using laptops in a classroom setting — Load tests mirrored Sunday cram sessions across time zones.

Online learning interface on laptop — Adaptive bitrate tuning prioritized audio continuity over 4K video vanity.

Tech stack

Next.js
Node.js
PostgreSQL
Redis
Kubernetes
AWS
WebRTC
Kafka
OpenTelemetry

Results

Sustained 40k concurrent learners with <1.5s p95 interaction latency
61% lower infra cost per MAU after 6 months of tuning
Assessment integrity incidents dropped 92% vs. prior term

Quantified impact

40k concurrent learners validated in load + production peaks
Observed during national exam collaboration.
78% reduction in CDN overspend
Via ladder tuning and segment TTL discipline.

Key takeaways

EdTech peaks are calendar-driven — invest in predictive autoscaling tied to academic calendars, not just CPU metrics.
CQRS pays off when reads dwarf writes during revision windows.
Integrity features must be designed with observability — not bolted after cheating incidents appear on Twitter.