Flaky Tests: Root Causes Our Teams Actually See in Production Repos
By Priyatham Rama Sai
Sleep calls, shared mutable state, and environments that drift from prod order failures that disappear on rerun. We catalogue fixes that stuck — clocks, isolation, data seeds — beyond ‘increase timeout’ cargo cults.
Patterns we remove first
Hard waits mask races until load shifts. Dynamic data without factories creates ordering assumptions. Shared login sessions leak cookies across parallel workers.
Practices that stabilize suites
Freeze time at boundaries. Spin disposable accounts per test where feasible. Align staging toggles with production defaults so feature flags stop surprising assertions.
Governance
Track flake rate per suite owner. Quarantine chronic offenders until rewritten — reruns should not subsidize broken tests indefinitely.