Flaky Test Detection
What is a flaky test?
A flaky test is one that flips between Pass and Fail across multiple runs without any code changes. It's not consistently passing or consistently failing — it's unreliable.
Flaky tests erode confidence in your test suite. When a test fails intermittently, teams start ignoring failures ("oh, that one's flaky"), which can mask real bugs.
How TestIntel detects flaky tests
Every time test results are imported (via Web UI upload, CLI, API, or Xray/Zephyr pull), TestIntel records each test's Pass/Fail status in a per-test run log.
TC-0001: [Pass, Pass, Fail, Pass, Fail, Pass] ← flaky
TC-0002: [Pass, Pass, Pass, Pass, Pass, Pass] ← stable
TC-0003: [Fail, Fail, Fail, Fail, Fail, Fail] ← broken (not flaky)
TC-0004: [Pass, Pass, Pass, Fail, Pass, Pass] ← borderline
Flakiness score
The flakiness score measures how often a test's status changes:
Flakiness score = number of status flips / (number of runs - 1)
| Score | Meaning |
|---|---|
| 0.0 | Perfectly stable (never flips) |
| 0.3 | Threshold — flagged as flaky |
| 0.5 | Flips half the time |
| 1.0 | Flips every single run |
Requirements to be flagged
A test is only flagged as flaky when:
- At least 3 runs are recorded (can't determine flakiness from 1-2 data points)
- Both Pass and Fail appear in recent history (all-pass or all-fail is not flaky)
- Score meets the threshold (default 0.3, configurable)
Root cause suggestions
TestIntel analyzes the pattern of flips and suggests a likely cause:
| Pattern | Example | Suggestion |
|---|---|---|
| Alternates every run | P, F, P, F, P | Test ordering dependency or shared state between tests |
| Occasional failures (<30% fail rate) | P, P, P, F, P, P, P, P, F, P | Timing/race condition or environment dependency |
| Mostly failing (>70% fail rate) | F, F, F, P, F, F, F, F | Real bug with intermittent workaround |
| Mixed | P, P, F, F, P, F, P | Investigate test data, environment, or external service dependencies |
How to use it
CLI
# Default settings (last 10 runs, threshold 0.3)
testintel flaky-report
# Custom window and threshold
testintel flaky-report --window 20 --threshold 0.2
API
GET /history/flaky?window=10&threshold=0.3
Web UI
Flaky test count appears in the History tab dashboard and feeds into the Test Maintenance Score.
Building enough data
Flaky detection requires multiple result imports over time. The more frequently results flow in, the faster flaky tests are detected.
Minimum data requirements:
- A test needs at least 3 recorded runs before it can be evaluated for flakiness
- For reliable detection, 5+ runs is recommended (3 runs can only detect extreme flakiness)
- The evaluation window defaults to the last 10 runs per test (configurable via
windowparameter) - Results must include
app_versionandenvironmentcontext for the most accurate detection
Timeline example: If your CI runs tests daily, you'll start seeing flaky flags after ~3-5 days. If you run weekly, it takes 3-5 weeks.
Ways to feed results:
- Upload JUnit XML via the Inventory tab "Upload Results" button
- CLI:
testintel results-import results.xml - API:
POST /results/upload(add to your CI/CD pipeline) - Xray/Zephyr: Pull results via sync buttons or CLI
CI/CD integration example
Add this step to your CI pipeline to automatically feed results after every test run:
GitHub Actions:
- name: Upload results to TestIntel
if: always()
run: |
curl -X POST http://testintel.internal:8000/results/upload \
-H "x-api-key: ${{ secrets.TESTINTEL_KEY }}" \
-F "file=@results.xml"
Jenkins:
post {
always {
sh '''
curl -X POST http://testintel.internal:8000/results/upload \
-H "x-api-key: ${TESTINTEL_KEY}" \
-F "file=@target/surefire-reports/TEST-results.xml"
'''
}
}
Impact on Test Maintenance Score
Flaky tests reduce your health score. The Flakiness component (10% weight) scores:
- 100/100 — no flaky tests
- 0/100 — 20% or more of tests are flaky
Fixing flaky tests is one of the fastest ways to improve your Test Maintenance Score.
Common fixes for flaky tests
| Cause | Fix |
|---|---|
| Timing/race conditions | Add explicit waits, avoid sleep(), use polling |
| Shared state | Isolate test data, reset state in setup/teardown |
| Test ordering dependency | Make each test independent, don't rely on execution order |
| Environment dependency | Mock external services, use test containers |
| Non-deterministic data | Use fixed seeds, deterministic test data |
| Network flakiness | Retry transient failures, mock HTTP calls |