Flaky Test Detection

What is a flaky test?

A flaky test is one that flips between Pass and Fail across multiple runs without any code changes. It's not consistently passing or consistently failing — it's unreliable.

Flaky tests erode confidence in your test suite. When a test fails intermittently, teams start ignoring failures ("oh, that one's flaky"), which can mask real bugs.

How TestIntel detects flaky tests

Every time test results are imported (via Web UI upload, CLI, API, or Xray/Zephyr pull), TestIntel records each test's Pass/Fail status in a per-test run log.


TC-0001: [Pass, Pass, Fail, Pass, Fail, Pass]  ← flaky
TC-0002: [Pass, Pass, Pass, Pass, Pass, Pass]  ← stable
TC-0003: [Fail, Fail, Fail, Fail, Fail, Fail]  ← broken (not flaky)
TC-0004: [Pass, Pass, Pass, Fail, Pass, Pass]  ← borderline

Flakiness score

The flakiness score measures how often a test's status changes:


Flakiness score = number of status flips / (number of runs - 1)
ScoreMeaning
0.0Perfectly stable (never flips)
0.3Threshold — flagged as flaky
0.5Flips half the time
1.0Flips every single run

Requirements to be flagged

A test is only flagged as flaky when:

  1. At least 3 runs are recorded (can't determine flakiness from 1-2 data points)
  1. Both Pass and Fail appear in recent history (all-pass or all-fail is not flaky)
  1. Score meets the threshold (default 0.3, configurable)

Root cause suggestions

TestIntel analyzes the pattern of flips and suggests a likely cause:

PatternExampleSuggestion
Alternates every runP, F, P, F, PTest ordering dependency or shared state between tests
Occasional failures (<30% fail rate)P, P, P, F, P, P, P, P, F, PTiming/race condition or environment dependency
Mostly failing (>70% fail rate)F, F, F, P, F, F, F, FReal bug with intermittent workaround
MixedP, P, F, F, P, F, PInvestigate test data, environment, or external service dependencies

How to use it

CLI


# Default settings (last 10 runs, threshold 0.3)
testintel flaky-report

# Custom window and threshold
testintel flaky-report --window 20 --threshold 0.2

API


GET /history/flaky?window=10&threshold=0.3

Web UI

Flaky test count appears in the History tab dashboard and feeds into the Test Maintenance Score.

Building enough data

Flaky detection requires multiple result imports over time. The more frequently results flow in, the faster flaky tests are detected.

Minimum data requirements:

Timeline example: If your CI runs tests daily, you'll start seeing flaky flags after ~3-5 days. If you run weekly, it takes 3-5 weeks.

Ways to feed results:

CI/CD integration example

Add this step to your CI pipeline to automatically feed results after every test run:

GitHub Actions:


- name: Upload results to TestIntel
  if: always()
  run: |
    curl -X POST http://testintel.internal:8000/results/upload \
      -H "x-api-key: ${{ secrets.TESTINTEL_KEY }}" \
      -F "file=@results.xml"

Jenkins:


post {
    always {
        sh '''
            curl -X POST http://testintel.internal:8000/results/upload \
              -H "x-api-key: ${TESTINTEL_KEY}" \
              -F "file=@target/surefire-reports/TEST-results.xml"
        '''
    }
}

Impact on Test Maintenance Score

Flaky tests reduce your health score. The Flakiness component (10% weight) scores:

Fixing flaky tests is one of the fastest ways to improve your Test Maintenance Score.

Common fixes for flaky tests

CauseFix
Timing/race conditionsAdd explicit waits, avoid sleep(), use polling
Shared stateIsolate test data, reset state in setup/teardown
Test ordering dependencyMake each test independent, don't rely on execution order
Environment dependencyMock external services, use test containers
Non-deterministic dataUse fixed seeds, deterministic test data
Network flakinessRetry transient failures, mock HTTP calls