Flaky Test Detection

What is a flaky test?

A flaky test is one that flips between Pass and Fail across multiple runs without any code changes. It's not consistently passing or consistently failing — it's unreliable.

Flaky tests erode confidence in your test suite. When a test fails intermittently, teams start ignoring failures ("oh, that one's flaky"), which can mask real bugs.

How TestIntel detects flaky tests

Every time test results are imported (via Web UI upload, CLI, API, or Xray/Zephyr pull), TestIntel records each test's Pass/Fail status in a per-test run log.


TC-0001: [Pass, Pass, Fail, Pass, Fail, Pass]  ← flaky
TC-0002: [Pass, Pass, Pass, Pass, Pass, Pass]  ← stable
TC-0003: [Fail, Fail, Fail, Fail, Fail, Fail]  ← broken (not flaky)
TC-0004: [Pass, Pass, Pass, Fail, Pass, Pass]  ← borderline

Flakiness score

The flakiness score measures how often a test's status changes:


Flakiness score = number of status flips / (number of runs - 1)

Score	Meaning
0.0	Perfectly stable (never flips)
0.3	Threshold — flagged as flaky
0.5	Flips half the time
1.0	Flips every single run

Requirements to be flagged

A test is only flagged as flaky when:

At least 3 runs are recorded (can't determine flakiness from 1-2 data points)

Both Pass and Fail appear in recent history (all-pass or all-fail is not flaky)

Score meets the threshold (default 0.3, configurable)

Root cause suggestions

TestIntel analyzes the pattern of flips and suggests a likely cause:

Pattern	Example	Suggestion
Alternates every run	P, F, P, F, P	Test ordering dependency or shared state between tests
Occasional failures (<30% fail rate)	P, P, P, F, P, P, P, P, F, P	Timing/race condition or environment dependency
Mostly failing (>70% fail rate)	F, F, F, P, F, F, F, F	Real bug with intermittent workaround
Mixed	P, P, F, F, P, F, P	Investigate test data, environment, or external service dependencies

How to use it

CLI


# Default settings (last 10 runs, threshold 0.3)
testintel flaky-report

# Custom window and threshold
testintel flaky-report --window 20 --threshold 0.2

API


GET /history/flaky?window=10&threshold=0.3

Web UI

Flaky test count appears in the History tab dashboard and feeds into the Test Maintenance Score.

Building enough data

Flaky detection requires multiple result imports over time. The more frequently results flow in, the faster flaky tests are detected.

Minimum data requirements:

A test needs at least 3 recorded runs before it can be evaluated for flakiness
For reliable detection, 5+ runs is recommended (3 runs can only detect extreme flakiness)
The evaluation window defaults to the last 10 runs per test (configurable via window parameter)
Results must include app_version and environment context for the most accurate detection

Timeline example: If your CI runs tests daily, you'll start seeing flaky flags after ~3-5 days. If you run weekly, it takes 3-5 weeks.

Ways to feed results:

Upload JUnit XML via the Inventory tab "Upload Results" button
CLI: testintel results-import results.xml
API: POST /results/upload (add to your CI/CD pipeline)
Xray/Zephyr: Pull results via sync buttons or CLI

CI/CD integration example

Add this step to your CI pipeline to automatically feed results after every test run:

GitHub Actions:


- name: Upload results to TestIntel
  if: always()
  run: |
    curl -X POST http://testintel.internal:8000/results/upload \
      -H "x-api-key: ${{ secrets.TESTINTEL_KEY }}" \
      -F "file=@results.xml"

Jenkins:


post {
    always {
        sh '''
            curl -X POST http://testintel.internal:8000/results/upload \
              -H "x-api-key: ${TESTINTEL_KEY}" \
              -F "file=@target/surefire-reports/TEST-results.xml"
        '''
    }
}

Impact on Test Maintenance Score

Flaky tests reduce your health score. The Flakiness component (10% weight) scores:

100/100 — no flaky tests
0/100 — 20% or more of tests are flaky

Fixing flaky tests is one of the fastest ways to improve your Test Maintenance Score.

Common fixes for flaky tests

Cause	Fix
Timing/race conditions	Add explicit waits, avoid `sleep()`, use polling
Shared state	Isolate test data, reset state in setup/teardown
Test ordering dependency	Make each test independent, don't rely on execution order
Environment dependency	Mock external services, use test containers
Non-deterministic data	Use fixed seeds, deterministic test data
Network flakiness	Retry transient failures, mock HTTP calls