The red build nobody read

A teammate of mine shipped a null-pointer regression to production. The test that should have caught it had already failed in CI. It went out anyway.

Here is how that happens, and it is depressingly ordinary. The PR was green on the second run. The first run had one red test, an integration check that touches a shared fixture, and that test had been flaky for weeks. So nobody read the failure. They clicked re-run, it passed, they merged. The failure was real. The retry papered over it. Production found out for us, on a Friday.

That is the part the tooling conversation usually misses. The cost of flaky tests is not the wasted CI minutes, though those add up fast. The real cost is that flakiness teaches your engineers to stop reading failures. Microsoft's research on test flakiness found the same pattern we see in our own data: once a developer gets burned by a flaky test, they investigate the next failure far less carefully. Trust erodes, and erosion spreads.

Why the obvious fixes fail

When flakiness starts hurting, every team reaches for one of three levers. All three are bad in the same way.

  • Blanket retries. Re-run everything two or three times, take the best result. CI goes green, the dashboards look healthy, and you have just built a machine that hides regressions. A test that fails for a real reason and passes on retry is indistinguishable from a flaky one. You are no longer testing; you are sampling until you get the answer you want.
  • Muting. Silence the noisy test. Fine in theory, except the mute outlives the fix by months, the test still burns compute, and "we will unmute it later" is a sentence with no future tense.
  • Ignoring red. The terminal state. The build is decorative. People merge on vibes.

The common failure is that all three answer the wrong question. They ask: how do I make this failure go away? The question that matters is: is this failure signal or noise? And you cannot answer that by eye, because a single failing test tells you almost nothing on its own. You need the test's history, the environment it ran in, the shape of the build around it. You need features.

What we built

Testhide's Flakiness Predictor is one of eight diagnostic models in our pipeline, and on paper it is the least glamorous: gradient-boosted trees. No transformer, no embeddings. Boring, in the way that load-bearing things are boring.

The model scores a single failure with one question in mind. Given everything we know about this test and this build, what is the probability that this failure is a real regression rather than noise? It runs over roughly 41 engineered features, grouped into four families:

  • Temporal — recency and timing. How long the test has existed, how recently it last flipped, time-of-day and day-of-week patterns, gaps between runs.
  • Failure-history — the heavy hitters. Recent pass/fail ratio, current and historical streak lengths, how many distinct failure signatures the test has produced, whether it has ever failed and then passed with no code change in between.
  • Environment — OS, runner image, executor (we run parallel through our TPS provider), matrix cell, resource pressure on the agent.
  • Build-context — what else failed in this build, whether the failure correlates with the diff, how many other tests touching the same files also went red.

That last family matters more than it looks. A lone test failing in an otherwise-green build with no related diff is a very different animal from one test in a cluster of twenty that all touch the file you just changed. The first smells like noise. The second smells like you broke something.

Which features actually move the needle

This is the part worth being honest about, because the list above implies all four families pull their weight. They do not.

In our data, failure-history dominates. Recent failure ratio and streak features alone get you most of the way to a usable model. The single strongest signal is close to "has this exact test flipped pass-to-fail-to-pass with no intervening code change" — which is almost the definition of flaky, and the model leans on it hard.

Build-context comes second, and it is the family that saves you from the expensive mistake. Diff-correlation flags the dangerous case: a test with a flaky reputation that fails and sits in a cluster correlated with the current change. That is exactly the regression-in-flaky-clothing that slips through retries. History says "ignore me"; context says "look closer." The model learns to weight context up when it disagrees with history.

Temporal features are mild contributors. Time-of-day has a real but small effect — it mostly proxies for load on shared infra. Test age matters early in a test's life, then flattens out.

Environment is the surprise. We expected OS and runner image to carry weight. Mostly they do not, with one exception: a couple of specific matrix cells are genuinely flakier than the rest, and there the signal is sharp. So the family as a whole reads as near-noise, but a few individual features inside it are gold. If we had judged the family by its average importance, we would have pruned the two features that earn their keep. Worth remembering before you cut by feature group.

From a score to a decision

A probability is useless until it changes what CI does. The predictor is a first-class build step, not a dashboard you remember to check. It sits in the pipeline YAML next to your test steps:

steps:
  - name: integration
    type: pytest
    paths: tests/integration
    flakiness:
      model: flakiness-predictor
      # at or above this, treat the failure as a real regression
      regression_threshold: 0.80
      on_real_regression: block_pr
      on_likely_flake: quarantine

The logic is the opposite of blanket retries. When a test fails, the predictor scores it. High probability of real regression, at or above the threshold: the step blocks the PR. No retry, no mercy. Below the threshold: the failure is quarantined. The result is recorded, the test keeps running in the background so we keep gathering history, but it does not redden the build or gate the merge.

This is smart quarantine. It is not muting, because the test still runs and still reports. It is not retrying, because we never re-run hoping for green. We make a prediction about whether the failure deserves your attention and act on it, with the score and its top contributing features attached to the build so a human can overrule it.

Keeping the model honest

Models that score production traffic rot. Your suite changes, your infra changes, last quarter's flake patterns are this quarter's stable tests. A predictor that quietly drifts is worse than no predictor, because it carries the authority of one.

So we monitor feature drift with PSI, the Population Stability Index, on every input feature. When a feature's live distribution drifts far enough from the training distribution, PSI crosses a threshold and we get an alert to retrain — rather than discovering the problem when a regression escapes. Diff-correlation is the feature we watch hardest. When teams reorganize their test layout, its distribution moves, and that is exactly when the model's most important judgment gets shaky.

The results, in our own numbers

These are internal results on our own pipelines, not a benchmark, and your suite will differ.

After we switched from blanket retries to predictor-driven quarantine:

  • CI minutes spent on retries dropped sharply. We had been re-running a large share of integration suites two or three times, and most of that was pure waste.
  • The median PR stopped carrying a "re-run and hope" step, which took real wall-clock latency out of every merge.
  • And the one that actually matters: a handful of real regressions that historically would have been retried into green got blocked at the PR, because diff-correlation outvoted a test's flaky reputation.

That last category is small in count and large in value. Catching even a few production regressions a quarter at the PR, instead of in an incident channel, pays for the whole model several times over.

The lesson

The instinct with flaky tests is to suppress the symptom — retry, mute, ignore. Every one of those trains your team to distrust CI, and a CI you do not trust is not catching regressions. It is generating noise with a green checkmark on top.

The fix is not a smarter retry. It is a model that answers the only question that matters — signal or noise — and a pipeline that acts on the answer instead of gambling on a re-run. Boring gradient-boosted trees, a feature set where two of the four families turn out to be near-noise, and the honesty to find the handful of features that carry the load.

Self-host the whole thing and point it at your suite: /installation/. The full set of diagnostic models lives at /features/. And if you want clean test history feeding the predictor, the open-source reporters and the Report Format spec are at /plugins/report-format/.


← Back to blog Install Testhide →