The pager goes off and you already know the drill

The deploy pipeline goes red on a Tuesday. There is a stack trace you half-recognize: ConnectionResetError deep inside the payments suite, three frames you have definitely seen before. Except you cannot remember when, or whether it was real, or who fixed it.

So you do the thing. You paste the error into Slack search. You try "payments timeout", then "ConnectionReset", then "flaky payments" because someone always calls it flaky. You find a thread from four months ago that trails off with "nvm, restarted the runner." You check Jira. You ask in #eng-help and wait. Forty minutes later you have a vague memory that someone dealt with this in Q1, and that someone is on PTO.

The maddening part is that the answer existed. Somebody already debugged this exact failure. The knowledge was real and recent, and it was locked inside a dead thread and one person's head. "Have we seen this before?" is the most common question in on-call, and we answer it with the worst possible tool: full-text search over chat logs written by tired people.

Why grep and keyword search fall down

The naive approaches share one flaw. They match strings, and failures do not repeat as strings.

The same root cause shows up with a different hostname, a different container ID, a different line number after a refactor, a slightly different timeout value. Keyword search treats pod-7f3a and pod-9b21 as completely different events when they are the same incident. Meanwhile it happily matches the word "timeout" across two hundred unrelated tests.

You can try to be clever with normalization, and we did. But normalization alone does not capture semantic similarity. "Connection reset by peer" and "broken pipe during upstream read" are the same story told two ways. A human on-call engineer knows that instantly. String matching never will.

The honest framing: this is a nearest-neighbour problem, not a search-box problem. You want the closest past failures to the current one, ranked by how alike they actually are, fast enough to matter on a pager.

How Testhide answers it: the Failure Retriever

The Failure Retriever is one of Testhide's eight diagnostic models, and it has a deliberately narrow job. Given a failure, return the most similar past failures, fast enough to feel instant inside a CI run.

The pipeline is unglamorous, which is the point:

  • The Log Signature Miner first turns raw, noisy logs into stable templates, using a hybrid of regex master-patterns and Drain3. This strips the volatile parts (IDs, timestamps, ports) so two instances of the same failure look the same before we ever embed them.
  • We embed the templated failure plus a little structured metadata into a dense vector.
  • We index those vectors in FAISS and query for nearest neighbours.
  • Each neighbour comes back with its history: what it was, whether it was a real bug or noise, which PR or Jira ticket closed it.

So instead of "here are messages containing the word timeout," you get "this failure is 0.94 cosine-similar to one from six weeks ago that the Root-Cause Classifier marked as a product bug, fixed in PR #4127." That is the difference between a search box and an answer.

In the pipeline YAML it is just another first-class build step, sitting next to your tests:

steps:
  - name: unit
    type: pytest
    paths:
      - tests/unit

  - name: retrieve-similar-failures
    type: failure_retriever
    top_k: 5
    min_similarity: 0.82
    annotate_pr: true
    on_fail: continue

When a test fails, the retriever attaches the closest matches straight to the PR. The on-call engineer opens the build and the answer to "have we seen this before?" is already sitting there, with links. No Slack archaeology.

The engineering decision: we tried to be fancy and the data said no

Here is the part worth writing a post about.

FAISS gives you an index_factory, a one-line string that picks your index type. The sophisticated choice for "scale" is IVF-PQ: partition the vectors into clusters (IVF), then compress each one with product quantization (PQ) so a million vectors fit in a sliver of RAM. It is genuinely clever. We reached for it first, because of course we did.

import faiss

dim = 384

# the clever option
ivfpq = faiss.index_factory(dim, "IVF256,PQ32")

# the boring option
flat = faiss.index_factory(dim, "Flat")

Then we measured on our actual corpus, and the boring option won. Not narrowly. On both axes that matter.

The thing the PQ tutorials skip: IVF-PQ is a compression-and-approximation technique, and it pays off when you are drowning in vectors. Product quantization throws away precision to save memory, which costs recall. IVF only searches a few clusters, which costs more recall and can miss the true nearest neighbour entirely if it sits just over a cluster boundary. Those are reasonable trades at a hundred million vectors. They are pure downside at the scale of one team's CI history.

And that scale is small. A busy repo might log tens of thousands of distinct failure signatures over a couple of years, not tens of millions. After the Log Signature Miner collapses near-duplicates, the working set is smaller still. At those sizes a flat index (plain exhaustive cosine over every stored vector) is not the fallback. It is the right answer.

This lines up with what FAISS practitioners say out loud: a flat index gives exact, perfect-recall results and is the natural choice for small corpora, while IVF-PQ earns its keep only once you genuinely cannot afford exact search. The trap is treating "approximate and compressed" as automatically more advanced, and reaching for it before your data has earned it.

What we actually measured

On our internal corpus (low tens of thousands of failure signatures, 384-dimension embeddings), after switching the default from IVF-PQ to flat:

  • Recall@5 went from roughly 0.9 to 1.0. Flat is exhaustive, so it returns the true nearest neighbours by definition. The IVF-PQ index was quietly dropping a real prior failure out of the top 5 about one query in ten, the worst possible failure mode for a tool whose entire job is "have we seen this before."
  • Median query latency dropped, not rose. Sub-30ms per lookup on commodity CPU. At our corpus size the brute-force scan is so cheap that IVF-PQ's cluster-probing and dequantization overhead made it slower per query, on top of being less accurate.
  • We deleted a training step. PQ codebooks need fitting; flat indexes need nothing. Less code, no quantizer to retrain as the corpus drifts, fewer parameters to get wrong.

To be clear about the framing: these are our numbers, on our data, at our scale. Your corpus may be bigger. That is exactly why the retriever is built on index_factory and not hard-wired to one index. The default is flat. When a corpus genuinely outgrows exact search, you change one string:

steps:
  - name: retrieve-similar-failures
    type: failure_retriever
    index_factory: "HNSW32"   # graph index: high recall, scales past flat
    top_k: 5

Our rough guidance on where each one wins:

  • Flat: up to a few hundred thousand vectors. Exact recall, trivial ops, often the fastest in practice. The default, and probably what you want.
  • HNSW: the next step up. A graph index that keeps recall high into the millions while staying fast, at the cost of more memory. The first thing we reach for when flat starts to drag.
  • IVF-PQ: tens of millions of vectors and up, or when memory is the binding constraint. Trades recall for a tiny footprint. Powerful, and almost certainly overkill for one team's test history.

The lesson

The headline is not "FAISS is great" (it is) or "embeddings beat grep" (they do). It is narrower and more useful: pick the index your data has earned, not the one that sounds impressive in the design review.

We assumed scale and reached for the scale tool. The measurement said our problem was small, and small problems deserve exact answers. Flat search gave us perfect recall, lower latency, and less code to maintain. The index_factory indirection means we did not paint ourselves into a corner; the day a corpus outgrows flat, it is a one-line change to HNSW or IVF-PQ. We just refuse to pay for that complexity before the data demands it.

"Have we seen this before?" should take one second and live inside the build, not forty minutes inside Slack. That is the whole feature.

Tests fail. AI explains why, and tells you who already fixed it.

Self-host it in one command: see /installation/. The Failure Retriever and the other seven diagnostic models are covered in /features/. And if you want failures flowing in from your existing test suite, the open report format and reporters are documented at /plugins/report-format/.


← Back to blog Install Testhide →