The build that lied to me

Friday afternoon. A teammate ships a small prompt tweak to our support agent. The CI run is a wall of green: 41 unit tests passing, the eval suite we had at the time passing too. He merges and goes home. Good engineer, clean PR.

By Monday the agent is quietly refusing refunds it should approve. Not crashing. Not erroring. Just confidently doing the wrong thing in about one of every twenty real conversations. No alarm fired, because nothing failed. The tests were green because they were testing a version of the agent that does not exist in production.

That gap is the most expensive thing about shipping agents, and almost nobody's CI is built to catch it. Here is why it happens, and what we built to close it.

Why your unit tests are lying to you

The things that make a unit test fast, cheap, and deterministic are the exact things that delete the failure modes you care about.

A good agent unit test does three things, and each one quietly removes a real production risk:

  • It mocks the model. You stub the LLM call so the test runs fast and free. Now you are testing control flow against a frozen, hand-written response. The real model, with temperature, sampling, and a silent weekend weight update from your provider, never runs.
  • It freezes the tools. You mock the search API, the database, the payment call. So you never see the tool that times out, returns a slightly different schema, or hands back an empty list the agent did not expect.
  • It pins the inputs. Your fixtures are the conversations you imagined. Real users ask in an order you did not plan, in three languages, with typos and half-finished sentences.

A deterministic system fails loudly. A non-deterministic agent fails quietly and differently every run. Demos work because the inputs match the mental model of the person who built them, and the first real users immediately wander off that distribution. The failure is rarely the model in isolation either. It is the seam between the model, the retrieval step, the tool call, and the orchestration holding them together.

The unit test asserts your code is correct in a fake world. Production is not the fake world.

The two halves nobody connects

When teams notice their agent is flaky in prod, they reach for one of two fixes, and pick exactly one.

Camp one writes more eval tests. Golden datasets, LLM-as-judge, the works. Good instinct. But an eval suite is a static snapshot of inputs you thought of, and your users are a moving target. You can pass every eval and still drift away from reality the week after you deploy.

Camp two buys observability. A dashboard, traces, token counts. Also good. But a dashboard tells you something is wrong after a customer already felt it, and it lives in a separate tool nobody opens until an incident.

You need both, and they need to be the same system. Eval has to run in CI so a bad change blocks the PR before it ships. Monitoring has to watch prod so the drift you could not have written a test for gets caught after it ships. One without the other is half a seatbelt.

That is the whole reason Testhide exists. Test intelligence and LLM evaluation are first-class CI build steps, sitting right next to your unit and integration tests, and the same models keep watching once the change is live.

Half one: make behaviour a build step

The fix for the mocked-model problem is to stop mocking the model in the step that matters. We add an llm_eval step that runs the real agent against a golden set and scores the output with a judge. It runs in under thirty seconds in CI and blocks the merge if the score drops below threshold.

steps:
  - name: unit
    type: pytest
    on_fail: block_pr

  - name: agent-behaviour
    type: llm_eval
    dataset: golden/support_refunds.jsonl
    judge: gpt-4o
    snapshot:
      mode: semantic   # exact | fuzzy | semantic
      threshold: 0.85
    on_fail: block_pr

A few things matter here that are easy to skip:

  • The snapshot judge runs in three modes. Use exact for outputs that must not move, fuzzy for near-matches, and semantic when you care that the meaning held even if the wording drifted. Refund-or-not is semantic. A generated invoice ID is exact.
  • Every prompt goes through a prompt registry with SHA-256 dedup and semver, so the Friday-afternoon tweak is a versioned, diffable artifact rather than a mystery string. When a score moves, you can see which prompt version moved it.
  • The score lands right in the PR. Reviewers see "behaviour 0.82, below 0.85, blocked" before merge, and can open the Eval Explorer to read the exact transcripts that regressed.

The refund bug from the top of this post? A semantic snapshot judge over a refund golden set catches it on the PR, because the meaning of the agent's decision changed even though the code and the tests did not.

When it does fail: product bug or environment?

Catching a failure is half the battle. The other half is the 10pm question: is this our change, or is the world having a bad day? A flaky tool timeout and a genuine logic regression look identical in a red build, and engineers burn hours guessing.

The Root-Cause Classifier does that triage. It is a DistilBERT model that reads the failure multi-modally, the log output plus the build metadata, and labels it a product bug, a test issue, or an environment problem. For agents that is the difference between "the model started reasoning differently" and "the search API was throwing 503s during the run." Same red X, completely different response.

Behind it, the Flakiness Predictor (gradient-boosted trees over about 41 engineered features spanning temporal, failure-history, environment, and build-context signals, with PSI-based drift monitoring) decides whether a flap is real signal or noise, and quarantines the genuinely flaky checks so they stop crying wolf. With agents, non-deterministic tests are the norm, not the exception, so triaging the noise is not optional.

Half two: watch prod, because your tests are already stale

Here is what the eval-only camp misses. The day you freeze your golden set, your users start drifting off it. New question types, new phrasing, a new feature that changes what people ask. No CI test covers an input you have never seen.

So the same models keep running after deploy.

The Novelty / OOD Detector is an autoencoder trained on your normal input and failure distribution. When the agent starts hitting inputs or producing failures that look like nothing in that distribution, reconstruction error spikes and the emerging mode surfaces early, while it is still rare. That is your signal that real users have wandered off the map your golden set was drawn on.

The Emerging Issues Detector works on the time-series instead, using changepoint detection plus STL seasonality decomposition. It separates "Tuesdays are always busier" from "something genuinely changed last Thursday at 14:00." When a new failure pattern trends up, it tells you it is trending, not just that it exists.

The loop closes like this: OOD finds the weird new input in prod, you turn it into a fixture, it joins the golden set, and the next PR's llm_eval step is now defending against a failure mode your users discovered for you.

What changed for us

Keeping this honest and to our own data, after we moved agent behaviour into the build instead of a side dashboard:

  • Behaviour regressions that used to reach prod started getting blocked at the PR, because the eval step ran against the real model instead of a mock. Most of our agent escapes had been silently-wrong outputs, exactly the class a passing unit test cannot see.
  • Triage time on red agent builds dropped sharply once the Root-Cause Classifier was calling "environment" versus "product bug" up front. Engineers stopped opening logs just to discover a tool had timed out.
  • Two of our larger prod incidents would have been caught earlier by the OOD detector. The inputs that broke the agent were genuinely novel, the kind nobody writes a test for because nobody has imagined them yet.

No magic multipliers, no invented customer logos. Just fewer 3am surprises.

The lesson

Green CI on an agent means "my code is correct in a world I mocked." Necessary, nowhere near sufficient. Agents fail on the three things tests deliberately remove: a model that is non-deterministic and quietly updated, tools that misbehave, and inputs that drift the moment real users show up.

The fix is not more mocks or another dashboard. It is treating behaviour as a build step that blocks the PR, and keeping the same models watching after deploy so the failures you could not have predicted still get caught. Eval-in-CI plus post-deploy monitoring, in one system, sharing one definition of "wrong."

Testhide is self-hosted, so your golden sets, transcripts, and prod traffic never leave your infra. Spin it up with Docker Compose from the installation guide, see the eval step and the eight diagnostic models on the features page, and if you want your existing test output to flow straight in, the open report format spec has reporters for pytest, JS, and .NET.

Tests fail. Testhide explains why, before and after you ship.


← Back to blog Install Testhide →