The prompt change that passed every test
Tuesday afternoon. Someone trims a few sentences off the support assistant's system prompt, tightens the tone, opens a PR. Every unit test is green. The JSON schema validates. Latency is flat. It merges in under an hour.
Around dinner, the Slack thread starts. "Is the bot okay? It just answered a customer with 'Yes.' and nothing else." The edit had quietly taught the model to be terse to the point of uselessness. One-word replies. Technically correct, completely unhelpful.
Here is the uncomfortable part: nothing was broken in any way our tooling could see. No exception. No failed assertion. No red X on the PR. The code was fine. The behaviour regressed, and behaviour was the one thing none of our tests were watching.
If you ship anything backed by an LLM, you have lived a version of this. The model's output is your product, and it is the one part of the product with no test gate in front of it.
Why the usual fixes miss it
The instinct is to reach for tools we already trust. Each one falls short in a specific way.
Unit tests on the prompt string. You can assert the prompt contains certain words, or that the output parses as JSON. That catches structural breakage and nothing else. "Yes." parses fine.
Exact-match assertions on output. Tempting, and it holds for about a week. Then a model update nudges the phrasing, every assertion fails at once, and you burn an afternoon rewriting goldens that were never actually wrong. The team learns to ignore the suite, which is worse than not having one.
A human eyeballs the diff. What most teams actually do. It does not scale, it varies by reviewer, and it leans on someone imagining every input that might break. The terse-answer regression sailed through review because the single example the reviewer tried happened to be a long question.
A separate eval dashboard. Plenty of good eval tools exist. But if the eval lives in a panel someone opens after merging, it is a postmortem instrument, not a guardrail. By the time the chart turns red, the regression is already in production. The signal arrives. It just arrives too late to matter.
One pattern runs through all four: quality gets checked somewhere other than the place that can stop a bad change. The entire point of CI is that the gate sits in the path. LLM quality has to live in that same path.
The fix: make the judge a build step
Testhide's whole premise is that test intelligence and LLM evaluation belong next to your unit and integration tests as first-class CI steps, not in a side dashboard. Same YAML, same pass/fail, same merge gate. If quality drops below threshold, the build goes red and the PR is blocked. No "I'll check the chart later."
The mechanism is an llm_eval step: it takes your prompt's output, hands it to a judge model with a scoring rubric, and compares the result against a golden set. It finishes in under 30 seconds in CI, because a golden set of a few dozen cases is small and the judge calls run in parallel.
Here is what it looks like in a pipeline:
steps:
- name: unit-tests
type: pytest
on_fail: block_pr
- name: assistant-quality
type: llm_eval
judge: gpt-4o # OpenAI, Anthropic, or Google Gemini
prompt: [email protected] # pinned from the prompt registry
golden_set: goldens/support.yaml
snapshot: semantic # exact | fuzzy | semantic
threshold: 0.82
trials: 3 # median score, to tame judge noise
on_fail: block_pr
A few lines in that block are carrying real weight.
The judge scores against goldens, not in a vacuum. Each golden is an input plus a reference answer and a rubric ("must directly answer, must stay polite, must not invent policy"). The judge reads the candidate output and scores it. We aggregate, compare against the threshold and the last known-good baseline, and a drop blocks the merge.
trials: 3 is there because judges are noisy. Score the same case twice and you might get 0.81, then 0.74. We run each case a few times and take the median. It costs a little more in judge tokens and cuts false failures sharply. A flaky gate that cries wolf gets muted within a week, so this is not optional polish.
snapshot sets how strict the comparison is. Use exact for output that must be byte-identical, like a formatted invoice line. Use fuzzy to tolerate whitespace and ordering. Use semantic when "Sure, I can help with that" and "Of course, happy to help" should both pass. The terse-answer regression dies right here: a one-word reply scores far below the reference on a semantic rubric, every single time.
A prompt registry, so you know what you're testing
You cannot gate on a prompt you cannot pin down. In most codebases prompts are f-strings scattered across three files, mutated at runtime, impossible to diff cleanly.
Testhide keeps a prompt registry instead. Every version is content-addressed with a SHA-256 hash, so identical prompts dedup on their own, and tagged with a semantic version. [email protected] in the YAML above points to one exact, immutable artifact. When the eval fails, you know precisely which prompt produced the regression, and you can diff 2.3.0 against 2.2.0 to find the one sentence that did it.
This is the part people underrate. The registry turns "the prompt changed somehow" into "this line, this commit, this score delta."
When it fails, you get an explanation, not a number
A red build that just says score 0.71 < 0.82 is annoying. You still have to go figure out why.
This is where Testhide's broader job kicks in. Tests fail, AI explains why. The Eval Explorer shows the per-case breakdown for a failing run: which goldens dropped, the judge's reasoning for each, and the candidate output side by side with the reference. For our terse-answer case, it would have surfaced a cluster of cases scoring near zero on "directly and completely answers," with the judge noting that the responses were a single word. The fix becomes obvious in the time it takes to read three rows.
End to end, the merge-blocking flow:
push to PR branch
-> agent runs pipeline
-> unit-tests: pass
-> assistant-quality (llm_eval): 0.71 (median of 3) < 0.82
-> on_fail: block_pr -> PR marked failing, merge button disabled
-> Eval Explorer link posted: 6 of 40 goldens regressed
-> author sees the "answers too short" cluster, reverts the prompt trim
-> re-run: 0.89 -> green -> merge
The regression never reaches a customer. It dies in the PR, which is the only place a regression should ever die.
What it cost us, honestly
Numbers from our own usage, not a benchmark deck.
- A 40-case golden set with
gpt-4oas judge andtrials: 3runs in roughly 20 to 30 seconds in CI. It runs alongside our other steps on the TPS parallel executors, so it does not extend the critical path of the build. - After we put a real golden set behind the gate, prompt-quality regressions stopped showing up in production review and started showing up in PRs. The category did not vanish. It moved left, which is the whole point.
- The
trialsmedian was the line between a gate people trusted and a gate people muted. Our first single-trial version flapped enough that someone asked to turn it off in week two. Three trials settled it.
One caveat worth stating plainly: an LLM judge is not ground truth. It is a cheap, consistent reviewer that never gets tired and never skips a case, and it will occasionally disagree with a human. The goal is not a perfect oracle. It is to make a quality regression as loud and as early as a failing unit test. That bar is very reachable.
The lesson
For years we treated LLM output as the one thing in the product you could only inspect by hand, and only after shipping. That made silent quality regressions structurally inevitable. There was no gate, so things slipped through the gap where a gate should have been.
Putting the judge in CI closes the gap. The prompt change that turns your assistant monosyllabic now shows up the same way a null dereference does: a red build, an explanation, and a merge button you cannot click until it is fixed. Behaviour becomes testable, and testable things stop surprising you at dinner.
To wire this into your own pipeline, the self-hosted deploy is a single Docker Compose command and your data never leaves your infra. Start at /installation/, read the full llm_eval and prompt-registry docs under /features/, and if you want your existing test output flowing in alongside it, the open Report Format spec and reporters live at /plugins/report-format/.
Tests fail. AI explains why. Now that includes the tests you used to run in your head.