The 3 AM prompt change

It starts at 3 AM. A senior engineer, coffee in hand, merges a change to the prompt that powers your chat feature. The change looks good: temperature down to 0.7, max_tokens tweaked, a small system prompt refinement. The unit tests pass. The integration tests pass. Looks solid.

At 7 AM, the first customer complaint lands: "My assistant is giving one-word answers now." By 9 AM, there's a critical incident. By 10 AM, you've rolled back the prompt and filed a post-mortem. By noon, you're adding manual QA steps to your release process.

This is the problem Testhide was built to solve.

The eval-tool category, briefly

The AI eval space is healthy. Braintrust, Langfuse, LangSmith — they're all strong platforms. I respect the work. Langfuse ships open-source code with 10k+ GitHub stars. Braintrust raised $36M. LangSmith is part of a $10B+ company. They've built dashboards that let you compare prompt versions, grade outputs, and iterate on evals. That's valuable.

But here's what they are not: they are not in your CI/CD system. They live in a browser tab. Your prompts live in your Git repo. Your evals live in Braintrust's dashboard. When you push a prompt change, your CI runs unit tests and integration tests — but it doesn't run the eval. When you see an eval result, you're looking at a dashboard that has no idea what your build number is or whether you've shipped the code yet.

You're juggling tools. That's the real problem.

What's missing in the seam

The seam between "eval" and "CI" is where things break:

CI doesn't see the eval. You push a prompt. Your pipeline runs. Green checkmark. But if the eval failed in Braintrust, CI has no idea. Your PR gets merged anyway.
Eval doesn't run on every PR. You open a pull request. Braintrust doesn't know about it. You manually click "evaluate this version" in the dashboard. Days later, you get results. Your PR is already stale.
Failure modes aren't reported the same way. A unit test fails → red X in your PR check. An eval fails → an email from Braintrust → a dashboard tab you forgot about → your code ships anyway.
No version control. You can't see when the eval changed. Was it the prompt? The model? The dataset? Good luck digging through Braintrust's activity log.

The result: engineers treat eval failures as suggestions, not blockers. The 3 AM prompt change happens because the system didn't make it hard to ship bad code.

What CI-native looks like

CI-native means: when you push a prompt change, your pipeline runs the eval automatically. It takes <30 seconds. It scores the output against your golden test set. If the score drops below your threshold, the PR check turns red. You can't merge broken code.

That's it. It's the same primitive that made unit tests non-negotiable 20 years ago.

LLM eval as a first-class build step means:

It runs on every commit. No manual clicks. No separate dashboard.
It blocks merges if it fails (unless you explicitly bypass it).
It shows up in your build history, timestamped, versioned, auditable.
It integrates with your existing CI tooling: GitHub Actions, GitLab CI, Jenkins, Bitbucket Pipelines.
It lives next to your unit tests, integration tests, visual regression tests — all one pipeline.

That's the thesis: if you treat LLM eval like a real test (because it is), the failure modes disappear.

The 8 ML models we built in

Real CI-native eval requires real production ML. We couldn't just wrap Braintrust's API. So we built 8 models trained on 15+ years of CI/CD data:

Root-Cause Classifier — Is this a product bug, test issue, or environment failure? (DistilBERT, multi-modal)
Flakiness Predictor — Is this a real regression or noise? (gradient-boosted, engineered features)
Failure Retriever (sub-second latency) — Find similar failures across history.
Novelty Detector — Surface emerging failure modes early.
Visual Diff Analyzer (pixel + perceptual) — Catch UI regressions.
Log Signature Miner — Deduplicate log noise. (regex-based template extraction)
Bug Linker — Auto-link failures to Jira tickets.
Emerging Issues Detector — Time-series monitoring for shifts in test behavior.

These aren't "AI-powered" in the vague marketing sense. These are specific models with specific metrics, trained on specific data, running in production today.

What's next

Testhide is free to self-host. Cloud tiers start at $49/month. Enterprise on-prem deployments for teams that need it.

The category is forming. Galileo forecasts 40% of agentic AI projects will be cancelled by end of 2027 due to bad testing. That's a growing market of teams that will demand better eval tooling. The platforms that make eval a native CI/CD primitive will win.

Install free. Explore features. Build AI features with real confidence.

← Back to blog Install Testhide →

Why we built CI/CD-native LLM evaluation (and why Braintrust wasn't enough)