/ about
Built by people who lived the problem.
15+ years running QA at AAA scale taught us one thing: LLM evaluation belongs in the build step — not bolted on with webhooks after the merge is already done.
/ by the numbers
What Testhide actually is.
embedded in your CI pipeline
your entire eval pipeline
third-party servers
forever, no credit card
/ our story
Why we built Testhide.
We spent years building QA automation infrastructure at scale — millions of test runs, hundreds of flaky tests tracked by hand, and nightly triage sessions that ate into engineering cycles. When LLMs entered the picture, we watched teams bolt evaluation tools onto the side of their CI pipelines with webhooks, shell scripts, and manual review queues.
The problem wasn't the eval tools — Braintrust and Langfuse are excellent. The problem was that LLM evaluation was treated as a sidecar rather than a build step. Failures didn't block PRs. Results arrived after the merge. Teams ran three dashboards simultaneously and none of them talked to each other.
Testhide exists to fix that. One pipeline YAML. One dashboard. Eight AI models analyzing your failures automatically. Eval failures block merges — just like unit test failures have always done.
"The build is the right abstraction. Everything else is a workaround."
/ contact
Get in touch.
/ the ai engine
Eight models. One pipeline.
Every build automatically passes through a stack of specialized ML models. No configuration required — they run in the background and surface results in your dashboard.
Fine-tuned DistilBERT classifies every failure: infra error, regression, flaky, timeout, or OOM. Returns a confidence score you can audit.
Free tierStatistical model tracks per-test pass rates and failure windows over time. Flags genuinely unreliable tests before they waste CI budget.
Free tierFAISS vector index over your entire failure history. Shows the top similar failures from past builds — with their root causes and the commits that fixed them.
Cloud Starter+Out-of-distribution detection flags test inputs that deviate from your training distribution — catch model drift and prompt regressions before they reach production.
Cloud Starter+Drain3-based template extraction clusters thousands of log lines into a handful of semantic patterns. Turns noise into signal — no regex required.
Cloud Starter+Correlates test failures with recent commits by comparing blame lines against failure stack traces. Points to the most likely breaking change automatically.
Cloud Starter+CLIP + trained MLP ensemble detects meaningful UI regressions in screenshot tests. Distinguishes cosmetic pixel noise from real functional regressions.
Cloud Starter+Local LLM agent (Phi-3.5-mini or Llama 3) performs a structured 7-stage deep diagnosis: sandbox isolation, signal gathering, reasoning, and a plain-English conclusion. Runs fully on-prem — no API key needed.
Cloud Team+/ under the hood
No black boxes. Every model named.
We publish the architecture, the model weights, and the confidence scores. If a model says it's 94% sure a test is flaky, you can inspect the features that drove that decision.
/ principles
How we build Testhide.
Self-hosted by default. Your test data, model traces, and LLM outputs stay on your infrastructure. We never see them.
One YAML file. One dashboard. No new concepts beyond what you already know from CI/CD. If you can write a GitHub Actions step, you can use Testhide.
Every AI feature is grounded in published research. No proprietary magic — explainable models with confidence scores and audit trails you can inspect.
Free self-hosted forever. Cloud tiers priced transparently. No surprise overages, no usage-based traps, no features gated behind a sales call.