/ about

Built by people who lived the problem.

15+ years running QA at AAA scale taught us one thing: LLM evaluation belongs in the build step — not bolted on with webhooks after the merge is already done.

/ by the numbers

What Testhide actually is.

8
Specialized AI models
embedded in your CI pipeline
1
YAML file to configure
your entire eval pipeline
0
Test data sent to
third-party servers
Self-hosted tier — free
forever, no credit card

/ our story

Why we built Testhide.

We spent years building QA automation infrastructure at scale — millions of test runs, hundreds of flaky tests tracked by hand, and nightly triage sessions that ate into engineering cycles. When LLMs entered the picture, we watched teams bolt evaluation tools onto the side of their CI pipelines with webhooks, shell scripts, and manual review queues.

The problem wasn't the eval tools — Braintrust and Langfuse are excellent. The problem was that LLM evaluation was treated as a sidecar rather than a build step. Failures didn't block PRs. Results arrived after the merge. Teams ran three dashboards simultaneously and none of them talked to each other.

Testhide exists to fix that. One pipeline YAML. One dashboard. Eight AI models analyzing your failures automatically. Eval failures block merges — just like unit test failures have always done.

"The build is the right abstraction. Everything else is a workaround."
pytest jest xUnit JUnit Playwright Any CI system

/ contact

Get in touch.

/ the ai engine

Eight models. One pipeline.

Every build automatically passes through a stack of specialized ML models. No configuration required — they run in the background and surface results in your dashboard.

🎯
Root Cause Classifier

Fine-tuned DistilBERT classifies every failure: infra error, regression, flaky, timeout, or OOM. Returns a confidence score you can audit.

Free tier
📊
Flakiness Predictor

Statistical model tracks per-test pass rates and failure windows over time. Flags genuinely unreliable tests before they waste CI budget.

Free tier
🔍
Failure Retriever

FAISS vector index over your entire failure history. Shows the top similar failures from past builds — with their root causes and the commits that fixed them.

Cloud Starter+
🧬
OOD Detector

Out-of-distribution detection flags test inputs that deviate from your training distribution — catch model drift and prompt regressions before they reach production.

Cloud Starter+
📝
Log Signature Miner

Drain3-based template extraction clusters thousands of log lines into a handful of semantic patterns. Turns noise into signal — no regex required.

Cloud Starter+
🔗
Bug Linker

Correlates test failures with recent commits by comparing blame lines against failure stack traces. Points to the most likely breaking change automatically.

Cloud Starter+
👁
Visual Diff Analyzer

CLIP + trained MLP ensemble detects meaningful UI regressions in screenshot tests. Distinguishes cosmetic pixel noise from real functional regressions.

Cloud Starter+
🤖
AI Investigator Agent

Local LLM agent (Phi-3.5-mini or Llama 3) performs a structured 7-stage deep diagnosis: sandbox isolation, signal gathering, reasoning, and a plain-English conclusion. Runs fully on-prem — no API key needed.

Cloud Team+

/ under the hood

No black boxes. Every model named.

We publish the architecture, the model weights, and the confidence scores. If a model says it's 94% sure a test is flaky, you can inspect the features that drove that decision.

Backend
Python / aiohttp MongoDB Redis Docker Compose SSE streaming
AI / ML
HuggingFace Transformers DistilBERT sentence-transformers FAISS CLIP (ViT-B/32) llama.cpp (GGUF) Drain3 scikit-learn
Frontend
Angular 17 Angular Material Server-Sent Events WebSocket RPC
CI Agent (.NET 8)
C# / .NET 8 pytest adapter jest adapter xUnit adapter JUnit adapter ONNX Runtime (Edge AI) Docker pool

/ principles

How we build Testhide.

🔐
Privacy first

Self-hosted by default. Your test data, model traces, and LLM outputs stay on your infrastructure. We never see them.

Developer experience

One YAML file. One dashboard. No new concepts beyond what you already know from CI/CD. If you can write a GitHub Actions step, you can use Testhide.

🔬
Research-backed ML

Every AI feature is grounded in published research. No proprietary magic — explainable models with confidence scores and audit trails you can inspect.

💡
Honest pricing

Free self-hosted forever. Cloud tiers priced transparently. No surprise overages, no usage-based traps, no features gated behind a sales call.