May 29, 2026 · 8 min read · Testhide Team

Why your AI agent passes unit tests but fails in prod

Your agent is green in CI and broken in prod. Unit tests mock the model, freeze the tools, and pin the inputs, which deletes exactly the non-determinism, tool flakiness, and input drift that break agents in the wild. Here is how we close that gap at Testhide with eval-in-CI plus post-deploy monitoring in one system.

Read article →

May 28, 2026 · 8 min read · Testhide Team

Visual regression in the age of LLM-generated UIs

AI coding tools regenerate your UI on every run, and naive pixel diffs drown in false positives until someone turns them off — right before a real layout break ships. Here is how we swapped the pixel count for a perceptual-plus-layout signal that ignores anti-aliasing noise and blocks actual regressions.

Read article →

May 26, 2026 · 7 min read · Testhide Team

Self-hosted CI test intelligence in 5 minutes: a Docker quickstart

Cloud CI dashboards want your logs, your source, and sometimes your prompts. For regulated teams that is a non-starter. Here is how to stand up Testhide self-hosted with docker compose, register a .NET agent that runs the diagnostic models on your own boxes, and plug in your existing test suite unchanged.

Read article →

May 23, 2026 · 8 min read · Testhide Team

LLM-as-judge in CI: a 30-second eval step

A prompt tweak passes every unit test, merges, and quietly turns your assistant into a one-word-answer machine. LLM behaviour has no red X in the PR. Here is how Testhide's llm_eval step puts a judge model in CI, scores output against a golden set in under 30 seconds, and blocks the merge before the regression reaches a customer.

Read article →

May 21, 2026 · 8 min read · Testhide Team

FAISS for failure retrieval: when brute force beats IVF-PQ

Every on-call engineer eventually asks "have we seen this before?" and then loses 40 minutes in Slack. Here is how Testhide's Failure Retriever answers it in under a second, and why at real CI corpus sizes a flat FAISS index beat IVF-PQ on both recall and latency in our data.

Read article →

May 19, 2026 · 7 min read · Testhide Team

Building a flakiness predictor: the features that actually matter

Blanket retries keep CI green by hiding real regressions as noise. Here is how we built Testhide's Flakiness Predictor on ~41 engineered features, which feature families actually earn their keep, and how it powers smart quarantine instead of retry-and-hope.

Read article →

May 15, 2026 · 7 min read · Testhide Team

Regex vs Drain3 for log mining: we kept both

Drain3 is the industry standard for log template extraction. We had hand-rolled regex master patterns. The "right answer" turned out to be both — here's why.

Read article →

May 8, 2026 · 8 min read · Testhide Team

Why we built CI/CD-native LLM evaluation (and why Braintrust wasn't enough)

Why Braintrust and Langfuse aren't enough. Discover why LLM evaluation needs to be a first-class CI/CD build step — and what happens when it isn't.

Read article →

Insights on AI testing.