/ blog

Insights on AI testing.

LLM evaluation, CI/CD patterns, and QA automation from the Testhide team.

· 8 min read · Testhide Team

Why your AI agent passes unit tests but fails in prod

Your agent is green in CI and broken in prod. Unit tests mock the model, freeze the tools, and pin the inputs, which deletes exactly the non-determinism, tool flakiness, and input drift that break agents in the wild. Here is how we close that gap at Testhide with eval-in-CI plus post-deploy monitoring in one system.

· 8 min read · Testhide Team

Visual regression in the age of LLM-generated UIs

AI coding tools regenerate your UI on every run, and naive pixel diffs drown in false positives until someone turns them off — right before a real layout break ships. Here is how we swapped the pixel count for a perceptual-plus-layout signal that ignores anti-aliasing noise and blocks actual regressions.

· 7 min read · Testhide Team

Self-hosted CI test intelligence in 5 minutes: a Docker quickstart

Cloud CI dashboards want your logs, your source, and sometimes your prompts. For regulated teams that is a non-starter. Here is how to stand up Testhide self-hosted with docker compose, register a .NET agent that runs the diagnostic models on your own boxes, and plug in your existing test suite unchanged.

· 8 min read · Testhide Team

LLM-as-judge in CI: a 30-second eval step

A prompt tweak passes every unit test, merges, and quietly turns your assistant into a one-word-answer machine. LLM behaviour has no red X in the PR. Here is how Testhide's llm_eval step puts a judge model in CI, scores output against a golden set in under 30 seconds, and blocks the merge before the regression reaches a customer.

· 8 min read · Testhide Team

FAISS for failure retrieval: when brute force beats IVF-PQ

Every on-call engineer eventually asks "have we seen this before?" and then loses 40 minutes in Slack. Here is how Testhide's Failure Retriever answers it in under a second, and why at real CI corpus sizes a flat FAISS index beat IVF-PQ on both recall and latency in our data.