The form nobody wanted to sign
A few months back I watched a friend's team get blocked on a CI vendor for six weeks. The product was fine. The security review stalled on one line in the contract: build logs, source snapshots, and a copy of every prompt sent to the eval step would transit and be retained on the vendor's infrastructure. The team shipped a payments product. Their compliance lead read that line, put the pen down, and asked the obvious question. Where does the data actually live?
That is the conversation a lot of cloud CI dashboards quietly lose. They are excellent right up until someone in legal reads the data-processing addendum and realizes your stack traces, which routinely leak customer identifiers and internal hostnames, are now sitting in a third party's object store. Engineers say the same thing in their own words on forums all the time: logs carry PII you never meant to ship, and once it crosses your perimeter your audit story depends on a vendor you cannot inspect.
Then there is the bill. The other reason teams come to us is that per-minute CI pricing punishes you for having a big test suite. A mid-size team can burn several thousand dollars a month on compute, and a chunk of that is pure waste: flaky retries, suites billed triple because they ran sequentially, credit math you need a spreadsheet to forecast. You grow your coverage like a responsible adult and the invoice treats it like a sin.
So the requirement is easy to state and annoying to satisfy. Keep the data on our infrastructure. Run the smart stuff there too. And do not make me rewrite my test suite to get it.
Why the obvious fixes do not quite work
The naive answer is "self-host the runner." Plenty of clouds let you bring your own compute. That solves the per-minute bill, sometimes, but it does nothing for the data question, because the intelligence layer (the part that classifies failures, embeds logs, judges LLM output) still calls home. You moved the muscle on-prem and left the brain in someone else's data center. The most sensitive artifact, the actual log text, is exactly what those models need to read.
The other naive answer is "turn off the AI and self-host plain CI." Now your data is safe and your pipeline is dumb again. You are back to a wall of red Xs at 2 AM with no idea whether this is a real regression, a flake, or the staging database falling over. The whole point of test intelligence was to stop paying engineers to triage that by hand.
What you actually want is for the intelligence to run where the data already is. That is the design Testhide is built around: tests fail, AI explains why, and the explaining happens on your hardware.
docker compose up, and the data never leaves
Testhide self-hosts as a Docker Compose stack, free to run. One control plane, your database, your object storage, all inside your perimeter. You bring a build agent (ours is a .NET service) that runs your tests and runs the AI on the same box. Nothing about a failure has to leave your network to get explained.
Here is the shape of a quickstart. Clone the deploy repo, set a couple of secrets, bring it up.
git clone https://github.com/testhide/deploy testhide
cd testhide
cp .env.example .env
# edit .env: set THIDE_ADMIN_EMAIL, THIDE_SECRET_KEY, and your DB password
docker compose up -d
That gives you the control plane on localhost:8080. The first-run wizard creates an admin user and prints an agent enrollment token. Now register an agent. The agent is the part that touches your code, so it lives next to your code, on a runner inside your VPC or a box in the office.
docker run -d --name testhide-agent -e THIDE_SERVER=https://ci.internal.example.com -e THIDE_ENROLL_TOKEN=the-token-from-the-wizard -e THIDE_EDGE_AI=on -v /var/run/docker.sock:/var/run/docker.sock testhide/agent:1
The piece that matters for privacy-conscious teams is THIDE_EDGE_AI=on. With it set, the agent runs the diagnostic models locally:
- An ONNX build of CLIP does the visual-diff pre-screen for screenshot tests, so perceptual comparison happens on the agent, not in a cloud GPU.
- An optional local GGUF LLM handles the Build Investigator's narrative root-cause write-up, if you would rather not call out to OpenAI, Anthropic, or Gemini for that. You point it at a model file and it runs in-process.
- Tool-calling during investigation happens in a read-only sandbox. The investigator can read logs and diffs and grep the workspace, but it cannot mutate your repo or reach the network. It looks; it does not touch.
If you are comfortable using a hosted judge for the LLM-eval step, plug in OpenAI, Anthropic, or Gemini and only that step talks out. If you are not, the local GGUF path keeps everything inside the fence. Your call, per step.
Pointing it at a repo
Connect your SCM (GitHub, GitLab, or Bitbucket, including self-managed instances) and Testhide writes status back as a normal commit check. The pipeline is YAML and lives in your repo, so each step is a first-class build step that can block the PR, not a dashboard you check after the fact. Matrix builds and parallel executors are part of the same config.
pipeline:
- name: unit
type: pytest
on_fail: block_pr
- name: dotnet-integration
type: dotnet
project: tests/Integration.Tests.csproj
on_fail: block_pr
- name: answer-quality
type: llm_eval
judge: local-gguf # or gpt-4o / claude / gemini if you allow egress
golden_set: evals/support-bot.jsonl
threshold: 0.85
on_fail: block_pr
That last step is the LLM-as-judge eval: it scores model output against a golden set and blocks the merge below threshold, and it is built to finish in under 30 seconds inside CI so it sits next to your unit tests rather than off in a separate tool.
When the integration step goes red, the agent runs the diagnostics right there. The Root-Cause Classifier (a multi-modal DistilBERT over the log plus build metadata) tags it product bug versus test issue versus environment. The Flakiness Predictor, gradient-boosted trees over roughly 41 engineered features (temporal, failure history, environment, build context), decides whether this is a real regression worth blocking on or noise worth quarantining. The Failure Retriever runs sub-second similarity over log embeddings to answer the only question anyone asks at 2 AM: have we seen this before? All of it on your hardware, on your logs.
That is three of the eight diagnostic models. The rest (an OOD detector for emerging failure modes, the visual diff analyzer, a log signature miner, a Jira bug linker, and changepoint detection over test time-series) run the same way, on the agent.
Your existing tests plug in unchanged
This is the part teams do not believe until they try it. You do not rewrite tests to adopt Testhide. The agent consumes Report Format v1, a JUnit-extended XML, and we ship open-source reporters for the runners you already use: pytest, unittest, .NET (xUnit, NUnit, MSTest), and the JS stack (Jest, Mocha, Vitest, Playwright).
For the .NET integration step above, that is one package and one logger flag:
dotnet add package Testhide.Reporter.DotNet
dotnet test --logger "testhide;output=reports/results.xml"
The reporter emits the richer XML, the agent ingests it, and the models get clean structured input including the metadata they need to classify well. The spec is open and published at testhide.com/plugins/report-format/, so if your runner is exotic you can emit the format yourself. Nothing proprietary, nothing locked.
What we actually saw
A few honest numbers from running this on our own pipeline, framed as exactly that: our internal results, not a benchmark with your name on it.
After we moved our diagnostics onto the edge-AI agent, a failed build's first-pass root-cause note landed in well under a minute, with no egress, because the heavy lifting happened on the same box that ran the test. The CLIP pre-screen cut the number of visual diffs that needed any escalation by a wide margin, since most screenshot "failures" are sub-pixel rendering noise that perceptual distance shrugs off. And because the Flakiness Predictor quarantines noisy tests instead of blindly retrying them, we stopped paying the flake tax that quietly inflates so many CI bills.
Your mileage will vary with your suite and your hardware. That is the point of self-hosting. It is your hardware.
The lesson
The trade-off everyone assumes, smart CI or private CI, pick one, is a false choice. It only exists because most intelligence layers were built to phone home. Move the models to the agent, keep the agent inside your perimeter, and the data question and the AI question stop fighting. Failures get explained where they happen, your compliance lead keeps the pen capped, and your invoice stops scaling with your good intentions about test coverage.
Five minutes of docker compose up, one reporter package, and the tests you already have. That is the whole quickstart.
Start here: Installation walks the full self-hosted deploy and agent enrollment, Features covers the eight diagnostic models and the LLM-eval step in depth, and the Report Format v1 spec has the open reporters so your existing suite plugs in unchanged.