/ installation

Self-hosted in under 15 minutes.

13 Docker services pulled from Docker Hub — application core + observability stack. One compose file. Your data stays on your infrastructure.

System requirements

Testhide runs 13 containers — backend × 2 replicas, AI inference + training workers, MongoDB, Redis, sidecar, plus a full observability stack (Prometheus + Grafana + Tempo + Loki + Alloy). Size your server accordingly.

Minimum

CPU8 cores

RAM32 GB

Disk100 GB SSD

OSLinux (Ubuntu 22.04+)

Docker24.0+ · Compose v2

AI Worker is capped at 12 GB and AI API at 6 GB. On 32 GB hosts, reduce AI_ASSIST_INFER_WORKERS and skip YOLO training with AI_YOLO_MIN_ANNOTATIONS=50.

Recommended

CPU16 cores

RAM64 GB

Disk500 GB SSD

OSLinux (Ubuntu 22.04+)

Docker24.0+ · Compose v2

Comfortable headroom for concurrent builds, AI training, observability retention, and log streaming via Alloy/Loki.

Memory per service

frontend256 MB

backend ×22.5 GB each

ai-api1.5–6 GB elastic

ai-worker5–12 GB elastic

mongo + redis3.7 GB

observability~1.8 GB total

AI API and AI Worker share an elastic memory contract via Redis — sum of their mem_limit can exceed physical RAM safely (anti-correlated workloads).

/ 13-service architecture

Edge

Nginx · Angular :80 · :443 · :7771

↓

Application

Backend ×2 REST · WebSocket

AI API LLM · CLIP · FAISS

AI Worker Train · YOLO · Edge

Sidecar Docker socket proxy

↓

Data

MongoDB 8.2 :27017

Redis 7 cache · pub/sub

Redis Insight admin UI (optional)

↓

Observability

Prometheus metrics · :9090 (loopback)

Grafana dashboards · :3000 (HTTPS)

Tempo OTel traces · :4317-4318

Loki log storage · :3100

Alloy log/metric collector

C# Agents (.NET 6) run on your build hosts and connect to Backend via WebSocket — Windows, Linux, or macOS. A one-shot config-init container syncs baked observability configs into named volumes on every --force-recreate deploy.

Download `docker-compose.yaml`

All images are pre-built on Docker Hub — no source code needed. Create a deployment directory and save this file.

bash — prepare deployment directory

$mkdir -p /opt/testhide/{ssl,data/mongo,ssh_keys,sandbox_data,releases,monitoring_scripts}

$cd /opt/testhide

# Download docker-compose.yaml and .env template

$curl -O https://testhide.com/static/landing/docker-compose.yaml

$curl -O https://testhide.com/static/landing/.env.example && cp .env.example .env

↓

docker-compose.yaml

13 services · latest images · ~17 KB

↓

.env.example

All variables · defaults included · ~9 KB

View full docker-compose.yaml expand ↓

docker-compose.yaml

# ==========================================
# Testhide — Production Docker Compose (13 services)
# Images: hub.docker.com/u/thuesdays
# Download the full file via the button above.
# ==========================================

name: testhide

services:

  # ── Config-init (one-shot, syncs baked configs to named volumes) ─
  config-init:
    image: thuesdays/testhide-backend:latest
    container_name: testhide-config-init
    entrypoint: []
    command: ["bash", "/app/scripts/sync-configs-to-volumes.sh"]
    environment:
      - SKIP_LLM_CHECK=true
      - LOAD_AI_MODELS=false
      - RUN_AI_WORKER=false
      - DISABLE_CRON=true
    volumes:
      - grafana_dashboards:/mnt/grafana_dashboards
      - grafana_provisioning:/mnt/grafana_provisioning
      - loki_config:/mnt/loki_config
      - prometheus_config:/mnt/prometheus_config
      - tempo_config:/mnt/tempo_config
      - alloy_config:/mnt/alloy_config
    restart: "no"

  # ── Frontend (Nginx + Angular SPA, TLS termination) ─────────────
  frontend:
    container_name: testhide_frontend
    image: thuesdays/testhide-frontend:latest
    restart: unless-stopped
    ports: [ "80:80", "443:443", "7771:7771" ]
    env_file: [ .env ]
    environment:
      - CERT_FILE=${CERT_FILE}
      - CERT_KEY=${CERT_KEY}
    volumes:
      - ./ssl:/etc/ssl:ro
      - testhide_static_data:/usr/share/nginx/html/static:ro
    depends_on: [ backend, ai-api ]
    networks: [ testhide-net ]
    mem_limit: 256m

  # ── Backend (Python API + WebSocket, 2 replicas) ─────────────────
  backend:
    image: thuesdays/testhide-backend:latest
    restart: unless-stopped
    env_file: [ .env ]
    environment:
      - LOAD_AI_MODELS=false
      - RUN_AI_WORKER=false
      - ENVIRONMENT=production
      - SERVICE_NAME=testhide
      - BUILD_RCA_RETRAIN_THRESHOLD=9999999   # MUST NEVER train
      - LOG_LEVEL=${LOG_LEVEL:-INFO}
      - SIDECAR_URL=http://sidecar-docker:8081
      - CORS_ALLOWED_ORIGINS=${CORS_ALLOWED_ORIGINS:-${PUBLIC_URL}}
    volumes:
      - testhide_static_data:/app/static
      - ./releases:/app/releases
      - ./monitoring_scripts:/app/monitoring_scripts
      - ./sandbox_data:/app/sandbox_data
      - ./ssh_keys:/app/ssh_keys     # persists across deploys
      - ./docker-compose.yaml:/app/docker-compose.yaml:ro
      - testhide_hf_cache:/app/hf_cache
    depends_on:
      mongo: { condition: service_healthy }
      redis: { condition: service_healthy }
      sidecar-docker: { condition: service_healthy }
    networks: [ testhide-net, testhide-internal ]
    mem_limit: 2.5g
    deploy:
      replicas: 2
      resources:
        limits: { cpus: "2.0", memory: 2.5g }

  # ── AI API (LLM / CLIP / FAISS inference — HTTP only) ────────────
  # Elastic memory: 1.5 GB reservation, 6 GB limit. Anti-correlated
  # with ai-worker via Redis (ai:training:in_progress).
  ai-api:
    container_name: testhide_ai_api
    image: thuesdays/testhide-backend:latest
    env_file: [ .env ]
    environment:
      - LOAD_AI_MODELS=true
      - RUN_AI_WORKER=false
      - DISABLE_CRON=true
      - HF_HOME=/app/hf_cache
      - HF_HUB_OFFLINE=${HF_HUB_OFFLINE:-0}
      - RC_MODEL_NAME=${RC_MODEL_NAME:-distilbert-base-uncased}
      - CORS_ALLOWED_ORIGINS=${CORS_ALLOWED_ORIGINS:-${PUBLIC_URL}}
    volumes:
      - testhide_static_data:/app/static
      - testhide_hf_cache:/app/hf_cache
      - ./sandbox_data:/app/sandbox_data
      - ./releases:/app/releases
    depends_on:
      mongo: { condition: service_healthy }
      redis: { condition: service_healthy }
    networks: [ testhide-net ]
    mem_limit: 6g
    mem_reservation: 1500m

  # ── AI Worker (training, vectorization, Edge AI) ─────────────────
  # Elastic memory: 5 GB reservation, 12 GB limit.
  ai-worker:
    container_name: testhide_ai_worker
    image: thuesdays/testhide-backend:latest
    env_file: [ .env ]
    environment:
      - LOAD_AI_MODELS=true
      - RUN_AI_WORKER=true
      - DISABLE_CRON=true
      - AI_MEMORY_LIMIT_MB=${AI_MEMORY_LIMIT_MB:-7500}
      - AI_HARD_EXIT_PERCENT=${AI_HARD_EXIT_PERCENT:-88}
      - AI_LLM_CTX=${AI_LLM_CTX:-8192}
      - AI_MERGE_BATCH_SIZE=${AI_MERGE_BATCH_SIZE:-150}
      # AI Pipeline streaming (Phase 1 + Phase 2 §3.7)
      - AI_DATASET_SHARD_ROWS=${AI_DATASET_SHARD_ROWS:-50000}
      - AI_COMPACT_AFTER_SHARDS=${AI_COMPACT_AFTER_SHARDS:-32}
      - AI_BUDGET_SAFETY_FACTOR=${AI_BUDGET_SAFETY_FACTOR:-0.5}
      - AI_TRAINING_LOCK_TTL_SEC=${AI_TRAINING_LOCK_TTL_SEC:-3600}
      - AI_VECTORS_SIDECAR_ENABLED=${AI_VECTORS_SIDECAR_ENABLED:-1}
      - AI_YOLO_FORCE_CPU=${AI_YOLO_FORCE_CPU:-true}
      - HF_HOME=/app/hf_cache
      - RC_MODEL_NAME=${RC_MODEL_NAME:-distilbert-base-uncased}
    volumes:
      - testhide_static_data:/app/static
      - testhide_hf_cache:/app/hf_cache
      - ./sandbox_data:/app/sandbox_data
      - ./releases:/app/releases
    depends_on:
      mongo: { condition: service_healthy }
      redis: { condition: service_healthy }
    networks: [ testhide-net ]
    mem_limit: 12g
    mem_reservation: 5g

  # ── MongoDB 8 ──────────────────────────────────────────────────
  mongo:
    container_name: testhide_mongo
    image: mongo:8.2.3
    restart: unless-stopped
    command: ["mongod","--auth","--wiredTigerCacheSizeGB","2"]
    environment:
      - MONGO_INITDB_ROOT_USERNAME=${MONGO_USER}
      - MONGO_INITDB_ROOT_PASSWORD=${MONGO_PASS}
    ports: [ "27017:27017" ]
    volumes: [ "${MONGO_DATA_PATH}:/data/db" ]
    networks: [ testhide-net ]
    healthcheck:
      test: ["CMD","mongosh","--quiet","-u","${MONGO_USER}","-p","${MONGO_PASS}","--authenticationDatabase","admin","--eval","db.adminCommand('ping')"]
      interval: 30s · timeout: 5s · retries: 5
    mem_limit: 3g

  # ── Redis 7 ───────────────────────────────────────────────────
  redis:
    container_name: testhide_redis
    image: redis:7-alpine
    command: ["redis-server","--maxmemory","512mb","--maxmemory-policy","allkeys-lru","--appendonly","no","--requirepass","${REDIS_PASSWORD}"]
    networks: [ testhide-net ]
    mem_limit: 768m

  # ── Redis Insight (optional admin UI, proxied via nginx) ───────
  redisinsight:
    container_name: testhide_redisinsight
    image: redis/redisinsight:latest
    depends_on:
      redis: { condition: service_healthy }
    networks: [ testhide-net ]
    mem_limit: 256m

  # ── Docker Socket Sidecar (SEC-004) ────────────────────────────
  # Sidecar token auto-derived from JWT_SECRET via HMAC-SHA256.
  sidecar-docker:
    container_name: testhide_sidecar_docker
    image: thuesdays/testhide-sidecar:latest
    environment:
      - JWT_SECRET=${JWT_SECRET}
      - SIDECAR_ALLOWED_IMAGES=${SIDECAR_ALLOWED_IMAGES}
      - SIDECAR_ALLOWED_EXEC_PATTERNS=${SIDECAR_ALLOWED_EXEC_PATTERNS}
      - SIDECAR_ALLOWED_NETWORKS=${SIDECAR_ALLOWED_NETWORKS}
      - SIDECAR_PORT=8081
    volumes: [ "/var/run/docker.sock:/var/run/docker.sock" ]
    networks: [ testhide-internal ]
    mem_limit: 128m
    healthcheck:
      test: ["CMD","wget","-qO-","http://localhost:8081/health"]

  # ── Observability: Prometheus (metrics) ────────────────────────
  # Loopback bind by default. Override via PROMETHEUS_PORT_BIND.
  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: testhide-prometheus
    depends_on:
      config-init: { condition: service_completed_successfully }
    volumes:
      - prometheus_config:/etc/prometheus:ro
      - prometheus_data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.time=30d
      - --web.enable-lifecycle
      - --web.enable-remote-write-receiver
    ports: [ "${PROMETHEUS_PORT_BIND:-127.0.0.1:9090}:9090" ]
    networks: [ testhide-net ]
    mem_limit: 512m

  # ── Observability: Grafana (HTTPS, dashboards) ─────────────────
  grafana:
    image: grafana/grafana:10.4.2
    container_name: testhide-grafana
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:?Set in .env}
      GF_SERVER_ROOT_URL: ${GRAFANA_ROOT_URL:-${PUBLIC_URL}:3000}
      GF_SERVER_PROTOCOL: https
      GF_SERVER_CERT_FILE: /etc/grafana/ssl/server.crt
      GF_SERVER_CERT_KEY: /etc/grafana/ssl/server.key
      GF_FEATURE_TOGGLES_ENABLE: traceqlEditor
      GF_AUTH_ANONYMOUS_ENABLED: "false"
    volumes:
      - grafana_provisioning:/etc/grafana/provisioning:ro
      - grafana_dashboards:/var/lib/grafana/dashboards:ro
      - grafana_data:/var/lib/grafana
      - ${GRAFANA_CERT_HOST_CRT:-./ssl/testhide.crt}:/etc/grafana/ssl/server.crt:ro
      - ${GRAFANA_CERT_HOST_KEY:-./ssl/testhide.key}:/etc/grafana/ssl/server.key:ro
    ports: [ "3000:3000" ]
    depends_on:
      config-init: { condition: service_completed_successfully }
      prometheus:  { condition: service_started }
      tempo:       { condition: service_started }
      loki:        { condition: service_started }
    networks: [ testhide-net ]
    mem_limit: 256m

  # ── Observability: Tempo (distributed traces, OTel) ────────────
  tempo:
    image: grafana/tempo:2.4.1
    container_name: testhide-tempo
    command: ["-config.file=/etc/tempo/tempo.yml"]
    depends_on:
      config-init: { condition: service_completed_successfully }
    volumes:
      - tempo_config:/etc/tempo:ro
      - tempo_data:/tmp/tempo
    ports: [ "4317:4317", "4318:4318", "3200:3200" ]
    networks: [ testhide-net ]
    mem_limit: 512m

  # ── Observability: Loki (log storage) ──────────────────────────
  loki:
    image: grafana/loki:2.9.4
    container_name: testhide-loki
    command: -config.file=/etc/loki/loki.yml
    depends_on:
      config-init: { condition: service_completed_successfully }
    volumes:
      - loki_config:/etc/loki:ro
      - loki_data:/loki
    ports: [ "3100:3100" ]
    networks: [ testhide-net ]
    mem_limit: 256m

  # ── Observability: Grafana Alloy (log/metric collector) ────────
  alloy:
    image: grafana/alloy:v1.5.1
    container_name: testhide-alloy
    volumes:
      - alloy_config:/etc/alloy:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - testhide_static_data:/var/log/testhide:ro
    ports: [ "12345:12345" ]
    depends_on:
      config-init: { condition: service_completed_successfully }
      loki:       { condition: service_started }
      prometheus: { condition: service_started }
    networks: [ testhide-net ]
    mem_limit: 256m

volumes:
  testhide_static_data:    { name: testhide_static_data }
  testhide_hf_cache:       { name: testhide_hf_cache }
  prometheus_data:         { name: testhide_prometheus_data }
  grafana_data:            { name: testhide_grafana_data }
  tempo_data:              { name: testhide_tempo_data }
  loki_data:               { name: testhide_loki_data }
  # Config volumes — populated by config-init from the image.
  grafana_dashboards:      { name: testhide_grafana_dashboards }
  grafana_provisioning:    { name: testhide_grafana_provisioning }
  loki_config:             { name: testhide_loki_config }
  prometheus_config:       { name: testhide_prometheus_config }
  tempo_config:            { name: testhide_tempo_config }
  alloy_config:            { name: testhide_alloy_config }

networks:
  testhide-net:      { name: testhide-net }
  testhide-internal: { driver: bridge, internal: true }

Configure `.env`

Edit .env — every variable is listed below. REQUIRED must be set before first boot. OPTIONAL have working defaults.

Domain & Frontend

PUBLIC_URL

REQUIRED

PUBLIC_URL=https://testhide.yourcompany.com

Your public-facing URL. Must match SSL cert domain exactly. Used by CORS_ALLOWED_ORIGINS, GRAFANA_ROOT_URL, Angular bundles.

API_URL

REQUIRED

API_URL=https://testhide.yourcompany.com:7771

Angular embeds this at build time. Port 7771 is the AI API endpoint.

WS_URL

REQUIRED

WS_URL=wss://testhide.yourcompany.com:7771

WebSocket endpoint for agent connections and real-time build logs.

FRONTEND_PUBLIC_URL

REQUIRED

FRONTEND_PUBLIC_URL=https://testhide.yourcompany.com

Used for generating absolute links in notifications and emails.

CORS_ALLOWED_ORIGINS

default: PUBLIC_URL

CORS_ALLOWED_ORIGINS=https://testhide.yourcompany.com,https://staging.example.com

Comma-separated allow-list. Defaults to PUBLIC_URL. Add staging/dev origins if needed. Previous wildcard default * raised a startup security warning.

ENVIRONMENT
PRODUCTION
DEBUG

default: production / true / false

ENVIRONMENT=production
PRODUCTION=true
DEBUG=false

Set ENVIRONMENT=production in all prod/staging deployments — SEC-002 refuses startup with a default JWT_SECRET in those modes. Keep DEBUG=false for security headers and to disable stacktrace responses.

TESTHIDE_TRUST_PROXY

default: 0

TESTHIDE_TRUST_PROXY=1

Backend sits behind nginx — set to 1 so client IPs are read from X-Forwarded-For (rate limits, ban store, IP allowlists work correctly). Default 0 uses the TCP peer.

INTERNAL_API_URL
INTERNAL_WS_URL
PORT

do not change

INTERNAL_API_URL=http://backend:8080
INTERNAL_WS_URL=ws://backend:8080
PORT=8080

Internal Docker service names. Only change if you rename services in the compose file.

Security

JWT_SECRET

REQUIRED

JWT_SECRET=← click to generate

Master secret for JWT tokens AND the Docker sidecar auth (HMAC-SHA256). Generate once, never rotate without a full redeploy.
python3 -c "import secrets; print(secrets.token_hex(32))"

SIDECAR_AUTH_TOKEN

auto-derived

# SIDECAR_AUTH_TOKEN= ← leave commented

Auto-derived from JWT_SECRET using HMAC-SHA256. Only override if you need an explicit token separate from JWT_SECRET.

SSL / TLS

CERT_FILE
CERT_KEY

REQUIRED

CERT_FILE=/etc/ssl/testhide.crt
CERT_KEY=/etc/ssl/testhide.key

Paths inside the container. Place your files in ./ssl/ on the host — that directory is bind-mounted to /etc/ssl.

USE_SSL

default: false

USE_SSL=false

Nginx handles SSL termination — keep this false. Only set true if you bypass Nginx and run the Python backend directly on HTTPS.

MongoDB

MONGO_USER
MONGO_PASS

REQUIRED

MONGO_USER=testhide_user
MONGO_PASS=your_strong_password

MongoDB root credentials. Set once — changing after first boot requires manual DB user update.

MONGO_DATA_PATH

REQUIRED

MONGO_DATA_PATH=/opt/testhide/data/mongo

Absolute path on the HOST where MongoDB stores data files. Must exist and be writable before first launch.

MONGO_HOST
MONGO_PORT
MONGO_DB_NAME
MONGO_AUTH_SOURCE

defaults shown

MONGO_HOST=mongo
MONGO_PORT=27017
MONGO_DB_NAME=testhide_database
MONGO_AUTH_SOURCE=testhide_database

Only change if using an external MongoDB instance outside the compose network.

Redis

REDIS_PASSWORD

REQUIRED

REDIS_PASSWORD=your_strong_redis_password

Redis is on the internal Docker network only, but always set a password in production.

REDIS_HOST
REDIS_PORT
REDIS_DB

defaults shown

REDIS_HOST=redis
REDIS_PORT=6379
REDIS_DB=0

Only change if using an external Redis instance.

Gunicorn (Python WSGI)

GUNICORN_WORKERS

default: 3

GUNICORN_WORKERS=3

Rule of thumb: (2 × CPU cores) + 1. Backend runs 2 replicas — total processes = replicas × workers.

GUNICORN_TIMEOUT
GUNICORN_GRACEFUL_TIMEOUT
GUNICORN_KEEPALIVE
GUNICORN_LOG_LEVEL

defaults shown

GUNICORN_TIMEOUT=120
GUNICORN_GRACEFUL_TIMEOUT=30
GUNICORN_KEEPALIVE=5
GUNICORN_LOG_LEVEL=info

Increase GUNICORN_TIMEOUT if you see 502 errors during large report uploads or long AI inference calls.

AI & Local LLM

AI_LLM_REPO
AI_LLM_FILE

defaults shown

AI_LLM_REPO=bartowski/Phi-3.5-mini-instruct-GGUF
AI_LLM_FILE=Phi-3.5-mini-instruct-Q5_K_M.gguf

Local LLM for diagnostic summaries. Q5_K_M ≈ 3.5 GB on disk. Switch to Q4_K_M to save ~800 MB on 32 GB servers.

AI_LLM_CTX
AI_LLM_THREADS
AI_LLM_GPU_LAYERS

defaults shown

AI_LLM_CTX=32768
AI_LLM_THREADS=8
AI_LLM_GPU_LAYERS=0

Set AI_LLM_THREADS to physical CPU core count. Reduce AI_LLM_CTX to 8192 on 32 GB servers to cut peak RAM by ~4 GB. GPU_LAYERS=0 = CPU-only.

RC_MODEL_NAME

default shown

RC_MODEL_NAME=distilbert-base-uncased

DistilBERT backbone for the Root-Cause Classifier. Must be identical in ai-api and ai-worker — they share the testhide_hf_cache volume.

HF_HUB_OFFLINE
TRANSFORMERS_OFFLINE

default: 0

HF_HUB_OFFLINE=0
TRANSFORMERS_OFFLINE=0

Set both to 1 after a successful first boot for air-gapped mode. Models persist in the testhide_hf_cache Docker volume between restarts.

AI Worker — Memory & Concurrency

AI_MEMORY_LIMIT_MB
AI_HARD_EXIT_PERCENT

defaults: 7500 / 88

AI_MEMORY_LIMIT_MB=7500
AI_HARD_EXIT_PERCENT=88

Two-layer OOM protection. Soft limit (MB): worker refuses new tasks above this RSS. Must stay BELOW Docker mem_limit=12g (12288 MB) so it fires first. Hard exit (% of container limit) triggers graceful shutdown if soft limit is missed.

AI_MERGE_BATCH_SIZE
AI_LLM_CTX

defaults: 150 / 8192

AI_MERGE_BATCH_SIZE=150
AI_LLM_CTX=8192

Pandas merge peak is ~24 MB per batch at 150 rows. LLM context window: 8192 saves ~1.5 GB KV-cache vs 32768 — recommended on 32 GB hosts.

AI_RETRAIN_QUEUE_BUSY_THRESHOLD
AI_LOCK_STALE_SECS

defaults: 200 / 300

AI_RETRAIN_QUEUE_BUSY_THRESHOLD=200
AI_LOCK_STALE_SECS=300

Skip retraining when the build queue is severely backlogged (200+ pending). Stale lock eviction prevents zombie training holds.

AI_ASSIST_INFER_WORKERS
AI_ASSIST_IO_WORKERS
AI_ASSIST_LLM_WORKERS

defaults: 4 / 4 / 1

AI_ASSIST_INFER_WORKERS=4
AI_ASSIST_IO_WORKERS=4
AI_ASSIST_LLM_WORKERS=1

Worker parallelism tuned for 6 CPU cores. Reduce infer/io to 2 on 32 GB servers if you see OOM events during heavy build load.

EDGE_AI_ENABLED
EDGE_AI_TIMEOUT_SEC
EDGE_LLM_MODEL
EDGE_EMBEDDER_MODEL

defaults shown

EDGE_AI_ENABLED=true
EDGE_AI_TIMEOUT_SEC=600
EDGE_LLM_MODEL=Phi-3.5-mini-instruct-Q5_K_M.gguf
EDGE_EMBEDDER_MODEL=minilm-l6-v2.onnx

Edge AI runs inference directly on agent nodes. Must match the LLM file name set in AI_LLM_FILE.

AI Pipeline — Streaming & Vector Sidecar (Phase 1 + §3.7)

AI_DATASET_SHARD_ROWS
AI_DATASET_MAX_ROWS
AI_COMPACT_AFTER_SHARDS

defaults: 50000 / 2000000 / 32

AI_DATASET_SHARD_ROWS=50000
AI_DATASET_MAX_ROWS=2000000
AI_COMPACT_AFTER_SHARDS=32

Training data is stored as Parquet shards (50k rows each). Old shards rotate out at AI_DATASET_MAX_ROWS. Compaction merges small shards every 32 new shards to keep file count bounded.

AI_BUDGET_SAFETY_FACTOR
AI_BUDGET_MIN_CHUNK_ROWS
AI_BUDGET_MAX_CHUNK_ROWS

defaults: 0.5 / 16 / 100000

AI_BUDGET_SAFETY_FACTOR=0.5
AI_BUDGET_MIN_CHUNK_ROWS=16
AI_BUDGET_MAX_CHUNK_ROWS=100000

Adaptive memory-budget controller: chunk sizes grow with headroom and shrink under pressure. safety_factor=0.5 means use up to 50% of available headroom per batch. Lower on tight hosts.

AI_TRAINING_LOCK_TTL_SEC

default: 3600

AI_TRAINING_LOCK_TTL_SEC=3600

Redis-coordinated training mutex TTL. Only one training session at a time. Set higher only if individual training runs exceed 1 hour.

AI_VECTORS_SIDECAR_ENABLED

default: 1

AI_VECTORS_SIDECAR_ENABLED=1

Vector sidecar memmap (Phase 2 §3.7). Extracts text_vector/image_vector from Parquet rows into raw float32 sidecar files. −60% dataset size, ×3-10 training speedup. On first deploy, migrate_vectors_to_sidecar.py runs automatically (one-shot, idempotent). Safety invariant (dataset_signature hash) blocks reads if shards diverge from sidecar. Set =0 to disable.

Build Investigator — Root-Cause ML

BUILD_RCA_ML_MODE

default: shadow

BUILD_RCA_ML_MODE=shadow

shadow — predictions persisted to db_ai_diagnostics but not shown (safe default). active — ML label used when confidence ≥ 0.6, replacing the regex baseline. disabled — model not loaded. Auto-promotes from shadow to active when shadow cron observes ≥ 70% agreement with regex baseline over 7 days and n ≥ 50 samples.

BUILD_RCA_RETRAIN_THRESHOLD

default: 200

BUILD_RCA_RETRAIN_THRESHOLD=200

Threshold on new RCA samples in db_rca_corpus before auto-retraining fires. Backend service overrides this to 9999999 in compose — training MUST run only in ai-worker (RUN_AI_WORKER=true).

Observability — Logging, Metrics, Traces

LOG_FORMAT
LOG_LEVEL

defaults: json / INFO

LOG_FORMAT=json
LOG_LEVEL=INFO

Structured JSON logs include trace_id/span_id so Loki + Tempo derive trace links automatically. Set LOG_LEVEL=DEBUG for verbose output (testhide loggers only — does not affect external libs).

METRICS_ENABLED

default: true

METRICS_ENABLED=true

Exposes /api/v3/metrics in Prometheus exposition format. 12 baseline metrics (HTTP rps + latency, build events, WS connections, LLM cost/tokens, AI precompute duration). Scraped every 30s by Prometheus.

OTEL_TRACES_ENABLED
OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_SAMPLE_RATE
SERVICE_NAME

defaults shown

OTEL_TRACES_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo:4318
OTEL_SAMPLE_RATE=0.1
SERVICE_NAME=testhide

OpenTelemetry traces. Tempo is reached via the docker network name (no port exposure). Sampling: 1.0 = 100% (dev), 0.1 = 10% (prod recommendation).

GRAFANA_ADMIN_PASSWORD

REQUIRED

GRAFANA_ADMIN_PASSWORD=← click to generate

Grafana admin password. Change before first deploy — compose refuses to start if unset. Generate: python3 -c "import secrets; print(secrets.token_urlsafe(32))"

GRAFANA_ROOT_URL

default: ${PUBLIC_URL}:3000

GRAFANA_ROOT_URL=https://testhide.yourcompany.com:3000

Grafana public URL — used in UI links and OAuth redirects. Defaults to ${PUBLIC_URL}:3000 if unset.

GRAFANA_CERT_HOST_CRT
GRAFANA_CERT_HOST_KEY

defaults: ./ssl/testhide.crt/.key

GRAFANA_CERT_HOST_CRT=./ssl/testhide.crt
GRAFANA_CERT_HOST_KEY=./ssl/testhide.key

Grafana SSL cert files (host paths relative to docker-compose.yaml). By default reuses the main testhide cert. Override only if Grafana needs a separate certificate.

PROMETHEUS_PORT_BIND

default: 127.0.0.1:9090

PROMETHEUS_PORT_BIND=127.0.0.1:9090

Loopback binding by default — Prometheus has no auth, so the UI must not be exposed externally. Reach it via SSH port-forward, or query through Grafana (uses the docker network). NEVER use 0.0.0.0:9090 in production.

Security — Extra Hardening

DB_DUMP_ALLOWED_IPS

fail-closed default

DB_DUMP_ALLOWED_IPS=10.0.0.0/8,127.0.0.1

SEC-014: CSV of IPs/CIDRs allowed to call /api/v3/db/dump. Fail-closed — leave empty and the endpoint denies every request with HTTP 403.

TESTHIDE_ALLOW_INSECURE_WEBHOOKS

default: 0

TESTHIDE_ALLOW_INSECURE_WEBHOOKS=0

SEC-011: only set to 1 if you have a documented reason (self-signed internal webhooks, on-prem MS Teams behind a private CA). Even then, the individual webhook row must also set allow_insecure_tls=True — both gates must agree before TLS verification is skipped.

YOLO Visual Regression

AI_YOLO_MODEL
AI_YOLO_FORCE_CPU

defaults shown

AI_YOLO_MODEL=yolov8n.pt
AI_YOLO_FORCE_CPU=true

CPU training is the default — no GPU required. yolov8n.pt is the nano model: fast, low memory. Use yolov8s.pt for better accuracy at the cost of ~2× RAM.

AI_YOLO_EPOCHS
AI_YOLO_BATCH
AI_YOLO_FREEZE
AI_YOLO_MIN_ANNOTATIONS
AI_YOLO_WEIGHT_CORRECTION
AI_YOLO_WEIGHT_MANUAL

defaults shown

AI_YOLO_EPOCHS=5
AI_YOLO_BATCH=8
AI_YOLO_FREEZE=10
AI_YOLO_MIN_ANNOTATIONS=10
AI_YOLO_WEIGHT_CORRECTION=10
AI_YOLO_WEIGHT_MANUAL=5

Increase AI_YOLO_MIN_ANNOTATIONS to skip training on small datasets. On 32 GB servers set to 50 to avoid OOM during training runs.

License

LICENSE_API_URL

default shown

LICENSE_API_URL=https://service.testhide.com/api/v1/

Leave as-is for cloud license validation. For air-gapped deployments, contact support for an offline key.

LICENSE_API_CA_BUNDLE

optional

# LICENSE_API_CA_BUNDLE=/etc/ssl/ldap-ca.pem

Only needed if your corporate egress proxy MitMs HTTPS. Point at the same CA bundle as LDAP_CA_BUNDLE.

Docker Sidecar (SEC-004)

SIDECAR_ALLOWED_IMAGES

defaults shown

SIDECAR_ALLOWED_IMAGES=thuesdays/testhide-agent:latest,thuesdays/testhide-backend:latest

Comma-separated allowlist — only these images can be started via the API. Add your custom agent image if using a custom build.

SIDECAR_ALLOWED_EXEC_PATTERNS
SIDECAR_ALLOWED_NETWORKS

defaults shown

SIDECAR_ALLOWED_EXEC_PATTERNS=^python\s,^bash\s,^sh\s
SIDECAR_ALLOWED_NETWORKS=testhide-internal,bridge,testhide-net

Restrict which exec commands and networks are allowed through the sidecar proxy. Tighten in high-security environments.

Integrations (all optional)

JIRA_URL
JIRA_USERNAME
JIRA_PASSWORD
BITBUCKET_API_VERSION

optional

JIRA_URL=https://jira.yourcompany.com/
JIRA_USERNAME=testhide-service
JIRA_PASSWORD=api_token
BITBUCKET_API_VERSION=1.0

Enables the Jira Auto-Linker ML model. Use a Jira API token (not password). Leave empty to disable.

LDAP_USE_SSL
LDAP_PORT
LDAP_CA_BUNDLE
LDAP_TLS_INSECURE

optional

LDAP_USE_SSL=1
LDAP_PORT=636
LDAP_CA_BUNDLE=/etc/ssl/ldap-ca.pem
LDAP_TLS_INSECURE=0

LDAPS on port 636 by default. Drop your CA PEM at ./ssl/ldap-ca.pem on the host. Set LDAP_TLS_INSECURE=1 only for local dev — never in production.

GOOGLE_AI_API_KEY

optional

GOOGLE_AI_API_KEY=your_gemini_api_key

Optional quality upgrade: Gemini is used as a fallback when local Phi-3.5 inference is busy. The local LLM runs without this key.

Place SSL certificates

Create a ./ssl/ directory next to docker-compose.yaml — it's bind-mounted read-only into Nginx.

bash

# Expected layout

$ls -la ssl/

-rw-r--r-- ssl/testhide.crt # certificate (PEM)
-rw------- ssl/testhide.key # private key (chmod 600)
-rw-r--r-- ssl/ldap-ca.pem # optional: LDAP / proxy CA bundle

$chmod 600 ssl/testhide.key

# Quick self-signed cert (dev only)

$openssl req -x509 -nodes -newkey rsa:4096 -days 365 \
-keyout ssl/testhide.key -out ssl/testhide.crt \
-subj "/CN=testhide.local" && chmod 600 ssl/testhide.key

Launch the stack

First run downloads ~4–6 GB of images and AI models. Allow 10–15 minutes.

bash

# First deploy: --force-recreate forces config-init to refresh observability configs

$docker compose up -d --force-recreate

[+] Running 14/14
✔ Container testhide-config-init Exited (0) — configs synced
✔ Container testhide_mongo Healthy
✔ Container testhide_redis Healthy
✔ Container testhide_redisinsight Started
✔ Container testhide_sidecar_docker Healthy
✔ Container testhide_backend Started (×2 replicas)
✔ Container testhide_ai_api Started
✔ Container testhide_ai_worker Started
✔ Container testhide_frontend Started
✔ Container testhide-prometheus Started
✔ Container testhide-tempo Started
✔ Container testhide-loki Started
✔ Container testhide-grafana Started
✔ Container testhide-alloy Started

# Watch startup (Ctrl+C to stop following)

$docker compose logs -f backend ai-api ai-worker

/ expected startup sequence

0:00config-init copies baked observability configs to named volumes, exits

0:05MongoDB + Redis healthchecks pass

0:15Sidecar healthy, backend replicas accept traffic

0:20Prometheus, Tempo, Loki, Alloy start scraping/collecting

0:30Nginx + Grafana ready — UI on :443, dashboards on :3000

1:00AI API loads DistilBERT + FAISS index

5:00+AI Worker downloads HuggingFace models on first boot (~3.5 GB)

10:00+Phi-3.5-mini LLM loaded — all AI diagnostics active

Connect build agents

A lightweight launcher wraps the agent binary, managing service lifecycle and auto-updates. Agents connect via WebSocket. Available for Windows, Linux, macOS, and Docker.

App mode — interactive

Runs in the current user session. Auto-starts on login. Console window visible. Good for developer workstations where someone is always logged in.

./testhide

Service mode — CI servers ←

Runs as a system service (SYSTEM on Windows, root on Linux). Starts at boot with no user login. Recommended for dedicated build machines and containers.

./testhide --daemon

bash — universal installer (auto-detects Linux · macOS)

# Detects OS — installs via apt · yum · brew · or binary download

$curl -fsSL https://dl.testhide.com/install.sh | sudo bash

PowerShell — run as Administrator

# 1. Download launcher

PS>Invoke-WebRequest https://dl.testhide.com/stable/testhide-win-x64.exe -OutFile testhide.exe

# 2. Configure: WebSocket URL + license key

PS>.\testhide.exe config --url wss://YOUR_DOMAIN:7771 --license-key YOUR_LICENSE_KEY

# 3a. Install as Windows Service (recommended for CI servers)

PS>.\testhide.exe install-service

✔ Service "TesthideAgent" installed and started
Auto-start on boot · runs as SYSTEM · no login required

# 3b. Or run interactively (App mode — current user session)

PS>.\testhide.exe

ℹService name: TesthideAgent. Manage via services.msc or net stop TesthideAgent / sc.exe delete TesthideAgent.

ℹConfig stored at C:\ProgramData\Testhide\config.json. Verify with .\testhide.exe config --show.

bash — Ubuntu / Debian

# 1. Add APT repository

$curl -fsSL https://dl.testhide.com/apt/gpg.key | sudo apt-key add -

$echo "deb https://dl.testhide.com/apt stable main" | \
sudo tee /etc/apt/sources.list.d/testhide.list

$sudo apt-get update && sudo apt-get install -y testhide

# 2. Configure

$sudo testhide config --url wss://YOUR_DOMAIN:7771 --license-key YOUR_LICENSE_KEY

# 3. Enable systemd service

$sudo systemctl enable --now testhide

✔ testhide.service enabled and started
Restarts automatically on crash · logs via journalctl

ℹInstalls launcher + agent + systemd unit. Update later via sudo apt-get upgrade testhide.

ℹLogs: journalctl -u testhide -f · Status: systemctl status testhide

bash — CentOS / RHEL / Fedora

# 1. Add YUM repository

$sudo tee /etc/yum.repos.d/testhide.repo <<'EOF'
[testhide]
name=Testhide Agent
baseurl=https://dl.testhide.com/rpm
enabled=1
gpgcheck=0
EOF

$sudo dnf install -y testhide

# 2. Configure

$sudo testhide config --url wss://YOUR_DOMAIN:7771 --license-key YOUR_LICENSE_KEY

# 3. Enable systemd service

$sudo systemctl enable --now testhide

✔ testhide.service enabled and started

ℹInstalls launcher + agent + systemd unit. Update via sudo dnf upgrade testhide.

ℹLogs: journalctl -u testhide -f · Status: systemctl status testhide

bash — any Linux x64 (manual / air-gapped)

# 1. Download launcher binary

$sudo mkdir -p /opt/testhide

$sudo curl -fsSL https://dl.testhide.com/stable/testhide-linux-x64 \
-o /opt/testhide/testhide && sudo chmod +x /opt/testhide/testhide

# 2. Configure

$sudo /opt/testhide/testhide config \
--url wss://YOUR_DOMAIN:7771 \
--license-key YOUR_LICENSE_KEY

# 3. Create systemd service

$sudo tee /etc/systemd/system/testhide.service <<'EOF'
[Unit]
Description=Testhide CI/CD Agent
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
Environment=TESTHIDE_SERVICE_MODE=1
ExecStart=/opt/testhide/testhide --daemon
WorkingDirectory=/opt/testhide
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
EOF

$sudo systemctl daemon-reload && sudo systemctl enable --now testhide

ℹLauncher downloads the agent binary on first run to ~/.testhide/client/{version}/ from dl.testhide.com.

ℹConfig stored at /root/.testhide/config.json when running as root.

bash — macOS Intel x64 · Apple Silicon arm64

# Auto-detect architecture and download

$ARCH=$(uname -m | sed 's/x86_64/x64/;s/arm64/arm64/') \
&& curl -fsSL https://dl.testhide.com/stable/testhide-macos-$ARCH \
-o testhide && chmod +x testhide

# Configure

$./testhide config --url wss://YOUR_DOMAIN:7771 --license-key YOUR_LICENSE_KEY

# Run (App mode)

$./testhide

ℹIf macOS blocks the binary on first run: xattr -d com.apple.quarantine ./testhide or allow via System Settings → Privacy & Security.

ℹDownloads: testhide-macos-x64 (Intel) · testhide-macos-arm64 (M1/M2/M3)

bash — Docker ephemeral agent

# Run agent container

$docker run -d \
  --name testhide-agent \
  --restart unless-stopped \
  -e TESTHIDE_SERVICE_MODE=1 \
  -e TESTHIDE_NODE_TYPE=dynamic \
  thuesdays/testhide-agent:latest

# Configure inside the container

$docker exec testhide-agent testhide config \
--url wss://YOUR_DOMAIN:7771 \
--license-key YOUR_LICENSE_KEY

$docker restart testhide-agent

# Logs

$docker logs -f testhide-agent

# Co-locate in your server docker-compose.yaml:

# image: thuesdays/testhide-agent:latest

# environment: [TESTHIDE_SERVICE_MODE=1, TESTHIDE_NODE_TYPE=dynamic]

# depends_on: [backend]

# networks: [testhide-net]

ℹImage: thuesdays/testhide-agent:latest · Base: Python 3.12 slim · Includes SSH server for backend remote-terminal access during test runs.

ℹAdd -v /var/run/docker.sock:/var/run/docker.sock only if your tests need to spawn Docker containers (Docker-in-Docker).

testhide config — flag reference

--url / -u

REQUIRED

--url wss://testhide.yourcompany.com:7771

WebSocket server URL. Must start with ws:// or wss://. Must match WS_URL in your server .env.

--license-key / -k

REQUIRED

--license-key YOUR_LICENSE_KEY

License key from your Testhide dashboard. Stored as a SHA-256 hash — never written to disk in plain text.

--instance-id / -i

optional

--instance-id build-server-01

Human-readable node name shown in the Agents dashboard. Defaults to machine hostname if not set.

--show / -s

read-only

testhide config --show

Print current configuration to stdout without making any changes. Useful for verifying before first run.

--license-url / -l

optional

--license-url https://service.testhide.com/api/v1/

Override the license validation endpoint. Only needed for air-gapped deployments — contact support for an offline license key.

install-service
uninstall-service

Windows only

.\testhide.exe install-service
.\testhide.exe uninstall-service

Install or remove the TesthideAgent Windows Service. Requires Administrator. The launcher installs itself as a service, auto-starts on boot.

--daemon

Linux / macOS

testhide --daemon

Run in daemon (service) mode. Sets TESTHIDE_SERVICE_MODE=1 internally. Used in the systemd ExecStart line — do not use for interactive sessions.

↗Find your license key: Dashboard → Settings → License after first login.

↗Agents auto-update hourly from dl.testhide.com. The launcher rolls back to the previous version on repeated crashes and re-downloads if needed.

Verify & health-check

Run these before connecting your first pipeline.

bash

$docker compose ps

# Backend health

$curl -sk https://localhost/api/v3/health | python3 -m json.tool

{"status": "ok", "version": "6.1.6", "mongo": "connected", "redis": "connected"}

# AI API health

$curl -sk https://localhost:7771/api/v3/ai/health | python3 -m json.tool

{"status": "ok", "models_loaded": true, "llm": "phi-3.5-mini"}

# Prometheus metrics endpoint

$curl -sk https://localhost/api/v3/metrics | head -20

# HELP http_requests_total Total HTTP requests by route
http_requests_total{method="GET",route="/api/v3/health",status="200"} 12
...

# Open Grafana in your browser — admin / GRAFANA_ADMIN_PASSWORD from .env

$open https://localhost:3000 # or via your domain

→

Upgrades & operations

bash

# Pull latest images + restart. --force-recreate re-runs config-init so

# observability configs (Grafana dashboards etc.) refresh from the image.

$docker compose pull && docker compose up -d --force-recreate

# Backup MongoDB

$docker exec testhide_mongo mongodump \
  -u $MONGO_USER -p $MONGO_PASS \
  --authenticationDatabase admin \
  --out /data/db/backup_$(date +%Y%m%d)

# Follow AI worker logs (or use Grafana → Loki for searchable history)

$docker logs -f testhide_ai_worker

# Restart only one service (no full stack downtime)

$docker compose up -d --force-recreate --no-deps ai-worker

Rather skip the infrastructure? We run it for you.

Cloud Starter ($49/mo) — managed instance, 3 concurrent builds, 30-day retention, full 8-model dashboard.

Get cloud access → Explore features

Self-hosted in under 15 minutes.

System requirements

Download docker-compose.yaml

Configure .env

Place SSL certificates

Launch the stack

Connect build agents

Verify & health-check

Upgrades & operations

Rather skip the infrastructure? We run it for you.

Download `docker-compose.yaml`

Configure `.env`