Skip to main content

Testing

Clawdbot has three Vitest suites (unit/integration, e2e, live) and a small set of Docker runners. This doc is a “how we test” guide:
  • What each suite covers (and what it deliberately does not cover)
  • Which commands to run for common workflows (local, pre-push, debugging)
  • How live tests discover credentials and select models/providers
  • How to add regressions for real-world model/provider issues

Quick start

Most days:
  • Full gate (expected before push): pnpm lint && pnpm build && pnpm test
When you touch tests or want extra confidence:
  • Coverage gate: pnpm test:coverage
  • E2E suite: pnpm test:e2e
When debugging real providers/models (requires real creds; skipped by default):
  • Live suite (models only): CLAWDBOT_LIVE_TEST=1 pnpm test:live
  • Live suite (models + providers): LIVE=1 pnpm test:live
Tip: when you only need one failing case, prefer narrowing live tests via the allowlist env vars described below.

Test suites (what runs where)

Think of the suites as “increasing realism” (and increasing flakiness/cost):

Unit / integration (default)

  • Command: pnpm test
  • Config: vitest.config.ts
  • Files: src/**/*.test.ts
  • Scope:
    • Pure unit tests
    • In-process integration tests (gateway auth, routing, tooling, parsing, config)
    • Deterministic regressions for known bugs
  • Expectations:
    • Runs in CI
    • No real keys required
    • Should be fast and stable

E2E (gateway smoke)

  • Command: pnpm test:e2e
  • Config: vitest.e2e.config.ts
  • Files: src/**/*.e2e.test.ts
  • Scope:
    • Multi-instance gateway end-to-end behavior
    • WebSocket/HTTP surfaces, node pairing, and heavier networking
  • Expectations:
    • Runs in CI (when enabled in the pipeline)
    • No real keys required
    • More moving parts than unit tests (can be slower)

Live (real providers + real models)

  • Command: pnpm test:live
  • Config: vitest.live.config.ts
  • Files: src/**/*.live.test.ts
  • Default: skipped unless CLAWDBOT_LIVE_TEST=1 or LIVE=1
  • Scope:
    • “Does this provider/model actually work today with real creds?”
    • Catch provider format changes, tool-calling quirks, auth issues, and rate limit behavior
  • Expectations:
    • Not CI-stable by design (real networks, real provider policies, quotas, outages)
    • Costs money / uses rate limits
    • Prefer running narrowed subsets instead of “everything”

Which suite should I run?

Use this decision table:
  • Editing logic/tests: run pnpm test (and pnpm test:coverage if you changed a lot)
  • Touching gateway networking / WS protocol / pairing: add pnpm test:e2e
  • Debugging “my bot is down” / provider-specific failures / tool calling: run a narrowed pnpm test:live

Live: model smoke (profile keys)

Live tests are split into two layers so we can isolate failures:
  • “Direct model” tells us the provider/model can answer at all with the given key.
  • “Gateway smoke” tells us the full gateway+agent pipeline works for that model (sessions, history, tools, sandbox policy, etc.).

Layer 1: Direct model completion (no gateway)

  • Test: src/agents/models.profiles.live.test.ts
  • Goal:
    • Enumerate discovered models
    • Use getApiKeyForModel to select models you have creds for
    • Run a small completion per model (and targeted regressions where needed)
  • How to enable:
    • CLAWDBOT_LIVE_TEST=1 or LIVE=1
    • CLAWDBOT_LIVE_ALL_MODELS=1 (required for this test to run)
  • How to select models:
    • CLAWDBOT_LIVE_MODELS=all to run everything with keys
    • or CLAWDBOT_LIVE_MODELS="openai/gpt-5.2,anthropic/claude-opus-4-5,..." (comma allowlist)
  • How to select providers:
    • CLAWDBOT_LIVE_PROVIDERS="google,google-antigravity,google-gemini-cli" (comma allowlist)
  • Where keys come from:
    • By default: profile store and env fallbacks
    • Set CLAWDBOT_LIVE_REQUIRE_PROFILE_KEYS=1 to enforce profile store only
  • Why this exists:
    • Separates “provider API is broken / key is invalid” from “gateway agent pipeline is broken”
    • Contains small, isolated regressions (example: OpenAI Responses/Codex Responses reasoning replay + tool-call flows)

Layer 2: Gateway + dev agent smoke (what “@clawdbot” actually does)

  • Test: src/gateway/gateway-models.profiles.live.test.ts
  • Goal:
    • Spin up an in-process gateway
    • Create/patch a agent:dev:* session (model override per run)
    • Iterate models-with-keys and assert:
      • “meaningful” response (no tools)
      • a real tool invocation works (read probe)
      • optional extra tool probes (bash+read probe)
      • OpenAI regression paths (tool-call-only → follow-up) keep working
  • How to enable:
    • CLAWDBOT_LIVE_TEST=1 or LIVE=1
    • CLAWDBOT_LIVE_GATEWAY=1 (required for this test to run)
  • How to select models:
    • CLAWDBOT_LIVE_GATEWAY_ALL_MODELS=1 to scan all discovered models with keys
    • or set CLAWDBOT_LIVE_GATEWAY_MODELS="provider/model,provider/model,..." to narrow quickly
  • How to select providers (avoid “OpenRouter everything”):
    • CLAWDBOT_LIVE_GATEWAY_PROVIDERS="google,google-antigravity,google-gemini-cli,openai,anthropic,zai,minimax" (comma allowlist)
  • Optional tool-calling stress:
    • CLAWDBOT_LIVE_GATEWAY_TOOL_PROBE=1 enables an extra “bash writes file → read reads it back → echo nonce” check.
    • This is specifically meant to catch tool-calling compatibility issues across providers (formatting, history replay, tool_result pairing, etc.).
  • Optional image send smoke:
    • CLAWDBOT_LIVE_GATEWAY_IMAGE_PROBE=1 sends a real image attachment through the gateway agent pipeline (multimodal message) and asserts the model can read back a per-run code from the image.
    • Flow (high level):
      • Test generates a tiny PNG with “CAT” + random code (src/gateway/live-image-probe.ts)
      • Sends it via agent attachments: [{ mimeType: "image/png", content: "<base64>" }]
      • Gateway parses attachments into images[] (src/gateway/server-methods/agent.ts + src/gateway/chat-attachments.ts)
      • Embedded agent forwards a multimodal user message to the model
      • Assertion: reply contains cat + the code (OCR tolerance: minor mistakes allowed)
Narrow, explicit allowlists are fastest and least flaky:
  • Single model, direct (no gateway):
    • CLAWDBOT_LIVE_TEST=1 CLAWDBOT_LIVE_ALL_MODELS=1 CLAWDBOT_LIVE_MODELS="openai/gpt-5.2" pnpm test:live src/agents/models.profiles.live.test.ts
  • Single model, gateway smoke:
    • LIVE=1 CLAWDBOT_LIVE_GATEWAY=1 CLAWDBOT_LIVE_GATEWAY_ALL_MODELS=1 CLAWDBOT_LIVE_GATEWAY_MODELS="openai/gpt-5.2" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts
  • Tool calling across several providers (bash + read probe):
    • LIVE=1 CLAWDBOT_LIVE_GATEWAY=1 CLAWDBOT_LIVE_GATEWAY_ALL_MODELS=1 CLAWDBOT_LIVE_GATEWAY_TOOL_PROBE=1 CLAWDBOT_LIVE_GATEWAY_MODELS="openai/gpt-5.2,anthropic/claude-opus-4-5,google/gemini-flash-latest,zai/glm-4.7,minimax/minimax-m2.1" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts
  • Google focus (Gemini API key + Antigravity):
    • Gemini (API key): LIVE=1 CLAWDBOT_LIVE_GATEWAY=1 CLAWDBOT_LIVE_GATEWAY_ALL_MODELS=1 CLAWDBOT_LIVE_GATEWAY_TOOL_PROBE=1 CLAWDBOT_LIVE_GATEWAY_IMAGE_PROBE=1 CLAWDBOT_LIVE_GATEWAY_MODELS="google/gemini-flash-latest" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts
    • Antigravity (OAuth): LIVE=1 CLAWDBOT_LIVE_GATEWAY=1 CLAWDBOT_LIVE_GATEWAY_ALL_MODELS=1 CLAWDBOT_LIVE_GATEWAY_TOOL_PROBE=1 CLAWDBOT_LIVE_GATEWAY_IMAGE_PROBE=1 CLAWDBOT_LIVE_GATEWAY_MODELS="google-antigravity/claude-opus-4-5-thinking,google-antigravity/gemini-3-pro-high" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts

Live: model matrix (what we cover)

There is no fixed “CI model list” (live is opt-in), but these are the recommended models to cover regularly on a dev machine with keys.

Baseline: tool calling (Read + optional Bash)

Pick at least one per provider family:
  • OpenAI: openai/gpt-5.2 (or openai/gpt-5-mini)
  • Anthropic: anthropic/claude-opus-4-5 (or anthropic/claude-sonnet-4-5)
  • Google: google/gemini-flash-latest (or google/gemini-2.5-pro)
  • Z.AI (GLM): zai/glm-4.7
  • MiniMax: minimax/minimax-m2.1
Optional additional coverage (nice to have):
  • xAI: xai/grok-4 (or latest available)
  • Mistral: mistral/… (pick one “tools” capable model you have enabled)
  • Cerebras: cerebras/… (if you have access)
  • LM Studio: lmstudio/… (local; tool calling depends on API mode)

Vision: image send (attachment → multimodal message)

Run with CLAWDBOT_LIVE_GATEWAY_IMAGE_PROBE=1 and include at least one image-capable model in CLAWDBOT_LIVE_GATEWAY_MODELS (Claude/Gemini/OpenAI vision-capable variants, etc.).

Aggregators / alternate gateways

If you have keys enabled, we also support testing via:
  • OpenRouter: openrouter/... (hundreds of models; use clawdbot models scan to find tool+image capable candidates)
  • OpenCode Zen: opencode-zen/... (requires OPENCODE_ZEN_API_KEY)
Tip: don’t try to hardcode “all models” in docs. The authoritative list is whatever discoverModels(...) returns on your machine + whatever keys are available.

Credentials (never commit)

Live tests discover credentials the same way the CLI does. Practical implications:
  • If the CLI works, live tests should find the same keys.
  • If a live test says “no creds”, debug the same way you’d debug clawdbot models list / model selection.
  • Profile store: ~/.clawdbot/credentials/ (preferred; what “profile keys” means in the tests)
  • Config: ~/.clawdbot/clawdbot.json (or CLAWDBOT_CONFIG_PATH)
If you want to rely on env keys (e.g. exported in your ~/.profile), run local tests after source ~/.profile, or use the Docker runners below (they can mount ~/.profile into the container).

Docker runners (optional “works in Linux” checks)

These run pnpm test:live inside the repo Docker image, mounting your local config dir and workspace (and sourcing ~/.profile if mounted):
  • Direct models: pnpm test:docker:live-models (script: scripts/test-live-models-docker.sh)
  • Gateway + dev agent: pnpm test:docker:live-gateway (script: scripts/test-live-gateway-models-docker.sh)
  • Onboarding wizard (TTY, full scaffolding): pnpm test:docker:onboard (script: scripts/e2e/onboard-docker.sh)
  • Gateway networking (two containers, WS auth + health): pnpm test:docker:gateway-network (script: scripts/e2e/gateway-network-docker.sh)
Useful env vars:
  • CLAWDBOT_CONFIG_DIR=... (default: ~/.clawdbot) mounted to /home/node/.clawdbot
  • CLAWDBOT_WORKSPACE_DIR=... (default: ~/clawd) mounted to /home/node/clawd
  • CLAWDBOT_PROFILE_FILE=... (default: ~/.profile) mounted to /home/node/.profile and sourced before running tests
  • CLAWDBOT_LIVE_GATEWAY_MODELS=... / CLAWDBOT_LIVE_MODELS=... to narrow the run
  • CLAWDBOT_LIVE_REQUIRE_PROFILE_KEYS=1 to ensure creds come from the profile store (not env)

Docs sanity

Run docs checks after doc edits: pnpm docs:list.

Offline regression (CI-safe)

These are “real pipeline” regressions without real providers:
  • Gateway tool calling (mock OpenAI, real gateway + agent loop): src/gateway/gateway.tool-calling.mock-openai.test.ts
  • Gateway wizard (WS wizard.start/wizard.next, writes config + auth enforced): src/gateway/gateway.wizard.e2e.test.ts

Adding regressions (guidance)

When you fix a provider/model issue discovered in live:
  • Add a CI-safe regression if possible (mock/stub provider, or capture the exact request-shape transformation)
  • If it’s inherently live-only (rate limits, auth policies), keep the live test narrow and opt-in via env vars
  • Prefer targeting the smallest layer that catches the bug:
    • provider request conversion/replay bug → direct models test
    • gateway session/history/tool pipeline bug → gateway live smoke or CI-safe gateway mock test