--- summary: "Testing kit: unit/e2e/live suites, Docker runners, and what each test covers" read_when: - Running tests locally or in CI - Adding regressions for model/provider bugs - Debugging gateway + agent behavior title: "Testing" --- # Testing OpenClaw has three Vitest suites (unit/integration, e2e, live) and a small set of Docker runners. This doc is a “how we test” guide: - What each suite covers (and what it deliberately does _not_ cover) - Which commands to run for common workflows (local, pre-push, debugging) - How live tests discover credentials and select models/providers - How to add regressions for real-world model/provider issues ## Quick start Most days: - Full gate (expected before push): `pnpm build && pnpm check && pnpm test` - Faster local full-suite run on a roomy machine: `pnpm test:max` When you touch tests or want extra confidence: - Coverage gate: `pnpm test:coverage` - E2E suite: `pnpm test:e2e` When debugging real providers/models (requires real creds): - Live suite (models + gateway tool/image probes): `pnpm test:live` Tip: when you only need one failing case, prefer narrowing live tests via the allowlist env vars described below. ## Test suites (what runs where) Think of the suites as “increasing realism” (and increasing flakiness/cost): ### Unit / integration (default) - Command: `pnpm test` - Config: `scripts/test-parallel.mjs` (runs `vitest.unit.config.ts`, `vitest.extensions.config.ts`, `vitest.gateway.config.ts`) - Files: `src/**/*.test.ts`, `extensions/**/*.test.ts` - Scope: - Pure unit tests - In-process integration tests (gateway auth, routing, tooling, parsing, config) - Deterministic regressions for known bugs - Expectations: - Runs in CI - No real keys required - Should be fast and stable - Scheduler note: - `pnpm test` now keeps a small checked-in behavioral manifest for true pool/isolation overrides and a separate timing snapshot for the slowest unit files. - Extension-only local runs now also use a checked-in extensions timing snapshot plus a slightly coarser shared batch target on high-memory hosts, so the shared extensions lane avoids spawning an extra batch when two measured shared runs are enough. - High-memory local extension shared batches also run with a slightly higher worker cap than before, which shortened the two remaining shared extension batches without changing the isolated extension lanes. - High-memory local channel runs now reuse the checked-in channel timing snapshot to split the shared channels lane into a few measured batches instead of one long shared worker. - High-memory local channel shared batches also run with a slightly lower worker cap than shared unit batches, which helped targeted channel reruns avoid CPU oversubscription once isolated channel lanes are already in flight. - Targeted local channel reruns now start splitting shared channel work a bit earlier, which keeps medium-sized targeted reruns from leaving one oversized shared channel batch on the critical path. - Targeted local unit reruns also split medium-sized shared unit selections into measured batches, which helps large focused reruns overlap instead of waiting behind one long shared unit lane. - High-memory local multi-surface runs also use slightly coarser shared `unit-fast` batches so the mixed planner spends less time spinning up extra shared unit workers before the later surfaces can overlap. - Shared unit, extension, channel, and gateway runs all stay on Vitest `forks`. - The wrapper keeps measured fork-isolated exceptions and heavy singleton lanes explicit in `test/fixtures/test-parallel.behavior.json`. - The wrapper peels the heaviest measured files into dedicated lanes instead of relying on a growing hand-maintained exclusion list. - For surface-only local runs, unit, extension, and channel shared lanes can overlap their isolated hotspots instead of waiting behind one serial prefix. - For multi-surface local runs, the wrapper keeps the shared surface phases ordered, but batches inside the same shared phase now fan out together, deferred isolated work can overlap the next shared phase, and spare `unit-fast` headroom now starts that deferred work earlier instead of leaving those slots idle. - Refresh the timing snapshots with `pnpm test:perf:update-timings` and `pnpm test:perf:update-timings:extensions` after major suite shape changes. - Embedded runner note: - When you change message-tool discovery inputs or compaction runtime context, keep both levels of coverage. - Add focused helper regressions for pure routing/normalization boundaries. - Also keep the embedded runner integration suites healthy: `src/agents/pi-embedded-runner/compact.hooks.test.ts`, `src/agents/pi-embedded-runner/run.overflow-compaction.test.ts`, and `src/agents/pi-embedded-runner/run.overflow-compaction.loop.test.ts`. - Those suites verify that scoped ids and compaction behavior still flow through the real `run.ts` / `compact.ts` paths; helper-only tests are not a sufficient substitute for those integration paths. - Pool note: - Base Vitest config still defaults to `forks`. - Unit, channel, extension, and gateway wrapper lanes all default to `forks`. - Unit, channel, and extension configs default to `isolate: false` for faster file startup. - `pnpm test` also passes `--isolate=false` at the wrapper level. - Opt back into Vitest file isolation with `OPENCLAW_TEST_ISOLATE=1 pnpm test`. - `OPENCLAW_TEST_NO_ISOLATE=0` or `OPENCLAW_TEST_NO_ISOLATE=false` also force isolated runs. - Fast-local iteration note: - `pnpm test:changed` runs the wrapper with `--changed origin/main`. - `pnpm test:changed:max` keeps the same changed-file filter but uses the wrapper's aggressive local planner profile. - `pnpm test:max` exposes that same planner profile for a full local run. - On supported local Node versions, including Node 25, the normal profile can use top-level lane parallelism. `pnpm test:max` still pushes the planner harder when you want a more aggressive local run. - The base Vitest config marks the wrapper manifests/config files as `forceRerunTriggers` so changed-mode reruns stay correct when scheduler inputs change. - The wrapper keeps `OPENCLAW_VITEST_FS_MODULE_CACHE` enabled on supported hosts, but assigns a lane-local `OPENCLAW_VITEST_FS_MODULE_CACHE_PATH` so concurrent Vitest processes do not race on one shared experimental cache directory. - Set `OPENCLAW_VITEST_FS_MODULE_CACHE_PATH=/abs/path` if you want one explicit cache location for direct single-run profiling. - Perf-debug note: - `pnpm test:perf:imports` enables Vitest import-duration reporting plus import-breakdown output. - `pnpm test:perf:imports:changed` scopes the same profiling view to files changed since `origin/main`. - `pnpm test:perf:profile:main` writes a main-thread CPU profile for Vitest/Vite startup and transform overhead. - `pnpm test:perf:profile:runner` writes runner CPU+heap profiles for the unit suite with file parallelism disabled. ### E2E (gateway smoke) - Command: `pnpm test:e2e` - Config: `vitest.e2e.config.ts` - Files: `src/**/*.e2e.test.ts`, `test/**/*.e2e.test.ts` - Runtime defaults: - Uses Vitest `forks` for deterministic cross-file isolation. - Uses adaptive workers (CI: up to 2, local: 1 by default). - Runs in silent mode by default to reduce console I/O overhead. - Useful overrides: - `OPENCLAW_E2E_WORKERS=` to force worker count (capped at 16). - `OPENCLAW_E2E_VERBOSE=1` to re-enable verbose console output. - Scope: - Multi-instance gateway end-to-end behavior - WebSocket/HTTP surfaces, node pairing, and heavier networking - Expectations: - Runs in CI (when enabled in the pipeline) - No real keys required - More moving parts than unit tests (can be slower) ### E2E: OpenShell backend smoke - Command: `pnpm test:e2e:openshell` - File: `test/openshell-sandbox.e2e.test.ts` - Scope: - Starts an isolated OpenShell gateway on the host via Docker - Creates a sandbox from a temporary local Dockerfile - Exercises OpenClaw's OpenShell backend over real `sandbox ssh-config` + SSH exec - Verifies remote-canonical filesystem behavior through the sandbox fs bridge - Expectations: - Opt-in only; not part of the default `pnpm test:e2e` run - Requires a local `openshell` CLI plus a working Docker daemon - Uses isolated `HOME` / `XDG_CONFIG_HOME`, then destroys the test gateway and sandbox - Useful overrides: - `OPENCLAW_E2E_OPENSHELL=1` to enable the test when running the broader e2e suite manually - `OPENCLAW_E2E_OPENSHELL_COMMAND=/path/to/openshell` to point at a non-default CLI binary or wrapper script ### Live (real providers + real models) - Command: `pnpm test:live` - Config: `vitest.live.config.ts` - Files: `src/**/*.live.test.ts` - Default: **enabled** by `pnpm test:live` (sets `OPENCLAW_LIVE_TEST=1`) - Scope: - “Does this provider/model actually work _today_ with real creds?” - Catch provider format changes, tool-calling quirks, auth issues, and rate limit behavior - Expectations: - Not CI-stable by design (real networks, real provider policies, quotas, outages) - Costs money / uses rate limits - Prefer running narrowed subsets instead of “everything” - Live runs will source `~/.profile` to pick up missing API keys - API key rotation (provider-specific): set `*_API_KEYS` with comma/semicolon format or `*_API_KEY_1`, `*_API_KEY_2` (for example `OPENAI_API_KEYS`, `ANTHROPIC_API_KEYS`, `GEMINI_API_KEYS`) or per-live override via `OPENCLAW_LIVE_*_KEY`; tests retry on rate limit responses. - Progress/heartbeat output: - Live suites now emit progress lines to stderr so long provider calls are visibly active even when Vitest console capture is quiet. - `vitest.live.config.ts` disables Vitest console interception so provider/gateway progress lines stream immediately during live runs. - Tune direct-model heartbeats with `OPENCLAW_LIVE_HEARTBEAT_MS`. - Tune gateway/probe heartbeats with `OPENCLAW_LIVE_GATEWAY_HEARTBEAT_MS`. ## Which suite should I run? Use this decision table: - Editing logic/tests: run `pnpm test` (and `pnpm test:coverage` if you changed a lot) - Touching gateway networking / WS protocol / pairing: add `pnpm test:e2e` - Debugging “my bot is down” / provider-specific failures / tool calling: run a narrowed `pnpm test:live` ## Live: Android node capability sweep - Test: `src/gateway/android-node.capabilities.live.test.ts` - Script: `pnpm android:test:integration` - Goal: invoke **every command currently advertised** by a connected Android node and assert command contract behavior. - Scope: - Preconditioned/manual setup (the suite does not install/run/pair the app). - Command-by-command gateway `node.invoke` validation for the selected Android node. - Required pre-setup: - Android app already connected + paired to the gateway. - App kept in foreground. - Permissions/capture consent granted for capabilities you expect to pass. - Optional target overrides: - `OPENCLAW_ANDROID_NODE_ID` or `OPENCLAW_ANDROID_NODE_NAME`. - `OPENCLAW_ANDROID_GATEWAY_URL` / `OPENCLAW_ANDROID_GATEWAY_TOKEN` / `OPENCLAW_ANDROID_GATEWAY_PASSWORD`. - Full Android setup details: [Android App](/platforms/android) ## Live: model smoke (profile keys) Live tests are split into two layers so we can isolate failures: - “Direct model” tells us the provider/model can answer at all with the given key. - “Gateway smoke” tells us the full gateway+agent pipeline works for that model (sessions, history, tools, sandbox policy, etc.). ### Layer 1: Direct model completion (no gateway) - Test: `src/agents/models.profiles.live.test.ts` - Goal: - Enumerate discovered models - Use `getApiKeyForModel` to select models you have creds for - Run a small completion per model (and targeted regressions where needed) - How to enable: - `pnpm test:live` (or `OPENCLAW_LIVE_TEST=1` if invoking Vitest directly) - Set `OPENCLAW_LIVE_MODELS=modern` (or `all`, alias for modern) to actually run this suite; otherwise it skips to keep `pnpm test:live` focused on gateway smoke - How to select models: - `OPENCLAW_LIVE_MODELS=modern` to run the modern allowlist (Opus/Sonnet/Haiku 4.5, GPT-5.x + Codex, Gemini 3, GLM 4.7, MiniMax M2.7, Grok 4) - `OPENCLAW_LIVE_MODELS=all` is an alias for the modern allowlist - or `OPENCLAW_LIVE_MODELS="openai/gpt-5.2,anthropic/claude-opus-4-6,..."` (comma allowlist) - How to select providers: - `OPENCLAW_LIVE_PROVIDERS="google,google-antigravity,google-gemini-cli"` (comma allowlist) - Where keys come from: - By default: profile store and env fallbacks - Set `OPENCLAW_LIVE_REQUIRE_PROFILE_KEYS=1` to enforce **profile store** only - Why this exists: - Separates “provider API is broken / key is invalid” from “gateway agent pipeline is broken” - Contains small, isolated regressions (example: OpenAI Responses/Codex Responses reasoning replay + tool-call flows) ### Layer 2: Gateway + dev agent smoke (what "@openclaw" actually does) - Test: `src/gateway/gateway-models.profiles.live.test.ts` - Goal: - Spin up an in-process gateway - Create/patch a `agent:dev:*` session (model override per run) - Iterate models-with-keys and assert: - “meaningful” response (no tools) - a real tool invocation works (read probe) - optional extra tool probes (exec+read probe) - OpenAI regression paths (tool-call-only → follow-up) keep working - Probe details (so you can explain failures quickly): - `read` probe: the test writes a nonce file in the workspace and asks the agent to `read` it and echo the nonce back. - `exec+read` probe: the test asks the agent to `exec`-write a nonce into a temp file, then `read` it back. - image probe: the test attaches a generated PNG (cat + randomized code) and expects the model to return `cat `. - Implementation reference: `src/gateway/gateway-models.profiles.live.test.ts` and `src/gateway/live-image-probe.ts`. - How to enable: - `pnpm test:live` (or `OPENCLAW_LIVE_TEST=1` if invoking Vitest directly) - How to select models: - Default: modern allowlist (Opus/Sonnet/Haiku 4.5, GPT-5.x + Codex, Gemini 3, GLM 4.7, MiniMax M2.7, Grok 4) - `OPENCLAW_LIVE_GATEWAY_MODELS=all` is an alias for the modern allowlist - Or set `OPENCLAW_LIVE_GATEWAY_MODELS="provider/model"` (or comma list) to narrow - How to select providers (avoid “OpenRouter everything”): - `OPENCLAW_LIVE_GATEWAY_PROVIDERS="google,google-antigravity,google-gemini-cli,openai,anthropic,zai,minimax"` (comma allowlist) - Tool + image probes are always on in this live test: - `read` probe + `exec+read` probe (tool stress) - image probe runs when the model advertises image input support - Flow (high level): - Test generates a tiny PNG with “CAT” + random code (`src/gateway/live-image-probe.ts`) - Sends it via `agent` `attachments: [{ mimeType: "image/png", content: "" }]` - Gateway parses attachments into `images[]` (`src/gateway/server-methods/agent.ts` + `src/gateway/chat-attachments.ts`) - Embedded agent forwards a multimodal user message to the model - Assertion: reply contains `cat` + the code (OCR tolerance: minor mistakes allowed) Tip: to see what you can test on your machine (and the exact `provider/model` ids), run: ```bash openclaw models list openclaw models list --json ``` ## Live: Anthropic setup-token smoke - Test: `src/agents/anthropic.setup-token.live.test.ts` - Goal: verify Claude Code CLI setup-token (or a pasted setup-token profile) can complete an Anthropic prompt. - Enable: - `pnpm test:live` (or `OPENCLAW_LIVE_TEST=1` if invoking Vitest directly) - `OPENCLAW_LIVE_SETUP_TOKEN=1` - Token sources (pick one): - Profile: `OPENCLAW_LIVE_SETUP_TOKEN_PROFILE=anthropic:setup-token-test` - Raw token: `OPENCLAW_LIVE_SETUP_TOKEN_VALUE=sk-ant-oat01-...` - Model override (optional): - `OPENCLAW_LIVE_SETUP_TOKEN_MODEL=anthropic/claude-opus-4-6` Setup example: ```bash openclaw models auth paste-token --provider anthropic --profile-id anthropic:setup-token-test OPENCLAW_LIVE_SETUP_TOKEN=1 OPENCLAW_LIVE_SETUP_TOKEN_PROFILE=anthropic:setup-token-test pnpm test:live src/agents/anthropic.setup-token.live.test.ts ``` ## Live: CLI backend smoke (Claude Code CLI or other local CLIs) - Test: `src/gateway/gateway-cli-backend.live.test.ts` - Goal: validate the Gateway + agent pipeline using a local CLI backend, without touching your default config. - Enable: - `pnpm test:live` (or `OPENCLAW_LIVE_TEST=1` if invoking Vitest directly) - `OPENCLAW_LIVE_CLI_BACKEND=1` - Defaults: - Model: `claude-cli/claude-sonnet-4-6` - Command: `claude` - Args: `["-p","--output-format","json","--permission-mode","bypassPermissions"]` - Overrides (optional): - `OPENCLAW_LIVE_CLI_BACKEND_MODEL="claude-cli/claude-opus-4-6"` - `OPENCLAW_LIVE_CLI_BACKEND_MODEL="codex-cli/gpt-5.4"` - `OPENCLAW_LIVE_CLI_BACKEND_COMMAND="/full/path/to/claude"` - `OPENCLAW_LIVE_CLI_BACKEND_ARGS='["-p","--output-format","json","--permission-mode","bypassPermissions"]'` - `OPENCLAW_LIVE_CLI_BACKEND_CLEAR_ENV='["ANTHROPIC_API_KEY","ANTHROPIC_API_KEY_OLD"]'` - `OPENCLAW_LIVE_CLI_BACKEND_IMAGE_PROBE=1` to send a real image attachment (paths are injected into the prompt). - `OPENCLAW_LIVE_CLI_BACKEND_IMAGE_ARG="--image"` to pass image file paths as CLI args instead of prompt injection. - `OPENCLAW_LIVE_CLI_BACKEND_IMAGE_MODE="repeat"` (or `"list"`) to control how image args are passed when `IMAGE_ARG` is set. - `OPENCLAW_LIVE_CLI_BACKEND_RESUME_PROBE=1` to send a second turn and validate resume flow. - `OPENCLAW_LIVE_CLI_BACKEND_DISABLE_MCP_CONFIG=0` to keep Claude Code CLI MCP config enabled (default disables MCP config with a temporary empty file). Example: ```bash OPENCLAW_LIVE_CLI_BACKEND=1 \ OPENCLAW_LIVE_CLI_BACKEND_MODEL="claude-cli/claude-sonnet-4-6" \ pnpm test:live src/gateway/gateway-cli-backend.live.test.ts ``` Docker recipe: ```bash pnpm test:docker:live-cli-backend ``` Notes: - The Docker runner lives at `scripts/test-live-cli-backend-docker.sh`. - It runs the live CLI-backend smoke inside the repo Docker image as the non-root `node` user, because Claude CLI rejects `bypassPermissions` when invoked as root. - For `claude-cli`, it installs the Linux `@anthropic-ai/claude-code` package into a cached writable prefix at `OPENCLAW_DOCKER_CLI_TOOLS_DIR` (default: `~/.cache/openclaw/docker-cli-tools`). - It copies `~/.claude` into the container when available, but on machines where Claude auth is backed by `ANTHROPIC_API_KEY`, it also preserves `ANTHROPIC_API_KEY` / `ANTHROPIC_API_KEY_OLD` for the child Claude CLI via `OPENCLAW_LIVE_CLI_BACKEND_PRESERVE_ENV`. ### Recommended live recipes Narrow, explicit allowlists are fastest and least flaky: - Single model, direct (no gateway): - `OPENCLAW_LIVE_MODELS="openai/gpt-5.2" pnpm test:live src/agents/models.profiles.live.test.ts` - Single model, gateway smoke: - `OPENCLAW_LIVE_GATEWAY_MODELS="openai/gpt-5.2" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts` - Tool calling across several providers: - `OPENCLAW_LIVE_GATEWAY_MODELS="openai/gpt-5.2,anthropic/claude-opus-4-6,google/gemini-3-flash-preview,zai/glm-4.7,minimax/MiniMax-M2.7" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts` - Google focus (Gemini API key + Antigravity): - Gemini (API key): `OPENCLAW_LIVE_GATEWAY_MODELS="google/gemini-3-flash-preview" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts` - Antigravity (OAuth): `OPENCLAW_LIVE_GATEWAY_MODELS="google-antigravity/claude-opus-4-6-thinking,google-antigravity/gemini-3-pro-high" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts` Notes: - `google/...` uses the Gemini API (API key). - `google-antigravity/...` uses the Antigravity OAuth bridge (Cloud Code Assist-style agent endpoint). - `google-gemini-cli/...` uses the local Gemini CLI on your machine (separate auth + tooling quirks). - Gemini API vs Gemini CLI: - API: OpenClaw calls Google’s hosted Gemini API over HTTP (API key / profile auth); this is what most users mean by “Gemini”. - CLI: OpenClaw shells out to a local `gemini` binary; it has its own auth and can behave differently (streaming/tool support/version skew). ## Live: model matrix (what we cover) There is no fixed “CI model list” (live is opt-in), but these are the **recommended** models to cover regularly on a dev machine with keys. ### Modern smoke set (tool calling + image) This is the “common models” run we expect to keep working: - OpenAI (non-Codex): `openai/gpt-5.2` (optional: `openai/gpt-5.1`) - OpenAI Codex: `openai-codex/gpt-5.4` - Anthropic: `anthropic/claude-opus-4-6` (or `anthropic/claude-sonnet-4-6`) - Google (Gemini API): `google/gemini-3.1-pro-preview` and `google/gemini-3-flash-preview` (avoid older Gemini 2.x models) - Google (Antigravity): `google-antigravity/claude-opus-4-6-thinking` and `google-antigravity/gemini-3-flash` - Z.AI (GLM): `zai/glm-4.7` - MiniMax: `minimax/MiniMax-M2.7` Run gateway smoke with tools + image: `OPENCLAW_LIVE_GATEWAY_MODELS="openai/gpt-5.2,openai-codex/gpt-5.4,anthropic/claude-opus-4-6,google/gemini-3.1-pro-preview,google/gemini-3-flash-preview,google-antigravity/claude-opus-4-6-thinking,google-antigravity/gemini-3-flash,zai/glm-4.7,minimax/MiniMax-M2.7" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts` ### Baseline: tool calling (Read + optional Exec) Pick at least one per provider family: - OpenAI: `openai/gpt-5.2` (or `openai/gpt-5-mini`) - Anthropic: `anthropic/claude-opus-4-6` (or `anthropic/claude-sonnet-4-6`) - Google: `google/gemini-3-flash-preview` (or `google/gemini-3.1-pro-preview`) - Z.AI (GLM): `zai/glm-4.7` - MiniMax: `minimax/MiniMax-M2.7` Optional additional coverage (nice to have): - xAI: `xai/grok-4` (or latest available) - Mistral: `mistral/`… (pick one “tools” capable model you have enabled) - Cerebras: `cerebras/`… (if you have access) - LM Studio: `lmstudio/`… (local; tool calling depends on API mode) ### Vision: image send (attachment → multimodal message) Include at least one image-capable model in `OPENCLAW_LIVE_GATEWAY_MODELS` (Claude/Gemini/OpenAI vision-capable variants, etc.) to exercise the image probe. ### Aggregators / alternate gateways If you have keys enabled, we also support testing via: - OpenRouter: `openrouter/...` (hundreds of models; use `openclaw models scan` to find tool+image capable candidates) - OpenCode: `opencode/...` for Zen and `opencode-go/...` for Go (auth via `OPENCODE_API_KEY` / `OPENCODE_ZEN_API_KEY`) More providers you can include in the live matrix (if you have creds/config): - Built-in: `openai`, `openai-codex`, `anthropic`, `google`, `google-vertex`, `google-antigravity`, `google-gemini-cli`, `zai`, `openrouter`, `opencode`, `opencode-go`, `xai`, `groq`, `cerebras`, `mistral`, `github-copilot` - Via `models.providers` (custom endpoints): `minimax` (cloud/API), plus any OpenAI/Anthropic-compatible proxy (LM Studio, vLLM, LiteLLM, etc.) Tip: don’t try to hardcode “all models” in docs. The authoritative list is whatever `discoverModels(...)` returns on your machine + whatever keys are available. ## Credentials (never commit) Live tests discover credentials the same way the CLI does. Practical implications: - If the CLI works, live tests should find the same keys. - If a live test says “no creds”, debug the same way you’d debug `openclaw models list` / model selection. - Profile store: `~/.openclaw/credentials/` (preferred; what “profile keys” means in the tests) - Config: `~/.openclaw/openclaw.json` (or `OPENCLAW_CONFIG_PATH`) If you want to rely on env keys (e.g. exported in your `~/.profile`), run local tests after `source ~/.profile`, or use the Docker runners below (they can mount `~/.profile` into the container). ## Deepgram live (audio transcription) - Test: `src/media-understanding/providers/deepgram/audio.live.test.ts` - Enable: `DEEPGRAM_API_KEY=... DEEPGRAM_LIVE_TEST=1 pnpm test:live src/media-understanding/providers/deepgram/audio.live.test.ts` ## BytePlus coding plan live - Test: `src/agents/byteplus.live.test.ts` - Enable: `BYTEPLUS_API_KEY=... BYTEPLUS_LIVE_TEST=1 pnpm test:live src/agents/byteplus.live.test.ts` - Optional model override: `BYTEPLUS_CODING_MODEL=ark-code-latest` ## Image generation live - Test: `src/image-generation/runtime.live.test.ts` - Command: `pnpm test:live src/image-generation/runtime.live.test.ts` - Scope: - Enumerates every registered image-generation provider plugin - Loads missing provider env vars from your login shell (`~/.profile`) before probing - Uses live/env API keys ahead of stored auth profiles by default, so stale test keys in `auth-profiles.json` do not mask real shell credentials - Skips providers with no usable auth/profile/model - Runs the stock image-generation variants through the shared runtime capability: - `google:flash-generate` - `google:pro-generate` - `google:pro-edit` - `openai:default-generate` - Current bundled providers covered: - `openai` - `google` - Optional narrowing: - `OPENCLAW_LIVE_IMAGE_GENERATION_PROVIDERS="openai,google"` - `OPENCLAW_LIVE_IMAGE_GENERATION_MODELS="openai/gpt-image-1,google/gemini-3.1-flash-image-preview"` - `OPENCLAW_LIVE_IMAGE_GENERATION_CASES="google:flash-generate,google:pro-edit"` - Optional auth behavior: - `OPENCLAW_LIVE_REQUIRE_PROFILE_KEYS=1` to force profile-store auth and ignore env-only overrides ## Docker runners (optional "works in Linux" checks) These Docker runners split into two buckets: - Live-model runners: `test:docker:live-models` and `test:docker:live-gateway` run `pnpm test:live` inside the repo Docker image, mounting your local config dir and workspace (and sourcing `~/.profile` if mounted). - Container smoke runners: `test:docker:openwebui`, `test:docker:onboard`, `test:docker:gateway-network`, and `test:docker:plugins` boot one or more real containers and verify higher-level integration paths. The live-model Docker runners also bind-mount only the needed CLI auth homes (or all supported ones when the run is not narrowed), then copy them into the container home before the run so external-CLI OAuth can refresh tokens without mutating the host auth store: - Direct models: `pnpm test:docker:live-models` (script: `scripts/test-live-models-docker.sh`) - CLI backend smoke: `pnpm test:docker:live-cli-backend` (script: `scripts/test-live-cli-backend-docker.sh`) - Gateway + dev agent: `pnpm test:docker:live-gateway` (script: `scripts/test-live-gateway-models-docker.sh`) - Open WebUI live smoke: `pnpm test:docker:openwebui` (script: `scripts/e2e/openwebui-docker.sh`) - Onboarding wizard (TTY, full scaffolding): `pnpm test:docker:onboard` (script: `scripts/e2e/onboard-docker.sh`) - Gateway networking (two containers, WS auth + health): `pnpm test:docker:gateway-network` (script: `scripts/e2e/gateway-network-docker.sh`) - Plugins (install smoke + `/plugin` alias + Claude-bundle restart semantics): `pnpm test:docker:plugins` (script: `scripts/e2e/plugins-docker.sh`) The live-model Docker runners also bind-mount the current checkout read-only and stage it into a temporary workdir inside the container. This keeps the runtime image slim while still running Vitest against your exact local source/config. They also set `OPENCLAW_SKIP_CHANNELS=1` so gateway live probes do not start real Telegram/Discord/etc. channel workers inside the container. `test:docker:live-models` still runs `pnpm test:live`, so pass through `OPENCLAW_LIVE_GATEWAY_*` as well when you need to narrow or exclude gateway live coverage from that Docker lane. `test:docker:openwebui` is a higher-level compatibility smoke: it starts an OpenClaw gateway container with the OpenAI-compatible HTTP endpoints enabled, starts a pinned Open WebUI container against that gateway, signs in through Open WebUI, verifies `/api/models` exposes `openclaw/default`, then sends a real chat request through Open WebUI's `/api/chat/completions` proxy. The first run can be noticeably slower because Docker may need to pull the Open WebUI image and Open WebUI may need to finish its own cold-start setup. This lane expects a usable live model key, and `OPENCLAW_PROFILE_FILE` (`~/.profile` by default) is the primary way to provide it in Dockerized runs. Successful runs print a small JSON payload like `{ "ok": true, "model": "openclaw/default", ... }`. Manual ACP plain-language thread smoke (not CI): - `bun scripts/dev/discord-acp-plain-language-smoke.ts --channel ...` - Keep this script for regression/debug workflows. It may be needed again for ACP thread routing validation, so do not delete it. Useful env vars: - `OPENCLAW_CONFIG_DIR=...` (default: `~/.openclaw`) mounted to `/home/node/.openclaw` - `OPENCLAW_WORKSPACE_DIR=...` (default: `~/.openclaw/workspace`) mounted to `/home/node/.openclaw/workspace` - `OPENCLAW_PROFILE_FILE=...` (default: `~/.profile`) mounted to `/home/node/.profile` and sourced before running tests - `OPENCLAW_DOCKER_CLI_TOOLS_DIR=...` (default: `~/.cache/openclaw/docker-cli-tools`) mounted to `/home/node/.npm-global` for cached CLI installs inside Docker - External CLI auth dirs under `$HOME` are mounted read-only under `/host-auth/...`, then copied into `/home/node/...` before tests start - Default: mount all supported dirs (`.codex`, `.claude`, `.minimax`) - Narrowed provider runs mount only the needed dirs inferred from `OPENCLAW_LIVE_PROVIDERS` / `OPENCLAW_LIVE_GATEWAY_PROVIDERS` - Override manually with `OPENCLAW_DOCKER_AUTH_DIRS=all`, `OPENCLAW_DOCKER_AUTH_DIRS=none`, or a comma list like `OPENCLAW_DOCKER_AUTH_DIRS=.claude,.codex` - `OPENCLAW_LIVE_GATEWAY_MODELS=...` / `OPENCLAW_LIVE_MODELS=...` to narrow the run - `OPENCLAW_LIVE_GATEWAY_PROVIDERS=...` / `OPENCLAW_LIVE_PROVIDERS=...` to filter providers in-container - `OPENCLAW_LIVE_REQUIRE_PROFILE_KEYS=1` to ensure creds come from the profile store (not env) - `OPENCLAW_OPENWEBUI_MODEL=...` to choose the model exposed by the gateway for the Open WebUI smoke - `OPENCLAW_OPENWEBUI_PROMPT=...` to override the nonce-check prompt used by the Open WebUI smoke - `OPENWEBUI_IMAGE=...` to override the pinned Open WebUI image tag ## Docs sanity Run docs checks after doc edits: `pnpm docs:list`. ## Offline regression (CI-safe) These are “real pipeline” regressions without real providers: - Gateway tool calling (mock OpenAI, real gateway + agent loop): `src/gateway/gateway.test.ts` (case: "runs a mock OpenAI tool call end-to-end via gateway agent loop") - Gateway wizard (WS `wizard.start`/`wizard.next`, writes config + auth enforced): `src/gateway/gateway.test.ts` (case: "runs wizard over ws and writes auth token config") ## Agent reliability evals (skills) We already have a few CI-safe tests that behave like “agent reliability evals”: - Mock tool-calling through the real gateway + agent loop (`src/gateway/gateway.test.ts`). - End-to-end wizard flows that validate session wiring and config effects (`src/gateway/gateway.test.ts`). What’s still missing for skills (see [Skills](/tools/skills)): - **Decisioning:** when skills are listed in the prompt, does the agent pick the right skill (or avoid irrelevant ones)? - **Compliance:** does the agent read `SKILL.md` before use and follow required steps/args? - **Workflow contracts:** multi-turn scenarios that assert tool order, session history carryover, and sandbox boundaries. Future evals should stay deterministic first: - A scenario runner using mock providers to assert tool calls + order, skill file reads, and session wiring. - A small suite of skill-focused scenarios (use vs avoid, gating, prompt injection). - Optional live evals (opt-in, env-gated) only after the CI-safe suite is in place. ## Contract tests (plugin and channel shape) Contract tests verify that every registered plugin and channel conforms to its interface contract. They iterate over all discovered plugins and run a suite of shape and behavior assertions. ### Commands - All contracts: `pnpm test:contracts` - Channel contracts only: `pnpm test:contracts:channels` - Provider contracts only: `pnpm test:contracts:plugins` ### Channel contracts Located in `src/channels/plugins/contracts/*.contract.test.ts`: - **plugin** - Basic plugin shape (id, name, capabilities) - **setup** - Setup wizard contract - **session-binding** - Session binding behavior - **outbound-payload** - Message payload structure - **inbound** - Inbound message handling - **actions** - Channel action handlers - **threading** - Thread ID handling - **directory** - Directory/roster API - **group-policy** - Group policy enforcement - **status** - Channel status probes - **registry** - Plugin registry shape ### Provider contracts Located in `src/plugins/contracts/*.contract.test.ts`: - **auth** - Auth flow contract - **auth-choice** - Auth choice/selection - **catalog** - Model catalog API - **discovery** - Plugin discovery - **loader** - Plugin loading - **runtime** - Provider runtime - **shape** - Plugin shape/interface - **wizard** - Setup wizard ### When to run - After changing plugin-sdk exports or subpaths - After adding or modifying a channel or provider plugin - After refactoring plugin registration or discovery Contract tests run in CI and do not require real API keys. ## Adding regressions (guidance) When you fix a provider/model issue discovered in live: - Add a CI-safe regression if possible (mock/stub provider, or capture the exact request-shape transformation) - If it’s inherently live-only (rate limits, auth policies), keep the live test narrow and opt-in via env vars - Prefer targeting the smallest layer that catches the bug: - provider request conversion/replay bug → direct models test - gateway session/history/tool pipeline bug → gateway live smoke or CI-safe gateway mock test - SecretRef traversal guardrail: - `src/secrets/exec-secret-ref-id-parity.test.ts` derives one sampled target per SecretRef class from registry metadata (`listSecretTargetRegistryEntries()`), then asserts traversal-segment exec ids are rejected. - If you add a new `includeInPlan` SecretRef target family in `src/secrets/target-registry-data.ts`, update `classifyTargetClass` in that test. The test intentionally fails on unclassified target ids so new classes cannot be skipped silently.