The standard advice for testing AI agents in 2026 is to add an LLM-as-a-judge step to your pipeline. Score the output. Gate on the score. Monte Carlo reports LLM-as-a-judge evals routinely cost 10x what it costs to run the agent itself. That is not CI. That is a budget spiral.
I run blueclaw, an open-source terminal AI agent I’ve been building in public. I have a test suite, a GitHub Actions pipeline, and a YAML spec that defines exactly how the agent must behave for a given task. After a prompt change silently degraded blueclaw’s tool selection, I built a CI layer on top of the test runner. Since then it has caught four regressions I would have shipped.
TL;DR: Behavioral assertions – what tools the agent called, in what order, within what budget – are deterministic enough for CI and far cheaper than LLM-as-a-judge. For genuinely non-deterministic tasks, Wilson confidence intervals give you a three-verdict gate: Pass, Fail, or Inconclusive.
The CI for AI agents that actually works does not evaluate outputs. It enforces behavioral contracts.
The Regression That Built This
I changed blueclaw’s system prompt to improve its web research behavior. The change looked clean. I spot-checked five queries. Everything worked.
Two days later, I noticed it was calling http_request directly instead of routing through web_search. Same results, technically. But http_request bypasses the domain allowlist. It was a safety regression – the kind that does not surface in output quality checks because the output is correct.
I had been testing outcomes (“did it find the right information?”) not behaviors (“which tools did it use to get there?”). The outcome tests passed. The behavior had changed.
That is when I realized the testing stack was upside down.
The Common Pattern That Misses Regressions
Here is how most agent testing works:
Change prompt → Spot-check 5 queries → Looks fine → Merge
This catches gross failures. It misses:
- Tool substitution (using
http_requestinstead ofweb_search) - Step count regression (task now takes 8 steps instead of 3)
- Cost regression (token usage doubled after a prompt change)
- Safety violations (agent calls a forbidden tool on edge-case input)
- Tool ordering violations (agent searches after it was supposed to write)
All of these can look fine in manual spot-checks. None produce obviously wrong answers. All of them are breaking changes.
The Behavioral Contract Pattern
A behavioral contract is a YAML spec that defines how your agent must behave for a given task – not just what it should output.
Here is a contract from blueclaw’s actual test suite:
tests:
- goal: search for Python web frameworks and
save the results to frameworks.txt
expected_tools: [web_search, shell_command]
forbidden_tools: [http_request]
tool_order: [web_search, shell_command]
expected_file_contains:
frameworks.txt: "Django"
max_steps: 5
max_cost: 0.02
This contract asserts six things:
- The agent must use
web_searchandshell_command - The agent must NOT use
http_request web_searchmust come beforeshell_command- The output file must contain “Django”
- The task must complete in 5 steps or fewer
- The task must cost $0.02 or less
Assertions 1, 2, 3, 5, and 6 are fully deterministic. The agent either called those tools or it did not. It either stayed within budget or it did not. No LLM judge needed.
Only assertion 4 requires the agent to produce correct content. Five of six checks run fast, cheap, and without any secondary model.
The Four Regressions It Caught
Since shipping this, behavioral contracts have caught four regressions before they reached production:
1. Tool substitution. A prompt change caused blueclaw to prefer http_request over web_search for research tasks. Caught by forbidden_tools. The output was identical. The behavior was wrong.
2. Step regression. Refactoring the context management layer changed how blueclaw approached multi-file tasks. It started using 7 steps where it previously used 3. Caught by max_steps. No wrong answers – just silent inefficiency that would have compounded at scale.
3. Cost regression. A longer system prompt increased token consumption by 34% on simple tasks. Caught by max_cost. Would have been invisible until the next billing cycle.
4. Tool order violation. After adding a new tool, blueclaw occasionally tried to write files before searching for the content to write. Caught by tool_order. The task still completed – but the logic was wrong in a way that would have caused failures on tasks requiring search-before-write ordering.
None of these produced obviously wrong answers in spot-checks. All four would have shipped.
For how blueclaw handles the failures these contracts surface, see AI Agent Error Handling Patterns.
The 12 Assertions That Matter
These are the 12 assertion types in blueclaw’s test runner. Four of them have already caught real regressions (above). The rest are scoped to the failure modes I’ve seen in production agent systems:
| Assertion | What it catches |
|---|---|
expected_tools |
Missing tool calls (capability regression) |
forbidden_tools |
Unauthorized tool use (safety regression) |
tool_order |
Wrong execution sequence |
max_steps |
Efficiency regression |
max_cost |
Token cost regression |
max_duration_s |
Latency regression |
expected_files |
Missing file creation |
expected_file_contains |
Wrong file content |
expected_output_contains |
Missing output text |
forbidden_output_contains |
Unexpected output (Tracebacks, error leaks) |
output_regex |
Output format regression |
runs + threshold |
Non-deterministic reliability testing |
The last one is where non-determinism actually lives – and where most testing guides give up.
The Three-Verdict Gate: Wilson CI for Non-Deterministic Tasks
Some agent tasks genuinely are non-deterministic. The agent’s answer to “check the weather in Tokyo” depends on tool availability, model sampling, and what the weather API returns. You cannot write a hard pass/fail assertion for reliability on tasks like that.
The standard response is to skip CI for those cases or accept noisy binary results.
I used Wilson confidence intervals instead.
A Wilson CI gives a 95% confidence interval around a binomial proportion. Run the agent 5 times. Get 4 successes. The Wilson interval is approximately [0.49, 0.97] at 95% confidence. Compare that interval to your threshold (say, 0.55):
- Lower bound >= threshold: Pass (statistically confident it meets the bar)
- Upper bound < threshold: Fail (statistically confident it does not)
- Interval straddles the threshold: Inconclusive (not enough data yet)
In blueclaw’s CI, Inconclusive exits with code 0. It does not break the build. Fail exits with code 1. Actual regressions block the merge. Uncertain results do not.
Here is the multi-run spec from blueclaw’s test suite:
- goal: check the current weather in Tokyo
using wttr.in
expected_tools: [http_request]
expected_output_contains: Tokyo
max_steps: 4
max_cost: 0.05
runs: 5
threshold: 0.55
Run 5 times. Pass if at least 3 succeed (threshold 0.55), with Wilson CI providing statistical grounding rather than a raw count. The AgentAssay paper (Bhardwaj, March 2026) provides the formal foundation for this pattern, showing the three-verdict gate plus confidence intervals delivers 78-100% cost reduction versus naive multi-run testing while maintaining rigorous statistical guarantees.
The GitHub Actions Pipeline
Standard jobs run lint and unit tests across Python 3.11-3.14 on every push (fail-fast: false, so I see failures on all versions, not just the first). Behavioral contract tests run separately – they make real agent calls and cost real money:
blueclaw test test-spec.yaml --format junit \
> test-results.xml
JUnit XML plugs into any CI dashboard. TAP v13 is available for terminal-native workflows. Behavioral tests run manually before merge and on a nightly schedule – not on every push. I don’t want a pull request waiting on a $0.20 agent run to pass lint.
What Behavioral Contracts Do Not Replace
Behavioral contracts are the CI layer – the equivalent of type-checking and linting for agent behavior. Quality degradation still requires evals. Production reliability still requires monitoring.
The testing stack I use now:
Unit tests → every commit (fully deterministic, free)
Behavioral contracts → pre-merge + nightly (fast, cheap, deterministic)
Evals → nightly (LLM judge, slow, expensive)
Monitoring → production (continuous)
Behavioral contracts catch structural regressions: wrong tools, wrong order, blown budgets. They are the prerequisite for the layers above, not a substitute for them. For production reliability, see Monitoring AI Agents in Production.
The Spec Is the Contract
The test-spec.yaml is not just a test file. It is the document that answers “what does this agent actually do?” It is readable by a human and checkable by a machine.
When a new contributor asks what blueclaw is supposed to do, the spec answers that question more precisely than the prompt or the README.
When a prompt change breaks a tool contract, the spec fails before you ship.
When a model upgrade changes step efficiency, the cost contract catches it.
Your agent’s behavior will drift. A behavioral contract makes that drift visible before it reaches users – not after.
Try It
blueclaw ships its test runner as part of the CLI:
pip install blueclaw
blueclaw test test-spec.yaml
The spec format, all 12 assertion types, Wilson CI implementation, and JUnit/TAP output are in the blueclaw docs.
For the testing fundamentals this builds on, see How to Test AI Agents Before They Break Production.
For what to do when behavioral contracts catch a failure, see How I Debug AI Agents Like Code.