When my production agent returned wrong data, the old approach was: re-run it, watch the logs scroll, read the output, guess what broke, tweak the prompt, repeat. That loop could run for hours. The new approach: blueclaw trace show <run_id>. Every tool call is visible. The failure is obvious in 30 seconds.
I built and open-sourced blueclaw, a terminal AI agent running daily research pipelines with MCP tool servers. After debugging enough production failures the wrong way, I added 10 trace CLI commands that map directly to the debugging primitives every developer already uses.
TL;DR: Every agent run writes a structured JSON trace automatically, no extra code, no hosted service. Four commands handle most debugging sessions: find the run, find the failing step, rule out latency, verify the fix. The rest of the toolkit lives in a summary table at the end.
Most teams have observability that collects traces. What most are missing is tools to read them.
The trace is the stack trace you never had.
The Failure
My research pipeline returned a plausible-but-wrong summary for a fetch task. It went into a report before I noticed. I only caught it the next day when the numbers didn’t add up. The logs showed everything succeeded. No errors in the output. Nothing obviously broken. Silent wrong is worse than loud broken. At least a loud error tells you something failed.
The old approach would have been: re-run it, watch the scroll, guess, tweak something. With a 7-step pipeline where the bug is somewhere in the middle, that loop can go for an hour.
Here is how I found it instead.
Step 1: Find the Run (trace list)
$ blueclaw trace list
20260315-054426 success 3 1,247 tok $0.0021 search Python 3.13...
20260315-062111 error 7 3,892 tok $0.0063 fetch and summarize...
20260315-071035 success 5 2,103 tok $0.0034 compile weekly report
Failed runs print in red. One glance: 20260315-062111, 7 steps, error status. That is the run.
This is the equivalent of git log. A history of runs with enough metadata to spot the failure without opening anything.
Step 2: Find the Failing Step (trace show)
$ blueclaw trace show 20260315-062111
Run: 20260315-062111
Task: fetch and summarize research
Model: claude-sonnet-4-6
Status: error
# Tool Duration Status
1 web_search 312ms success
2 web_search 287ms success
3 http_request 4,103ms success
4 http_request 2ms error
5 web_search 401ms success
6 web_search 318ms success
7 shell 1ms error
Total: 7 steps · 5,640ms · 3,892 tokens · $0.0063
Step 4: http_request at 2ms, error.
2ms. That is not a slow request. A real HTTP round-trip takes at least 50ms. 2ms means the connection was refused before it could start. The endpoint is dead.
That is the whole diagnosis. Not prompt drift. Not model behavior. A dead API endpoint.
I found this in under a minute. Before trace show, I would have re-run the agent and tried to catch the error in the log scroll.
Step 3: Rule Out Latency (trace timeline)
Step 3 looked suspicious too: http_request at 4,103ms. Four seconds. Before closing the investigation I wanted to be sure that was not a secondary problem.
$ blueclaw trace timeline 20260315-062111
# Tool Start Duration Bar
1 web_search +0ms 312ms ██
2 web_search +314ms 287ms █
3 http_request +601ms 4,103ms ████████████████████
4 http_request +4,706ms 2ms █
5 web_search +4,710ms 401ms ██
6 web_search +5,113ms 318ms ██
7 shell +5,433ms 1ms █
Tool time: 5,424ms · Wall time: 5,640ms · Overhead: 216ms (4%)
Step 3’s bar dominates but it succeeded. A large payload on a slow endpoint, not broken. Step 4’s bar is one block. The timeline confirms: one dead endpoint, everything else normal.
The bottom line I also check: Overhead: 216ms (4%). That is the time the model spent reasoning versus time tools were running. At 4%, this run is almost entirely tool time. If overhead were 40%, I would look at the prompt and context size. The model would be doing more work than it should.
Step 4: Verify the Fix (trace diff)
I updated the endpoint configuration and re-ran the task. Then:
$ blueclaw trace diff 20260315-062111 20260316-081244
Steps: 7 → 3 (-4)
Tokens: 3,892 → 1,247 (-2,645)
Cost: $0.0063 → $0.0021 (-$0.0042)
Time: 5,640ms → 965ms (-4,675ms)
Four fewer steps, two-thirds fewer tokens, 83% faster. The diff proves the fix worked.
Before, I would have eyeballed two log files and hoped the numbers were comparable. trace diff is what I reach for after any change: tool configuration, system prompt, or model. If the fix made things worse on some dimension, you see it immediately.
The Rest of the Toolkit
Those four commands handle most debugging sessions. The other six cover more specific needs:
| Command | What it does |
|---|---|
trace graph |
Tree view of execution: tool calls in order, with durations and status |
trace replay |
Step-through with Enter key: input, output, and error for each step |
trace stats |
Aggregate metrics: avg steps/tokens/cost, p95 duration, error rates by type |
trace explain |
Sends the trace to an LLM and asks it to explain what the agent did and why |
trace ui |
Opens a browser dashboard at localhost:8111 for exploring and comparing runs |
trace purge |
Deletes traces older than N days (default: 30, configurable) |
trace stats is where I catch patterns across runs. A spike in error rate means an API changed. A jump in avg steps per run means the agent is looping on something. A p95 spike means a flaky dependency appeared. trace replay --stub-tools re-runs the agent against recorded outputs instead of calling real APIs, useful for testing prompt changes without spending tokens. If you’re building a full monitoring strategy on top of this, Monitoring AI Agents in Production covers the 4-layer approach.
How Traces Are Recorded
Zero instrumentation required. ObserverHooks hooks into the Strands SDK’s BeforeToolCallEvent and AfterToolCallEvent callbacks. Every tool call is automatically recorded: tool name, input summary (truncated to 200 chars per key), output summary, duration, status, and any error.
At the end of each run, a RunTrace JSON file is written to the workspace directory. Nothing added to your tool code. Nothing added to your agent prompt. The trace just appears.
Errors are automatically classified into categories: timeout, rate_limit, auth, not_found, schema, network, sandbox. So trace stats gives you a breakdown without manual log parsing.
The Part Most Observability Tooling Skips
The industry consensus right now: install LangSmith, connect Langfuse, add OpenTelemetry spans. More observability is better.
That is half right. You need structured traces. But structured traces without the right tools for reading them just give you more data to be confused by. Ninety-four percent of production AI teams have some observability, per the 2025 Galileo survey. Most are still debugging by re-running and guessing.
The missing piece is not more data. It is developer tools for reading the data you already have.
You do not debug code by adding more print statements. You use a debugger, git diff, a profiler. Those tools exist because they map to how code actually fails: at specific lines, in measurable time, with inputs and outputs you can inspect. AI agents fail the same way: at specific steps, in measurable time, with the same kinds of inputs and outputs.
Build the primitives once. Debug in seconds instead of hours. For what to do once you have a diagnosis, see AI Agent Error Handling Patterns.
blueclaw is open source: github.com/jztan/blueclaw