How I Debug AI Agents Like Code (Not Guesswork)

When my production agent returned wrong data, the old approach was: re-run it, watch the logs scroll, read the output, guess what broke, tweak the prompt, repeat. That loop could run for hours. The new approach: blueclaw trace show <run_id>. Every tool call is visible. The failure is obvious in 30 seconds.

I built and open-sourced blueclaw, a terminal AI agent running daily research pipelines with MCP tool servers. After debugging enough production failures the wrong way, I added 10 trace CLI commands that map directly to the debugging primitives every developer already uses.

TL;DR: Every agent run writes a structured JSON trace automatically, no extra code, no hosted service. Four commands handle most debugging sessions: find the run, find the failing step, rule out latency, verify the fix. The rest of the toolkit lives in a summary table at the end.

Most teams have observability that collects traces. What most are missing is tools to read them.

The trace is the stack trace you never had.


The Failure

My research pipeline returned a plausible-but-wrong summary for a fetch task. It went into a report before I noticed. I only caught it the next day when the numbers didn’t add up. The logs showed everything succeeded. No errors in the output. Nothing obviously broken. Silent wrong is worse than loud broken. At least a loud error tells you something failed.

The old approach would have been: re-run it, watch the scroll, guess, tweak something. With a 7-step pipeline where the bug is somewhere in the middle, that loop can go for an hour.

Here is how I found it instead.


Step 1: Find the Run (trace list)

$ blueclaw trace list

20260315-054426  success  3  1,247 tok  $0.0021  search Python 3.13...
20260315-062111  error    7  3,892 tok  $0.0063  fetch and summarize...
20260315-071035  success  5  2,103 tok  $0.0034  compile weekly report

Failed runs print in red. One glance: 20260315-062111, 7 steps, error status. That is the run.

This is the equivalent of git log. A history of runs with enough metadata to spot the failure without opening anything.


Step 2: Find the Failing Step (trace show)

$ blueclaw trace show 20260315-062111

Run:     20260315-062111
Task:    fetch and summarize research
Model:   claude-sonnet-4-6
Status:  error

#    Tool             Duration   Status
1    web_search       312ms      success
2    web_search       287ms      success
3    http_request     4,103ms    success
4    http_request     2ms        error
5    web_search       401ms      success
6    web_search       318ms      success
7    shell            1ms        error

Total: 7 steps · 5,640ms · 3,892 tokens · $0.0063

Step 4: http_request at 2ms, error.

2ms. That is not a slow request. A real HTTP round-trip takes at least 50ms. 2ms means the connection was refused before it could start. The endpoint is dead.

That is the whole diagnosis. Not prompt drift. Not model behavior. A dead API endpoint.

I found this in under a minute. Before trace show, I would have re-run the agent and tried to catch the error in the log scroll.


Step 3: Rule Out Latency (trace timeline)

Step 3 looked suspicious too: http_request at 4,103ms. Four seconds. Before closing the investigation I wanted to be sure that was not a secondary problem.

$ blueclaw trace timeline 20260315-062111

#    Tool           Start     Duration   Bar
1    web_search     +0ms      312ms      ██
2    web_search     +314ms    287ms      █
3    http_request   +601ms    4,103ms    ████████████████████
4    http_request   +4,706ms  2ms        █
5    web_search     +4,710ms  401ms      ██
6    web_search     +5,113ms  318ms      ██
7    shell          +5,433ms  1ms        █

Tool time: 5,424ms · Wall time: 5,640ms · Overhead: 216ms (4%)

Step 3’s bar dominates but it succeeded. A large payload on a slow endpoint, not broken. Step 4’s bar is one block. The timeline confirms: one dead endpoint, everything else normal.

The bottom line I also check: Overhead: 216ms (4%). That is the time the model spent reasoning versus time tools were running. At 4%, this run is almost entirely tool time. If overhead were 40%, I would look at the prompt and context size. The model would be doing more work than it should.


Step 4: Verify the Fix (trace diff)

I updated the endpoint configuration and re-ran the task. Then:

$ blueclaw trace diff 20260315-062111 20260316-081244

Steps:  7 → 3 (-4)
Tokens: 3,892 → 1,247 (-2,645)
Cost:   $0.0063 → $0.0021 (-$0.0042)
Time:   5,640ms → 965ms (-4,675ms)

Four fewer steps, two-thirds fewer tokens, 83% faster. The diff proves the fix worked.

Before, I would have eyeballed two log files and hoped the numbers were comparable. trace diff is what I reach for after any change: tool configuration, system prompt, or model. If the fix made things worse on some dimension, you see it immediately.


The Rest of the Toolkit

Those four commands handle most debugging sessions. The other six cover more specific needs:

Command What it does
trace graph Tree view of execution: tool calls in order, with durations and status
trace replay Step-through with Enter key: input, output, and error for each step
trace stats Aggregate metrics: avg steps/tokens/cost, p95 duration, error rates by type
trace explain Sends the trace to an LLM and asks it to explain what the agent did and why
trace ui Opens a browser dashboard at localhost:8111 for exploring and comparing runs
trace purge Deletes traces older than N days (default: 30, configurable)

trace stats is where I catch patterns across runs. A spike in error rate means an API changed. A jump in avg steps per run means the agent is looping on something. A p95 spike means a flaky dependency appeared. trace replay --stub-tools re-runs the agent against recorded outputs instead of calling real APIs, useful for testing prompt changes without spending tokens. If you’re building a full monitoring strategy on top of this, Monitoring AI Agents in Production covers the 4-layer approach.


How Traces Are Recorded

Zero instrumentation required. ObserverHooks hooks into the Strands SDK’s BeforeToolCallEvent and AfterToolCallEvent callbacks. Every tool call is automatically recorded: tool name, input summary (truncated to 200 chars per key), output summary, duration, status, and any error.

At the end of each run, a RunTrace JSON file is written to the workspace directory. Nothing added to your tool code. Nothing added to your agent prompt. The trace just appears.

Errors are automatically classified into categories: timeout, rate_limit, auth, not_found, schema, network, sandbox. So trace stats gives you a breakdown without manual log parsing.


The Part Most Observability Tooling Skips

The industry consensus right now: install LangSmith, connect Langfuse, add OpenTelemetry spans. More observability is better.

That is half right. You need structured traces. But structured traces without the right tools for reading them just give you more data to be confused by. Ninety-four percent of production AI teams have some observability, per the 2025 Galileo survey. Most are still debugging by re-running and guessing.

The missing piece is not more data. It is developer tools for reading the data you already have.

You do not debug code by adding more print statements. You use a debugger, git diff, a profiler. Those tools exist because they map to how code actually fails: at specific lines, in measurable time, with inputs and outputs you can inspect. AI agents fail the same way: at specific steps, in measurable time, with the same kinds of inputs and outputs.

Build the primitives once. Debug in seconds instead of hours. For what to do once you have a diagnosis, see AI Agent Error Handling Patterns.


blueclaw is open source: github.com/jztan/blueclaw

ai-agents llm blueclaw observability
Kevin Tan

Kevin Tan

Cloud Solutions Architect and Engineering Leader based in Singapore. I write about AWS, distributed systems, and building reliable software at scale.