Monitoring AI Agents in Production: 4 Layers That Actually Catch Failures

Monitoring AI agents in production

Every API call returned 200. Reports landed in my inbox on schedule. Then one morning: 709,000 characters of repeated text. The model had gone into a repetition loop, and nothing in my setup caught it before delivery.

I had infrastructure monitoring. I didn’t have agent monitoring.

Monitoring AI agents is not the same as monitoring APIs. A system can be technically healthy while the model quietly produces garbage.

TL;DR: I added four monitoring layers to my AI agent. Each catches a class of failure the previous one misses: token tracking (cost), validation gates (correctness), output logging (trends), and quality scoring (depth). Start with token tracking. Add the rest as production teaches you what breaks.

This post is part of The Production AI Agent Playbook, which covers all eight disciplines for shipping reliable AI agents.

Here’s how I built each layer, and the failures each one caught.

The Agent

My agent runs a daily research pipeline on a GitHub Actions cron schedule. Each job calls Gemini with Google Search grounding, runs parallel research steps, synthesizes everything into a structured briefing, validates the output, and emails me the report.

GitHub Actions (cron)
  ├── Research steps (parallel)
  │     ├── market_overview  (Gemini + Search)
  │     ├── major_news       (Gemini + Search)
  │     ├── watchlist         (Gemini + Search)
  │     └── forward_look      (Gemini + Search)
  ├── Synthesis (Gemini, up to 3 retries)
  ├── Validation gate
  └── Email delivery

When I first deployed it, my monitoring was: open the email, read the report, decide if it looked right. That lasted until the 709K character incident.

Layer 1: Token Tracking

The first thing I added was visibility into what each run actually consumed. Every LLM call now tracks prompt and completion tokens through a TokenUsage dataclass that accumulates across the job:

@dataclass
class TokenUsage:
    prompt_tokens: int = 0
    completion_tokens: int = 0
    total_tokens: int = 0

    def add(self, prompt, completion):
        self.prompt_tokens += prompt
        self.completion_tokens += completion
        self.total_tokens += prompt + completion

After each Gemini call, the provider extracts token counts from the response metadata and feeds them into the accumulator:

def _track_usage(self, response):
    meta = getattr(
        response, "usage_metadata", None
    )
    if meta:
        self.usage.add(
            prompt=getattr(
                meta, "prompt_token_count", 0
            ) or 0,
            completion=getattr(
                meta, "candidates_token_count", 0
            ) or 0,
        )

At the end of each job, the totals are logged at INFO level:

Token usage -- prompt: 12840, completion: 3215,
  total: 16055

A normal daily run uses around 15,000-20,000 total tokens. The 709K runaway would have been obvious: tens of thousands of completion tokens from a single synthesis call. Without token tracking, the only signal was a garbled email in my inbox.

This takes about an afternoon to implement and immediately pays for itself. But it only tells you how much the agent consumed, not whether the output was correct.

Layer 2: Validation Gates

Token tracking told me how much the agent spent. Validation gates tell me whether the output is structurally correct before it reaches my inbox.

Every report passes through validate_synthesis() before delivery. The checks are simple:

Length bounds: 500 to 50,000 characters. The 709K runaway fails immediately.
Required headers: Must contain markdown ## headers for every configured section.
Config-driven quality checks: Specific sections must contain markdown tables. Others must include citation links. Citation timestamps must be fresh (within 48 hours).

The validation config lives in the job YAML:

validation:
  min_section_length: 50
  sections_with_tables:
    - Markets Snapshot
    - Watchlist Report

When validation fails, the error message is fed back to the LLM for self-correction:

for attempt in range(1, MAX_RETRIES + 1):
    result = provider.synthesize(
        context=context,
        instruction=config["instruction"],
        sections=config.get("sections"),
        prior_errors=last_error_msg,
    )
    try:
        validate_synthesis(result, config["sections"],
                           validation=config.get("validation"))
        return result
    except SynthesisValidationError as e:
        last_error_msg = str(e)

The prompt appends the validation errors so the model knows exactly what to fix:

IMPORTANT: Your previous attempt failed validation:
Synthesis validation failed (2 issues):
  Section 'Markets Snapshot' is missing required table;
  Report contains no markdown headers
Fix these issues.

This retry-with-feedback loop caught two real bugs.

The table regex bug. The model produced valid markdown tables using alignment markers (| :--- | :--- |). My validator used a strict check for ---| that missed the aligned syntax. Validation rejected valid output, retried 3 times, exhausted retries, and sent me a failure alert. The fix was one regex change: from a literal substring check to r'\|.*---.*\|'.

The blind retry problem. Before I added error feedback, the model received the exact same prompt on retry. It made the same structural mistake three times in a row: no markdown headers, just plain text. After I added prior_errors to the retry prompt, the model could see what it got wrong and fix it. First-attempt failures still happen, but second-attempt successes went from rare to the norm.

With these two layers, the 709K runaway would be caught twice: token tracking (cost spike) and validation (length bounds). No more gibberish in my inbox.

Layer 3: Output Logging and Timing

Validation catches problems in real time. Output logging lets me investigate after the fact.

Without historical logs, every incident becomes guesswork. Every successful run saves two files:

output/logs/stock-daily/
  ├── 2026-02-27T07-00-12.md
  └── 2026-02-27T07-00-12.meta.json

The .meta.json sidecar captures everything I need for post-run analysis:

{
  "job_name": "Daily Market Briefing",
  "job_slug": "stock-daily",
  "timestamp": "2026-02-27T07-00-12",
  "report_length": 4821,
  "sections": [
    "Markets Snapshot",
    "Big Movers & Why",
    "Watchlist Report"
  ],
  "token_usage": {
    "prompt_tokens": 12840,
    "completion_tokens": 3215,
    "total_tokens": 16055
  }
}

GitHub Actions uploads these as artifacts with 90-day retention. When a delivered report looks off, I can pull the metadata and compare it to previous runs. If report_length dropped from 5,000 to 800, or total_tokens spiked to 3x normal, something changed.

A _log_duration() context manager wraps every research step and synthesis attempt:

Research step 'market_overview' completed in 4.2s
Research step 'major_news' completed in 3.7s
Synthesis (attempt 1/3) completed in 2.1s

This is how I’d notice a model upgrade or prompt change making the agent slower. I don’t have automated alerting on timing yet, but the logs are there when I need them.

Layer 4: Quality Scoring and Golden Corpus

Validation catches structural failures: missing sections, dropped tables, runaway length. But a report can pass every validation check and still be shallow, outdated, or unhelpful.

That’s what quality scoring is for. A separate eval module scores reports on four dimensions using an LLM-as-judge rubric:

DIMENSIONS = [
    "analytical_depth",
    "factual_consistency",
    "actionability",
    "completeness",
]

def score_report(report, provider):
    prompt = f"{RUBRIC}\n\n---\n\n{report}"
    scores = provider.generate_json(prompt)
    scores["overall"] = (
        sum(scores[d] for d in DIMENSIONS)
        / len(DIMENSIONS)
    )
    return scores

Reports scoring above 4.0 overall get promoted to a golden corpus via promote_golden(). These golden reports become regression test fixtures, extending the three-layer testing strategy I use for the agent. CI validates them against current validation rules on every change. If I tighten a validation rule and it breaks a previously good report, CI tells me before I merge.

This layer runs on-demand, not on every production run. It’s the closest thing I have to automated output quality monitoring. The gap: it doesn’t run inline, so a quality regression can slip through until I manually trigger a scoring pass.

What I’d Add Next

My monitoring is adequate for a batch agent that runs twice a day and delivers to one user (me). For anything more, I’d need:

Real-time cost alerting. Right now I see token costs in the next day’s logs. A runaway loop could burn through a full day’s budget before I notice. A simple alert on any run exceeding 3x the median cost would catch this.

Tool call success rate tracking. This is my biggest blind spot. I have no visibility into whether the Gemini Search grounding actually returned useful results, only whether the final synthesis passed validation. A tool that returns empty results doesn’t throw an error. The agent just reasons on thin context. Last week, the watchlist step returned zero search results for a ticker symbol. The synthesis still passed validation because it had enough context from other steps to fill in the section. But the data was stale. I only noticed because I happened to read the report closely that day.

Distributed tracing. My logs are flat: one INFO line per step. For multi-step agents with branching logic, you need correlated spans showing the full decision tree. OpenTelemetry’s GenAI semantic conventions are maturing fast. That’s where I’d start.

For a deeper dive into the 7 metrics that matter most, alert thresholds, and a progressive monitoring roadmap, I’ll cover that in a future post.

The Pattern

Each monitoring layer caught a class of problem the previous layer missed:

Layer	What it catches	What it misses
Token tracking	Cost anomalies, runaway generation	Whether the output is correct
Validation gates	Structural failures, missing sections	Whether the content is good
Output logging	Trends, regressions over time	Real-time problems
Quality scoring	Shallow or unhelpful output	Only runs on-demand

Don’t try to build all four on day one. Start with token tracking. Add validation gates when you have enough production data to know what “good” looks like structurally. Layer in output logging for post-incident analysis. Add quality scoring when you’re ready to invest in automated evaluation.

The monitoring that matters is the monitoring you actually ship. Everything else is a roadmap.

Part of the AI agents reliability series. Previously: Testing AI Agents (three-layer testing), Error Handling Patterns (runtime recovery).

Kevin Tan

Monitoring AI Agents in Production: 4 Layers That Actually Catch Failures

The Agent

Layer 1: Token Tracking

Layer 2: Validation Gates

Layer 3: Output Logging and Timing

Layer 4: Quality Scoring and Golden Corpus

What I’d Add Next

The Pattern

Discussion

The Agent

Layer 1: Token Tracking

Layer 2: Validation Gates

Layer 3: Output Logging and Timing

Layer 4: Quality Scoring and Golden Corpus

What I’d Add Next

The Pattern

Subscribe to the newsletter

Discussion

AI Agents Fail Silently: 5 Error Handling Patterns for Production

Why AI Agents Fail in Production (And How to Fix Them)

Why Local LLMs Hallucinate When Your AI Agent Has Search

The Production AI Agent Playbook: 8 Disciplines for Reliable LLM Systems