AI Agents Fail Silently: 5 Error Handling Patterns for Production

AI agent error handling

TL;DR: AI agents fail differently from traditional software. The most dangerous failures look like success. Five patterns make agents safe in production: circuit breakers for LLM quality failures, validation gates before tool execution, idempotent workflows with saga rollbacks, token and cycle budget guardrails, and human escalation for high-risk actions. These patterns ensure failures are detected early, contained tightly, and surfaced deliberately.

Last year I had an agent running a data enrichment pipeline. It pulled records from an external API, mapped fields into our schema, and wrote them to a database. Every API call returned 200 OK. The agent reported success on every step. The dashboard showed green across the board.

Six hours later, a downstream team flagged the data. Half the field mappings were hallucinated. The agent had confidently mapped company_revenue to employee_count, invented values for fields that didn’t exist in the source, and written duplicates for records it had already processed. Hundreds of bad rows, all marked as verified.

Nobody noticed because nothing “failed.” The API calls worked. The writes succeeded. The agent completed its tasks. It was working perfectly, and producing garbage.

That night taught me that AI agents need fundamentally different error handling. Traditional try/catch assumes failures are obvious. With agents, the most dangerous failures look exactly like success.

When I say “AI agents,” I mean systems that call tools, mutate state, and trigger real side effects, not chatbots. Think database writes, API calls, infra changes, or workflow automation running unattended. In Why AI Agents Fail in Production, I wrote about the silent architectural failures that make agents break under real load. That post was the diagnosis. This one is the prescription.


The production AI agent toolkit. This post covers runtime error handling. For catching failures before deploy, see Testing AI Agents in Production. For the complete framework, see The Production AI Agent Playbook.


Five Patterns That Actually Work

After breaking things in production more times than I’d like to admit, these are the five patterns I now treat as non-negotiable. Together, they ensure failures are detected early, contained tightly, and surfaced deliberately, not silently propagated. The examples below use the Strands Agents SDK, but every pattern is framework-agnostic. The same ideas apply whether you’re using LangGraph, CrewAI, or raw function calling. They form a natural progression: detect failures (circuit breakers), prevent them (validation), contain partial failures (sagas), limit blast radius (budget guardrails), and know when to stop (escalation).

1. Circuit Breakers for LLM Calls

After a model provider degradation, our agent started returning malformed JSON. Every API call “succeeded,” but the output was unusable. We burned 40 minutes of compute before anyone noticed, because nothing in our error handling checked output quality. Only HTTP status codes.

The classic circuit breaker pattern (closed, open, half-open) adapts well to AI agents, with one critical difference: you’re not just tracking HTTP failures. You’re tracking quality failures: any output that violates schema, fails a semantic invariant, or produces an unsafe action, even if the API call itself succeeded.

from strands.hooks import HookProvider, HookRegistry
from strands.hooks.events import (
    BeforeToolCallEvent,
    AfterToolCallEvent,
)


class CircuitBreakerHook(HookProvider):
    def __init__(self, failure_threshold=3, reset_timeout=60):
        self.failures = 0
        self.threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.state = "closed"  # closed, open, half-open
        self.last_failure_time = None

    def register_hooks(self, registry: HookRegistry) -> None:
        registry.add_callback(BeforeToolCallEvent, self.check_circuit)
        registry.add_callback(AfterToolCallEvent, self.track_quality)

    def check_circuit(self, event: BeforeToolCallEvent) -> None:
        if self.state == "open":
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = "half-open"  # Allow one probe
            else:
                event.cancel_tool = True  # Block execution

    def track_quality(self, event: AfterToolCallEvent) -> None:
        content = event.result.get("content", [])
        result_text = content[0].get("text", "") if content else ""
        has_error = "error" in result_text.lower()
        if has_error or not self._passes_validation(result_text):
            self.failures += 1
            self.last_failure_time = time.time()
            if self.failures >= self.threshold:
                self.state = "open"
        else:
            if self.state == "half-open":
                self.state = "closed"
                self.failures = 0

The key: when the circuit opens, stop. Don’t burn tokens on a model that’s producing garbage. Wait, then probe with a single request before resuming.

Using the Strands SDK’s HookProvider, the circuit breaker plugs directly into the agent lifecycle. BeforeToolCallEvent blocks execution when the circuit is open, and AfterToolCallEvent inspects results for quality failures. No wrapper functions, no monkey-patching. The hook fires on every tool call automatically.

I track validation failures, not just HTTP errors. If the agent produces three consecutive outputs that fail schema validation, the circuit opens, even though every API call “succeeded.”


2. Validate Before You Execute

An agent mapped a delete_all action to what it interpreted as a cleanup task. The API accepted it. 47 records gone before the next human review. The agent was “confident” in its action. The action was valid. The intent was completely wrong.

In the previous post, I argued that constraints beat roles and structured output is the API contract for LLMs. This pattern is the concrete implementation of that principle.

Never let an agent’s output directly trigger a side effect. Always validate first.

from strands.hooks import HookProvider, HookRegistry
from strands.hooks.events import BeforeToolCallEvent


class ValidationHook(HookProvider):
    def register_hooks(self, registry: HookRegistry) -> None:
        registry.add_callback(BeforeToolCallEvent, self.validate)

    def validate(self, event: BeforeToolCallEvent) -> None:
        tool_name = event.tool_use.get("name", "")
        tool_input = event.tool_use.get("input", {})

        # Sanity check: block dangerous operations
        if (
            tool_name == "delete_records"
            and tool_input.get("count", 0) > 100
        ):
            event.cancel_tool = True

        # Boundary check: restrict to allowed targets
        if tool_input.get("target") not in ALLOWED_TARGETS:
            event.cancel_tool = True

With Strands’ BeforeToolCallEvent, you intercept every tool call before execution. The hook inspects event.tool_use (the tool name and inputs) and cancels via event.cancel_tool = True if anything fails validation. No separate validation function to remember to call; it fires automatically.

Three layers of validation:

  1. Schema: Is the output structurally correct? (Missing a required field, wrong type, malformed JSON.)
  2. Sanity: Does the action make sense? (Deleting 10,000 records? Probably not.)
  3. Boundary: Is the agent operating within its allowed scope? (Cross-tenant access, targeting a production table from a staging workflow.)

This builds directly on the tool design lesson from giving my agent full API access. When I built pdf-mcp, splitting one monolithic tool into eight focused tools eliminated most validation failures before they could happen. If you’re deciding between MCP and native function calling for your tool layer, see MCP vs Function Calling for a detailed comparison. Constrain what the agent can do, and you prevent most errors at the source.


3. Idempotent Workflows (The Saga Pattern)

A three-step workflow failed on step 2. Step 1 had already created a customer record. The retry created a duplicate. We found 200+ orphaned records a week later, each one a customer who received double billing notifications. The agent had no concept of “I already did step 1.”

AI agents retry. Models have transient failures. Networks drop. If your agent workflow isn’t idempotent, retries create duplicate side effects.

Idempotency prevents duplicate effects; the saga pattern handles partial completion. You need both once agents can fail mid-workflow. Borrowing from the saga pattern in distributed systems, each step records its completion and defines a compensation action.

steps = [
    Step("fetch_data", compensate=None),  # Read-only, safe
    Step("transform", compensate=None),  # Pure function, safe
    Step("write_to_db", compensate="delete_record"),  # Reversible
    Step("send_notification", compensate="send_correction"),  # Compensatable
]

Classify every step:

  • Read-only: Safe to retry freely
  • Reversible: Can undo (delete what you created)
  • Compensatable: Can’t undo, but can correct (send a follow-up notification)
  • Final: Can’t undo at all (payment processed). These need the most validation before execution, and should go through a human escalation flow (Pattern #5) so an irreversible action never fires without explicit approval

When an agent fails mid-workflow, you walk backwards through completed steps and run compensation. No orphaned records. No half-finished operations.

In Strands, the GraphBuilder multi-agent pattern provides a natural structure for this: each node is an agent, conditional edges route to compensation nodes on failure, and the graph handles execution order:

from strands import Agent
from strands.multiagent import GraphBuilder
from strands.multiagent.base import Status

order_agent = Agent(
    name="order",
    system_prompt="Create the order.",
    callback_handler=None,
)
payment_agent = Agent(
    name="payment",
    system_prompt="Process payment.",
    callback_handler=None,
)
fulfillment_agent = Agent(
    name="fulfillment",
    system_prompt="Ship the order.",
    callback_handler=None,
)
rollback_agent = Agent(
    name="rollback",
    system_prompt="Cancel the order and notify the customer.",
    callback_handler=None,
)


def payment_succeeded(state):
    return (
        "payment" in state.results
        and state.results["payment"].status == Status.COMPLETED
    )


def payment_failed(state):
    return (
        "payment" in state.results
        and state.results["payment"].status == Status.FAILED
    )

builder = GraphBuilder()
builder.add_node(order_agent, "create_order")
builder.add_node(payment_agent, "payment")
builder.add_node(fulfillment_agent, "fulfillment")
builder.add_node(rollback_agent, "compensate_order")

builder.add_edge("create_order", "payment")
builder.add_edge("payment", "fulfillment", condition=payment_succeeded)
builder.add_edge("payment", "compensate_order", condition=payment_failed)
builder.set_entry_point("create_order")

graph = builder.build()

If payment fails, the graph routes to compensate_order instead of fulfillment. No orphaned orders, no half-finished workflows.


4. Budget and Token Guardrails

An agent entered a generate-validate-fail loop on a malformed input. It would generate output, fail validation, adjust, fail again, adjust differently, fail again. Twenty minutes and $180 in tokens later, it was still looping. It never occurred to the agent to stop and report the failure. It just kept trying.

An IDC survey found that 92% of organizations implementing agentic AI reported costs higher than expected. The primary driver? Runaway loops.

An agent that retries endlessly, or recursively calls tools, will happily burn through your API budget while producing nothing useful.

from strands.hooks import HookProvider, HookRegistry
from strands.hooks.events import AfterInvocationEvent


class ExecutionGuardHook(HookProvider):
    def __init__(self, max_tokens=100_000, max_cycles=20):
        self.total_tokens = 0
        self.total_cycles = 0
        self.max_tokens = max_tokens
        self.max_cycles = max_cycles

    def register_hooks(self, registry: HookRegistry) -> None:
        registry.add_callback(AfterInvocationEvent, self.check_budget)

    def check_budget(self, event: AfterInvocationEvent) -> None:
        if event.result is None:
            return
        usage = event.result.metrics.accumulated_usage
        self.total_tokens += usage["totalTokens"]
        self.total_cycles += event.result.metrics.cycle_count

        if self.total_tokens > self.max_tokens:
            raise BudgetExceeded("Token limit reached")
        if self.total_cycles > self.max_cycles:
            raise BudgetExceeded("Cycle limit reached - possible loop")

Hard limits, not hopes. The Strands SDK exposes result.metrics.accumulated_usage and result.metrics.cycle_count after each invocation: real token counts and reasoning cycles, not estimates. Set a ceiling for both, and the AfterInvocationEvent hook enforces it automatically.

This also catches the subtle failure where an agent enters a self-correction loop: generate, validate, fail, regenerate, validate, fail… Each cycle burns tokens with no progress. A step counter catches this immediately.


5. Know When to Ask for Help

An agent was 95% confident about a production database migration. It analyzed the schema, generated the migration script, and was ready to execute. The 5% case was a foreign key constraint it hadn’t seen in the test data. If it had run, it would have corrupted referential integrity across three tables. The only thing that saved us was a hard rule: destructive operations always require human approval, regardless of confidence.

The hardest pattern to get right: when should the agent stop and escalate to a human?

I use a simple three-tier framework:

Risk Level Confidence Action
Low High Agent retries autonomously
Medium Uncertain Agent completes in draft/read-only mode, flags for review
High Any Agent stops immediately, escalates with context

The key insight is risk level, not confidence alone. An agent that’s 90% sure about a read-only query can proceed. An agent that’s 90% sure about deleting production data should still ask.

Confidence isn’t model-reported, since that’s unreliable. Instead, classify tools by blast radius upfront. The risk map is static and deterministic: read tools run freely, write tools get validation (Pattern #2), destructive tools trigger interrupt() which pauses execution and surfaces the full decision context to a human. The agent resumes only after explicit approval.

from strands.hooks import HookProvider, HookRegistry
from strands.hooks.events import BeforeToolCallEvent

DANGEROUS_TOOLS = {
    "delete_records",
    "drop_table",
    "revoke_access",
}


class EscalationHook(HookProvider):
    def register_hooks(self, registry: HookRegistry) -> None:
        registry.add_callback(BeforeToolCallEvent, self.check_risk)

    def check_risk(self, event: BeforeToolCallEvent) -> None:
        tool_name = event.tool_use.get("name", "")
        if tool_name not in DANGEROUS_TOOLS:
            return

        approval = event.interrupt(
            "high-risk-approval",
            reason={
                "tool": tool_name,
                "inputs": event.tool_use.get("input", {}),
            },
        )

        if approval.lower() != "y":
            event.cancel_tool = "Human denied permission"

When the agent does escalate, include the full decision context: tool inputs, model output, validation failures, and the agent’s last reasoning step. An escalation without context just shifts the debugging burden to a human.

If you’re wondering how this composes with Pattern #2’s ValidationHook, both register on BeforeToolCallEvent, and Strands fires Before* hooks in registration order. Register validation first, escalation second: that way invalid inputs get rejected cheaply before the escalation hook ever prompts a human.


What Actually Changed

After implementing all five patterns, I expected the biggest win to come from circuit breakers or budget guardrails, the clever engineering patterns.

It didn’t.

The biggest improvement came from Pattern #2: better tool design. When you constrain what an agent can do (smaller tools, clear boundaries, built-in validation), most errors never happen in the first place. When I built pdf-mcp, splitting one monolithic tool into eight focused tools eliminated entire categories of validation failures at the source.

The data enrichment agent that started this post? After adding these patterns, it still occasionally gets field mappings wrong. But now: the circuit breaker catches quality degradation within three calls instead of six hours. The validation gate blocks any write where field types don’t match the target schema. The budget guardrail kills runaway loops before they cost more than a few dollars. And the escalation policy means ambiguous mappings get flagged for human review instead of silently committed.

The real lesson: most agent errors aren’t runtime failures. They’re design failures. An agent that can’t write to a table it shouldn’t touch will never accidentally delete 47 records. An agent with a hard token ceiling will never burn $180 in a retry loop. The best error handling is the error that’s structurally impossible.


The Full Production Toolkit

Error handling catches failures at runtime. But a complete production agent needs both layers:

  1. Testing AI Agents in Production: Unit tests, evals, and integration tests for non-deterministic agents. Catch failures before they reach production.
  2. Error handling (this post): Circuit breakers, validation, sagas, guardrails, and escalation. Contain failures when they do happen.
  3. Monitoring AI Agents in Production: Token tracking, validation gates, output logging, and quality scoring. Detect failures that slip past error handling and testing.

For a hands-on walkthrough of the Strands SDK used throughout this post, see Strands Agents SDK: Building My First AI Agent.


Built from production failures, so yours can be less painful.

Kevin Tan
Written by

Cloud Solutions Architect and Engineering Leader based in Singapore. I write about AWS, distributed systems, and building reliable software at scale.

Discussion

Comments are powered by GitHub Discussions. Sign in with GitHub to join the conversation.