Testing AI Agents in Production: Unit Tests, Evals, and Integration Tests

Testing AI agents

Testing AI agents requires three distinct layers: unit tests for deterministic logic, evals for LLM output quality, and integration tests for end-to-end workflows. Here’s the testing strategy I use in production, with code examples for each layer.

In AI Agent Error Handling Patterns, I covered five patterns that catch failures at runtime: circuit breakers, validation gates, saga rollbacks, budget guardrails, and escalation policies. But runtime error handling is reactive. Testing is how you catch problems before they reach production.

If you’ve been “vibe checking” your agents (change a prompt, try a few inputs, ship if it looks good), you’re not alone. Per LangChain’s State of Agent Engineering survey, only 52% of teams run offline evals and just 37% run evals in production. Quality is the #1 blocker for agent deployments (32%), ahead of latency, cost, and hallucination.

The problem is that agents are non-deterministic. You can’t assertEqual on LLM output. A retry returns a different answer. Even temperature=0 produces up to 15% variation across runs. Traditional unit testing breaks down.

But that doesn’t mean agents are untestable. It means you need three layers.

Think of it as a testing pyramid adapted for non-deterministic systems.

         ┌──────────────────────┐
         │  Integration Tests   │  Slow, expensive. Run nightly.
         └──────────────────────┘
      ┌────────────────────────────┐
      │           Evals            │  Medium speed. Run on PRs.
      └────────────────────────────┘
   ┌──────────────────────────────────┐
   │            Unit Tests            │  Fast, free. Run pre-commit.
   └──────────────────────────────────┘

Layer 1: Unit Tests. Test What Must Be Predictable

Most of your agent’s code is deterministic. Tool selection logic, argument parsing, routing decisions, retry behavior, and output format validation are all deterministic. Test them like you would any other code.

The key: mock the LLM, test everything around it.

Start with a stub. Instead of patching individual methods ad-hoc, create a test double that implements your LLM provider interface with canned responses:

class StubProvider(LLMProvider):
    """A test-only provider that returns canned responses."""
    def __init__(self, research_response="stub research",
                 synthesize_response="stub synthesis",
                 generate_json_response=None):
        super().__init__()
        self._research_response = research_response
        self._synthesize_response = synthesize_response
        self._generate_json_response = generate_json_response or {}

    def research(self, instruction, search=True):
        return self._research_response

    def synthesize(self, context, instruction, sections=None, prior_errors=None):
        return self._synthesize_response

    def generate_json(self, prompt):
        return self._generate_json_response

This StubProvider is your workhorse. Every test that doesn’t care about LLM behavior uses it. No API keys, no mocking boilerplate, and you can inject specific responses when needed.

Now use it to test the deterministic logic:

def test_research_steps_all_succeed(mock_provider, mocker):
    """Research steps run in parallel via ThreadPoolExecutor.
    Use a function-based side_effect to handle nondeterministic ordering."""
    research_steps = [
        {"name": "market_overview", "instruction": "Get market data"},
        {"name": "major_news", "instruction": "Get top news"},
    ]

    def _side_effect(instruction, search=True):
        return {
            "Get market data": "S&P 500 up 0.5%",
            "Get top news": "NVDA beats earnings",
        }[instruction]

    mocker.patch.object(mock_provider, "research", side_effect=_side_effect)
    results = run_research_steps(mock_provider, research_steps, search_enabled=True)

    assert results == {
        "market_overview": "S&P 500 up 0.5%",
        "major_news": "NVDA beats earnings",
    }
    assert mock_provider.research.call_count == 2

Notice the side_effect function that maps instructions to outputs. Since ThreadPoolExecutor makes execution order nondeterministic, a sequential list of return values would be flaky. A dictionary-based side_effect handles any ordering.

For retry logic, the challenge is that production retries use exponential backoff (via tenacity), which would make tests slow. Patch the wait strategy to zero:

@pytest.fixture
def _no_retry_wait(mocker):
    """Patch tenacity wait to zero so retry tests run instantly."""
    from tenacity import wait_none
    mocker.patch.object(GeminiProvider.research.retry, "wait", wait_none())
    mocker.patch.object(GeminiProvider.synthesize.retry, "wait", wait_none())


def test_retries_on_server_error(mocker, _no_retry_wait):
    mock_genai = mocker.patch("my_agent.providers.gemini.genai")
    mock_client = mocker.MagicMock()
    mock_genai.Client.return_value = mock_client

    mock_response = mocker.MagicMock()
    mock_response.text = "Success after retries"
    mock_client.models.generate_content.side_effect = [
        ServerError(503, {}),   # Attempt 1: fails
        ServerError(503, {}),   # Attempt 2: fails
        mock_response,          # Attempt 3: succeeds
    ]

    provider = GeminiProvider(api_key="test-key")
    result = provider.research("test instruction", search=False)

    assert result == "Success after retries"
    assert mock_client.models.generate_content.call_count == 3


def test_no_retry_on_client_error(mocker):
    """400 errors should NOT be retried. Only 429/503 are retryable."""
    mock_genai = mocker.patch("my_agent.providers.gemini.genai")
    mock_client = mocker.MagicMock()
    mock_genai.Client.return_value = mock_client
    mock_client.models.generate_content.side_effect = ClientError(400, {"error": "bad request"})

    provider = GeminiProvider(api_key="test-key")
    with pytest.raises(ClientError):
        provider.research("test instruction", search=False)

    assert mock_client.models.generate_content.call_count == 1  # No retry

What to unit test on every agent:

  • Step routing and orchestration: Given a config, does the agent dispatch to the correct steps in the correct order?
  • Argument validation: Are inputs in the correct format, within bounds, and schema-valid?
  • Retry logic: Does the agent retry on transient errors (503, 429)? Does it not retry on client errors (400)? Does it give up after N attempts?
  • Output parsing: Does structured output (JSON, function call schemas) parse without errors?
  • Guardrails: Do safety filters, validation gates, and scope boundaries trigger correctly?

These tests are fast, free (no LLM API calls), and catch real bugs. The StubProvider pattern scales. Whether your agent has 3 steps or 30, inject canned responses and test everything around the LLM.

If you’re using Pydantic AI, its TestModel gives you a similar stub out of the box, no API keys needed.


Layer 2: Evals. Test What the LLM Actually Says

Unit tests cover the deterministic scaffolding. Evals cover the part you can’t predict: the LLM’s actual output.

An eval is a scored test case: you give the agent an input and measure output quality against a rubric. The difference from a unit test is that you’re scoring on a spectrum, not asserting exact equality.

Not all evals are equal. Some are cheap and deterministic. Some require another LLM. Some trade precision for flexibility. Start from the cheapest and escalate:

Deterministic checks (fast, free)

Regex matches, keyword presence, JSON schema validation, length constraints. Use these for “does the output contain X” or “is the response valid JSON” checks. They catch obvious failures cheaply.

My agent generates daily briefings with required sections, markdown tables, and citations. I validate all of this structurally, with no LLM needed:

import re

def validate_synthesis(report, sections, validation=None, reference_time=None):
    errors = []

    if len(report) < 500:
        errors.append(f"Report too short: {len(report)} chars (minimum 500)")

    headers = re.findall(r'^#{1,6}\s+(.+)$', report, re.MULTILINE)
    if not headers:
        errors.append("Report contains no markdown headers")

    headers_lower = [h.lower() for h in headers]
    missing = [s for s in sections if not any(s.lower() in h for h in headers_lower)]
    if missing:
        errors.append(f"Missing sections: {', '.join(missing)}")

    # Config-driven quality checks
    if validation:
        for section in validation.get("sections_with_tables", []):
            content = section_content.get(section, "")
            if not re.search(r'\|.*---.*\|', content):
                errors.append(f"Section '{section}' is missing required table")

    if errors:
        raise SynthesisValidationError(
            f"Validation failed ({len(errors)} issue(s)): " + "; ".join(errors)
        )

Then test it with production-representative fixtures:

STOCK_DAILY_REPORT = (
    "## Markets Snapshot\n\n"
    "| Index | Level | Change |\n|---|---|---|\n"
    "| S&P 500 | 5,320.15 | +0.5% |\n| NASDAQ | 16,780.42 | +0.8% |\n\n"
    "## Big Movers & Why\n\n"
    "NVIDIA surged 5% on record data center revenue.\n\n"
    "## Watchlist Report\n\n"
    "| Ticker | Price | Change | Notes |\n|---|---|---|---|\n"
    "| AAPL | $185.50 | +1.2% | Services revenue beat |\n\n"
    "## Tomorrow's Radar\n\n"
    "CPI data at 8:30 AM ET.\n\n"
    "## Bottom Line\n\n"
    "Bullish momentum continues."
)

VALIDATION_CONFIG = {
    "min_section_length": 50,
    "sections_with_tables": ["Markets Snapshot", "Watchlist Report"],
}

def test_well_formed_report_passes():
    validate_synthesis(STOCK_DAILY_REPORT, STOCK_DAILY_SECTIONS,
                       validation=VALIDATION_CONFIG)


def test_missing_table_fails():
    report = STOCK_DAILY_REPORT.replace(
        "| Index | Level | Change |\n|---|---|---|\n"
        "| S&P 500 | 5,320.15 | +0.5% |\n| NASDAQ | 16,780.42 | +0.8% |\n",
        "Markets were broadly higher.\n"
    )
    with pytest.raises(SynthesisValidationError, match="Markets Snapshot.*table"):
        validate_synthesis(report, STOCK_DAILY_SECTIONS,
                           validation=VALIDATION_CONFIG)


def test_truncated_output_fails():
    report = STOCK_DAILY_REPORT[:400]
    with pytest.raises(SynthesisValidationError, match="too short"):
        validate_synthesis(report, STOCK_DAILY_SECTIONS)

These fixtures mirror real production output. When the LLM changes its formatting, drops a section, or skips a table, these tests catch it before delivery.

LLM-as-a-Judge (medium cost, scalable)

Use a capable LLM to score another LLM’s output against a rubric. This is the most popular automated eval method (53% adoption). The key is writing clear rubrics. Vague criteria produce arbitrary scores.

DIMENSIONS = ["analytical_depth", "factual_consistency", "actionability", "completeness"]

RUBRIC = """\
Score the following report on these dimensions (1-5 each):
- analytical_depth: How deep and insightful is the analysis?
- factual_consistency: Are claims well-sourced and internally consistent?
- actionability: Does the report provide clear, actionable takeaways?
- completeness: Are all major topics covered with sufficient detail?

Return ONLY a JSON object with these four keys and integer scores 1-5.
"""

def score_report(report, provider):
    prompt = f"{RUBRIC}\n\n---\n\n{report}"
    scores = provider.generate_json(prompt)

    missing = [d for d in DIMENSIONS if d not in scores]
    if missing:
        raise ValueError(f"LLM response missing dimension(s): {', '.join(missing)}")

    for dim in DIMENSIONS:
        if not (1 <= scores[dim] <= 5):
            raise ValueError(f"Score for '{dim}' out of range: {scores[dim]}")

    scores["overall"] = sum(scores[d] for d in DIMENSIONS) / len(DIMENSIONS)
    return scores

In tests, mock the provider so you’re testing the scoring logic, not the LLM itself:

from unittest.mock import MagicMock

def _mock_provider(scores_dict):
    provider = MagicMock(spec=LLMProvider)
    provider.generate_json.return_value = scores_dict
    return provider

def test_score_report_valid_response():
    provider = _mock_provider({
        "analytical_depth": 4, "factual_consistency": 5,
        "actionability": 3, "completeness": 4,
    })
    result = score_report("## Summary\n\nMarkets rallied.", provider)

    assert result["overall"] == pytest.approx(4.0)


def test_score_report_missing_dimension_raises():
    provider = _mock_provider({"analytical_depth": 4, "factual_consistency": 5})

    with pytest.raises(ValueError, match="missing.*dimension"):
        score_report("report", provider)


def test_score_report_out_of_range_raises():
    provider = _mock_provider({
        "analytical_depth": 7, "factual_consistency": 5,  # 7 is out of 1-5 range
        "actionability": 3, "completeness": 4,
    })

    with pytest.raises(ValueError, match="out of range"):
        score_report("report", provider)

Golden Corpus Regression Tests

Once you have scored outputs, promote the best ones into a “golden corpus” that every future change is tested against:

import json
from pathlib import Path
from datetime import datetime, timezone, timedelta

SGT = timezone(timedelta(hours=8))

def test_golden_corpus_passes_validation():
    """Every golden report must still pass validation
    under current rules."""
    golden_dir = Path("evals/golden")
    if not golden_dir.exists():
        pytest.skip("No golden corpus yet")

    for md_file in golden_dir.rglob("*.md"):
        meta_path = md_file.with_suffix(".meta.json")
        meta = json.loads(meta_path.read_text())
        report = md_file.read_text()

        # Anchor freshness checks to when the report
        # was produced, not datetime.now().
        ts = meta["timestamp"]
        ref_time = datetime.strptime(
            ts, "%Y-%m-%dT%H-%M-%S"
        ).replace(tzinfo=SGT)

        validate_synthesis(
            report, meta["sections"],
            validation=meta.get("validation_config"),
            reference_time=ref_time,
        )

The reference_time anchoring is important. Without it, citation freshness checks fail as the golden reports age, creating flaky CI failures that have nothing to do with code changes.

Start small: Anthropic recommends starting with 20-50 test cases drawn from real failures. Don’t try to build a comprehensive eval suite on day one. Pull from actual production bugs. If it broke once, test for it forever.

I had a prompt change that improved helpfulness but introduced subtle hallucinations. My factuality eval caught it in CI before the change was merged. Without the eval, it would have shipped.

One warning: be careful not to overfit prompts to your eval suite. If you optimize only for passing tests, you risk gaming your own metrics. Rotate real user queries into your eval set regularly to keep it honest.


Layer 3: Integration Tests. Test the Full Loop

Unit tests check components. Evals check output quality. Integration tests check whether the whole agent actually completes its job end-to-end.

Test the outcome, not the path

Agents can take different routes to the same correct answer. Don’t assert on intermediate steps. Assert on the final result. Did the agent complete the workflow? Did it call the right number of steps? Did it deliver to the right recipient?

def test_full_pipeline(mocker):
    config = {
        "name": "Test Job",
        "schedule": "0 0 * * *",
        "model": {
            "provider": "gemini",
            "name": "gemini-2.5-flash",
            "search_enabled": True,
        },
        "research_steps": [
            {"name": "s1", "instruction": "do stuff"},
        ],
        "synthesis": {
            "instruction": "synth",
            "sections": ["Summary"],
        },
        "delivery": [{
            "method": "email",
            "to": "user@example.com",
            "subject_template": "T",
            "template": "t.html",
        }],
        "alerts": {"on_failure": "alert@example.com"},
    }
    mocker.patch(
        "my_agent.runner.load_config",
        return_value=config,
    )
    mock_create = mocker.patch(
        "my_agent.runner.create_provider",
    )
    mock_research = mocker.patch(
        "my_agent.runner.run_research_steps",
        return_value={"s1": "research output"},
    )
    mock_synthesis = mocker.patch(
        "my_agent.runner.run_synthesis",
        return_value="Final report",
    )
    mock_delivery = mocker.patch(
        "my_agent.runner.run_delivery",
    )
    mock_alert = mocker.patch(
        "my_agent.runner.send_alert",
    )
    mocker.patch.dict("os.environ", {
        "GEMINI_API_KEY": "test-key",
        "GMAIL_ADDRESS": "sender@gmail.com",
        "GMAIL_APP_PASSWORD": "app-pass",
    })

    run_job("jobs/test-job.yaml")

    mock_create.assert_called_once()
    mock_research.assert_called_once()
    mock_synthesis.assert_called_once()
    mock_delivery.assert_called_once()
    mock_alert.assert_not_called()

Test the retry-with-feedback loop

When synthesis fails validation, the error message should be fed back to the LLM for self-correction. This is a critical integration point between the validation layer and the LLM layer, and the kind of bug that only shows up when you test the interaction between components:

def test_synthesis_retries_on_validation_failure(mock_provider, mocker):
    mocker.patch.object(mock_provider, "synthesize", side_effect=[
        "Too short, no headers",  # Attempt 1: fails validation
        GOOD_SYNTHESIS,            # Attempt 2: passes
    ])
    synthesis_config = {
        "instruction": "Combine into a briefing",
        "sections": ["Market Pulse", "Big Movers"],
    }
    result = run_synthesis(mock_provider, sample_research_outputs, synthesis_config)

    assert "Market Pulse" in result
    assert mock_provider.synthesize.call_count == 2
    # First call has no prior_errors
    first_call = mock_provider.synthesize.call_args_list[0]
    assert first_call.kwargs.get("prior_errors") is None
    # Second call includes the validation error as feedback
    second_call = mock_provider.synthesize.call_args_list[1]
    assert "validation failed" in second_call.kwargs["prior_errors"].lower()

Test failure propagation and alerting

The happy path is easy. Test the unhappy path: does a research failure trigger an alert? Does the alert failure not mask the original error?

def test_research_failure_sends_alert(mocker):
    mocker.patch("my_agent.runner.load_config", return_value={...})
    mocker.patch("my_agent.runner.create_provider")
    mocker.patch("my_agent.runner.run_research_steps",
                  side_effect=Exception("API timeout"))
    mock_alert = mocker.patch("my_agent.runner.send_alert")
    mocker.patch.dict("os.environ", {
        "GEMINI_API_KEY": "test-key", "GMAIL_ADDRESS": "s@g.com",
        "GMAIL_APP_PASSWORD": "p",
    })

    with pytest.raises(Exception, match="API timeout"):
        run_job("jobs/test-job.yaml")

    mock_alert.assert_called_once()
    assert str(mock_alert.call_args.kwargs["error"]) == "API timeout"


def test_alert_failure_still_raises_original(mocker):
    """If alerting itself fails, the original error must still propagate."""
    mocker.patch("my_agent.runner.load_config", return_value={...})
    mocker.patch("my_agent.runner.create_provider")
    mocker.patch("my_agent.runner.run_research_steps",
                  side_effect=Exception("Original error"))
    mocker.patch("my_agent.runner.send_alert",
                  side_effect=Exception("SMTP connection failed"))
    mocker.patch.dict("os.environ", {
        "GEMINI_API_KEY": "test-key", "GMAIL_ADDRESS": "s@g.com",
        "GMAIL_APP_PASSWORD": "p",
    })

    with pytest.raises(Exception, match="Original error"):
        run_job("jobs/test-job.yaml")

Run multiple trials

For tests that hit real LLM APIs (nightly runs, not CI), a single run tells you nothing about reliability. Run the same test 5-10 times and measure the success rate. A drop from 92% to 78% is a real regression; a single failure is noise.

import statistics

def test_agent_reliability():
    """Run 10 trials and assert minimum success rate."""
    results = []
    for _ in range(10):
        try:
            output = agent.run("Generate the daily briefing")
            passed = output is not None and len(output) > 500
            results.append(passed)
        except Exception:
            results.append(False)

    success_rate = statistics.mean(results)
    assert success_rate >= 0.8, (
        f"Success rate {success_rate:.0%} below 80%"
    )

After a model upgrade, my agent’s end-to-end success rate dropped from 92% to 78%. The model passed all evals, but failed composition. Unit tests and evals didn’t catch it because the individual components were fine. Only the integration tests saw the full picture.


Putting It Together: The CI/CD Pipeline

These three layers run at different speeds and serve different purposes. Run the cheapest layer as often as possible, and push expensive LLM-based evals later in the pipeline:

Layer When it runs Speed Cost What it catches
Unit tests Pre-commit Seconds Free Routing bugs, schema errors, guardrail gaps
Evals PR check Minutes $ (LLM calls) Quality regressions, hallucinations, relevance drift
Integration tests Nightly 10+ min $$ (multiple trials) End-to-end failures, composition bugs, reliability drops

Treat prompts like production code. Version them, review changes, require eval approval before merge. A prompt tweak that fixes one edge case can silently break three others. This is what regression testing is for.

Here’s what this looks like in practice:

# .github/workflows/agent-evals.yml
name: Agent Evals
on:
  pull_request:
    paths: ['prompts/**', 'agents/**']

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install deepeval
      - run: pytest tests/evals/ --tb=short

If you prefer declarative configs, Promptfoo has its own GitHub Action with YAML-based eval definitions. DeepEval builds on pytest, so evals use the same test structure your team already knows. Either works. Pick the one that fits your stack.


The Minimum Viable Testing Setup

You don’t need a dedicated eval team or an enterprise platform. Start here:

  1. pytest + mocked LLM for unit tests on tool routing, argument validation, and guardrails
  2. 20 eval cases drawn from real production failures, run on every PR that touches prompts
  3. One CI/CD gate that blocks merges if evals regress

That’s it. You can set this up in an afternoon and expand from there.

Reliability is layered defense. Error handling prevents failures. Testing blocks regressions. Monitoring catches drift.

Start with the tests that would have caught your last production bug.


Part of the AI agents reliability series. Previously: Error Handling Patterns.

Kevin Tan
Written by

Cloud Solutions Architect and Engineering Leader based in Singapore. I write about AWS, distributed systems, and building reliable software at scale.

Discussion

Comments are powered by GitHub Discussions. Sign in with GitHub to join the conversation.