My AI Agent Passed Every Check. 67% of It Was Wrong.

Cracked statue with a serene expression

Valid JSON. Correct schema. All fields populated.

Completely wrong data.

That’s the most dangerous LLM failure mode. Most production systems won’t catch it. My AI agent’s report passed every structural check while 67% of its verifiable facts were fabricated. It’s the hardest hallucination to detect because nothing looks broken.

I run a daily AI agent pipeline called Athena that generates market briefings using local and cloud LLMs with search tools and structured outputs. When I tested three local models against the same real-time data task, the most capable model was the most dangerous.

TL;DR: I tested three local LLMs on a real-time data task. The worst model was the easiest to catch. The best model was the most dangerous, producing polished output where 67% of verifiable claims were wrong. Structural validation catches nothing here. Production systems need factual verification against ground truth, and you need to treat model confidence as unreliable by default.

Confidence is a style signal, not a truth signal. The model that fills every field is more suspicious than the one that returns “N/A”.

My agent handed me what looked like a perfect report. Clean formatting. Every section filled. Numbers with decimal precision. If you skimmed it, you would trust it.

Then I checked the data. Every single number was wrong.

Amazon up 9.2%. Actually down 9.5%. Not just wrong. Wrong in the opposite direction. The S&P 500 down 0.37%. Actually up 0.78%. Gold off by more than $3,000.

If I hadn’t manually spot-checked against live data, this would have gone out as a morning briefing. The models that break obviously are easy problems. The models that sound confident while being wrong cause real damage.


The Confidence Spectrum

I tested each model against the same market briefing task to see if any could replace the cloud API. Each failed differently, and the pattern revealed something I now think about every time I evaluate an LLM for production use.

Level 1: Loud Failure

Qwen 3.5 (9B) couldn’t produce valid JSON to call its search tools. Three out of four attempts threw a KeyError. When it did manage a query, it sent the entire instruction prompt as the search string.

Production risk: Low. This breaks loudly. Error handling catches it. Monitoring alerts fire. Nobody acts on bad data because there’s no data.

If you have error handling patterns in place, this is the failure mode they’re built for.

Level 2: Honest Uncertainty

Qwen 3 (14B) used its tools correctly. It retrieved 12-15 results per step. But when it couldn’t find a specific number, it said so. The Markets Snapshot showed “N/A” for nearly every close price.

It also ignored the configured watchlist and substituted its own, which is a different problem. But on the core question, “what do you do when you don’t have the data?”, it gave the right answer: admit it.

Production risk: Moderate, but manageable. You can build around honest uncertainty. A monitoring layer flags “N/A” responses and triggers a fallback. The agent says “I don’t know,” and your system handles it. This is how well-monitored agents are supposed to work.

Level 3: Confident Hallucination

Gemma 3 (27B) was the most capable model by every metric except the one that matters.

JSON generation: reliable (4/4 passed). Search tool use: correct (12-15 results). Report structure: complete. Formatting: professional.

Factual accuracy: 33%.

I checked six data points. Every single one was wrong:

Claim Reported Actual Error Source
Gold price $2,155/oz ~$5,175/oz -$3,000 Yahoo Finance
Hang Seng 16,992 25,321 -8,000+ Yahoo Finance
AMZN weekly +9.21% -9.53% Opposite direction Yahoo Finance
Dow Jones -0.55% +0.66% Opposite direction Yahoo Finance
S&P 500 -0.37% +0.78% Opposite direction Yahoo Finance
Bitcoin $66,000 ~$72,855 -~$7,000 CoinGecko

Six checks. Six failures. This wasn’t noise. It was systematic hallucination.

Production risk: Critical. This passes every structural check. No error flags. No uncertainty markers. A human scanning the output sees a professional report with plausible numbers. Only someone who independently verifies the data catches it.

The model that sounds the most confident is often the most dangerous when it’s wrong.


Why Structural Validation Fails

Most production LLM systems validate outputs the same way:

  • Is the response valid JSON?
  • Does it match the expected schema?
  • Are all required fields present?
  • Are values within expected ranges?

Gemma 3 passed all of these. The gold price of $2,155 is within a “reasonable” range if you last checked gold prices in 2024. The percentage changes are small, plausible numbers. Nothing triggers an outlier alert.

This is the gap. Structural validation confirms the output looks right. It says nothing about whether the output is right.

In traditional software, the right type usually implies the right value. In LLM systems, a model can produce perfectly structured output that is completely wrong. Structural validation assumes the system is deterministic. LLMs are not.

Hallucination benchmarks put the best models at under 1% on general tasks. Financial queries push that to 15-25%. The local models I tested performed worse.

This is why LLM hallucination detection cannot rely on schema validation alone. If you’re testing your agents, this is the failure class that unit tests and structural checks miss entirely. You need a different layer.


Why LLMs Hallucinate with Confidence

The root cause isn’t a bug. It’s a training incentive.

When a model searches for “gold price today” and gets back headlines instead of numbers, it has three options:

  1. Refuse to answer. Almost no model does this unprompted.
  2. Signal uncertainty. Qwen 3 did this with “N/A”. The safe choice.
  3. Generate a plausible answer. Gemma 3 did this. The dangerous choice.

Option 3 is the default because LLMs are trained to be helpful. Returning “N/A” feels like failure to the model. Generating a plausible number feels like completing the task.

Research on LLM training shows a consistent pattern: benchmarks reward answering confidently more than admitting uncertainty. Next-token prediction provides no training signal for “I don’t know” responses, so the training loop selects for confident fabrication.

MIT research quantified this: models are 34% more likely to use phrases like “definitely” and “certainly” when providing incorrect information. The less a model knows, the more confident it sounds.

This is why giving agents search tools doesn’t prevent hallucination. If the search results don’t contain the exact number the model needs, it fills the gap from training data. The search tool gave Gemma 3 context about gold markets. It just didn’t give it today’s price. So the model picked the most plausible price from training.

The result: a number that looks right, sounds right, and is months out of date.

Financial data is hit hardest because market prices have a shelf life measured in hours. LLM training data has a shelf life measured in months.


3 Detection Layers for Confident Hallucination

After this experience, I stopped treating hallucination detection as a single validation step. I now treat hallucination detection like a safety system: several small checks that fail independently instead of one large validation step.

1. Ground Truth Spot-Checks

For any output containing verifiable claims, sample a subset and check them against authoritative sources.

def spot_check_prices(
    report: dict, api_client
) -> list[dict]:
    """Check reported prices against live data."""
    checks = []
    watchlist = report["watchlist"]
    sample = random.sample(
        watchlist, min(3, len(watchlist))
    )
    for ticker in sample:
        reported = ticker["close"]
        actual = api_client.get_price(
            ticker["symbol"]
        )
        drift = abs(reported - actual) / actual
        checks.append({
            "symbol": ticker["symbol"],
            "reported": reported,
            "actual": actual,
            "drift_pct": drift * 100,
            "pass": drift < 0.05,  # 5% tolerance
        })
    return checks

You don’t need to verify every claim. Spot-checking 2-3 data points catches systematic hallucination. If the model is pulling numbers from stale training data, multiple claims will be off, not just one.

2. Cross-Reference Consistency

Claims within the same report should be internally consistent. If the report says the S&P 500 is down but 8 of 10 sectors are up, something is wrong.

def check_internal_consistency(
    report: dict,
) -> list[str]:
    """Flag contradictions within the report."""
    flags = []
    sp500_up = (
        report["sp500"]["change_percent"] > 0
    )
    sectors = report["sectors"]
    sectors_up = sum(
        1 for s in sectors
        if s["change_percent"] > 0
    )
    ratio = sectors_up / len(sectors)
    if sp500_up and ratio < 0.3:
        flags.append(
            f"Index up but only "
            f"{sectors_up}/{len(sectors)} "
            f"sectors up"
        )
    if not sp500_up and ratio > 0.7:
        flags.append(
            f"Index down but "
            f"{sectors_up}/{len(sectors)} "
            f"sectors up"
        )
    return flags

Confident hallucinations often create internally inconsistent reports because each number is generated independently from training data, not derived from a coherent dataset.

3. Temporal Drift Detection

If you run the same agent daily, track the magnitude of changes between reports. A stock price jumping 50% overnight without a corresponding news event is a hallucination signal.

def check_temporal_drift(
    current: dict,
    previous: dict,
    max_daily_drift: float = 0.15,
) -> list[str]:
    """Flag values that changed too much."""
    flags = []
    for symbol in current["watchlist"]:
        prev = find_ticker(
            previous, symbol["symbol"]
        )
        if prev and prev["close"] > 0:
            drift = (
                abs(symbol["close"] - prev["close"])
                / prev["close"]
            )
            if drift > max_daily_drift:
                flags.append(
                    f"{symbol['symbol']} moved "
                    f"{drift:.0%} in one day"
                )
    return flags

Gold jumping from $5,175 to $2,155 between reports would trigger this immediately. The model isn’t reporting a market crash. It’s reporting a different reality.


Building Confidence-Aware Systems

The deeper lesson goes beyond financial data. Any production system where an LLM generates claims that humans or downstream systems act on needs to treat model confidence as unreliable.

Separate data retrieval from synthesis

Don’t let the model fetch and interpret data in the same step. Retrieve structured data through APIs, not search, then pass verified data to the model for narrative synthesis. The model writes the story. The data comes from an authoritative source. There’s no gap to fill.

Make uncertainty a first-class output

Design your agent’s output schema to include provenance. Instead of "gold_price": 5175, return "gold_price": {"value": 5175, "source": "api", "timestamp": "2026-03-05T14:30:00Z"}. If the source is “model” instead of “api”, your monitoring layer treats it differently.

Validate before distribution

Never distribute LLM-generated reports without at least one verification pass. Spot-check a sample against a live API. The cost of a few API calls is negligible compared to the cost of distributing wrong information to stakeholders who will act on it.

Assume polished output is the riskiest

This is counterintuitive but critical. The messy output with “N/A” values is safer than the clean report where every field is filled. If every field has a value and no uncertainty markers, be more suspicious, not less. The model may have filled every gap with a guess.


Key Takeaways

  1. The most dangerous model sounds confident. Obvious failures are easy to catch. Polished, professional, wrong output passes every structural check and reaches users unchallenged.

  2. Structural validation is necessary but not sufficient. Valid JSON and correct schema don’t mean correct content. You need factual verification against ground truth.

  3. Honest uncertainty is a feature. A model that returns “N/A” when it doesn’t know is more valuable in production than one that fills every field with a plausible guess. Design your system to reward uncertainty over fabrication.

  4. Build three detection layers. Ground truth spot-checks catch systematic hallucination. Consistency checks catch contradictions. Temporal drift detection catches stale training data leaking into outputs.

  5. Separate data from synthesis. Retrieve facts through authoritative APIs. Let the model write the narrative. When data and synthesis are separated, there are no gaps to fill.

If your system trusts confident output, it will eventually ship confident lies.


This is part of my series on building AI agents in production. For the experiment behind these findings, see Why Local LLMs Hallucinate When Your AI Agent Has Search. For the architectural fix, see My AI Agent Got 10 of 15 Facts Wrong. The Fix Was the Tool, Not the Model.. Related: Error Handling Patterns, Testing AI Agents, Monitoring AI Agents.

Kevin Tan
Written by

Cloud Solutions Architect and Engineering Leader based in Singapore. I write about AWS, distributed systems, and building reliable software at scale.

Discussion

Comments are powered by GitHub Discussions. Sign in with GitHub to join the conversation.