Why Local LLMs Hallucinate When Your AI Agent Has Search

I gave a local LLM web search. It still got 10 out of 15 financial facts wrong.

Not because it didn’t search. Because search doesn’t return data.

I run a daily market briefing agent that pulls stock prices, macro indicators, and sector news every morning. I tested Gemini 2.5 Flash against three Ollama models on the same real-time financial task.

TL;DR: I tested Gemini 2.5 Flash against three local Ollama models on a real-time market briefing agent. The best-performing local model (Gemma 3 27B) got 10 of 15 financial claims wrong, even with web search. The root cause: search returns headlines, not data. When the model can’t find the number, it fills in a confident guess from stale training data. For real-time factual tasks, local LLMs aren’t ready.

If your agent is using search for facts, it is guessing.

Most agents actually behave like this:

Agent → Search → Headlines → Missing data → LLM guess

This is a common LLM hallucination pattern when using search tools in AI agents.


The Experiment

The agent is called Athena. In production, it runs on Gemini 2.5 Flash with native Google Search grounding. Reports are accurate, structured, and ready in about 2 to 3 minutes.

I wanted to see if local models could do the same job. Lower cost, no API dependency, full data privacy. If a 14B or 27B parameter model running on Ollama could produce the same report, I could cut the cloud dependency entirely.

I tested four models on the same task: generate a Daily Market Briefing with current prices, macro data, sector analysis, and a watchlist update.

Model Provider JSON Reliability Search Results Report Quality Speed
Gemini 2.5 Flash Google AI Native Native Google Search Production-grade ~2-3 min
Qwen 3.5 (9B) Ollama 3/4 failed 0-1 results Unusable ~36 min
Gemma 3 (27B) Ollama 4/4 passed 12-15 results Wrong data ~15 min
Qwen 3 (14B) Ollama 4/4 passed 12-15 results Missing data ~17 min

The failures weren’t random. Each model failed in a different, predictable way.


Where It Broke: Three Failure Modes

Failure Mode 1: Tooling Breakdown

Qwen 3.5 (9B) couldn’t reliably produce the JSON needed to call search tools. Three out of four attempts threw a KeyError. The model would output something close to the expected format but with wrong keys or malformed structure.

When it did manage a search query, it sent the entire instruction prompt as the query string. DuckDuckGo returned zero or one results for a 200-word “query.”

This is the simplest failure mode. The model can’t use its tools. No tools, no grounding, no useful output. At 36 minutes per run on local hardware, it wasn’t worth debugging further.

If you’ve dealt with structured output failures in production, the pattern is familiar. I covered why structured output is the API contract for LLMs in a previous post about why AI agents fail in production. Small models break that contract more often.

Failure Mode 2: Confident Hallucination (High Risk)

Gemma 3 (27B) was the most dangerous result. It passed every structural check. JSON was valid. All sections were present. The report looked professional. It even cited reasonable-sounding numbers.

The problem: 10 of 15 verifiable claims were wrong.

Claim Gemma 3 Reported Actual Value
Gold price $2,155/oz ~$5,175/oz
Hang Seng Index 16,992 25,321
AMZN weekly change +9.21% -9.53%
Dow Jones direction -0.55% +0.66%
S&P 500 change -0.37% +0.78%
Bitcoin price $66,000 ~$72,855
Jobs report date Released Mar 5 Scheduled Mar 6

Gold was off by more than $3,000. Bitcoin was off by nearly $7,000.

Not just wrong. Wrong in the opposite direction.

Six of nine watchlist tickers were missing close prices entirely. The model filled them with nothing or with fabricated numbers.

Failure Mode 3: Instruction Drift

Qwen 3 (14B) took a different path. It generated valid JSON. Search queries worked. It retrieved 12 to 15 results per step.

But it ignored the configured watchlist entirely. Instead of reporting on NVDA, AMZN, MSFT, and the other tickers I specified, it invented its own list: Duolingo, Sea Limited, Toyota, Alibaba, BP, Siemens.

It also hallucinated a detailed “Tomorrow’s Radar” section with specific event times that couldn’t be verified.

To its credit, Qwen 3 was mostly honest about missing data. The Markets Snapshot section showed “N/A” for nearly every close price. That honesty is more useful than Gemma 3’s confident wrong answers.


The Headline Gap

I call this the Headline Gap: search tools return context about a number, not the number itself.

When Gemma 3 searched DuckDuckGo for “S&P 500 today,” it got back something like:

Markets rally as investors digest Fed comments - Reuters
S&P 500 rises amid broader market recovery - CNBC

Headlines. Snippets. Context about the market. But not the actual number. This creates a failure loop:

Search → Context without numbers → Missing data
→ Model fills gap from training data → Confident guess

When the model can’t find the number, it generates the most plausible one from training data. Training data that is already stale. No uncertainty flag. Just a confident, stale number.

Research confirms this pattern. Even frontier models equipped with financial tools achieve only 67.4% accuracy on adversarial financial trading tasks. LLMs preferentially choose web search (55.5% of tool invocations) over authoritative data APIs, making them vulnerable to incomplete or misleading snippets.

Gemini avoids this because Google Search grounding returns structured data, not just snippets. When Gemini searches for a stock price, it gets the actual number from Google’s financial data sources. This isn’t because Gemini is “smarter.” It’s because it has access to structured data, not just search snippets. Local models using DuckDuckGo don’t have that advantage.


The Confidence Spectrum

The most important lesson from this experiment isn’t that local models hallucinate. It’s how they hallucinate.

Not all hallucinations are equal. Some are survivable. Some are production incidents.

Model Behavior Production Risk
Qwen 3.5 (9B) Can’t call tools Obvious failure, easy to catch
Qwen 3 (14B) Admits missing data (“N/A”) Manageable, build error handling around it
Gemma 3 (27B) Fabricates numbers confidently Dangerous, passes all structural checks

Qwen 3.5 couldn’t use its tools at all. Obvious failure. Easy to catch.

Qwen 3 used tools correctly, retrieved real results, but admitted when it didn’t have data. It showed “N/A” instead of guessing. You can build error handling around honest uncertainty. If you’re designing error handling patterns for agents, a model that says “I don’t know” is far easier to work with than one that says “$2,155” when the answer is $5,175.

Gemma 3 used tools correctly, retrieved real results, then confidently produced wrong numbers. No hedging. No uncertainty markers. Just polished, professional, incorrect data.

This is the most dangerous failure mode in production. It passes every structural validation. It looks right. Only a human with access to the real data would catch it.

If you’re testing AI agents, this is exactly the kind of failure that unit tests and structural evals miss. You need factual verification against ground truth data, which means your test suite needs access to the same live data sources your agent uses.


What This Means for Production

Don’t assume search equals grounding

Giving an agent a search tool doesn’t mean it will use search results correctly. If the search results don’t contain the specific data the model needs, it will fill in gaps from training data. Silently. Confidently.

If your agent needs numbers, web search is the wrong tool

For real-time financial data, weather, sports scores, or anything with a ground truth number, your agent needs a structured API that returns the actual data point. Not a news article about it. Not a snippet mentioning it. The number itself, from an authoritative source.

Model size matters for tool use reliability

The 9B model couldn’t even call tools correctly. The 14B and 27B models could, but they still struggled with the interpretation step. For agentic workflows that require reliable tool use and faithful synthesis, smaller local models are not ready.

Validate outputs against ground truth

If your agent produces numerical claims, you need a monitoring layer that checks those claims against authoritative data. This is especially true for financial, medical, or safety-critical domains where confident wrong answers cause real harm.

Honest uncertainty is a feature

A model that returns “N/A” when it doesn’t know is more valuable than one that returns a plausible but wrong number. When evaluating models for production, test for calibrated uncertainty, not just output quality.


When Local LLMs Make Sense

This isn’t a blanket case against local models. They work well for:

  • Text generation and summarization where training data is sufficient context
  • Code assistance where correctness is verifiable through execution
  • Classification tasks where outputs are constrained to known categories
  • Privacy-sensitive workflows where data cannot leave your network

They struggle with:

  • Real-time data tasks that require current, accurate numbers
  • Agentic workflows that depend on reliable tool use and faithful synthesis
  • High-stakes domains where confident hallucination causes harm

The question isn’t whether local models can replace cloud models.

The question is whether your agent is allowed to guess.

If your system produces numbers, guessing isn’t intelligence. It’s a production bug with a confident tone.

This is part of my series on building AI agents in production. The architectural fix is in Why AI Agents Need APIs, Not Search. For a deeper look at confident hallucinations passing every structural check, see My AI Agent Passed Every Check. 67% of It Was Wrong.. Related: Why AI Agents Fail in Production, Error Handling Patterns, Testing AI Agents, and Monitoring AI Agents.

ai-agents llm production-systems
Kevin Tan

Kevin Tan

Cloud Solutions Architect and Engineering Leader based in Singapore. I write about AWS, distributed systems, and building reliable software at scale.