Why Local LLMs Hallucinate When Your AI Agent Has Search

Desert mirage

I replaced Gemini with a local LLM for my AI research agent.

Three hours later it confidently told me:

Amazon was up 9.2% this week.

In reality it was down 9.5%.

The agent had web search. It still hallucinated.

TL;DR: I tested Gemini 2.5 Flash against three local Ollama models on a real-time market briefing agent. The best-performing local model (Gemma 3 27B) got 10 of 15 financial claims wrong, even with web search. The root cause: search returns headlines, not data. When the model can’t find the number, it fills in a confident guess from stale training data. For real-time factual tasks, local LLMs aren’t ready.

That experiment exposed a deeper problem with local LLMs and search grounding. This is a classic case of local LLM hallucination: the model had access to search, but still fabricated numbers because the search results didn’t contain structured data. Having access to search tools and using search results correctly are two very different things.


The Experiment

I run an AI agent called Athena that generates daily market briefings. It pulls stock prices, macro indicators, sector news, and watchlist updates into a structured report every morning.

In production, it runs on Gemini 2.5 Flash with native Google Search grounding. It works. Reports are accurate, structured, and ready in about 2 to 3 minutes.

I wanted to see if local models could do the same job. The motivation was straightforward: lower cost, no API dependency, full data privacy. If a 14B or 27B parameter model running on Ollama could produce the same report, I could cut the cloud dependency entirely.

I tested four models on the same task: generate a Daily Market Briefing with current prices, macro data, sector analysis, and a watchlist update.

Model Provider JSON Reliability Search Results Report Quality Speed
Gemini 2.5 Flash Google AI Native Native Google Search Production-grade ~2-3 min
Qwen 3.5 (9B) Ollama 3/4 failed 0-1 results Unusable ~36 min
Gemma 3 (27B) Ollama 4/4 passed 12-15 results Wrong data ~15 min
Qwen 3 (14B) Ollama 4/4 passed 12-15 results Missing data ~17 min

The results were not just worse. They failed in completely different ways.


Where It Broke: Three Failure Modes

Failure Mode 1: Broken Tool Use

Qwen 3.5 (9B) couldn’t reliably produce the JSON needed to call search tools. Three out of four attempts threw a KeyError. The model would output something close to the expected format but with wrong keys or malformed structure.

When it did manage a search query, it sent the entire instruction prompt as the query string. DuckDuckGo returned zero or one results for a 200-word “query.”

This is the simplest failure mode. The model can’t use its tools. No tools, no grounding, no useful output. At 36 minutes per run on local hardware, it wasn’t worth debugging further.

If you’ve dealt with structured output failures in production, the pattern is familiar. I covered why structured output is the API contract for LLMs in a previous post. Small models break that contract more often.

Failure Mode 2: Confident Hallucination (The Dangerous One)

Gemma 3 (27B) was the most dangerous result. It passed every structural check. JSON was valid. All sections were present. The report looked professional. It even cited reasonable-sounding numbers.

The problem: 10 of 15 verifiable claims were wrong.

Claim Gemma 3 Reported Actual Value
Gold price $2,155/oz ~$5,175/oz
Hang Seng Index 16,992 25,321
AMZN weekly change +9.21% -9.53%
Dow Jones direction -0.55% +0.66%
S&P 500 change -0.37% +0.78%
Bitcoin price $66,000 ~$72,855
Jobs report date Released Mar 5 Scheduled Mar 6

Gold was off by more than $3,000. Bitcoin was off by nearly $7,000. The Amazon number wasn’t just wrong, it was wrong in the opposite direction.

Six of nine watchlist tickers were missing close prices entirely. The model filled them with nothing or with fabricated numbers.

Failure Mode 3: Ignoring Instructions

Qwen 3 (14B) took a different path. It generated valid JSON. Search queries worked. It retrieved 12 to 15 results per step.

But it ignored the configured watchlist entirely. Instead of reporting on NVDA, AMZN, MSFT, and the other tickers I specified, it invented its own list: Duolingo, Sea Limited, Toyota, Alibaba, BP, Siemens.

It also hallucinated a detailed “Tomorrow’s Radar” section with specific event times that couldn’t be verified.

To its credit, Qwen 3 was mostly honest about missing data. The Markets Snapshot section showed “N/A” for nearly every close price. That honesty is more useful than Gemma 3’s confident wrong answers.


When Gemma 3 searched DuckDuckGo for “S&P 500 today,” it got back something like:

Markets rally as investors digest Fed comments - Reuters
S&P 500 rises amid broader market recovery - CNBC

Headlines. Snippets. Context about the market. But not the actual number.

This exposes the core limitation of search grounding for local LLMs. Web search returns context, not data.

Headlines mention markets moving. They rarely include the exact number an agent needs. When the model can’t find the number, it generates the most plausible one from training data. Training data that was months old. No uncertainty flag. Just a confident, stale number.

Research confirms this pattern. Even frontier models equipped with financial tools achieve only 67.4% accuracy on adversarial financial trading tasks. LLMs preferentially choose web search (55.5% of tool invocations) over authoritative data APIs, making them vulnerable to incomplete or misleading snippets.

Gemini avoids this because Google Search grounding returns structured data, not just snippets. When Gemini searches for a stock price, it gets the actual number from Google’s financial data sources. Local models using DuckDuckGo don’t have that advantage.


The Confidence Spectrum

The most important lesson from this experiment isn’t that local models hallucinate. It’s how they hallucinate.

Model Behavior Production Risk
Qwen 3.5 (9B) Can’t call tools Obvious failure, easy to catch
Qwen 3 (14B) Admits missing data (“N/A”) Manageable, build error handling around it
Gemma 3 (27B) Fabricates numbers confidently Dangerous, passes all structural checks

Qwen 3.5 couldn’t use its tools at all. Obvious failure. Easy to catch.

Qwen 3 used tools correctly, retrieved real results, but admitted when it didn’t have data. It showed “N/A” instead of guessing. You can build error handling around honest uncertainty. If you’re designing error handling patterns for agents, a model that says “I don’t know” is far easier to work with than one that says “$2,155” when the answer is $5,175.

Gemma 3 used tools correctly, retrieved real results, then confidently produced wrong numbers. No hedging. No uncertainty markers. Just polished, professional, incorrect data.

This is the most dangerous failure mode in production. It passes every structural validation. It looks right. Only a human with access to the real data would catch it.

If you’re testing AI agents, this is exactly the kind of failure that unit tests and structural evals miss. You need factual verification against ground truth data, which means your test suite needs access to the same live data sources your agent uses.


What This Means for Production

Don’t assume search equals grounding

Giving an agent a search tool doesn’t mean it will use search results correctly. If the search results don’t contain the specific data the model needs, it will fill in gaps from training data. Silently. Confidently.

Structured data APIs beat web search for factual tasks

For real-time financial data, weather, sports scores, or anything with a ground truth number, your agent needs a tool that returns the actual data point. Not a news article about it. Not a snippet mentioning it. The number itself, from an authoritative source.

Model size matters for tool use reliability

The 9B model couldn’t even call tools correctly. The 14B and 27B models could, but they still struggled with the interpretation step. For agentic workflows that require reliable tool use and faithful synthesis, smaller local models are not ready.

Validate outputs against ground truth

If your agent produces numerical claims, you need a monitoring layer that checks those claims against authoritative data. This is especially true for financial, medical, or safety-critical domains where confident wrong answers cause real harm.

Honest uncertainty is a feature

A model that returns “N/A” when it doesn’t know is more valuable than one that returns a plausible but wrong number. When evaluating models for production, test for calibrated uncertainty, not just output quality.


When Local LLMs Make Sense

This isn’t a blanket case against local models. They work well for:

  • Text generation and summarization where training data is sufficient context
  • Code assistance where correctness is verifiable through execution
  • Classification tasks where outputs are constrained to known categories
  • Privacy-sensitive workflows where data cannot leave your network

They struggle with:

  • Real-time data tasks that require current, accurate numbers
  • Agentic workflows that depend on reliable tool use and faithful synthesis
  • High-stakes domains where confident hallucination causes harm

The question isn’t whether local models can replace cloud models.

The question is whether your agent is allowed to guess.

If the task requires real numbers, guessing isn’t intelligence. It’s a production incident waiting to happen.


Key Takeaways

  1. Search tools don’t prevent hallucination. If search results return headlines instead of data, the model fills gaps from stale training data.

  2. The most dangerous model is the one that sounds confident. Gemma 3 got 10 of 15 claims wrong but produced a polished, professional report. Qwen 3 admitted what it didn’t know.

  3. Grounding requires structured data, not snippets. For real-time factual tasks, your agent needs APIs that return actual values, not search results that mention the topic.

  4. Test for factual accuracy, not just structure. Valid JSON and correct formatting don’t mean correct content. Build evals that check claims against ground truth.

  5. Local models aren’t ready for real-time data agents. For tasks that require current, accurate numerical data with tool use, cloud models with native grounding still have a significant edge.


This is part of my series on building AI agents in production. Related posts: Why AI Agents Fail in Production, Error Handling Patterns, Testing AI Agents, and Monitoring AI Agents.

Kevin Tan
Written by

Cloud Solutions Architect and Engineering Leader based in Singapore. I write about AWS, distributed systems, and building reliable software at scale.

Discussion

Comments are powered by GitHub Discussions. Sign in with GitHub to join the conversation.