Hybrid Search vs Query Routing in RAG: Benchmark on Real PDFs

Hybrid search (BM25 + vector with RRF fusion) outperformed a regex query router in a 7-scenario benchmark on real PDFs: MRR 1.00 vs 0.67, at a latency cost of ~0.5ms. The router’s failure mode (“Router Trap”) returns zero results when a query looks structured but isn’t a verbatim phrase in the document.

Hybrid search in pdf-mcp runs two engines every time, keyword and semantic, and fuses their results with RRF (Reciprocal Rank Fusion: a scoring method that merges ranked lists by summing inverse ranks). That sounds wasteful. The obvious optimization is a query router: detect whether the query looks like a keyword lookup or a conceptual question, and fire only the right engine. Cheaper, simpler, and surely good enough.

I built a benchmark to confirm this. It confirmed the opposite.

pdf-mcp is an open-source MCP server that gives Claude and other AI agents structured access to PDF documents: search, extraction, and caching (how I built it). To validate hybrid search before shipping it, I ran a task-centric benchmark across seven scenarios on real public PDFs, with keyword, semantic, hybrid, and a regex-based router all measured side by side.

TL;DR: I benchmarked hybrid RRF against a query router on real PDFs across seven agentic scenarios. Hybrid scored MRR 1.00; the router scored 0.67. At 4ms latency, hybrid is the right default.

The router’s single failure cost an agent a complete miss: not a bad result, nothing.
The “true fusion” scenario is the clearest case: keyword found page 34, semantic found pages 36 and 39, only hybrid found all three. No routing strategy gets there.
Hybrid costs ~3× keyword in latency. At 4ms, that’s the right tradeoff.
Navigation by concept fails for every mode including hybrid. That’s an honest finding worth keeping.

Verdict: A router optimizes for average queries. Hybrid covers the edge cases that break agent workflows.

Mode	MRR (Q&A)	Recall@10 (Context)	Latency	Best for
Keyword	0.33	33%	~1.5ms	Exact phrase lookups
Semantic	1.00	67%	~3.8ms	Conceptual questions
Hybrid	1.00	100%	~4.2ms	Mixed or unpredictable queries
Router	0.67	67%	~3.7ms	When query types are fully known at design time

Router path vs fusion path: the router can return ∅, hybrid always fuses both engines

Why Routing Seems Like the Right Answer

The intuition behind routing is sound. Some queries are clearly keyword-friendly: exact section titles, product codes, error identifiers. Others are clearly semantic: “why does the model generalize well?” or “what are the limitations of this approach?” If you can classify the query first, you avoid paying for the engine you don’t need.

A typical regex router looks like this:

import re
def route(query: str) -> str:
    if re.search(r"[A-Z0-9\-]{4,}", query):
        return "keyword"
    return "semantic"

The logic: uppercase letters, digits, and hyphens grouped together signal a structured identifier: a model name, a metric, a standard. Route it to keyword. Everything else goes to semantic.

This is a reasonable heuristic. A more sophisticated version might use an LLM to classify queries, but even LLM-based routers inherit the same fundamental problem: they pick one engine and inherit its blind spots. The regex version just makes the failure mode obvious. The problem is what happens the rest of the time.

(For a deeper comparison of when keyword vs. semantic search each wins on their own, see Semantic vs Keyword Search for AI Agents.)

The Router Trap Pattern

Consider the query: WMT-2014 generalization capability

The regex sees WMT-2014: eight characters, all matching [A-Z0-9\-]. It routes to keyword.

Keyword search wraps all queries in FTS5 phrase syntax. “WMT-2014 generalization capability” is not a verbatim phrase in the paper. Keyword returns zero results.

The agent gets nothing. Not a bad answer: nothing.

Here’s the full benchmark result for this scenario (Scenario 1c, Mixed query on “Attention Is All You Need”):

Query: 'WMT-2014 generalization capability'   K=5

Mode           Recall@5     RR       Rank-1st   Top-5 pages
─────────────  ──────────  ──────  ────────  ────────────────────
keyword        0/1  0%      0.00     ∞          (none)
semantic       1/1  100%    1.00     rank 1     8, 10, 9, 12, 11
hybrid         1/1  100%    1.00     rank 1     8, 10, 9, 12, 11
router(key)    0/1  0%      0.00     ∞          (none)

The router looked confident. It was completely wrong. Semantic finds page 8 (the Machine Translation results table, which discusses newstest2014 results). Hybrid inherits that find via RRF. The router fires keyword and returns an empty result set.

This is the Router Trap: a query that looks structured. It has a model name, a year, a standard abbreviation, but the document doesn’t contain it as a verbatim phrase. The heuristic fires precisely because the query looks pattern-like. And it fails precisely because of that.

The router’s failure rate in this benchmark: one scenario out of three Q&A scenarios. That doesn’t sound like much. But the consequences of a miss for an agent running autonomously are not “slightly worse answer”: wrong tool call, agent backtracks, user gets nothing. This is one of the silent failure patterns that make AI agents hard to debug in production.

Q&A Group MRR results:

MRR scores: keyword 0.33, semantic 1.00, hybrid 1.00, router 0.67

Mode	MRR
keyword	0.33
semantic	1.00
hybrid	1.00
router	0.67

The True Fusion Scenario

The router failure above is about a single relevant page. The context-building scenarios reveal a deeper problem: what happens when no single engine has full coverage.

Scenario 2b queries the GPT-3 paper (“Language Models are Few-Shot Learners”) with bias fairness. The relevant pages are 34, 36, and 39: the Broader Impacts section and its subsections on fairness and representation challenges.

Query: 'bias fairness'   K=10
Relevant pages: [34, 36, 39]

Mode           Recall@10    Top-10 pages
─────────────  ──────────  ─────────────────────────────────
keyword        1/3  33%     34, 6
semantic       2/3  67%     39, 36, 37, 43, 6, 73, 44, 27...
hybrid         3/3  100%    6, 34, 39, 36, 37, 43, 73, 44...
router(sem)    2/3  67%     39, 36, 37, 43, 6, 73, 44, 27...

Keyword finds page 34: the Broader Impacts overview section, which mentions “bias, fairness” in its summary. It misses 36 and 39 (the detailed subsections) because those pages use language like “representation” and “challenges” without repeating the exact two-word phrase.

Semantic finds pages 36 and 39: the embedding model picks up the conceptual content of those subsections. It misses page 34, the section introduction with the verbatim phrase, possibly because the surrounding context dilutes the signal.

Hybrid finds all three. Neither engine alone had full coverage. RRF combined their signals: keyword’s exact-match hit on 34, semantic’s conceptual reach to 36 and 39. The agent gets the complete picture.

True Fusion: keyword finds {34}, semantic finds {36, 39}, hybrid finds all three via RRF

This is the True Fusion scenario: keyword finds {A}, semantic finds {B, C}, hybrid finds {A, B, C}. No routing strategy can reproduce this result. Routing picks one engine and inherits its blind spots. Fusion doesn’t.

The Honest Finding: When Even Hybrid Fails

The navigation group tests a harder problem: following a cross-reference by concept.

Scenario 3b queries “Attention Is All You Need” with: the parallelization advantage over sequential recurrence. The relevant page is page 6: the “Why Self-Attention” section, which formally analyzes parallelization complexity vs. recurrent layers.

Mode           Recall@3     RR       Top-3 pages
─────────────  ──────────  ──────  ─────────────
keyword        0/1  0%      0.00     (none)
semantic       1/1  100%    0.50     2, 6, 11
hybrid         1/1  100%    0.50     2, 6, 11
router(sem)    1/1  100%    0.50     2, 6, 11

Hybrid finds page 6. But it finds it at rank 2, not rank 1. Recall@1 = 0 for all modes, including hybrid.

The introduction (page 2) ranks first. It mentions parallelization in passing. Page 6 is the formal analysis. Semantic and hybrid score both in the top-3, but not at position 1.

For a navigation task where the agent needs to land on exactly the right page, this is a failure, a soft one: the right page is at rank 2 and a human reading top-2 results would find it. But an agent acting on rank-1 alone would read the introduction and call it done.

This matters for blog honesty: hybrid search isn’t a universal win. When the relevant page shares conceptual territory with nearby pages, even embedding-based retrieval can rank a related page above the target.

Navigation Summary (Recall@1):

Scenario	kw	sem	hybrid	router
3a: Exact section heading	✓	✓	✓	✓
3b: Cross-reference by concept	✗	✗	✗	✗

All modes pass on the exact heading query. All modes fail at Recall@1 for cross-reference by concept. That’s a leveled playing field that deserves acknowledgment.

The Latency Cost Is Real but Acceptable

Hybrid search runs two engines per query. The latency numbers from the benchmark (3 warm-cache runs, median):

Benchmark latency by task group: hybrid at 3.4–5.1ms vs router at 2.9–4.4ms

Task Group	keyword	semantic	hybrid	router
Q&A	1.2ms	3.4ms	4.2ms	3.7ms
Context Building	1.8ms	4.8ms	5.1ms	4.4ms
Navigation	1.6ms	3.3ms	3.4ms	2.9ms

Hybrid is ~3–3.5× slower than keyword alone. At 4–5ms absolute, that’s not a meaningful latency budget concern for document retrieval in an agent tool call. The LLM inference that follows will take 1–10 seconds. The search is not the bottleneck.

The router’s latency advantage (2.9–4.4ms vs hybrid’s 3.4–5.1ms) is real but thin. You’re not trading reliability for speed. You’re trading reliability for 1ms.

The Benchmark Design

Standard RAG benchmarks test retrieval on dense vector stores. This benchmark is structured around how AI agents actually use search, covering three agentic task types with distinct metrics:

Task Type	Agent Behavior	Primary Metric
Q&A	Issues one query, acts on first hit	MRR (agent stops at first relevant result)
Context Building	Issues one query, reads all K results	Recall@K (agent needs completeness)
Navigation	Follows a reference to a specific location	Recall@1 (exact page matters)

All seven scenarios run against real public PDFs: “Attention Is All You Need” (Vaswani et al., 2017) and “Language Models are Few-Shot Learners” (Brown et al., 2020), with ground truth manually annotated. No synthetic tokens. No toy documents.

A k-sensitivity sweep on Scenario 1b (k=10, 30, 60, 120) confirmed this isn’t a top-K tuning problem: keyword stays at 0% recall regardless of how many results you request. Increasing K doesn’t fix a mode mismatch. It just returns more irrelevant pages.

What This Means for pdf-mcp

pdf-mcp’s pdf_search tool now defaults to mode="auto" (hybrid RRF, k=60). Keyword and semantic run in parallel; results are fused via RRF; the top-K pages are returned ranked by combined score. The routing logic described here is available as a comparison point, not a recommendation.

The implementation cost of hybrid over routing is one extra search call per query. The reliability benefit, as the benchmark shows, is avoiding complete misses on structurally tricky queries.

If you’re building an agent that reads PDFs, the question isn’t “keyword or semantic?” It’s “can your search mode survive a query type you didn’t design for?” Hybrid can.

The Named Patterns

Two patterns from this benchmark worth naming:

The Router Trap: A query that looks syntactically structured (uppercase, digits, hyphens) routes to keyword. The document doesn’t contain the query as a verbatim phrase. The agent gets nothing. This isn’t a rare edge case: any query containing an acronym, year, or proper noun in mixed-case text can trigger it.

True Fusion: Keyword finds page set A. Semantic finds page set B. A and B don’t fully overlap. Only hybrid, via RRF, finds A ∪ B. No routing strategy reaches this result. It requires running both engines and fusing, which is exactly what routing was designed to avoid.

Conclusion

The router looks smart on paper. It probably works fine most of the time. But “most of the time” is not the right bar for an agent that users trust to autonomously read documents and synthesize answers.

The Router Trap and True Fusion scenarios aren’t invented edge cases. They’re the kinds of queries real agents issue against real documents. One involves a dataset name. The other involves a cross-cutting topic scattered across pages that no single engine fully covers.

Hybrid RRF doesn’t know which queries are easy or hard. It runs both engines, fuses the results, and handles both. At 4–5ms per query, with less than 1ms latency advantage for routing, that’s not a tradeoff worth making.

If your agent is using a query router, make sure you’re logging what it routes. You might be surprised what it silently discards.

The benchmark code and ground truth annotations are available in the pdf-mcp repository under scripts/benchmark_rrf.py and benchmark_data/ground_truth.json.

mcp ai-agents rag llm

Kevin Tan

Cloud Solutions Architect and Engineering Leader based in Singapore. I write about AWS, distributed systems, and building reliable software at scale.

Email Portfolio LinkedIn GitHub

Hybrid Search vs Query Routing for AI Agents: Benchmark Results

Why Routing Seems Like the Right Answer

The Router Trap Pattern

The True Fusion Scenario

The Honest Finding: When Even Hybrid Fails

The Latency Cost Is Real but Acceptable

The Benchmark Design

What This Means for pdf-mcp

The Named Patterns

Conclusion

Discussion

Why Routing Seems Like the Right Answer

The Router Trap Pattern

The True Fusion Scenario

The Honest Finding: When Even Hybrid Fails

The Latency Cost Is Real but Acceptable

The Benchmark Design

What This Means for pdf-mcp

The Named Patterns

Conclusion

Get real-world MCP systems in your inbox.

Discussion

Related posts

Why Multi-Column PDFs Scramble Reading Order in RAG

RAG for AI Agents: 6 Decisions That Make or Break Retrieval

How One Search Change Eliminated an Entire Agent Step