Hybrid search (BM25 + vector with RRF fusion) outperformed a regex query router in a 7-scenario benchmark on real PDFs: MRR 1.00 vs 0.67, at a latency cost of ~0.5ms. The router’s failure mode (“Router Trap”) returns zero results when a query looks structured but isn’t a verbatim phrase in the document.

Hybrid search in pdf-mcp runs two engines every time, keyword and semantic, and fuses their results with RRF (Reciprocal Rank Fusion: a scoring method that merges ranked lists by summing inverse ranks). That sounds wasteful. The obvious optimization is a query router: detect whether the query looks like a keyword lookup or a conceptual question, and fire only the right engine. Cheaper, simpler, and surely good enough.
I built a benchmark to confirm this. It confirmed the opposite.
pdf-mcp is an open-source MCP server that gives Claude and other AI agents structured access to PDF documents: search, extraction, and caching (how I built it). To validate hybrid search before shipping it, I ran a task-centric benchmark across seven scenarios on real public PDFs, with keyword, semantic, hybrid, and a regex-based router all measured side by side.
TL;DR: I benchmarked hybrid RRF against a query router on real PDFs across seven agentic scenarios. Hybrid scored MRR 1.00; the router scored 0.67. At 4ms latency, hybrid is the right default.
- The router’s single failure cost an agent a complete miss: not a bad result, nothing.
- The “true fusion” scenario is the clearest case: keyword found page 34, semantic found pages 36 and 39, only hybrid found all three. No routing strategy gets there.
- Hybrid costs ~3× keyword in latency. At 4ms, that’s the right tradeoff.
- Navigation by concept fails for every mode including hybrid. That’s an honest finding worth keeping.
Verdict: A router optimizes for average queries. Hybrid covers the edge cases that break agent workflows.
| Mode | MRR (Q&A) | Recall@10 (Context) | Latency | Best for |
|---|---|---|---|---|
| Keyword | 0.33 | 33% | ~1.5ms | Exact phrase lookups |
| Semantic | 1.00 | 67% | ~3.8ms | Conceptual questions |
| Hybrid | 1.00 | 100% | ~4.2ms | Mixed or unpredictable queries |
| Router | 0.67 | 67% | ~3.7ms | When query types are fully known at design time |
Why Routing Seems Like the Right Answer
The intuition behind routing is sound. Some queries are clearly keyword-friendly: exact section titles, product codes, error identifiers. Others are clearly semantic: “why does the model generalize well?” or “what are the limitations of this approach?” If you can classify the query first, you avoid paying for the engine you don’t need.
A typical regex router looks like this:
import re
def route(query: str) -> str:
if re.search(r"[A-Z0-9\-]{4,}", query):
return "keyword"
return "semantic"
The logic: uppercase letters, digits, and hyphens grouped together signal a structured identifier: a model name, a metric, a standard. Route it to keyword. Everything else goes to semantic.
This is a reasonable heuristic. A more sophisticated version might use an LLM to classify queries, but even LLM-based routers inherit the same fundamental problem: they pick one engine and inherit its blind spots. The regex version just makes the failure mode obvious. The problem is what happens the rest of the time.
(For a deeper comparison of when keyword vs. semantic search each wins on their own, see Semantic vs Keyword Search for AI Agents.)
The Router Trap Pattern
Consider the query: WMT-2014 generalization capability
The regex sees WMT-2014: eight characters, all matching [A-Z0-9\-]. It routes to keyword.
Keyword search wraps all queries in FTS5 phrase syntax. “WMT-2014 generalization capability” is not a verbatim phrase in the paper. Keyword returns zero results.
The agent gets nothing. Not a bad answer: nothing.
Here’s the full benchmark result for this scenario (Scenario 1c, Mixed query on “Attention Is All You Need”):
Query: 'WMT-2014 generalization capability' K=5
Mode Recall@5 RR Rank-1st Top-5 pages
───────────── ────────── ────── ──────── ────────────────────
keyword 0/1 0% 0.00 ∞ (none)
semantic 1/1 100% 1.00 rank 1 8, 10, 9, 12, 11
hybrid 1/1 100% 1.00 rank 1 8, 10, 9, 12, 11
router(key) 0/1 0% 0.00 ∞ (none)
The router looked confident. It was completely wrong. Semantic finds page 8 (the Machine Translation results table, which discusses newstest2014 results). Hybrid inherits that find via RRF. The router fires keyword and returns an empty result set.
This is the Router Trap: a query that looks structured. It has a model name, a year, a standard abbreviation, but the document doesn’t contain it as a verbatim phrase. The heuristic fires precisely because the query looks pattern-like. And it fails precisely because of that.
The router’s failure rate in this benchmark: one scenario out of three Q&A scenarios. That doesn’t sound like much. But the consequences of a miss for an agent running autonomously are not “slightly worse answer”: wrong tool call, agent backtracks, user gets nothing. This is one of the silent failure patterns that make AI agents hard to debug in production.
Q&A Group MRR results:
| Mode | MRR |
|---|---|
| keyword | 0.33 |
| semantic | 1.00 |
| hybrid | 1.00 |
| router | 0.67 |
The True Fusion Scenario
The router failure above is about a single relevant page. The context-building scenarios reveal a deeper problem: what happens when no single engine has full coverage.
Scenario 2b queries the GPT-3 paper (“Language Models are Few-Shot Learners”) with bias fairness. The relevant pages are 34, 36, and 39: the Broader Impacts section and its subsections on fairness and representation challenges.
Query: 'bias fairness' K=10
Relevant pages: [34, 36, 39]
Mode Recall@10 Top-10 pages
───────────── ────────── ─────────────────────────────────
keyword 1/3 33% 34, 6
semantic 2/3 67% 39, 36, 37, 43, 6, 73, 44, 27...
hybrid 3/3 100% 6, 34, 39, 36, 37, 43, 73, 44...
router(sem) 2/3 67% 39, 36, 37, 43, 6, 73, 44, 27...
Keyword finds page 34: the Broader Impacts overview section, which mentions “bias, fairness” in its summary. It misses 36 and 39 (the detailed subsections) because those pages use language like “representation” and “challenges” without repeating the exact two-word phrase.
Semantic finds pages 36 and 39: the embedding model picks up the conceptual content of those subsections. It misses page 34, the section introduction with the verbatim phrase, possibly because the surrounding context dilutes the signal.
Hybrid finds all three. Neither engine alone had full coverage. RRF combined their signals: keyword’s exact-match hit on 34, semantic’s conceptual reach to 36 and 39. The agent gets the complete picture.
This is the True Fusion scenario: keyword finds {A}, semantic finds {B, C}, hybrid finds {A, B, C}. No routing strategy can reproduce this result. Routing picks one engine and inherits its blind spots. Fusion doesn’t.
The Honest Finding: When Even Hybrid Fails
The navigation group tests a harder problem: following a cross-reference by concept.
Scenario 3b queries “Attention Is All You Need” with: the parallelization advantage over sequential recurrence. The relevant page is page 6: the “Why Self-Attention” section, which formally analyzes parallelization complexity vs. recurrent layers.
Mode Recall@3 RR Top-3 pages
───────────── ────────── ────── ─────────────
keyword 0/1 0% 0.00 (none)
semantic 1/1 100% 0.50 2, 6, 11
hybrid 1/1 100% 0.50 2, 6, 11
router(sem) 1/1 100% 0.50 2, 6, 11
Hybrid finds page 6. But it finds it at rank 2, not rank 1. Recall@1 = 0 for all modes, including hybrid.
The introduction (page 2) ranks first. It mentions parallelization in passing. Page 6 is the formal analysis. Semantic and hybrid score both in the top-3, but not at position 1.
For a navigation task where the agent needs to land on exactly the right page, this is a failure, a soft one: the right page is at rank 2 and a human reading top-2 results would find it. But an agent acting on rank-1 alone would read the introduction and call it done.
This matters for blog honesty: hybrid search isn’t a universal win. When the relevant page shares conceptual territory with nearby pages, even embedding-based retrieval can rank a related page above the target.
Navigation Summary (Recall@1):
| Scenario | kw | sem | hybrid | router |
|---|---|---|---|---|
| 3a: Exact section heading | ✓ | ✓ | ✓ | ✓ |
| 3b: Cross-reference by concept | ✗ | ✗ | ✗ | ✗ |
All modes pass on the exact heading query. All modes fail at Recall@1 for cross-reference by concept. That’s a leveled playing field that deserves acknowledgment.
The Latency Cost Is Real but Acceptable
Hybrid search runs two engines per query. The latency numbers from the benchmark (3 warm-cache runs, median):
| Task Group | keyword | semantic | hybrid | router |
|---|---|---|---|---|
| Q&A | 1.2ms | 3.4ms | 4.2ms | 3.7ms |
| Context Building | 1.8ms | 4.8ms | 5.1ms | 4.4ms |
| Navigation | 1.6ms | 3.3ms | 3.4ms | 2.9ms |
Hybrid is ~3–3.5× slower than keyword alone. At 4–5ms absolute, that’s not a meaningful latency budget concern for document retrieval in an agent tool call. The LLM inference that follows will take 1–10 seconds. The search is not the bottleneck.
The router’s latency advantage (2.9–4.4ms vs hybrid’s 3.4–5.1ms) is real but thin. You’re not trading reliability for speed. You’re trading reliability for 1ms.
The Benchmark Design
Standard RAG benchmarks test retrieval on dense vector stores. This benchmark is structured around how AI agents actually use search, covering three agentic task types with distinct metrics:
| Task Type | Agent Behavior | Primary Metric |
|---|---|---|
| Q&A | Issues one query, acts on first hit | MRR (agent stops at first relevant result) |
| Context Building | Issues one query, reads all K results | Recall@K (agent needs completeness) |
| Navigation | Follows a reference to a specific location | Recall@1 (exact page matters) |
All seven scenarios run against real public PDFs: “Attention Is All You Need” (Vaswani et al., 2017) and “Language Models are Few-Shot Learners” (Brown et al., 2020), with ground truth manually annotated. No synthetic tokens. No toy documents.
A k-sensitivity sweep on Scenario 1b (k=10, 30, 60, 120) confirmed this isn’t a top-K tuning problem: keyword stays at 0% recall regardless of how many results you request. Increasing K doesn’t fix a mode mismatch. It just returns more irrelevant pages.
What This Means for pdf-mcp
pdf-mcp’s pdf_search tool now defaults to mode="auto" (hybrid RRF, k=60). Keyword and semantic run in parallel; results are fused via RRF; the top-K pages are returned ranked by combined score. The routing logic described here is available as a comparison point, not a recommendation.
The implementation cost of hybrid over routing is one extra search call per query. The reliability benefit, as the benchmark shows, is avoiding complete misses on structurally tricky queries.
If you’re building an agent that reads PDFs, the question isn’t “keyword or semantic?” It’s “can your search mode survive a query type you didn’t design for?” Hybrid can.
The Named Patterns
Two patterns from this benchmark worth naming:
The Router Trap: A query that looks syntactically structured (uppercase, digits, hyphens) routes to keyword. The document doesn’t contain the query as a verbatim phrase. The agent gets nothing. This isn’t a rare edge case: any query containing an acronym, year, or proper noun in mixed-case text can trigger it.
True Fusion: Keyword finds page set A. Semantic finds page set B. A and B don’t fully overlap. Only hybrid, via RRF, finds A ∪ B. No routing strategy reaches this result. It requires running both engines and fusing, which is exactly what routing was designed to avoid.
Conclusion
The router looks smart on paper. It probably works fine most of the time. But “most of the time” is not the right bar for an agent that users trust to autonomously read documents and synthesize answers.
The Router Trap and True Fusion scenarios aren’t invented edge cases. They’re the kinds of queries real agents issue against real documents. One involves a dataset name. The other involves a cross-cutting topic scattered across pages that no single engine fully covers.
Hybrid RRF doesn’t know which queries are easy or hard. It runs both engines, fuses the results, and handles both. At 4–5ms per query, with less than 1ms latency advantage for routing, that’s not a tradeoff worth making.
If your agent is using a query router, make sure you’re logging what it routes. You might be surprised what it silently discards.
The benchmark code and ground truth annotations are available in the pdf-mcp repository under scripts/benchmark_rrf.py and benchmark_data/ground_truth.json.
Discussion
Comments are powered by GitHub Discussions. Sign in with GitHub to join the conversation.