Section-Level RAG: Why BM25 Beat Hybrid Search in My Benchmark

Page-grain hybrid search won my last benchmark (Hybrid Search vs Query Routing). Section-grain hybrid search lost the next one. Same fusion technique, opposite verdict, because the granularity changed which engine owned the lexical signal.

The intuition trap is hard to resist. If hybrid (BM25 + dense, fused with RRF) beats keyword-only at page grain, and section grain beats page grain (Section Chunking vs Page Chunking), then hybrid-at-section grain looks like the obvious next step. Most teams would have shipped it. I built a kill-switch before the feature, ran a small confirmation benchmark, and watched it fail decisively enough that I never wrote the production code.

TL;DR: Adding semantic fusion at section grain caused a 33% lexical regression (0.93 → 0.63 MRR at section grain) on a 45-query benchmark across three arxiv PDFs. Section titles are the lexical signal, so RRF demoted BM25’s rank-1 hits without compensating from the semantic side. Recommendation: keyword-only at section grain. The pdf-mcp hybrid-section roadmap item was rejected after Phase 1 validation.

The 33% regression is the within-category number, not an inflated cell-aggregate. Fusion broke the thing it was supposed to leave alone.
Across all categories, BM25-only keyword-section was the strongest cell overall (0.53 mean MRR vs 0.36 for hybrid). It is not a marginal call.
The granularity flip from the page-grain result is the actual finding: same fusion, opposite verdict, because BM25 already owns the lexical signal at section grain.
The kill-switch process that produced this result saved an estimated 25 to 35 hours of implementation work for a feature with no value claim.

Verdict: Pick the search technique your chunk unit actually needs. Fusion is not a universal upgrade.

The result

The benchmark covers four cells, the full 2×2 over {keyword, hybrid} × {page, section}. hybrid is RRF (k=60) over BM25 + BAAI/bge-small-en-v1.5 dense embeddings, the same configuration that won at page grain in the previous benchmark. The corpus is three arxiv PDFs: a graph neural networks review, a large language model survey, and the GPT-3 paper. Queries are split into three categories per PDF: lexical (section-title style), paraphrase-semantic (different vocabulary, same meaning), and mixed-distractor (the tokens appear in a wrong-meaning section, the conceptual match is elsewhere).

Grouped bar chart of per-cell MRR by query category. On lexical queries keyword-section scores 0.93 while hybrid-section drops to 0.63; on mixed-distractor keyword-section scores 0.65 versus hybrid-section 0.38; overall keyword-section leads at 0.53 versus hybrid-section 0.36.

Per-cell mean reciprocal rank (micro-mean across all 45 queries):

cell	lexical	paraphrase	mixed-distractor	all
keyword-page	0.44	0.00	0.32	0.25
hybrid-page	0.39	0.19	0.29	0.29
keyword-section	0.93	0.00	0.65	0.53
hybrid-section	0.63	0.07	0.38	0.36

Walk the table by row.

Lexical row. keyword-section lands at 0.93 MRR, near-perfect. hybrid-section drops to 0.63. That is a 33% relative regression on the category fusion was supposed to leave alone. RRF didn’t fail to add value here. It actively removed value that BM25 had already produced.

Paraphrase row. Both section-grain cells are at floor (0.00 lexical-only, 0.07 hybrid). This is the category where semantic fusion is supposed to shine. The 0.07 lift is statistical noise on a 15-query subset, and it would not survive a larger corpus. Page-grain hybrid is materially better here (0.19): semantic fusion does add value, but at the wrong granularity to claim it for the section feature.

Mixed-distractor row. This is the category my Phase 1 gate was specifically designed around: a query whose surface tokens match the wrong section, where the right section requires conceptual matching. Hybrid was supposed to win here by construction. keyword-section scored 0.65; hybrid-section scored 0.38. The cell that was built to win the category lost it by 27 MRR points.

Across all categories, BM25-only keyword-section was the strongest cell overall (0.53 mean MRR vs 0.36 for hybrid). It is not a marginal call.

Caveat (the methodological asterisk). The 45-query corpus is a calibration run, not the 180-query frozen gold corpus the spec called for. The spec’s frozen-corpus protocol was deliberately waived, with a single sitting of authorship by me, the spec author, who knew the gate shape and could (consciously or otherwise) bias queries toward hybrid winning. That bias should help hybrid; it didn’t. The failure margins are 3.7× and 5.2× the gate thresholds (more on that in §5), and the literature predicts the same direction independently. I trust the result well enough to publish it. I would not trust it to defend a 0.05 MRR difference. The full 180-query authorship is not blocked: it just stopped being the cheapest way to answer the question.

One more disclosure on the same theme. The 45-query calibration corpus was authored at runtime and was not preserved on disk; the benchmark results JSON for this run was also not archived. Re-verification against pdf-mcp 1.12.1 with the same queries is therefore not possible. This is itself a methodological note: validation-first engineering with cheap, time-boxed corpora trades reproducibility for decision speed. The sibling section-chunking benchmark from the prior week followed the more disciplined research-dossier-and-results-archive pattern; this hybrid-search confirmation run did not, because the 30-minute experiment was not expected to carry weight beyond the immediate decision. It carries weight in this post, so the limitation is named here.

Why fusion breaks at section grain

The numbers are the result; the mechanism is the post.

Section titles are the lexical signal. When an agent queries “Spatial GNN Methods,” the FTS5 section index has that exact phrase as the section’s stored title field. BM25 nails it at rank 1 with a high score, because the query and the document field share most of their token mass. The match is sharp.

The dense ranker doesn’t have a corresponding sharp signal. Section titles in the embedding space sit in roughly the same neighborhood as body text from the same section and adjacent sections. The dense ranker spreads probability mass across a cluster of conceptually-related sections. BM25 concentrates it on the right one.

RRF then averages two ranks. BM25 puts the right section at rank 1. Dense puts it somewhere in rank 5 to 10. The fused rank lands around 3.

The regression is one sentence wide: rank 1 became rank 3 because the second engine had nothing useful to add.

This is RRF doing exactly what it was designed to do, applied to a grain where one engine has already converged and the other one is noisy. The regression appears in the lexical category specifically because that is where BM25 was already winning by the largest margin, and therefore where RRF had the most to demote.

The granularity flip from page grain

This is the part worth tweeting.

At page grain, BM25 misses conceptual matches because page-mode FTS5 wraps queries in phrase syntax: verbatim or nothing. A query like “WMT-2014 generalization capability” returns zero results when the document doesn’t contain that phrase verbatim, even though the relevant content is right there on page 8. The previous post calls this the Router Trap. Semantic fills that gap. Hybrid wins.

At section grain, BM25 has the lexical signal and enough body context to disambiguate. The query “training process” matches the section titled “2.3 Training Process” with overwhelming token overlap. The dense ranker can’t add precision; it can only spread mass. There is nothing left for fusion to do except dilute the signal that BM25 already had right.

Slope chart of overall MRR across two retrieval grains. Hybrid leads at page grain (0.29 versus keyword 0.25) but keyword-only overtakes at section grain (0.53 versus hybrid 0.36). The two lines cross between the grains, so the verdict flips with granularity.

Same fusion technique, opposite verdict. Granularity changed which engine owned the lexical signal, and that change inverted the architecture decision. This is the kind of result that does not show up in vendor benchmarks, because vendor benchmarks usually fix one grain and vary the engine. The real-world question is the inverse: which engine, at which grain.

Literature confirmation

Three findings from the IR literature point the same direction.

BEIR SciFact (the closest published analog to single-paper section retrieval) puts a ceiling on the hybrid lift. On that dataset Bruch et al. measured RRF fusion at 0.730 nDCG@100 against BM25’s 0.698, a ~4.6% relative gain on scientific papers. My gate threshold for clause 1 was +0.10 absolute MRR. The published ceiling on this domain is well below the bar I set, and I set the bar deliberately to filter out marginal results.

Bruch et al. (ACM TOIS 2023) documents lexical regression as a known failure mode of poorly-tuned hybrid retrieval. The exact pattern I observed (fusion scoring below the sparse baseline on lexical-style queries) is in the operational literature.

PaperQA2 (Future-House, 2024, current SOTA on the RAG-QA Arena science benchmark) does not use BM25/dense fusion at all. It uses semantic embeddings followed by LLM-based reranking. The dominant production architecture for scientific document retrieval has moved past the paradigm I tested.

A more rigorous Phase 2 candidate would compare keyword-section against keyword-section + LLM-rerank, not against keyword-section + dense fusion.

What to do on your system

Three decision rules, applicable beyond pdf-mcp:

Retrieval grain	Default search technique	Why
Page-grain	Hybrid (BM25 + dense, RRF)	Page-mode FTS5 phrase syntax misses conceptual matches; semantic fills the gap
Section-grain	BM25-only	Section titles already carry the lexical signal; dense fusion dilutes it
Section-grain, want better quality	LLM rerank on top of BM25	Rerank disambiguates near-duplicate sections without disrupting BM25 wins

The third row is the one to think about. Why LLM rerank instead of more fusion? Rerank reads the top-K candidates and picks the best with conceptual judgment. Fusion blindly merges two ranked lists and inherits both engines’ biases. At section grain where BM25 already has the lexical signal pinned at rank 1, rerank can disambiguate the mixed-distractor case (surface tokens in the wrong section, concept in another) without disturbing BM25’s existing wins. Fusion can’t make that distinction: it has no read on which engine is right for this query.

Practical starting points: a small cross-encoder like BAAI/bge-reranker-base for latency-bound use, or an LLM call (Claude, GPT-4o) for quality-bound use. The next pdf-mcp benchmark will compare both against keyword-section.

For pdf-mcp specifically: the hybrid-section roadmap item was retired after Phase 1 validation on 2026-05-04. The next experiment slot is a keyword-section + LLM-rerank candidate against the same Phase 1 framework, using the same gate clauses adapted for rerank-specific value claims (mixed-distractor uplift, latency budget).

How I caught this before shipping

This section exists to back §1 through §4. Here is the process that produced the numbers, and why I trust them.

The design rule was validation-first: write the kill-switch before writing the feature. The spec defined a benchmark, a frozen query corpus, and a three-clause gate. None of the implementation code was allowed to be written until the gate had been agreed and the corpus committed.

The gate, compactly:

Mixed-distractor uplift. MRR(hybrid-section, mixed-distractor) ≥ MRR(next-best-cell, mixed-distractor) + 0.10. Fusion has to win the category it was designed to win.
No lexical regression. MRR(hybrid-section, lexical) ≥ MRR(keyword-section, lexical) − 0.05. Don’t break what BM25 does well.
No overall regression. MRR(hybrid-section, all) ≥ MRR(hybrid-page, all). Section-with-fusion has to be at least as good as today’s smartest mode at page grain.

After the Phase 1 pipeline was built (15 commits on a feature branch, 550 tests), I projected 14 to 20 hours to hand-author the 180-query frozen corpus the spec required. Before sinking that, I spent 30 minutes on a literature review and surfaced the three findings above. They predicted the gate would fail.

I authored a 45-query confirmation corpus (5 per category × 3 PDFs) in another 30 minutes: explicitly not the frozen gold corpus, explicitly biased toward hybrid winning by an author who knew the gate shape. The result:

Clause	Required	Actual	Result
1: mixed-distractor uplift ≥ +0.10	hybrid-section ≥ 0.7500	0.3840	FAIL by 0.37 (3.7× threshold)
2: lexical regression ≥ −0.05	hybrid-section ≥ 0.8833	0.6250	FAIL by 0.26 (5.2× threshold)
3: overall ≥ hybrid-page	hybrid-section ≥ 0.2883	0.3588	pass

Two of three clauses failed by 3.7× and 5.2× the gate thresholds, on a corpus the author had every incentive to favor. That is not a borderline result; it is decisive.

Total spent: ~6.5 hours including spec authorship, plan, framework build, literature review, confirmation run, and verdict write-up. Estimated cost saved: 14 to 20 hours of corpus authorship plus 10 to 15 hours of Phase 2 implementation (cache schema, embedding helpers, server dispatch, integration tests) the gate would have rejected anyway. Conservatively 25 to 35 hours of work avoided for a feature that had no value claim.

The Phase 1 framework is retained. It was built to evaluate one Phase 2 candidate; it can evaluate the next one (keyword-section + LLM-rerank) without rework. The kill-switch did its job, and the infrastructure that ran it remains as durable value.

What’s next

The interesting question is no longer “should I add semantic fusion to section search?” The literature, the gate, and the diagnostic findings all answer that one in the same direction. The interesting question is “how much can LLM reranking lift keyword-section on the mixed-distractor category, and at what latency cost?”

That is a different gate. Latency matters in a way it didn’t for fusion (rerank adds an LLM round-trip per query); semantic dense embeddings don’t. A new spec gets written before any code lands, and the Phase 1 framework supports the comparison directly.

The earlier post on section vs page chunking closed with “pick the chunk unit your content actually has, not the unit the format names.” This post adds the inverse: pick the search technique your chunk unit actually needs. Granularity is a first-class architecture decision. The right search technique at one grain is the wrong technique at another.

pdf-mcp ships section-grain search as pdf_search(granularity="section"), BM25-only by design after this benchmark. For the broader retrieval architecture, see Hybrid Search vs Query Routing and Section Chunking vs Page Chunking.

mcp ai-agents rag llm

Kevin Tan

Cloud Solutions Architect and Engineering Leader based in Singapore. I write about AWS, distributed systems, and building reliable software at scale.

Email Portfolio LinkedIn GitHub

The result

Why fusion breaks at section grain

The granularity flip from page grain

Literature confirmation

What to do on your system

How I caught this before shipping

What’s next

Get real-world MCP systems in your inbox.

Discussion

Related posts

Why Multi-Column PDFs Scramble Reading Order in RAG

RAG for AI Agents: 6 Decisions That Make or Break Retrieval

How One Search Change Eliminated an Entire Agent Step