A hand flipping through an open library card catalog drawer: a curated index that takes you straight to the right card instead of every card.

An AI agent reading the GPT-3 paper with page-mode PDF search issues 5.79 extra pdf_read_pages calls per query to recover full section context. On a 144-page LLM survey it is 4.71. On a denser GNN review it is 2.23. Section-aware search delivers all three in zero. The cost of page-walking scales with section length, and the gap does not close as document shape varies. Section mode is invariant. Page mode is sensitive.

I built pdf-mcp, an open-source MCP server used by Claude Desktop and Cursor for large PDF processing. Section-mode search shipped in v1.10.0 after a benchmark on three real arxiv PDFs: GPT-3, an LLM survey, and a GNN review. The numbers below are from a re-run on v1.12.1, after pdf-mcp 1.12.0 tokenised keyword search and changed which pages page-mode lands on for many queries (see LLM-Free QA for MCP Servers for why the re-run was needed).

TL;DR: Page-mode PDF search makes AI agents walk pages to rebuild multi-page sections. On the GPT-3 paper, that costs 5.79 extra pdf_read_pages calls per query, with 0% recovery on 11 of 24 sections inside a 10-call budget. Section-aware search using TOC-first plus a 7-signal heuristic detector hits F1 ≥ 0.80 on three real PDFs and delivers full section content in one call. Available now via pdf_search(path, query, granularity="section").

Verdict: If your agent reads PDFs page by page, every multi-page section is a tool-call leak. Pages are not the right chunk unit for retrieval.


The Page-Walk Trap

Most PDF tooling treats the page as the natural chunk. Page numbers are stable, page boundaries are unambiguous, and pdf_read_pages is the obvious API. RAG frameworks default to it. So did pdf-mcp’s first version.

Here is what an agent actually does with page-mode search when it needs section content:

1. pdf_search("approach")                  -> page 9
2. pdf_read_pages([9])                     -> partial match
3. pdf_read_pages([10])  (walk forward)    -> still partial
4. pdf_read_pages([8])   (walk backward)   -> still partial
5. pdf_read_pages([11])
6. pdf_read_pages([7])
... up to a 10-call cap

The keyword hit lands on one page. The section spans more. The agent walks N+1, N-1, N+2, N-2 outward until it has covered enough of the section’s tokens, or until it gives up. On the GPT-3 paper, 11 of 24 evaluated sections never reach 95% token coverage inside a 10-call budget. The agent runs out of moves before it sees the section.

This is the Page-Walk Trap: a multi-page section forces page-mode search to walk pages, and any walk that hits a section boundary returns content from the wrong section. Adjacent walks are not just slow; they are noisy.


Validation: Three Real arxiv PDFs

I picked three PDFs to span structural diversity in academic publishing:

PDF Pages Sections Avg pages/section Why it’s interesting
GNN: A Review (1812.08434) 26 31 ~1 Tightly numbered, mostly single-page subsections
LLM Survey (2303.18223) 144 93 ~1.5 Multi-word unnumbered titles (“Background for LLMs”)
GPT-3 (2005.14165) 75 24 ~3 Multi-page numbered sections, the worst case

The benchmark runs each gold section’s title as a query, takes the rank-1 page hit from page-mode search, and walks outward until the agent has 95% of the section’s unique tokens (or hits 10 walks). Section mode delivers the full section in one call by construction.

The Headline Numbers

Extra pdf_read_pages calls per query: page mode requires 2.23 calls on GNN, 4.71 on LLM survey, 5.79 on GPT-3; section mode is zero across all three. Page-mode search costs 2 to 6 extra pdf_read_pages calls per query. Section mode delivers all three PDFs in zero.

  GNN LLM survey GPT-3
Sections evaluated 31 93 24
Page-mode mean extra reads 2.23 4.71 5.79
Page-mode 0-extra-reads rate 58% 25% 17%
Section-mode 0-extra-reads rate 100% 100% 100%

GPT-3 is the strongest case. Not a single section in the paper fits on one page. The LLM survey is the middle case (mid-sized sections). GNN is the weak case (most sections are already single-page, so page-mode does less walking).

The pattern: section-granularity wins precisely when sections span multiple pages. This is also the pattern that breaks page-mode RAG silently in production, because the agent never reports “I needed to walk 9 pages and gave up.” It just answers with whatever was on the first page it found.

Per-Section Drama

Three queries that hit the 10-call cap on GPT-3 page-mode (agent never recovered the section):

  • “2 Approach”
  • “3 Results”
  • “3.4 Winograd-Style Tasks”

These are the sections an agent reading the paper for benchmark numbers actually queries. They are also the ones page-mode search structurally cannot deliver inside any reasonable tool-call budget. “2 Approach” is the cleanest cautionary case: the 1.12.0 keyword tokenisation made it worse (8 reads to 10), because “approach” is common enough that BM25 picks a body page over the heading page, and the agent walks the wrong way from there.


Why Detection Is Hard

Section-aware search needs section boundaries. The PDF’s table of contents gives them when present, but the heuristic fallback is where it gets interesting. My first attempt was a regex:

# Iteration 1: regex-only
PATTERN = (
    r"^(?:\d+(?:\.\d+)*\s+[A-Z]"
    r"|(?i:Chapter|Section|Part)\s+\d+)"
)

Numbered headings (“1.1 Background”) and the obvious keywords. F1 numbers:

  GNN LLM survey
Regex-only F1 0.750 0.324
Recall 0.55 0.22

The LLM survey collapsed to 0.32. Its TOC is full of titles like “Background for LLMs” and “Resources of LLMs”: multi-word phrases with no leading number. Regex caught 17 of 67 gold boundaries. Widening the regex to grab “Abstract”, “References”, “Appendix [A-Z]” did not help. The missing titles are not regex-shaped.

The Font-Face Surprise

I assumed the signal was font size. Bigger font = heading. That is the spec PyMuPDF examples and most pdf-extraction tutorials suggest. On the LLM survey:

  • Body text: URWPalladioL-Roma, 9.5pt, regular
  • Heading “Background for LLMs”: NimbusSanL-Bold, 9.5pt, bold flag set

Same size. The signal is in (font_name, is_bold), not size. On the GNN review the relationship is even worse:

  • Body: AdvTTe692faf0, 8.0pt, regular
  • Heading: AdvTTc9617e0c.B, 7.97pt, bold flag NOT set (bold is encoded in the font name’s .B suffix)

Heading slightly smaller than body, no bold flag, bold encoded in the font name. A unified detector needs both font_face != body_face and (is_bold flag OR name has bold marker like '.B', '-Bold'). This is the kind of finding that does not show up in spec-driven design. It only shows up when you point the detector at three different publishers and watch it fail differently each time.

The 7-Signal Heuristic

I replaced the regex with a weighted-score detector:

Signal Weight What it catches
regex_match 3 Numbered headings, keyword headings
face_delta 2 Different font face from body majority
bold_marker 2 Bold flag OR font name has bold marker
whitespace_above 1 Vertical gap >= 1.5x line height
top_of_page 1 Within top 15% of page height
title_case_or_caps 1 Title Case or ALL CAPS
short_line 1 <= 80 characters

A line is a heading iff score >= 4. Sections rendered with the number on a separate line from the title (the LLM survey case again) are merged via a post-pass.

Heading detector F1 scores: regex-only is 0.750 on GNN and 0.324 on LLM survey; the 7-signal heuristic improves to 0.936 and 0.800. The 7-signal heuristic recovers the LLM survey’s unnumbered titles that regex misses entirely.

  GNN LLM survey GPT-3
Multi-signal F1 0.936 0.800 0.796
Recall 1.000 1.000 1.000
Precision 0.880 0.667 0.667

Recall is 1.0 across all three. Precision drops to 0.67 because the detector occasionally fires on a bold figure caption or a table header. The downstream BM25 ranker absorbs most of those false positives because section bodies are long enough that a misplaced heading does not change the top-ranked match.


What I Shipped

Section-mode search is now wired end-to-end:

pdf_search(
    "https://arxiv.org/pdf/2005.14165",
    "training process",
    granularity="section",
)
# {"sections": [
#   {"title": "2.3 Training Process",
#    "start_page": 9, "end_page": 9, "score": 4.68},
#   {"title": "2 Approach",
#    "start_page": 6, "end_page": 9, "score": 3.56}],
#  "search_mode": "section",
#  "total_sections": 32}

Architecture:

  • pdf_section_fts SQLite FTS5 virtual table, parallel to the existing pdf_search_fts page index. Same Porter+unicode61 tokenizer, same BM25 ranker.
  • TOC-first dispatch: if the PDF has a TOC, sections come from extract_toc_sections. If not, derive_sections falls back to the 7-signal heuristic detector.
  • Lazy index: first call per PDF derives sections and populates the FTS5 table; warm-cache subsequent calls are pure FTS5 queries (same shape as page-mode search, see hybrid search benchmarks for the keyword/semantic side). I have not formally benchmarked first-call latency end-to-end yet.

The smoke test above is the result that matters: BM25 prefers the leaf “2.3 Training Process” (BM25 score 4.68) over its parent container “2 Approach” (3.56). The naive “rank-1-page-then-find-containing-section” shim I tried first ranked the parent first because the container has more text. BM25 over section text fixes that. (For the broader keyword vs semantic tradeoff in PDF retrieval, semantic vs keyword search covers when each wins.)


When Section Mode Does Not Help

Worth saying directly: section mode is not a universal win.

  • GNN review: Most sections are 1-2 pages. Page-mode mean extra reads is 2.23. Section mode delivers the same content with marginal benefit. On Group 2 content-recall, section mode is barely ahead (+0.073) because BM25 occasionally picks a sibling subsection. Tool-call savings, not content recall, is the case for section mode on this corpus.
  • PDFs without a TOC: The heuristic detector hits F1 around 0.80. That means ~20% of section boundaries are wrong. The dispatcher still falls back to it, but the precision-on-precision matters more there.
  • Single-page queries: If the agent’s query maps to a one-page answer, page-mode and section-mode return the same content with the same latency.
  • Heuristic-mode is unvalidated on TOC-less PDFs. All three benchmark PDFs have TOCs. The 7-signal detector exists as a fallback, but I have not measured its F1 on a real TOC-less document. If you are pointing this at scanned PDFs, books, or anything OCR’d, treat the savings claim as unproven.

The benchmark also measures content recall (not just tool-call cost). On GPT-3, section mode delivers 65.0% more of the section’s tokens than page mode (median 87.9% recall delta). On the LLM survey, the gain is 24.0%. On GNN, it is +7.3%. Document shape decides whether section mode wins on content quality. The tool-call savings claim survives across all three.


What This Does Not Prove

Three honest limits worth flagging:

  1. It is a retrieval benchmark, not an agent benchmark. Whether agents answer better questions with section granularity is a separate eval (LLM grading, downstream task accuracy). The 5.79 number measures tool-call cost to recover content, not answer quality.
  2. Page-mode is simulated. I implemented page-mode as keyword + mechanical walk. Real agents may issue refined queries instead of walking, which would lower the 5.79 floor. So treat 5.79 as an upper bound on the savings, not the expected.
  3. The validation set is three arxiv PDFs. Books, contracts, scanned documents, and anything without a TOC are out of scope. The heuristic-mode fallback exists but I have not validated it on a TOC-less PDF yet.

The Reproducibility Path

git clone https://github.com/jztan/pdf-mcp
cd pdf-mcp
git checkout v1.12.1

python scripts/benchmark_sections.py \
    --calibrate \
    --include-blog-pdf

Output in benchmark_results/sections_<timestamp>.json. Variants:

# TOC-derived sections, leaf-only (non-overlapping)
python scripts/benchmark_sections.py --calibrate \
    --include-blog-pdf --toc-flatten=leaves

# Heuristic detector instead of TOC
python scripts/benchmark_sections.py --calibrate \
    --include-blog-pdf --detector-source=heuristic

The benchmark is the single source of truth. Group 1 measures detector F1 against TOC ground truth. Group 2 measures content recall delta (section vs page). Group 3 produces the 5.79 number.


Conclusion

The Page-Walk Trap is not a corner case. Any PDF with multi-page sections produces it, and a 75-page paper from 2020 produces it on every section. Pages are the unit the file format gives you, not the unit retrieval should respect. Pick the chunk unit your content actually has, not the one the format names. Across three corpora and a tokenisation change that shifted page-mode in different directions on each, section mode stayed at zero extra reads on all three. The invariance is the point.

Section-aware search is in pdf_search(granularity="section") as of pdf-mcp v1.10.0. The benchmark is reproducible, the detector is one file, and if your agent reads PDFs in production, this is the kind of leak that does not show up in monitoring until you measure it.


Benchmark code, ground truth, and the section detector are in the pdf-mcp repository. For the broader retrieval architecture this fits into, see Hybrid Search vs Query Routing for AI Agents and Semantic vs Keyword Search.

mcp ai-agents rag python
Kevin Tan

Kevin Tan

Cloud Solutions Architect and Engineering Leader based in Singapore. I write about AWS, distributed systems, and building reliable software at scale.