An AI agent reading the GPT-3 paper with page-mode PDF search issues 5.79 extra pdf_read_pages calls per query to recover full section context. On a 144-page LLM survey it is 4.71. On a denser GNN review it is 2.23. Section-aware search delivers all three in zero. The cost of page-walking scales with section length, and the gap does not close as document shape varies. Section mode is invariant. Page mode is sensitive.
I built pdf-mcp, an open-source MCP server used by Claude Desktop and Cursor for large PDF processing. Section-mode search shipped in v1.10.0 after a benchmark on three real arxiv PDFs: GPT-3, an LLM survey, and a GNN review. The numbers below are from a re-run on v1.12.1, after pdf-mcp 1.12.0 tokenised keyword search and changed which pages page-mode lands on for many queries (see LLM-Free QA for MCP Servers for why the re-run was needed).
TL;DR: Page-mode PDF search makes AI agents walk pages to rebuild multi-page sections. On the GPT-3 paper, that costs 5.79 extra
pdf_read_pagescalls per query, with 0% recovery on 11 of 24 sections inside a 10-call budget. Section-aware search using TOC-first plus a 7-signal heuristic detector hits F1 ≥ 0.80 on three real PDFs and delivers full section content in one call. Available now viapdf_search(path, query, granularity="section").
Verdict: If your agent reads PDFs page by page, every multi-page section is a tool-call leak. Pages are not the right chunk unit for retrieval.
The Page-Walk Trap
Most PDF tooling treats the page as the natural chunk. Page numbers are stable, page boundaries are unambiguous, and pdf_read_pages is the obvious API. RAG frameworks default to it. So did pdf-mcp’s first version.
Here is what an agent actually does with page-mode search when it needs section content:
1. pdf_search("approach") -> page 9
2. pdf_read_pages([9]) -> partial match
3. pdf_read_pages([10]) (walk forward) -> still partial
4. pdf_read_pages([8]) (walk backward) -> still partial
5. pdf_read_pages([11])
6. pdf_read_pages([7])
... up to a 10-call cap
The keyword hit lands on one page. The section spans more. The agent walks N+1, N-1, N+2, N-2 outward until it has covered enough of the section’s tokens, or until it gives up. On the GPT-3 paper, 11 of 24 evaluated sections never reach 95% token coverage inside a 10-call budget. The agent runs out of moves before it sees the section.
This is the Page-Walk Trap: a multi-page section forces page-mode search to walk pages, and any walk that hits a section boundary returns content from the wrong section. Adjacent walks are not just slow; they are noisy.
Validation: Three Real arxiv PDFs
I picked three PDFs to span structural diversity in academic publishing:
| Pages | Sections | Avg pages/section | Why it’s interesting | |
|---|---|---|---|---|
| GNN: A Review (1812.08434) | 26 | 31 | ~1 | Tightly numbered, mostly single-page subsections |
| LLM Survey (2303.18223) | 144 | 93 | ~1.5 | Multi-word unnumbered titles (“Background for LLMs”) |
| GPT-3 (2005.14165) | 75 | 24 | ~3 | Multi-page numbered sections, the worst case |
The benchmark runs each gold section’s title as a query, takes the rank-1 page hit from page-mode search, and walks outward until the agent has 95% of the section’s unique tokens (or hits 10 walks). Section mode delivers the full section in one call by construction.
The Headline Numbers
Page-mode search costs 2 to 6 extra
pdf_read_pages calls per query. Section mode delivers all three PDFs in zero.
| GNN | LLM survey | GPT-3 | |
|---|---|---|---|
| Sections evaluated | 31 | 93 | 24 |
| Page-mode mean extra reads | 2.23 | 4.71 | 5.79 |
| Page-mode 0-extra-reads rate | 58% | 25% | 17% |
| Section-mode 0-extra-reads rate | 100% | 100% | 100% |
GPT-3 is the strongest case. Not a single section in the paper fits on one page. The LLM survey is the middle case (mid-sized sections). GNN is the weak case (most sections are already single-page, so page-mode does less walking).
The pattern: section-granularity wins precisely when sections span multiple pages. This is also the pattern that breaks page-mode RAG silently in production, because the agent never reports “I needed to walk 9 pages and gave up.” It just answers with whatever was on the first page it found.
Per-Section Drama
Three queries that hit the 10-call cap on GPT-3 page-mode (agent never recovered the section):
- “2 Approach”
- “3 Results”
- “3.4 Winograd-Style Tasks”
These are the sections an agent reading the paper for benchmark numbers actually queries. They are also the ones page-mode search structurally cannot deliver inside any reasonable tool-call budget. “2 Approach” is the cleanest cautionary case: the 1.12.0 keyword tokenisation made it worse (8 reads to 10), because “approach” is common enough that BM25 picks a body page over the heading page, and the agent walks the wrong way from there.
Why Detection Is Hard
Section-aware search needs section boundaries. The PDF’s table of contents gives them when present, but the heuristic fallback is where it gets interesting. My first attempt was a regex:
# Iteration 1: regex-only
PATTERN = (
r"^(?:\d+(?:\.\d+)*\s+[A-Z]"
r"|(?i:Chapter|Section|Part)\s+\d+)"
)
Numbered headings (“1.1 Background”) and the obvious keywords. F1 numbers:
| GNN | LLM survey | |
|---|---|---|
| Regex-only F1 | 0.750 | 0.324 |
| Recall | 0.55 | 0.22 |
The LLM survey collapsed to 0.32. Its TOC is full of titles like “Background for LLMs” and “Resources of LLMs”: multi-word phrases with no leading number. Regex caught 17 of 67 gold boundaries. Widening the regex to grab “Abstract”, “References”, “Appendix [A-Z]” did not help. The missing titles are not regex-shaped.
The Font-Face Surprise
I assumed the signal was font size. Bigger font = heading. That is the spec PyMuPDF examples and most pdf-extraction tutorials suggest. On the LLM survey:
- Body text:
URWPalladioL-Roma, 9.5pt, regular - Heading “Background for LLMs”:
NimbusSanL-Bold, 9.5pt, bold flag set
Same size. The signal is in (font_name, is_bold), not size. On the GNN review the relationship is even worse:
- Body:
AdvTTe692faf0, 8.0pt, regular - Heading:
AdvTTc9617e0c.B, 7.97pt, bold flag NOT set (bold is encoded in the font name’s.Bsuffix)
Heading slightly smaller than body, no bold flag, bold encoded in the font name. A unified detector needs both font_face != body_face and (is_bold flag OR name has bold marker like '.B', '-Bold'). This is the kind of finding that does not show up in spec-driven design. It only shows up when you point the detector at three different publishers and watch it fail differently each time.
The 7-Signal Heuristic
I replaced the regex with a weighted-score detector:
| Signal | Weight | What it catches |
|---|---|---|
regex_match |
3 | Numbered headings, keyword headings |
face_delta |
2 | Different font face from body majority |
bold_marker |
2 | Bold flag OR font name has bold marker |
whitespace_above |
1 | Vertical gap >= 1.5x line height |
top_of_page |
1 | Within top 15% of page height |
title_case_or_caps |
1 | Title Case or ALL CAPS |
short_line |
1 | <= 80 characters |
A line is a heading iff score >= 4. Sections rendered with the number on a separate line from the title (the LLM survey case again) are merged via a post-pass.
The 7-signal heuristic recovers the LLM survey’s unnumbered titles that regex misses entirely.
| GNN | LLM survey | GPT-3 | |
|---|---|---|---|
| Multi-signal F1 | 0.936 | 0.800 | 0.796 |
| Recall | 1.000 | 1.000 | 1.000 |
| Precision | 0.880 | 0.667 | 0.667 |
Recall is 1.0 across all three. Precision drops to 0.67 because the detector occasionally fires on a bold figure caption or a table header. The downstream BM25 ranker absorbs most of those false positives because section bodies are long enough that a misplaced heading does not change the top-ranked match.
What I Shipped
Section-mode search is now wired end-to-end:
pdf_search(
"https://arxiv.org/pdf/2005.14165",
"training process",
granularity="section",
)
# {"sections": [
# {"title": "2.3 Training Process",
# "start_page": 9, "end_page": 9, "score": 4.68},
# {"title": "2 Approach",
# "start_page": 6, "end_page": 9, "score": 3.56}],
# "search_mode": "section",
# "total_sections": 32}
Architecture:
pdf_section_ftsSQLite FTS5 virtual table, parallel to the existingpdf_search_ftspage index. Same Porter+unicode61 tokenizer, same BM25 ranker.- TOC-first dispatch: if the PDF has a TOC, sections come from
extract_toc_sections. If not,derive_sectionsfalls back to the 7-signal heuristic detector. - Lazy index: first call per PDF derives sections and populates the FTS5 table; warm-cache subsequent calls are pure FTS5 queries (same shape as page-mode search, see hybrid search benchmarks for the keyword/semantic side). I have not formally benchmarked first-call latency end-to-end yet.
The smoke test above is the result that matters: BM25 prefers the leaf “2.3 Training Process” (BM25 score 4.68) over its parent container “2 Approach” (3.56). The naive “rank-1-page-then-find-containing-section” shim I tried first ranked the parent first because the container has more text. BM25 over section text fixes that. (For the broader keyword vs semantic tradeoff in PDF retrieval, semantic vs keyword search covers when each wins.)
When Section Mode Does Not Help
Worth saying directly: section mode is not a universal win.
- GNN review: Most sections are 1-2 pages. Page-mode mean extra reads is 2.23. Section mode delivers the same content with marginal benefit. On Group 2 content-recall, section mode is barely ahead (+0.073) because BM25 occasionally picks a sibling subsection. Tool-call savings, not content recall, is the case for section mode on this corpus.
- PDFs without a TOC: The heuristic detector hits F1 around 0.80. That means ~20% of section boundaries are wrong. The dispatcher still falls back to it, but the precision-on-precision matters more there.
- Single-page queries: If the agent’s query maps to a one-page answer, page-mode and section-mode return the same content with the same latency.
- Heuristic-mode is unvalidated on TOC-less PDFs. All three benchmark PDFs have TOCs. The 7-signal detector exists as a fallback, but I have not measured its F1 on a real TOC-less document. If you are pointing this at scanned PDFs, books, or anything OCR’d, treat the savings claim as unproven.
The benchmark also measures content recall (not just tool-call cost). On GPT-3, section mode delivers 65.0% more of the section’s tokens than page mode (median 87.9% recall delta). On the LLM survey, the gain is 24.0%. On GNN, it is +7.3%. Document shape decides whether section mode wins on content quality. The tool-call savings claim survives across all three.
What This Does Not Prove
Three honest limits worth flagging:
- It is a retrieval benchmark, not an agent benchmark. Whether agents answer better questions with section granularity is a separate eval (LLM grading, downstream task accuracy). The 5.79 number measures tool-call cost to recover content, not answer quality.
- Page-mode is simulated. I implemented page-mode as keyword + mechanical walk. Real agents may issue refined queries instead of walking, which would lower the 5.79 floor. So treat 5.79 as an upper bound on the savings, not the expected.
- The validation set is three arxiv PDFs. Books, contracts, scanned documents, and anything without a TOC are out of scope. The heuristic-mode fallback exists but I have not validated it on a TOC-less PDF yet.
The Reproducibility Path
git clone https://github.com/jztan/pdf-mcp
cd pdf-mcp
git checkout v1.12.1
python scripts/benchmark_sections.py \
--calibrate \
--include-blog-pdf
Output in benchmark_results/sections_<timestamp>.json. Variants:
# TOC-derived sections, leaf-only (non-overlapping)
python scripts/benchmark_sections.py --calibrate \
--include-blog-pdf --toc-flatten=leaves
# Heuristic detector instead of TOC
python scripts/benchmark_sections.py --calibrate \
--include-blog-pdf --detector-source=heuristic
The benchmark is the single source of truth. Group 1 measures detector F1 against TOC ground truth. Group 2 measures content recall delta (section vs page). Group 3 produces the 5.79 number.
Conclusion
The Page-Walk Trap is not a corner case. Any PDF with multi-page sections produces it, and a 75-page paper from 2020 produces it on every section. Pages are the unit the file format gives you, not the unit retrieval should respect. Pick the chunk unit your content actually has, not the one the format names. Across three corpora and a tokenisation change that shifted page-mode in different directions on each, section mode stayed at zero extra reads on all three. The invariance is the point.
Section-aware search is in pdf_search(granularity="section") as of pdf-mcp v1.10.0. The benchmark is reproducible, the detector is one file, and if your agent reads PDFs in production, this is the kind of leak that does not show up in monitoring until you measure it.
Benchmark code, ground truth, and the section detector are in the pdf-mcp repository. For the broader retrieval architecture this fits into, see Hybrid Search vs Query Routing for AI Agents and Semantic vs Keyword Search.