Hand an AI agent a 100-page PDF and, unless the tool layer gives it better options, it reads far more than it needs. It pulls page after page into context, spends its token budget on text that never mattered, and either truncates silently or dies with a context-overflow error. Newer agents search first when they can, but that instinct only helps if the tools let them act on it. The PDF was never the problem. The reading strategy was, and the reading strategy is something the tool layer decides.
I build and maintain pdf-mcp, an open-source MCP server that AI coding tools use to read and search PDFs without overflowing their context. Since it launched in January 2026 it has passed 26,000 PyPI downloads across 20 releases. That volume is the only reason this post exists: these five patterns are not what I designed on day one. They are what survived contact with real agents reading real documents at scale, and several of them I got wrong before I got them right.
None of this is specific to my server, or even to MCP. The same patterns apply whether you are building a RAG pipeline, an MCP server, a document agent, or an internal knowledge assistant. If you use another PDF reader, a document loader, or your own tool layer, they still hold: they are about how an agent should navigate a document, not about which library does the parsing.
TL;DR: Five patterns for agents reading PDFs at scale: scout the structure before reading content, decompose tools by interaction step instead of by feature, cache the slow extraction and invalidate it by a cheap signal, recover reading order before it poisons retrieval, and route the search mode to the query type. Each one is implementation-agnostic.
Five Patterns That Survived Production
These patterns build on each other. The first governs how an agent approaches a document; the last governs how it finds the one paragraph that matters:
- Inspect first, read last keeps the agent from dumping pages it never needed.
- Decompose tools by interaction step lets the agent make one small decision at a time.
- Cache the slow part turns a per-conversation tax into a one-time cost.
- Recover reading order stops a parsing bug from quietly poisoning everything downstream.
- Match search mode to query type finds the right passage whether the query is an exact identifier or a vague concept.
The examples reference pdf-mcp’s tool surface because that is the system I have production data for, but every pattern applies to any PDF tool an agent can call.
1. Inspect First, Read Last
The naive tool design is a single read_pdf(path) that returns every page. It feels complete, and it fails the moment a document is large. A 100-page report is tens of thousands of tokens of mostly-irrelevant text. The agent pays for all of it to answer a question that lived on page 38.
The fix is to give the agent a way to scout before it commits. pdf-mcp’s tool descriptions push this explicitly: pdf_info is documented as “always call this first,” and it returns page count, metadata, and the table of contents without returning a single page of body text. From there the agent has cheap moves before any expensive one:
pdf_info -> structure: pages, metadata, TOC
pdf_get_toc -> full outline when the TOC is large
pdf_search -> locate the pages that match the query
pdf_read_pages-> read only the matching ranges
The interesting part is what the agents did with this on their own. Early on, I watched them call pdf_info as a confidence check and branch on document size: smaller PDFs got fuller reads, larger ones triggered search-first. I never wrote a prompt telling them to do that. The important part was not the tools themselves. It was that scouting had become the cheapest available action, and agents gravitate to the cheapest action. The boundaries did the work a prompt usually tries to. For the full story of how that choreography shifted across releases, see how an AI coding tool actually reads PDFs.
The reason this matters is cost as much as correctness. Reading everything is the single largest source of wasted tokens in a document workflow, the same dynamic I measured when I cut an agent’s token costs: most of an agent’s tokens are tool output, not reasoning. Scout-then-read attacks that at the source.
2. Decompose Tools by Interaction Step, Not by Feature
A monolithic read_pdf tool forces the agent to decide everything up front, before it knows anything about the document. How many pages? Which ones? With images or without? The agent guesses, over-requests, and pays for the guess.
Decomposing by interaction step inverts that. Each tool represents one decision the agent makes after seeing the result of the last one. pdf-mcp exposes nine tools, and the split is by step, not by feature:
| Step | Tool |
|---|---|
| Understand structure | pdf_info, pdf_get_toc |
| Locate content | pdf_search |
| Read targeted content | pdf_read_pages |
| Read sequentially | pdf_read_all |
| See the page as an image | pdf_render_pages |
| Check capabilities | server_info |
| Manage the cache | pdf_cache_stats, pdf_cache_clear |
Each call is small, and crucially each one is safe to get wrong. If the agent searches for the wrong term, it has spent a few hundred tokens, not the whole document. The blast radius of a bad decision shrinks to a single step. Before pdf-mcp’s reader was split this way, the failure was routine: handed one do-everything tool, agents would pull an entire document when they needed five pages, spend thousands of tokens on the rest, and sometimes overflow the context window outright. Splitting that monolith into focused tools made entire categories of “agent over-requested and crashed” simply stop happening.
This is the same lesson that shows up everywhere good tool design does. It is why I argue for giving agents narrow, well-bounded tools rather than broad ones, and it is the principle behind cutting MCP tool sprawl: more tools is not the goal, the right seams are. If you are wiring this into your own server, my walkthrough on building an MCP server in Python covers where to draw those seams.
3. Cache the Slow Part, Invalidate by the Fast Signal
MCP’s STDIO transport spawns a fresh server process per conversation. Without persistence, that means every new chat re-extracts the entire PDF from scratch: re-parses the text, re-renders the images, rebuilds the search index. The slow part runs again and again for a document that has not changed.
The pattern is to persist the expensive output and key its validity on something cheap to check. pdf-mcp stores extracted text, images, and the search index in a SQLite database under ~/.cache/pdf-mcp, and it invalidates entries by file path plus modification time. Checking an mtime is effectively free; re-extracting a PDF is not. A 24-hour TTL sits behind that as a backstop. The result is a clean split: the first conversation pays the extraction cost, and every conversation after it reads from cache. In practice that turns reopening a large paper from several seconds of re-extraction into a near-instant SQLite read, on every conversation that would otherwise start from zero.
cache key = (file path, mtime)
fast check = stat the file, compare mtime
on miss = extract once, persist to SQLite
backstop = 24h TTL
The trap is assuming the file is the only thing that can go stale. The extraction code changes too. When I shipped a fix to how the extractor handled multi-column pages, every PDF already in the cache still held text produced by the old, broken code. The mtime had not changed, so the cache happily served poisoned text and the bug looked unfixed even after the new code shipped. The fix was a second invalidation signal: an extraction-version marker (a PRAGMA user_version on the SQLite database) that drops stale text, embeddings, and search index whenever the extraction logic changes. Cache the slow part, but invalidate on every signal that can make it wrong, not just the obvious one.
4. Recover Reading Order Before It Poisons Retrieval
This is the pattern I would have missed entirely if I had not been staring at production output. A PDF does not store reading order. It stores fragments of text, each pinned to an (x, y) position, emitted in whatever sequence the generator felt like. For a plain single-column page, sorting fragments top-to-bottom and left-to-right reconstructs the text correctly. For anything else, it quietly produces nonsense.
A two-column paper is the clearest case. Sort by vertical position and you read the first line of the left column, jump to the first line of the right column, come back for the second line of the left, and so on. Every sentence gets spliced to a sentence from the other column. Nothing errors. The text is grammatical and complete and in the wrong order, and every downstream stage inherits the corruption: chunks that never existed, embeddings that encode two unrelated arguments, search hits that look relevant and are not.
The fix is to detect the layout first, then read each region in full before moving on:
boxes = detect_column_boxes(page)
if is_multi_column(boxes):
parts = (
page.get_text("text", clip=box, sort=True)
for box in boxes
)
text = "\n\n".join(p.strip() for p in parts if p.strip())
else:
# single-column path, unchanged
...
The edge case is what makes this a real pattern and not a one-liner. Naive column detection treats any page with multiple side-by-side boxes as multi-column, which scrambles the author grid on an academic title page: read down each column instead of across each row, and the first author of a paper quietly gets renamed. The signal that separates a genuine column from a grid cell is height. A real text column runs most of the page; an author cell or a caption occupies a short band. Gate the column path on tall boxes and the title page falls back to the safe positional sort. In my own reading-order benchmark, that gate lifted two-column fidelity from 0.564 to 0.816, while the title-page regression it quietly introduced barely dented the aggregate metric at all. An averaged quality score hides localized corruption, and the smaller the corrupted region, the better it hides.
The same failure has a second axis: vertical scripts. Japanese tategaki reads top-to-bottom, right-to-left, and a horizontal sort shreds it identically. The reflex is OCR, but on a born-digital file the characters are not lost, only misordered, and recognition cannot fix an ordering bug. Reconstruct the order from glyph geometry instead, no new dependency and no recognition at all. OCR earns its place only when the text really is pixels.
The generalizable claim: treat “we extracted the text” and “we extracted the text in the right order” as two different guarantees with two different bugs.
The text your model reads is not the text you see, until you prove otherwise.
5. Match Search Mode to Query Type
Search is how the scout pattern actually locates the right pages, and the mode you search in decides whether it works. Ask an agent for “the section on ISO 27001 controls” and the result hinges entirely on search mode: the exact identifier needs a keyword match, the surrounding description needs a semantic one. A single mode is wrong for half of all queries.
Pure keyword search misses concepts. A query for “revenue growth” will not match a passage that says “topline expansion,” even though they mean the same thing. Pure semantic search misses exact tokens. Ask for “ISO 27001” or “GPT-3” and an embedding model will happily return passages that are about compliance or about language models while skipping the page that contains the literal identifier. Identifiers, citations, and error codes are exactly where keyword search wins and semantics lose.
The pattern is to route the mode to the query, not to pick one mode forever. pdf-mcp’s pdf_search exposes three:
mode="keyword" -> exact terms, identifiers, citations
mode="semantic" -> concepts, natural-language questions
mode="auto" -> both, fused via Reciprocal Rank Fusion
auto fuses keyword and semantic results with Reciprocal Rank Fusion, which is the right default when you do not know the query shape ahead of time. But “default to hybrid” is not a law, and granularity changes the answer. When I benchmarked search at the section level rather than the page level, plain BM25 keyword search beat the hybrid fusion, because section-grain text is long enough that the keyword signal is already strong and RRF mostly added noise. That result, and when to trust hybrid over keyword, is in BM25 vs hybrid search for section RAG and the broader semantic vs keyword tradeoff. The pattern is not “always hybrid.” It is “match the mode to the query and the granularity.”
Applying These Without pdf-mcp
None of these five depend on my server. They are properties of how an agent and a document interact, and you can hold any PDF tool to them. A short checklist for evaluating one:
- Can the agent inspect structure without reading content? If the only tool is
read_pdf, the scout pattern is impossible and large documents will blow context. - Is the tool surface decomposed by interaction step? Count the decisions the agent must make before its first call. More than one or two is a monolith in disguise.
- Does extraction persist across conversations, and what invalidates it? Ask specifically what happens when the source file changes and when the extractor changes. Many tools handle neither.
- Does it recover reading order, or just extract text? Hand it a two-column paper and a vertical-script page, then compare the output to what you see on screen. If it has never been tested against layout, assume it scrambles.
- Can you control the search mode? A single hardcoded mode will fail on either identifiers or concepts. You want to choose.
A tool that passes all five will make agents efficient without a single custom prompt. A tool that fails them will produce wasteful, occasionally wrong behavior no matter how carefully you craft your system message. Prompt engineering cannot rescue a tool layer that forces agents into bad decisions. The fastest way to improve a document agent is usually not a better prompt. It is a better seam.
Try It On Your Own Documents
pdf-mcp is open source. The fastest way to see these patterns in action, with nothing to install, is the live browser demo: upload a PDF and watch an agent inspect, search, and read only the pages it needs. On its sample contract, two pages answered the question while 98.6% of the document’s tokens never entered context. To run it for real, install it and point it at your own documents:
pip install pdf-mcp
claude mcp add pdf-mcp -- pdf-mcp
The code lives on GitHub.
What to Read Next
Depending on what you are building:
- Shipping your first pattern? Start with the pdf-mcp build story for a reference implementation of the scout and decomposition patterns.
- Building your own MCP server? How to build an MCP server in Python walks through the tool seams from scratch.
- Going deep on retrieval? The reading-order and search-mode patterns each have a dedicated benchmark: section vs page chunking and BM25 vs hybrid search.
Five patterns, 26,000 downloads of feedback, and several wrong turns. Yours can skip the wrong turns.