Claude was halfway through a PDF retrieval task when the tool response came back. The matches array had five items in it. The total_matches field, sitting one line below, said 0. The model had to decide which half of its own input to trust. No test had failed. No schema was malformed. No exception was raised. The bug lived in the experience of receiving the whole payload, not in any single field of it. Real LLM use surfaces a class of bugs conventional schema and field-level tests rarely catch, because the failure is the consumer’s, not the producer’s.
I ship pdf-mcp, an open-source MCP server for large-PDF workflows that I use through Claude Desktop and Cursor every day. The four bugs in this post all surfaced in that daily use, across releases 1.11.0 through 1.12.1.
TL;DR: Schema tests assert that fields are correct. They do not assert that the whole payload is usable to an LLM caller. Four schema and response-shape bugs in pdf-mcp surfaced not in CI but in Claude Desktop, because Claude tripped over them during real PDF work. “Free QA” means free of additional effort on top of work you were already doing. It does not replace your test suite. It tells you which bugs your test suite never had a chance of seeing.
Verdict: Your tests check that fields are correct. Your LLM checks that the payload is usable. Those are not the same job.
Payload UX
What Claude saw in that hybrid search response was not a unit-test failure, not a schema-validation failure, not a crash. It was a consistency failure between two fields of the same response, only visible because something downstream had to act on both at once. Schema tests check each field in isolation. The LLM consumes the whole payload as a single artifact and reasons across it.
I have started calling this surface Payload UX: the experience of receiving the whole response as an LLM caller that has to act on it without re-reading the docs. Producers have an API. Consumers have a payload. The two are not the same thing, and the gap between them is where the bugs in this post live.
Four of them surfaced in pdf-mcp during one stretch of daily use:
- A hybrid search response whose two count fields contradicted each other.
- A
pdf_infodefault that returned 56 metadata objects on a 56-page PDF. - An auto-mode search that silently degraded to keyword when the embedding model would not load.
- A heuristic section-title detector that returned mid-paragraph fragments as titles.
None of these failed a test. All of them got in Claude’s way.
Why my tests missed these
pdf-mcp has tests. They pass. They cover individual fields, return-type contracts, error paths, and a respectable share of branching logic. They do not cover the experience of receiving the whole payload as an LLM caller. Four concrete reasons, one per bug above:
- Field-level coverage does not catch cross-field contradictions. Test 1 asserts
matchesis a list of the right shape. Test 2 assertstotal_matchesis a non-negative integer. Neither test asserts the two agree, because no one writes that test until they see the bug. - Tests assert fields are correct when present. They rarely assert fields are present when they should be. The silent fallback bug had no test because the missing field had no name yet.
- Heuristic outputs are hard to assert tightly. The strongest assertion you can write against a section-title detector without overfitting is “title is a non-empty string.” Mid-paragraph fragments clear that bar.
- Tests have no token budget. The 56-page response was correct. It was just expensive to receive, and “expensive to receive” is not a property unit tests measure.
The structured-testing companion to this post is the one on building CI for AI agents. That CI catches behavioural regressions and schema breakage. It is the load-bearing side of agent reliability. This post is about a different surface entirely. Tests assert correctness. Payloads assert usability. Those are different jobs, and the bugs below sit in the second one.
Four bugs the LLM surfaced
The four are arranged so the reader hits the most legible failure mode first, an easy second example next, the most insidious one third, and the most interpretive one last. Each subsection labels the Payload UX failure mode it represents, so the generalisation in the next section has hooks to reach back to.
The self-contradicting response (contradiction)
Hybrid search in pdf-mcp 1.11.0 returned this shape:
{
"matches": [
{"page": 8, "score": 4.21, "snippet": "..."},
{"page": 12, "score": 3.97, "snippet": "..."},
{"page": 3, "score": 3.55, "snippet": "..."},
{"page": 19, "score": 3.12, "snippet": "..."},
{"page": 22, "score": 2.88, "snippet": "..."}
],
"total_matches": 0,
"search_mode": "hybrid"
}
Five items in the array. total_matches: 0 next to them. The producer-side bug was simple. total_matches was being filled from the BM25 leg’s hit count, before fusion replaced the result set. The two fields had different upstream sources and no one had wired them together.
The consumer-side effect was not simple. Claude paused on the contradiction. In some sessions it re-issued the query with different parameters to disambiguate. In others it caveated the answer (“the tool returned five matches but reports zero, so I will summarise the matches with low confidence”). Both responses were honest. Both wasted tokens and time on a bug that no field-level test was ever going to see.
Fixed in 1.12.0: total_matches is now computed from the matches array after fusion, in one place, with a regression test that explicitly checks len(matches) == total_matches.
The token-tax response (default-payload tax)
pdf_info on a 56-page PDF in 1.11.0 returned a 56-element array. Each element carried per-page metadata: dimensions, rotation, font list, image count, text length. The shape was internally consistent and the schema was tight. It was also 56 objects of metadata Claude almost never needed.
The default response shape is a UX decision, not just a correctness one. If your default is “everything,” you have made the LLM pay for completeness it did not ask for. In practice Claude was burning context on a metadata blob every time it touched a new PDF, before doing the actual search the user had asked for.
Fixed in 1.12.0 with a new detail parameter that defaults to False:
{
"page_count": 56,
"title": "Language Models are Few-Shot Learners",
"size_bytes": 8421344,
"has_toc": true,
"encrypted": false
}
The full per-page array is still available with detail=True. The default just stopped charging for it.
The silent fallback (silent degradation)
This one is the worst class of bug in the post, and the one I would have shipped indefinitely without LLM signal.
pdf_search(mode="auto") is supposed to pick keyword or semantic search based on the query and the index state. When the local embedding model failed to load (cold cache, network blip, OOM during init), auto mode caught the failure and fell back to keyword. Gracefully. Silently. The response shape was identical to a successful semantic search. The payload gave the caller no way to know.
Claude consumed those responses believing it had semantic search behind them. Sometimes the keyword result was good enough. Sometimes it was not, and the answer was wrong in a way no one was going to catch from the outside. The bug had the worst possible operational signature: it was invisible in tests because the fallback path was working as designed, and it was invisible in production logs because the response was shaped correctly.
It was visible in Claude’s behaviour. Once I noticed a pattern of paraphrase-heavy queries returning nothing useful on a PDF I knew contained the answer, I pulled the response payload and saw what the test suite could not see: there was no field in the response that said which mode actually ran.
Fixed in 1.12.0 by adding two fields:
{
"matches": [...],
"semantic_unavailable": true,
"semantic_unavailable_reason": "embedding_model_load_failed"
}
If a capability degraded, the response says so. That is the rule. The LLM downstream now has correct beliefs about what mode it is in, and that is the only thing that fixes a silent failure.
The section titles that were not (unprovenanced heuristics)
The section-aware search feature uses a TOC when present and a heuristic detector when not. The heuristic detector is a 7-signal weighted classifier (font face delta, bold marker, whitespace above, top-of-page position, title case, short line length, regex match for numbered headings). On a TOC-less arxiv preprint in 1.11.0, it returned this section title for a real query:
{
"title": "the methodology used in this study suggests",
"start_page": 6,
"end_page": 8,
"score": 3.41
}
That is not a heading. It is the start of a sentence from the middle of paragraph 3, which happened to land at the top of a page in a font face that matched the heuristic’s “different from body” check. The detector fired with a borderline score, and the response surfaced the result as a section title with no flag that it was a guess.
The downstream effect was interpretive failure. Claude had to decide whether “the methodology used in this study suggests” was a real section heading or a parsing artifact. Sometimes it flagged the title as suspicious. Sometimes it built follow-up retrieval queries on top of it.
Partial fix in 1.12.0 (tightened the heuristic threshold and added a post-pass to merge body-text fragments into the preceding section). Full fix in 1.12.1 added provenance:
{
"title": "2.3 Training Process",
"title_source": "toc",
"start_page": 9,
"end_page": 9
}
title_source is one of "toc", "heading_detected", or null. The consumer can now reason about confidence. A TOC-sourced title is reliable. A heuristic-sourced title is a guess that happened to clear a threshold. A null source is the detector punting on a span it could not name. Three states, three different downstream policies, all visible in the payload.
What watching the LLM teaches you that tests cannot
The four bugs above are not a random sample. They cluster into four failure modes that other MCP server builders can apply as a checklist. The signal is never “the LLM says your server is buggy.” The signal is “the LLM hesitates, asks a clarifying question, retries with different parameters, or produces an answer that is quietly wrong.” Watch for these four shapes:
- Contradictions. Two fields of the same response telling you different things. Schema tests almost never check pairs of fields against each other. LLMs notice immediately, because they consume the whole payload.
- Silent degradation. Your server caught an internal failure and kept going, but the response does not say so. Schema honesty: if a capability degraded, return a field that names the degradation.
semantic_unavailable: trueis cheaper than perfect availability. - Unprovenanced heuristics. Your server guesses (section detection, language detection, type inference, classification, anything probabilistic). Tell the caller it guessed, and how.
title_sourceis cheaper than perfect title detection, because confidence policy belongs to the consumer. - Default-payload tax. Your default response shape is a UX decision, not just a correctness one. If “everything” is the default, you have priced the LLM out of the cheap path. Optional fields and
detail=Falsedefaults shift the cost back to the caller that actually needs the detail.
The operational instruction underneath all four is the same. When you use your own server through Claude for real work, write down every moment Claude hesitates, retries, or asks a clarifying question. Each of those moments is a Payload UX failure with a payload you can inspect. That is the loop.
What this does not replace
The “free QA” framing earns its keep only with the right qualifier. Free as in “you are already paying for it via real usage.” Not free as in “this replaces your test suite.” Four honest limits:
- It is unstructured. You find the bugs you happen to trip over. The bugs you do not trip over are still in there. Coverage is whatever your real workload looks like, no more.
- It is biased toward your tasks. If you only use your PDF MCP for English-language papers, you will not find the bugs that show up on Japanese PDFs.
- Tokens are not free. Real LLM sessions cost money. Structured tests, once written, are essentially free per run. “Free QA” means free of additional effort, not free of cost.
- Structured testing still matters. The CI-for-AI-agents post covers the load-bearing side of this. This post is about the incidental signal alongside it, not instead of it.
The case for “free QA” is that it costs zero additional effort on top of work you were already doing. That is the only claim being made.
What to do this week
The next time you open Claude Desktop and use your own MCP server for actual work, keep a scratch file open next to it. Every time Claude hesitates, asks a clarifying question, retries with different parameters, or produces an answer that is quietly wrong, write down the moment and the response payload that produced it. After a week, you will have a list of bugs that did not show up in your test suite, ranked by how often they actually got in your way. That ranking is more honest than any backlog you would have written from scratch, because the order was decided by use, not by guess.
The fixes above shipped in pdf-mcp releases 1.12.0 and 1.12.1. For the structured-testing companion that catches the bugs schema tests can see, see I Built CI for My AI Agent. For the production-shipping companion (auth, safety, what breaks), see What It Actually Takes to Ship a Production MCP Server.