How to Evaluate an MCP Server With an LLM: 17 Bugs Found and Fixed | Kevin Tan

The standard advice for testing an MCP server is unit tests for tool logic, schema tests for contract drift, and MCP Inspector for manual exploration. That advice is correct. It also misses an entire class of bug: the bugs that only appear when an LLM tries to use your server to do something. I evaluated my own redmine-mcp-server by driving it as an agent across eight rounds, and it surfaced 17 bugs that my existing 830-test suite had not caught. Sixteen shipped within the first week. The seventeenth followed shortly after.

I run a few open-source MCP servers: pdf-mcp (26k+ PyPI downloads), redmine-mcp-server (19k+ PyPI downloads, OAuth2, multi-contributor), and a few smaller ones. All of them have unit tests. All of them have schema tests. None of those tests would have found what eight rounds of LLM-driven probing found. Each one started from the same foundation I documented in how to build an MCP server in Python; this post is what comes after the build, once real agents start using it.

TL;DR: Unit tests prove each tool works. Schema tests prove the contract doesn’t drift. MCP Inspector lets you click through tools manually. None of these catch the bugs that break agent reasoning: discovery tools that lie, hints that promise features that don’t exist, silent acceptance of garbage inputs, cross-tool inconsistencies. The way to find those is to drive your server like an agent would. I did that to my own server and shipped 17 fixes.

If your MCP server passes unit tests but agents still get confused, the bug is not in the tools. It is in the gaps between them.

Testing layers, with the LLM-as-driver pass added on top of unit tests, schema tests, and MCP Inspector

The common failure pattern

Most MCP server testing looks like this:

Tool A unit test → passes
Tool B unit test → passes  
Schema test → no drift
Inspector manual call → works
Ship it

That covers each tool in isolation. It does not cover the agent’s experience of using the server.

An agent reads tool A’s output to decide how to call tool B. If A says one thing and B enforces another, both tests pass and the agent fails. An agent reads error hints to recover from mistakes. If a hint promises a recovery path that does not exist, both tests pass and the agent loops. An agent looks at a schema to know what’s valid. If the runtime is more permissive than the schema or rejects values the docstring promises, both tests pass and the agent learns wrong rules.

These are real bugs. Unit tests cannot see them because there is no single tool to test against. I will show you five from my own server.

The LLM-as-driver pattern

The LLM-as-driver loop: give a task, watch the trace, collect a friction report, ship fixes, repeat

This runs after the standard pyramid, not instead of it. Unit tests, schema tests, and Inspector still gate every PR; the LLM-as-driver pass is the layer on top that catches what those cannot see.

Here is the method.

Give an LLM access to your MCP server with no other context.
Ask it to do a realistic task that exercises multiple tools.
After each round, ask it to report every moment of confusion, every retry, every workaround it had to invent.
Fix the bugs.
Repeat with the same harness on the next task.

That is it. The friction lives in the reports. Anything that makes the agent stop and re-plan is signal. Anything that makes it call a discovery tool more than once is signal. Anything that makes the response too noisy to scan is signal.

The reason this works is that the LLM is exactly the consumer your server is built for. Every confusion the LLM reports is a confusion that production agents will hit at scale. Your unit tests do not have opinions about your API ergonomics. The LLM does.

I ran this against redmine-mcp-server across eight rounds. The bugs below are five representative findings from those rounds. There were twelve more.

The evaluation harness

Nothing exotic. A recent Claude model driving the server through Claude Code’s MCP client, one realistic task per round, no other context loaded. Each round was supervised: I watched the trace live and intervened only to ask for the post-task friction report. Prompts were standardized across rounds (same task template, same “list every confusion, retry, and workaround” closing instruction) so the friction reports were comparable. The reproducibility cost is a few hours per server.

Bug 1: The discovery tool that lied

The first call I made:

list_project_issue_custom_fields(project_id=1)
→ [
    {"id": 1, "name": "Priority Level", "is_required": false},
    {"id": 2, "name": "Department",     "is_required": false}
  ]

Both fields say is_required: false. So I tried to create an issue without setting either:

create_redmine_issue(project_id=1, subject="...")
→ {"error": "Validation failed: Department cannot be blank"}

The discovery tool said the field was optional. The create call said it was required. Both were doing their job correctly in isolation. The bug was that they disagreed.

The root cause turned out to be that Redmine’s is_required flag reflects only the field-definition setting. Workflow rules, role-based field permissions, and tracker-bound required-field settings can also make a field required at runtime, and the bare API does not surface any of that. My discovery tool was reporting the field-definition flag and calling it the answer.

Unit tests for list_project_issue_custom_fields passed forever because the response was correctly shaped. Unit tests for create_redmine_issue passed forever because the validation error was correctly handled. There is no unit test you can write against a single tool that detects “these two tools have inconsistent views of reality.”

The fix shipped two changes. First, the docstring on the discovery tool now explicitly says is_required reflects only the field-definition setting, and links to a documented workaround. Second, when the create tool returns a “cannot be blank” validation error, it now augments the response with a structured missing_required_fields: ["Department"] array and a hint that names the recovery path. An agent that hits this no longer has to re-read the discovery docstring. The next call works.

I only realized this needed fixing after watching the agent waste a turn re-checking the discovery tool, find no contradiction, and try the same broken create call again.

Bug 2: The hint that promised a feature that did not exist

Round eight. I was verifying a fix from round seven. The fix had added a missing_required_fields array to validation errors, plus a hint string that explained how to recover. The hint said:

"Recovery: pass values for the listed fields via the `fields` parameter 
 (custom-field-name lookup is supported)..."

So I followed the hint. I tried fields={"custom_fields": [{"name": "Department", "value": "Engineering"}]}. Failed. Tried fields={"Department": "Engineering"}. Failed. Every shape the hint suggested came back rejected.

The hint promised name-based custom-field lookup. Name-based lookup was implemented for update_redmine_issue. It was not implemented for create_redmine_issue. The hint had been copy-pasted into the wrong tool’s error path.

This is the bug I am most embarrassed about. A unit test for the hint string would have passed: the string contains the word “lookup”, it is formatted correctly, no errors are thrown. The bug only exists if you try to follow the hint and find the path it describes does not work.

Two fixes shipped. First, the broken hint was rewritten to describe a recovery path that actually works on the create side. Second, a new issue was filed (and shipped two days later) to add name-based lookup to the create tool. The update-side helper was refactored into a shared _resolve_named_custom_fields function called from both paths, so the next change touches one site instead of two.

The general lesson is that a hint string is part of your API contract. A hint that promises a feature is worse than no hint at all, because an agent that sees “lookup is supported” will spend turns trying call shapes that cannot work.

Bug 3: Silent acceptance of garbage

This one nearly slipped past me.

list_redmine_issues(assigned_to_id="notmeortheid")
→ []

The parameter type was Optional[Union[int, str]]. The str branch existed to accept the literal "me" (Redmine’s sentinel for “the current user”). But because the type was a permissive str, anything matched. Garbage strings were silently coerced into Redmine’s filter API, which then returned an empty list because no user was named "notmeortheid".

For an agent driving the tool, the result is indistinguishable from “no issues are assigned to anyone.” The agent does not know it passed a malformed argument. It assumes the filter ran correctly and there were no results.

I call this the silent failure pattern: an invalid input that returns plausible-but-wrong output instead of an error. It is one of the worst classes of bug in agent tooling because it does not produce a retry signal. The agent moves on with a wrong conclusion.

The fix was a tightened type signature. Optional[Union[int, str]] became Optional[Union[int, Literal["me"]]]. Now the boundary middleware rejects anything that is neither an integer nor the exact literal "me":

list_redmine_issues(assigned_to_id="notmeortheid")
→ {
    "error": "Invalid value for parameter 'assigned_to_id': Input should be a 
              valid integer or Input should be 'me'",
    "code": "INVALID_ARGUMENTS",
    "hint": "Got 'notmeortheid' (type=str)..."
  }

Same fix pattern shipped to list_time_entries.user_id. Audit found two affected tools. Both fixed in one PR. The drift guard test pins the schema so a future “just accept any string for convenience” refactor breaks CI.

Bug 4: The meta-bug

This one is the most embarrassing and the most useful.

By round seven, the verification loop had a recurring problem. I would identify a bug from the LLM’s friction report, ship the fix, and re-probe at the MCP boundary, and the old behavior would still appear. Was the fix wrong, or was the deployment stale?

I had no way to tell. The MCP server did not expose its version anywhere. The bug I was diagnosing was that I had no way to diagnose deployment lag.

I shipped a new tool, get_mcp_server_info, that returns the package version, the auth mode, the read-only flag, and the set of enabled plugin flags. An agent verifying a fix can call this first, compare server_version to the version it expects, and either skip the verification (pre-fix build) or run it (post-fix build).

The first time I tried to call it after restart, the entire server hung. Every tool timed out for 4 minutes.

The new tool’s startup path was probing every plugin endpoint to populate plugin_flags. One of those endpoints had no timeout. When it did not respond, the whole boot hung. The diagnostic tool I had built to detect deployment problems was the deployment problem.

I rebuilt the plugin probe to be lazy and timeout-bounded. The server came back. The next time I probed for a fix, get_mcp_server_info returned server_version: 2.0.0 and the verification loop closed cleanly.

The general pattern is the diagnostic-tool boundary: any tool that exposes server state to an agent is itself part of the agent-facing surface. It needs the same care as your business tools, plus the discipline that it must never block boot. If your version-info tool can crash your server, you have one tool, not two.

What I shipped

Eight rounds of probing. Seventeen issues filed. All seventeen shipped. Three hundred ninety-one new tests pinning the new behavior. Two API patterns codified as drift-guarded standards: the structured error envelope ({error, code, hint}) and the two-phase destructive operation (preview the blast radius before commit). The full inventory is in the redmine-mcp-server changelog.

The thirteen I did not detail above followed the same patterns: inconsistent pagination across list tools, missing retry hints on transient upstream errors, ambiguous destructive-operation previews, schema-vs-runtime mismatches on optional fields, plugin capability drift between environments, duplicate-key behavior that differed by tool, and a handful of error envelopes that pre-dated the standard and had to be retrofitted. None of them needed clever fixes. All of them needed the eval to notice they existed.

The drift guards are the real ship. The bugs the eval found will recur if someone reverts to “the old way” for simplicity. The tests pinning the new behavior are how I make sure they don’t. They are also evidence, for the next person doing this kind of eval, that the changes were deliberate.

The important part was not fixing seventeen bugs. It was discovering that agent-facing reliability lives in the consistency between tools, not just the correctness inside them.

What this method does not catch

The LLM-as-driver pattern is not a substitute for the rest of your testing pyramid. It will not find:

Performance regressions. An LLM driver makes a handful of calls. It cannot tell you that a tool got 200 ms slower under load.
Concurrency bugs. Race conditions, partial-update inconsistencies, lock contention. Single-threaded probing will not surface them.
Long-horizon state corruption. Bugs that only appear after thousands of writes need a different harness.
Security vulnerabilities. An LLM driver might stumble onto a missing permission check, but it is not adversarial. Use a security audit for that.

What it finds is the class of bug that breaks agent reasoning: ergonomics, consistency, discoverability, hint accuracy, error envelope shape. That is a real class of bug. It is also the class that production agent traces will show you, repeatedly, after launch.

Try it on your own server

If you have an MCP server in production, this evaluation takes a few hours and costs almost nothing. The protocol:

Pick a non-trivial task that touches at least three of your tools. “Create a record, attach a file, link it to a related record” is a good template.
Give an LLM with no prior context access to your server.
Watch where it gets confused. Watch where it has to call a discovery tool twice. Watch where the error envelopes are inconsistent.
After each task, ask the LLM to write a structured report of every friction point.
Fix the bugs that fall out. Most are smaller than you expect. Discovery-tool docstrings. Error envelope shape. Type tightenings. Hint accuracy.

These are not the bugs your existing tests fail on. They are the bugs your existing tests cannot see. That is the point.

If your MCP server passes unit tests but agents still get confused, the bug is not in the tools. It is in the gaps between them.

Drive the server like an agent does.

Find the gaps.

Fill them.

Code and full issue history for the evaluation: redmine-mcp-server on GitHub. For the broader pattern of building MCP servers that agents can drive reliably, see How to Ship an MCP Server to Production and The Production AI Agent Playbook.

mcp ai-agents production-systems python

Kevin Tan

Cloud Solutions Architect and Engineering Leader based in Singapore. I write about AWS, distributed systems, and building reliable software at scale.

Email Portfolio LinkedIn GitHub

The common failure pattern

The LLM-as-driver pattern

The evaluation harness

Bug 1: The discovery tool that lied

Bug 2: The hint that promised a feature that did not exist

Bug 3: Silent acceptance of garbage

Bug 4: The meta-bug

What I shipped

What this method does not catch

Try it on your own server

Get real-world MCP systems in your inbox.

Discussion

Related posts

MCP Tool Sprawl: How I Cut 69 Tools to 43 With a Decorator

Section Chunking vs Page Chunking for AI Agents: ~6 Fewer Tool Calls Per PDF Query

Your LLM Is Free QA for Your MCP Server