Switching models is the first thing engineers reach for when AI agent costs get out of hand. I did the same. Then I ran the benchmarks and found out the model wasn’t the problem.
The problem was everything my agent kept in context that it never needed to read again.
I run blueclaw, a terminal AI agent I use for daily research tasks. It runs on Claude Sonnet, uses Strands Agents SDK under the hood, and regularly does 10-12 turn sessions fetching and analyzing web pages. After a few months of production use, I built a benchmarking harness to measure what was actually driving costs. The answer surprised me.
TL;DR: In my benchmarks, over 80% of tokens in a typical agent turn are tool output, not reasoning. Replacing old tool outputs with placeholders (“observation masking”) cut costs 21% on multi-step research tasks and up to 62% on retrieval-heavy workloads. The “smart” alternative (LLM summarization) costs more and slows agents down. It’s one config change in Strands Agents SDK.
Verdict: If your agent is fetching pages or calling APIs, old tool outputs are your biggest cost driver. Not the model.
The Short Answer
If your AI agent is expensive, it’s probably not the model.
In production agents:
- Most tokens come from tool outputs (web pages, API responses)
- These outputs are rarely reused after the turn they were fetched
- Keeping them in context inflates every subsequent request
Replacing old tool outputs with placeholders (“observation masking”):
- Cuts costs 20-60%
- Requires no model change
- Takes one config change in most frameworks
The Common Failure Pattern
Most agents are built like this:
Turn 1: fetch page A → model gets full page content
Turn 2: fetch page B → model gets full page A + full page B
Turn 3: fetch page C → model gets full page A + B + C
...
Turn 8: summarize → model gets all 7 pages, even though
it only needs the last one
By turn 8, the agent is paying to process 7 full web pages in every request, even though it fetched page A seven turns ago and hasn’t referenced it since. The model read it once. Now it’s just weight.
This is context bloat. And it compounds. A 12-turn research session can accumulate 100,000+ characters of tool output, most of which is stale by the second half of the session.
A common fix in agentic frameworks like OpenHands is LLM summarization: use a separate model call to compress old context into a summary. It sounds smart. The benchmarks say otherwise.
This kind of bloat rarely shows up in unit tests. It only emerges when you run full multi-turn trajectories against real workloads. See Testing AI Agents in Production for how I approach that.
The Complexity Trap
A 2025 NeurIPS workshop paper from JetBrains Research (“The Complexity Trap”, Lindenbauer et al.) tested both strategies on 500 SWE-bench Verified instances across five models.
Their finding: simple observation masking performs just as well as LLM summarization, at half the cost of an unmanaged agent.
More striking: LLM summarization made agents run 13-15% longer. When summaries compressed old context, agents lost signal about previous failures and repeated work they’d already done. The summary also consumed up to 7% of total inference cost depending on model configuration. You’re paying for a feature that makes your agent slower.
I ran my own benchmarks on blueclaw to verify. The pattern held.
What Observation Masking Actually Does
Instead of summarizing old tool outputs, masking replaces them with a placeholder:
[output omitted -- 14,832 chars]
The agent keeps all its reasoning history intact. It still knows what it did, what tools it called, and what decisions it made. It just can’t re-read the raw output from 8 turns ago.
This works because agents need to remember their reasoning, not re-read their observations. The content of a web page fetched in turn 2 is irrelevant by turn 10. The agent’s decision based on that page is what matters, and that’s in the assistant messages, which masking never touches.
The implementation in blueclaw uses what I call the Observation Window: keep the last M=10 tool outputs intact, mask everything older. M=10 is the optimal value from the paper, and it’s what I use in production.
The Benchmarks
I built bench_context.py to run the same prompt sequences through mask and summarize strategies side-by-side, measuring per-turn tokens, cost, and time.
Full Research Session (12 turns, Claude Sonnet)
12 turns of competitive research on Rust vs Go: search queries, page fetches, synthesis.
| Metric | Mask | Summarize | Delta |
|---|---|---|---|
| Tokens | 1,252,477 | 1,641,180 | -23.7% |
| Cost | $3.996 | $5.069 | -21.2% |
| Steps | 101 | 65 | +55.4% |
| Time | 44 min | 54 min | -17.7% |
| Masked chars | 105,464 chars | 0 | n/a |
Note what happened with steps: masking used 55% more tool calls but 21% less money. More steps is not more cost when each step processes less context. Each step was cheaper because the context was smaller. Summarize used fewer steps but each was more expensive.
Masking saved $1.07 on a single 12-turn session. At production volume, that compounds fast.
Workload Breakdown (Haiku model)
The savings vary significantly by workload type:
| Workload | Output size | Mask savings | Notes |
|---|---|---|---|
| Search only | 1-2k chars/result | ~7% | Modest; search results are small |
| Mixed (search + 1 page) | Varies | ~6% | One fetch doesn’t build much pressure |
| Retrieval-heavy | 5-20k chars/result | 62% | Pages explode context fast |
The pattern is clear: masking savings scale with tool output size. If your agent primarily uses search, you’ll see modest savings. If it fetches pages or processes large API responses, masking can cut costs by more than half.
My blueclaw sessions are retrieval-heavy. The agent regularly fetches full documentation pages. That’s why I see 21% savings on average, with spikes higher on sessions with many page fetches.
The Code (Strands Agents SDK)
This shipped in blueclaw v1.3. If you’re using the Strands Agents SDK, this is a one-config change. Here’s the ObservationMaskingManager I built for blueclaw:
from strands.agent.conversation_manager import (
ConversationManager,
SummarizingConversationManager,
)
from strands.hooks import BeforeModelCallEvent, HookRegistry
MASK_PLACEHOLDER = "[output omitted -- {n} chars]"
class ObservationMaskingManager(ConversationManager):
"""Replace old tool results with placeholders
instead of summarizing.
Based on Lindenbauer et al. 2025 "The Complexity Trap":
observation masking preserves all reasoning while
replacing distant environment observations with
a size placeholder.
"""
def __init__(self, mask_after: int = 10) -> None:
super().__init__()
self.mask_after = mask_after
self._masked_chars = 0
def register_hooks(
self, registry: HookRegistry, **kwargs
) -> None:
super().register_hooks(registry, **kwargs)
registry.add_callback(
BeforeModelCallEvent, self._on_before_model_call
)
def _on_before_model_call(self, event) -> None:
self._apply_masking(event.agent)
def apply_management(self, agent, **kwargs) -> None:
self._apply_masking(agent)
def reduce_context(self, agent, e=None, **kwargs) -> None:
# Aggressive fallback: mask everything, then summarize
self._apply_masking(agent, override_mask_after=0)
SummarizingConversationManager().reduce_context(
agent, e=e
)
def _apply_masking(
self, agent, override_mask_after=None
) -> None:
messages = agent.messages
if override_mask_after is not None:
m = override_mask_after
else:
m = self.mask_after
cutoff = self._find_mask_cutoff(messages, m)
for i in range(cutoff):
self._mask_tool_results(messages[i])
def _mask_tool_results(self, message: dict) -> None:
if message.get("role") != "user":
return
for block in message.get("content", []):
if "toolResult" not in block:
continue
tr = block["toolResult"]
items = tr.get("content", [])
total = sum(
len(item.get("text", "")) for item in items
)
if total == 0:
continue
if len(items) == 1:
t = items[0].get("text", "")
if t.startswith("[output omitted"):
continue # Already masked
self._masked_chars += total
tr["content"] = [
{"text": MASK_PLACEHOLDER.format(n=total)}
]
def _find_mask_cutoff(
self, messages: list, keep_recent: int
) -> int:
count = 0
for idx in range(len(messages) - 1, -1, -1):
msg = messages[idx]
if msg.get("role") == "assistant" and any(
"toolUse" in c
for c in msg.get("content", [])
):
count += 1
if count >= keep_recent:
return idx
return 0
Wire it into your agent:
from strands import Agent
agent = Agent(
model=model,
tools=tools,
conversation_manager=ObservationMaskingManager(
mask_after=10
),
)
Or if you’re using a config file approach (like blueclaw’s blueclaw.yaml):
context:
strategy: mask
mask_after: 10 # Keep last 10 tool outputs intact
The BeforeModelCallEvent hook is important. It means masking happens before each model call within a multi-step turn, not just at turn boundaries. Without it, a 13-step turn would keep growing context within the turn before masking kicked in at the end.
When Each Strategy Wins
There is no single best strategy. It depends on your workload.
Use masking when:
- Your agent fetches web pages, documents, or large API responses
- Tool outputs are large (>5k chars each)
- Context dependency is low: later turns don’t need to re-read earlier tool outputs
- You want predictable, low-overhead context management
Use summarization when:
- Tool outputs are small (search snippets, short API responses)
- Context dependency is high: the agent needs to cross-reference earlier findings
- Session length is short (fewer than 5-6 turns)
Use hybrid when:
- Mixed workloads: start with masking, fall back to summarization after N=43 turns (the paper optimized this threshold empirically in Section 5.3, the point where very long trajectories benefit from compression on top of masking)
- The paper found hybrid adds 7% savings over masking and 11% over summarization for very long sessions
For most production agents doing research or retrieval: masking is the right default. The Strands Agents SDK’s SummarizingConversationManager is designed for reactive context overflow handling. It doesn’t proactively manage costs the way masking does.
Observability
Masking adds one useful metric: how many chars were masked per session.
Done · 13 steps · 92,450 tokens · $0.287 · 48.3s
Context: mask · 18,943 chars masked
High masked chars means retrieval-heavy session, masking doing real work. Low means search-heavy, overhead near zero. It also tells you when to tune mask_after. If masking never activates on your workload, you don’t need it.
If you’re wiring this into a production agent, the monitoring layer matters too. See Monitoring AI Agents in Production for the full instrumentation approach.
The Production Reality
The 21% number is conservative. It’s from a mixed research session: some search turns (where masking barely helps) and some retrieval turns (where masking helps a lot).
On retrieval-heavy workloads (an agent that primarily fetches and processes large documents) the savings are closer to 62%. That’s not a theoretical number. That’s from the category 2 benchmark run on Haiku: $0.071 vs $0.188 for the same 5-turn task.
At production volume, 62% savings on your most expensive workload class is not a nice-to-have. It’s the difference between a sustainable cost model and a bill that compounds out of control.
The model was never the problem.
You were paying to re-read data your agent already forgot. Fix the context, and the cost fixes itself.
Resources
- blueclaw: open source terminal AI agent with context management built in
- blueclaw roadmap: full feature history and what’s next
- Lindenbauer et al. 2025, “The Complexity Trap”: the paper this is based on
- Strands Agents SDK conversation management docs
- JetBrains Research blog post: accessible write-up of the paper
Related: Testing AI Agents in Production · Monitoring AI Agents in Production