I Cut My AI Agent's Token Costs 21% Without Changing the Model

Switching models is the first thing engineers reach for when AI agent costs get out of hand. I did the same. Then I ran the benchmarks and found out the model wasn’t the problem.

The problem was everything my agent kept in context that it never needed to read again.

I run blueclaw, a terminal AI agent I use for daily research tasks. It runs on Claude Sonnet, uses Strands Agents SDK under the hood, and regularly does 10-12 turn sessions fetching and analyzing web pages. After a few months of production use, I built a benchmarking harness to measure what was actually driving costs. The answer surprised me.

TL;DR: In my benchmarks, over 80% of tokens in a typical agent turn are tool output, not reasoning. Replacing old tool outputs with placeholders (“observation masking”) cut costs 21% on multi-step research tasks and up to 62% on retrieval-heavy workloads. The “smart” alternative (LLM summarization) costs more and slows agents down. It’s one config change in Strands Agents SDK.

Verdict: If your agent is fetching pages or calling APIs, old tool outputs are your biggest cost driver. Not the model.


The Short Answer

If your AI agent is expensive, it’s probably not the model.

In production agents:

  • Most tokens come from tool outputs (web pages, API responses)
  • These outputs are rarely reused after the turn they were fetched
  • Keeping them in context inflates every subsequent request

Replacing old tool outputs with placeholders (“observation masking”):

  • Cuts costs 20-60%
  • Requires no model change
  • Takes one config change in most frameworks

The Common Failure Pattern

Most agents are built like this:

Turn 1: fetch page A  → model gets full page content
Turn 2: fetch page B  → model gets full page A + full page B
Turn 3: fetch page C  → model gets full page A + B + C
...
Turn 8: summarize     → model gets all 7 pages, even though
                         it only needs the last one

By turn 8, the agent is paying to process 7 full web pages in every request, even though it fetched page A seven turns ago and hasn’t referenced it since. The model read it once. Now it’s just weight.

This is context bloat. And it compounds. A 12-turn research session can accumulate 100,000+ characters of tool output, most of which is stale by the second half of the session.

A common fix in agentic frameworks like OpenHands is LLM summarization: use a separate model call to compress old context into a summary. It sounds smart. The benchmarks say otherwise.

This kind of bloat rarely shows up in unit tests. It only emerges when you run full multi-turn trajectories against real workloads. See Testing AI Agents in Production for how I approach that.


The Complexity Trap

A 2025 NeurIPS workshop paper from JetBrains Research (“The Complexity Trap”, Lindenbauer et al.) tested both strategies on 500 SWE-bench Verified instances across five models.

Their finding: simple observation masking performs just as well as LLM summarization, at half the cost of an unmanaged agent.

More striking: LLM summarization made agents run 13-15% longer. When summaries compressed old context, agents lost signal about previous failures and repeated work they’d already done. The summary also consumed up to 7% of total inference cost depending on model configuration. You’re paying for a feature that makes your agent slower.

I ran my own benchmarks on blueclaw to verify. The pattern held.


What Observation Masking Actually Does

Instead of summarizing old tool outputs, masking replaces them with a placeholder:

[output omitted -- 14,832 chars]

The agent keeps all its reasoning history intact. It still knows what it did, what tools it called, and what decisions it made. It just can’t re-read the raw output from 8 turns ago.

This works because agents need to remember their reasoning, not re-read their observations. The content of a web page fetched in turn 2 is irrelevant by turn 10. The agent’s decision based on that page is what matters, and that’s in the assistant messages, which masking never touches.

The implementation in blueclaw uses what I call the Observation Window: keep the last M=10 tool outputs intact, mask everything older. M=10 is the optimal value from the paper, and it’s what I use in production.


The Benchmarks

I built bench_context.py to run the same prompt sequences through mask and summarize strategies side-by-side, measuring per-turn tokens, cost, and time.

Full Research Session (12 turns, Claude Sonnet)

12 turns of competitive research on Rust vs Go: search queries, page fetches, synthesis.

Metric Mask Summarize Delta
Tokens 1,252,477 1,641,180 -23.7%
Cost $3.996 $5.069 -21.2%
Steps 101 65 +55.4%
Time 44 min 54 min -17.7%
Masked chars 105,464 chars 0 n/a

Note what happened with steps: masking used 55% more tool calls but 21% less money. More steps is not more cost when each step processes less context. Each step was cheaper because the context was smaller. Summarize used fewer steps but each was more expensive.

Masking saved $1.07 on a single 12-turn session. At production volume, that compounds fast.

Workload Breakdown (Haiku model)

The savings vary significantly by workload type:

Workload Output size Mask savings Notes
Search only 1-2k chars/result ~7% Modest; search results are small
Mixed (search + 1 page) Varies ~6% One fetch doesn’t build much pressure
Retrieval-heavy 5-20k chars/result 62% Pages explode context fast

The pattern is clear: masking savings scale with tool output size. If your agent primarily uses search, you’ll see modest savings. If it fetches pages or processes large API responses, masking can cut costs by more than half.

My blueclaw sessions are retrieval-heavy. The agent regularly fetches full documentation pages. That’s why I see 21% savings on average, with spikes higher on sessions with many page fetches.


The Code (Strands Agents SDK)

This shipped in blueclaw v1.3. If you’re using the Strands Agents SDK, this is a one-config change. Here’s the ObservationMaskingManager I built for blueclaw:

from strands.agent.conversation_manager import (
    ConversationManager,
    SummarizingConversationManager,
)
from strands.hooks import BeforeModelCallEvent, HookRegistry

MASK_PLACEHOLDER = "[output omitted -- {n} chars]"

class ObservationMaskingManager(ConversationManager):
    """Replace old tool results with placeholders
    instead of summarizing.

    Based on Lindenbauer et al. 2025 "The Complexity Trap":
    observation masking preserves all reasoning while
    replacing distant environment observations with
    a size placeholder.
    """

    def __init__(self, mask_after: int = 10) -> None:
        super().__init__()
        self.mask_after = mask_after
        self._masked_chars = 0

    def register_hooks(
        self, registry: HookRegistry, **kwargs
    ) -> None:
        super().register_hooks(registry, **kwargs)
        registry.add_callback(
            BeforeModelCallEvent, self._on_before_model_call
        )

    def _on_before_model_call(self, event) -> None:
        self._apply_masking(event.agent)

    def apply_management(self, agent, **kwargs) -> None:
        self._apply_masking(agent)

    def reduce_context(self, agent, e=None, **kwargs) -> None:
        # Aggressive fallback: mask everything, then summarize
        self._apply_masking(agent, override_mask_after=0)
        SummarizingConversationManager().reduce_context(
            agent, e=e
        )

    def _apply_masking(
        self, agent, override_mask_after=None
    ) -> None:
        messages = agent.messages
        if override_mask_after is not None:
            m = override_mask_after
        else:
            m = self.mask_after
        cutoff = self._find_mask_cutoff(messages, m)
        for i in range(cutoff):
            self._mask_tool_results(messages[i])

    def _mask_tool_results(self, message: dict) -> None:
        if message.get("role") != "user":
            return
        for block in message.get("content", []):
            if "toolResult" not in block:
                continue
            tr = block["toolResult"]
            items = tr.get("content", [])
            total = sum(
                len(item.get("text", "")) for item in items
            )
            if total == 0:
                continue
            if len(items) == 1:
                t = items[0].get("text", "")
                if t.startswith("[output omitted"):
                    continue  # Already masked
            self._masked_chars += total
            tr["content"] = [
                {"text": MASK_PLACEHOLDER.format(n=total)}
            ]

    def _find_mask_cutoff(
        self, messages: list, keep_recent: int
    ) -> int:
        count = 0
        for idx in range(len(messages) - 1, -1, -1):
            msg = messages[idx]
            if msg.get("role") == "assistant" and any(
                "toolUse" in c
                for c in msg.get("content", [])
            ):
                count += 1
                if count >= keep_recent:
                    return idx
        return 0

Wire it into your agent:

from strands import Agent

agent = Agent(
    model=model,
    tools=tools,
    conversation_manager=ObservationMaskingManager(
        mask_after=10
    ),
)

Or if you’re using a config file approach (like blueclaw’s blueclaw.yaml):

context:
  strategy: mask
  mask_after: 10  # Keep last 10 tool outputs intact

The BeforeModelCallEvent hook is important. It means masking happens before each model call within a multi-step turn, not just at turn boundaries. Without it, a 13-step turn would keep growing context within the turn before masking kicked in at the end.


When Each Strategy Wins

There is no single best strategy. It depends on your workload.

Use masking when:

  • Your agent fetches web pages, documents, or large API responses
  • Tool outputs are large (>5k chars each)
  • Context dependency is low: later turns don’t need to re-read earlier tool outputs
  • You want predictable, low-overhead context management

Use summarization when:

  • Tool outputs are small (search snippets, short API responses)
  • Context dependency is high: the agent needs to cross-reference earlier findings
  • Session length is short (fewer than 5-6 turns)

Use hybrid when:

  • Mixed workloads: start with masking, fall back to summarization after N=43 turns (the paper optimized this threshold empirically in Section 5.3, the point where very long trajectories benefit from compression on top of masking)
  • The paper found hybrid adds 7% savings over masking and 11% over summarization for very long sessions

For most production agents doing research or retrieval: masking is the right default. The Strands Agents SDK’s SummarizingConversationManager is designed for reactive context overflow handling. It doesn’t proactively manage costs the way masking does.


Observability

Masking adds one useful metric: how many chars were masked per session.

Done · 13 steps · 92,450 tokens · $0.287 · 48.3s
Context: mask · 18,943 chars masked

High masked chars means retrieval-heavy session, masking doing real work. Low means search-heavy, overhead near zero. It also tells you when to tune mask_after. If masking never activates on your workload, you don’t need it.

If you’re wiring this into a production agent, the monitoring layer matters too. See Monitoring AI Agents in Production for the full instrumentation approach.


The Production Reality

The 21% number is conservative. It’s from a mixed research session: some search turns (where masking barely helps) and some retrieval turns (where masking helps a lot).

On retrieval-heavy workloads (an agent that primarily fetches and processes large documents) the savings are closer to 62%. That’s not a theoretical number. That’s from the category 2 benchmark run on Haiku: $0.071 vs $0.188 for the same 5-turn task.

At production volume, 62% savings on your most expensive workload class is not a nice-to-have. It’s the difference between a sustainable cost model and a bill that compounds out of control.

The model was never the problem.

You were paying to re-read data your agent already forgot. Fix the context, and the cost fixes itself.


Resources


Related: Testing AI Agents in Production · Monitoring AI Agents in Production

ai-agents llm python blueclaw
Kevin Tan

Kevin Tan

Cloud Solutions Architect and Engineering Leader based in Singapore. I write about AWS, distributed systems, and building reliable software at scale.