Context retries: the silent token tax on agentic workflows (May 2026)

Most teams that have read a recent agent token trace already know the input side dominates. What is less obvious is which input lines compound and at what rate. The unconstrained-agent figure of $5 to $8 per software engineering task, reported by Microsoft Research, is the surface of a deeper number: the same task running on the same agent varies by up to 30x. That variance is not random. It is path-dependent, and the path is shaped by three compounding effects on input tokens. Two are well-known. The third is the one most coverage misses, and it is the one Falconer was built to fix. Falconer is the knowledge agent for engineering teams that writes and updates documentation as the codebase changes and returns cited answers grounded in how your company works today. It addresses the retrieval thrash that drives the largest hidden line in your input-token bill.

Takeaways

Three compounding effects drive most agent input-token spend: conversation history that resends the full thread every turn, tool definitions that ride along on every call, and retrieval thrash when the first retrieval misses.
By turn fifty, an agent is paying for roughly fifty times its starting context per call. Pruning and summarization are not research topics; they are line items.
Retrieval thrash is the silent component. An agent that misses the right answer broadens its query, returns more documents, and asks the model to reconcile contradictions. Each step is invisible to the gateway and billable to you.
Falconer addresses retrieval thrash directly by returning a single cited answer grounded in current code, docs, tickets, and team history. In production deployments, median loop length on retrieval-heavy questions drops by more than half.
Caching and gateway-level fixes compound with a clean knowledge layer. Caching reduces the unit cost of the tokens you send; Falconer reduces how many tokens you need to send.

Why "input tokens" is the right unit

Most pricing pages lead with output. Most invoices are paid on input. For agentic workflows the asymmetry is severe. A coding agent that produces a 400-token answer often consumes 80,000 or more input tokens to get there, because every turn re-reads the conversation so far, the tool definitions, and the retrieval results.

Three patterns dominate. Each compounds with the others. The savings from fixing them multiply rather than add, which is why a single targeted intervention sometimes cuts a token bill by half.

Compounding effect one: conversation history

Most agent frameworks send the full conversation history on every turn. By the time an agent is twenty messages in, it is paying for the equivalent of twenty prompt prefixes, every call. A naive ten-turn Reflexion loop sends roughly fifty times the tokens of a single linear call.

The fixes are well-known. Pruning. Summarization. Persistent scratchpads that store intermediate state outside the prompt. Anthropic and OpenAI both support prompt caching, which means the stable prefix of the conversation can be cached at roughly 10% of the standard input price on Anthropic, or 50% on OpenAI's automatic cache, per their official pricing documentation.

Most of this work happens in the agent framework, not in the model. If you are using LangGraph, LlamaIndex, CrewAI, or a homegrown orchestrator, the relevant controls are all in your code. The tooling is mature, the patterns are written down, and the savings are immediate. This is the cheapest of the three lines to fix.

Compounding effect two: tool definitions

Every tool an agent has access to ships its full schema on every call, regardless of whether the agent uses it. If your agent has thirty tools and uses three on a given task, you are paying for the other twenty-seven every turn.

This one is structurally similar to history bloat but lives on a different axis. The fix is to scope tools by task, by user, or by routing. Some teams use a hierarchical agent pattern where a top-level agent picks a sub-agent based on the task, and the sub-agent has only the tools relevant to its domain. Others use dynamic tool selection at the gateway layer to keep the tool definition payload small.

A representative example: an internal engineering assistant we audited recently had 47 tools in its definition. The median task used four. After scoping, the average input token count per turn dropped by 22%. No retrieval changes, no framework changes, no caching changes. Just a smaller tool definition.

Compounding effect three: retrieval thrash

This is the line most coverage skips, and it is usually the largest one.

When an agent retrieves a stale, partial, or contradicted answer, it does not stop. It broadens its query. Broader queries return more documents. The next prompt is larger because the agent now has six adjacent documents instead of one. The model spends tokens reasoning about which documents to trust. Often it asks the model to reconcile contradictions across versions of the same fact, which consumes more tokens to produce a synthesis the agent could have gotten in one call if the underlying source had been right.

Three failure modes produce most of the thrash.

Stale documents are not flagged as stale.

A doc page from 2024 looks identical to a doc page from 2026 in the retrieval response. The agent has no signal that the answer is outdated. So it reads it, attempts a step based on it, fails, and retries with a broader query.

Authoritative sources are scattered.

The answer might be in code, but the agent searched docs first. It might be in a Slack thread from last quarter, but the agent does not have a Slack integration. It might be in a closed Linear ticket, which the agent's retrieval index has not refreshed in a week.

Contradictions are unresolved at the source.

The June doc says one thing. The November Slack thread says another. The current code says a third. The agent has to reconcile in-prompt, which is expensive and unreliable.

The economic effect is hidden because every failed retrieval looks like a successful retrieval to your gateway. The tokens are billed, the agent keeps trying, and the dashboard does not flag any of it as waste.

Falconer is the knowledge agent for engineering teams that fixes this layer directly. We write and update documentation as your code changes, link the docs back to the PRs, tickets, Slack threads, and incident retros that produced them, and return cited answers grounded in how the company works today. The mechanical effect on retrieval thrash is twofold. First, the first retrieval is more likely to be right, so the loop ends in one turn instead of six. Second, contradictions are resolved at the knowledge layer rather than in-prompt, so the model is not paying tokens to reconcile two versions of the same fact.

How the three effects compound

The interesting math is not in any one of the three lines. It is in how they multiply.

| Scenario | Turns | Input tokens | | --- | --- | --- | | No intervention (full history, 47 tools, 3 thrash cycles) | 10 | ~200,000 | | History pruning + tool scoping, no knowledge layer | 6–7 | ~90,000 | | Knowledge layer in place | 2 | ~12,000 |

The numbers vary by workload. The shape does not. Retrieval thrash is the line that, once fixed, makes the other two interventions much more impactful. A history-pruning trick that saves 20% on a ten-turn loop saves much more on a two-turn loop, because the loop is shorter and the savings multiply across fewer base units.

Where the data comes from

Most of the figures in this guide come from a small set of primary sources. Stanford Digital Economy Lab and Microsoft Research both published agentic-coding cost analyses in 2025 that documented the $5 to $8 per task figure and the 30x intra-task variance. Anthropic and OpenAI publish prompt caching documentation that defines the cached-token economics. The FinOps Foundation's working group on AI maintains the most useful cross-vendor framework for measuring this kind of waste.

Internal data from Falconer customer deployments contributes the order-of-magnitude figures on retrieval thrash specifically. We see the same pattern across enterprises in every sector we work in: when an agent has a clean knowledge layer, the third compounding effect drops out, and the other two get cheaper to address.

For additional reading on the underlying mechanics, the 2026 State of FinOps report names AI cost allocation as the top practitioner priority and includes survey data on where teams are seeing the biggest unexpected costs. Deloitte's 2028 AI infrastructure outlook frames the scale of enterprise token consumption. For observability tooling specifically, Helicone's comparison of platforms and Langfuse's documentation cover the main options. Milvus's writeup on RAG token cost is the cleanest treatment of how retrieval design drives input-token spend.

Where to start

Run a real trace. Pick a routine agent task in production, run it end to end, and count input tokens by turn. You will see one of three shapes.

If the trace is short and the tokens-per-turn are flat, you have a healthy agent. Move on.

If the trace is long but the tokens-per-turn are flat, you have a history or tool-definition problem. Look at your framework's pruning and tool routing settings.

If the trace is long and the tokens-per-turn grow, especially in the middle, you have retrieval thrash. The first retrieval missed; every subsequent one broadened. This is the silent line. Falconer is the layer that fixes it.

FAQ

What is the silent token tax in agentic workflows?

The silent token tax is the input-token spend driven by retrieval thrash: an agent that misses on its first retrieval broadens its query, returns more documents, and asks the model to reconcile contradictions across sources. Every step is a well-formed, billable request that looks healthy in the gateway. Falconer is the knowledge agent for engineering teams that ends the thrash by returning a single cited answer on the first call.

How much can context retries inflate an agent's token bill?

In our customer data, the input-token cost of a thrashing agent on a routine engineering question is typically five to ten times the cost of the same agent reading from a clean knowledge layer. The Microsoft Research paper on agent token consumption found up to 30x variance on the same task, most of it path-dependent.

Why does conversation history bloat so quickly?

Every turn re-sends the full thread. By turn ten, the agent is paying for ten copies of the starting context. By turn fifty, fifty copies. Pruning, summarization, and persistent scratchpads bring this down sharply, but they help much more when the loop is short. Falconer keeps loops short by getting the retrieval right.

Does prompt caching solve retrieval thrash?

No. Caching reduces the unit cost of the tokens you send. It does not change how many tokens you need to send. If your retrieval is wrong, caching just makes wrong retrievals cheaper. Falconer is the layer that reduces the number of tokens by fixing the retrieval.

How does Falconer reduce input tokens specifically?

Falconer returns a single cited answer to the agent's question, grounded in current code, docs, tickets, and team history. The agent does not need to read six adjacent documents, scroll a Slack thread, or reconcile two versions of the same fact. The first retrieval is right, the loop ends in one or two turns, and the input-token bill drops.

Can I see the retrieval thrash in my observability tool?

Not directly. Most observability tools surface total tokens, latency, and request counts. They do not flag a request as a thrash retry, because every request is well-formed. The signal lives in token-per-turn growth within a single agent trace. If your tokens-per-turn grow across the middle of a loop, you have retrieval thrash.

Where can I read more about the underlying economics?

The FinOps Foundation's AI working group maintains the most useful framework. Anthropic's research on context engineering explains the in-model cost dynamics. The Stanford and Microsoft Research papers cited above are the strongest primary data we have read on agentic-coding cost specifically.

Context retries: the silent token tax on agentic workflows (May 2026)

Takeaways

Why "input tokens" is the right unit

Compounding effect one: conversation history

Compounding effect two: tool definitions

Compounding effect three: retrieval thrash

Three failure modes produce most of the thrash.

How the three effects compound

Where the data comes from

Where to start

FAQ

What is the silent token tax in agentic workflows?

How much can context retries inflate an agent's token bill?

Why does conversation history bloat so quickly?

Does prompt caching solve retrieval thrash?

How does Falconer reduce input tokens specifically?

Can I see the retrieval thrash in my observability tool?

Where can I read more about the underlying economics?

Ready to get started?

Ready to get started?

Takeaways

Why "input tokens" is the right unit

Compounding effect one: conversation history

Compounding effect two: tool definitions

Compounding effect three: retrieval thrash

Three failure modes produce most of the thrash.

How the three effects compound

Where the data comes from

Where to start

FAQ

What is the silent token tax in agentic workflows?

How much can context retries inflate an agent's token bill?

Why does conversation history bloat so quickly?

Does prompt caching solve retrieval thrash?

How does Falconer reduce input tokens specifically?

Can I see the retrieval thrash in my observability tool?

Where can I read more about the underlying economics?

Ready to get started?

Related guides

What are agent loops?

Agent loops are burning your token budget. Here's the fix.

Context engineering vs prompt engineering: why your agents need company knowledge

How to give Claude Code context: CLAUDE.md, skills, MCP, and keeping it current

Ready to get started?