Why your AI agents burn tokens hunting for answers they should already have (May 2026)

Q: Where does the data on agent token spend come from?

The most-cited sources are Stanford Digital Economy Lab , Microsoft Research , Deloitte's enterprise AI infrastructure outlook, and the FinOps Foundation's 2026 State of FinOps report. Anthropic and OpenAI publish pricing and caching documentation that informs the per-call economics.

A coding agent gets a routine question: how does our auth middleware handle expired session tokens? The answer exists. Someone on the platform team wrote it down. A senior engineer explained it in Slack last quarter. The PR that changed the behavior was merged six months ago. The agent does not know any of that, so it starts looking.

By the time the agent answers, it has read 14 files, opened three Linear tickets, scrolled through 200 lines of Slack history, and asked the model to reconcile two contradictory paragraphs from internal docs that have not been updated since 2024. It spent roughly 80,000 input tokens to produce a 400-token answer. Most of that work is invisible to your observability dashboard. All of it is on the bill. AI agents burn tokens hunting for answers because the answers are scattered across systems that were never built to be read by a machine. Falconer is the knowledge agent for engineering teams that writes and updates documentation as your codebase changes and returns cited answers grounded in how your company works today, so the agent finds the answer on the first call instead of the fourteenth.

Takeaways

AI agents burn tokens because they cannot locate answers quickly. An agent that misses on the first retrieval does not stop; it retries, re-reads, and re-derives. Every move is billable.
Microsoft Research found unconstrained coding agents cost $5 to $8 per software engineering task and that the same agent on the same task can vary by up to 30x. Almost all of that variance is determined by how good the retrieved context is.
The cost is dominated by input tokens, not output. A ten-turn agent loop sends roughly 50x the tokens of a single linear call, because the agent re-reads its history every turn.
Falconer is the knowledge agent that consolidates code, tickets, docs, and team history into a single layer that agents and humans can read on the first call. In our customer deployments, agent token spend on retrieval-heavy tasks drops by an order of magnitude.
Token waste is rarely visible as token waste. It shows up as healthy, well-formed requests that cost three to ten times what they should. Most observability tools cannot see it without help.

The trace, in detail

Engineering teams talk about agent cost in averages. Averages hide the mechanism. The cheapest way to see what is happening is to walk a real trace, step by step.

The question is straightforward. An internal AI assistant is asked: when our auth middleware sees an expired session token, does it redirect to the login flow or return a 401? The senior engineer in #platform-eng knows the answer. The PR that changed the behavior in February has the answer. The docs page on auth has the answer, except the docs page was last edited eighteen months ago and is wrong.

The agent begins.

Turn-by-turn breakdown

Turn one. The agent searches the internal docs portal for "session token expiry." It gets four results, three of them stale. It reads the top result (4,200 tokens of doc content) and forms a hypothesis based on the 2024 behavior.

Turn two. The agent is not confident. It searches GitHub for the auth middleware source. It pulls in the file (3,500 tokens) and the most recent five commits (another 2,800 tokens of diffs and messages).

Turn three. The 2024 doc and the 2026 code disagree. The agent expands its query. It pulls three more middleware files (8,400 tokens) and the auth schema (1,900 tokens) to see if the behavior is conditional.

Turn four. The agent searches Slack via integration for "session token expiry." It gets 22 messages spanning a year. It reads the most recent 12 (4,100 tokens).

Turn five. Slack has two contradictory threads. One from June 2025 says redirect. One from November 2025 says 401. The agent asks the model to reconcile (3,200 input tokens, 800 output tokens of reasoning).

Turn six. The agent opens the Linear ticket referenced in the November thread (1,800 tokens) and the PR that ticket links to (4,600 tokens of diff plus review comments).

Turn seven. The agent now has enough to answer. It composes a response (400 output tokens) that cites the PR.

Total: roughly 80,000 input tokens, 1,200 output tokens, seven turns. At current Anthropic pricing on Claude Sonnet 4.6, that single answer costs around 25 cents. Multiply by a few thousand questions a day across an engineering org and the math gets uncomfortable.

The model did not do anything wrong. The agent did not do anything wrong. The dashboard does not flag anything wrong, because every request was well-formed. The waste is structural. The answer was findable in one place, if the place had existed.

The four mechanisms behind the burn

The trace above is not unusual. It is the median experience for an agent operating against a real engineering org's knowledge surface. Four mechanisms are doing most of the damage.

Conversation history compounds. Every turn re-sends the entire conversation so far. By turn seven the agent is paying for the equivalent of seven prompt prefixes plus every tool call result. A linear function call would have cost a tenth as much.

Tool definitions ride along. Most agent frameworks include every tool's schema on every call. If your agent has thirty tools and uses three on this task, you are paying for the other twenty-seven every turn.

Retrieval thrash. When the first retrieval is wrong or stale, the agent broadens. Broader retrievals return more documents. More documents inflate the next prompt. The model spends tokens reasoning about which documents to trust. Bad first answers cost more than no answer at all.

Re-derivation. When the agent cannot find an authoritative source, it asks the model to reason from primary evidence. Reading PRs, reconciling Slack threads, and synthesizing across systems is expensive, and it produces an answer that already existed somewhere.

The first two mechanisms are well-known and addressable in the agent framework. The third and fourth are knowledge problems. They are why a quota or a dashboard alone never solves the issue.

The cost breakdown by mechanism

| Mechanism | Addressable in framework | Addressable at knowledge layer | Typical input-token impact | | --- | --- | --- | --- | | Conversation history compounds | Yes — truncation, summarization | Partial — shorter turns reduce accumulation | High | | Tool definitions ride along | Yes — scope tools per task | No | Medium | | Retrieval thrash | No | Yes — clean first retrieval eliminates re-expansion | Very high | | Re-derivation | No | Yes — authoritative answer exists on first call | High |

What changes when there is a knowledge layer

The same question, one turn

The same question, posed to an agent that has Falconer as its knowledge layer, looks like this.

Turn one. The agent asks Falconer for the current behavior of the auth middleware on expired session tokens. Falconer returns a one-paragraph answer with citations to the relevant PR, the Linear ticket, and the canonical doc page (which Falconer keeps in sync with the code). Total input plus output tokens for the entire interaction: under 8,000.

The agent does not need to pull GitHub files directly. The agent does not need to scroll Slack. The agent does not need to reconcile contradictions, because the contradictions have already been resolved at the knowledge layer. The savings are not marginal; they are an order of magnitude on this class of question, and questions like this make up the majority of routine agent traffic in most engineering orgs.

Falconer is a knowledge agent for engineering teams. We connect code, Linear tickets, design docs, Slack threads, PRs, and incident retros into a single engineering context layer. We update that layer as your code changes. Answers come back with citations so an agent or a human can verify them. The mechanical effect on token spend is simple: shorter retrievals, fewer turns, and no re-derivation of facts that already exist somewhere in the org.

Falconer doesn't fix every agent cost line you have. Recursion limits, prompt caching, tool definition pruning, and gateway-level metering are still worth doing. What Falconer fixes is the input-token bill on retrieval-heavy tasks, which in most enterprises is the largest cost line you have.

Why this problem keeps getting worse

The trace above is from May 2026. The same trace in 2023 would have looked different in scale but identical in shape. Three things have changed since then that make the problem more expensive each quarter.

Three compounding trends

Agent usage is up. Deloitte's 2028 AI infrastructure outlook reports 30% of enterprises already generating more than 10 billion tokens a month, with 61% expected to exceed that threshold by 2028. More agents, more questions, more retrieval traffic, all at the same cost-per-token economics.

Model context windows are bigger. A larger window means more room for an agent to stuff in adjacent documents on the off chance one of them is right. The math is perverse: as windows grew, average input prompts grew faster than average answer quality.

Knowledge sources got more fragmented, not less. The average engineering org we work with stores answers across at least seven systems: GitHub, Linear or Jira, Notion or Confluence, Slack, Google Drive, an incident system, and at least one internal wiki. Adding more sources to an agent's toolbelt makes the trace longer, not shorter.

Cost and quality move together

The token bill is the visible symptom of these three trends compounding. The hidden symptom is that agents trained on this kind of context produce subtly worse answers, because they are reasoning from contradictory inputs. Cost and quality move together.

Where to start

If you have not run a real agent trace recently, do that first. Pick a routine question, run it through your agent against your real corpus, and count input tokens by turn. Most engineering leaders are surprised by what they find.

Four moves, in order

Once you have the trace, four moves are worth making in order. Enable prompt caching at the gateway level. Add an observability layer if you do not have one. Set a max-turn budget so agents fail fast instead of looping forever. Then attack the retrieval layer, which is where most of the input-token bill lives.

Falconer is the knowledge agent built for that fourth move. We are happy to walk a trace with you against your real corpus and show what changes when the retrieval is right. A longer walkthrough of how Falconer maintains an engineering knowledge layer at company scale lives in our guide on scaling Karpathy's LLM wiki pattern to your org. For the broader FinOps picture, our companion guide on AI FinOps for engineering teams walks through the full four-source framework.

For external reading, the FinOps Foundation's working group on AI and Stanford Digital Economy Lab's analysis of agent token spend are the two most useful primary sources we have read. Anthropic publishes ongoing research on context engineering that lands on the same conclusions from a different angle. The 2026 State of FinOps report puts AI cost allocation at the top of the practitioner priority list for the second year running. Anthropic's pricing documentation is the cleanest reference on prompt caching economics, and PromptHub's vendor comparison covers OpenAI and Google for completeness. Milvus and Helicone have published the most useful coverage of the retrieval and observability layers respectively.

FAQ

Why do AI agents burn so many tokens on simple questions?

Because the answer is rarely in one place. An agent that cannot find the right answer on the first retrieval expands its search, opens more files, and asks the model to reconcile contradictions across sources. Each step adds input tokens. Falconer is a knowledge agent for engineering teams that consolidates the answer into one cited, current source so the agent finds it on the first call.

How much do AI agents typically cost per task?

Microsoft Research found unconstrained coding agents cost $5 to $8 per software engineering task. The same agent on the same task varied by up to 30x depending on the path it took. Most of that variance is context-dependent: agents reading clean, current context finish in one or two turns. Agents thrashing across stale sources finish in six or more.

What is "retrieval thrash" and why does it cost so much?

Retrieval thrash is the pattern of an agent issuing successive, broader retrievals because the first answer was wrong or stale. Each broader query returns more documents, which inflates the next prompt, which forces the model to reason about which documents to trust. The economic effect is a quiet tax on every request. Falconer solves it by giving the agent a clean, current, cited answer on the first call.

Why is observability not enough to control agent token spend?

Observability tells you which agents are expensive. It does not tell you why a specific retrieval was wrong, why a loop ran for six turns instead of two, or why an agent re-derived a fact that exists elsewhere in the org. Those answers live in the knowledge layer, which is what Falconer was built for.

Does Falconer replace my LLM gateway or my vector database?

No. Falconer is a knowledge agent for engineering teams that handles context quality. The gateway handles routing, caching, and metering. A vector database is one storage layer Falconer can sit on top of. They are complementary, and the savings compound.

What is the fastest way to see if my agents have a retrieval thrash problem?

Pull a recent agent trace and count input tokens per turn. If the third, fourth, and fifth turns are each larger than the first, the agent is broadening its retrieval because the early ones missed. That is retrieval thrash. The fix is at the knowledge layer, not the agent framework.

Where does the data on agent token spend come from?

The most-cited sources are Stanford Digital Economy Lab, Microsoft Research, Deloitte's enterprise AI infrastructure outlook, and the FinOps Foundation's 2026 State of FinOps report. Anthropic and OpenAI publish pricing and caching documentation that informs the per-call economics.

Why your AI agents burn tokens hunting for answers they should already have (May 2026)

Takeaways

The trace, in detail

Turn-by-turn breakdown

The four mechanisms behind the burn

The cost breakdown by mechanism

What changes when there is a knowledge layer

The same question, one turn

Why this problem keeps getting worse

Three compounding trends

Cost and quality move together

Where to start

Four moves, in order

FAQ

Why do AI agents burn so many tokens on simple questions?

How much do AI agents typically cost per task?

What is "retrieval thrash" and why does it cost so much?

Why is observability not enough to control agent token spend?

Does Falconer replace my LLM gateway or my vector database?

What is the fastest way to see if my agents have a retrieval thrash problem?

Where does the data on agent token spend come from?

Ready to get started?

Ready to get started?

Takeaways

The trace, in detail

Turn-by-turn breakdown

The four mechanisms behind the burn

The cost breakdown by mechanism

What changes when there is a knowledge layer

The same question, one turn

Why this problem keeps getting worse

Three compounding trends

Cost and quality move together

Where to start

Four moves, in order

FAQ

Why do AI agents burn so many tokens on simple questions?

How much do AI agents typically cost per task?

What is "retrieval thrash" and why does it cost so much?

Why is observability not enough to control agent token spend?

Does Falconer replace my LLM gateway or my vector database?

What is the fastest way to see if my agents have a retrieval thrash problem?

Where does the data on agent token spend come from?

Ready to get started?

Related guides

What are agent loops?

Agent loops are burning your token budget. Here's the fix.

Context engineering vs prompt engineering: why your agents need company knowledge

How to give Claude Code context: CLAUDE.md, skills, MCP, and keeping it current

Ready to get started?