# Agent loops are burning your token budget. Here's the fix.

> Agent loops burn tokens because the first retrieval misses. A 10-turn loop sends roughly 50x the tokens of a single linear call, and most of that variance is path-dependent — driven by retrieval thrash that observability tools can't see. This guide explains why prompt caching and pruning don't fix it, and how a shared knowledge layer like Falconer cuts median loop length by more than half on retrieval-heavy questions.

- Date: 2026-06-09
- Tags: ai-agents, agent-loops, token-efficiency, mcp, finops

---
Agent loops burn tokens because the first retrieval misses. When the lookup returns stale or partial context, the agent broadens its query, pulls more docs, and asks the model to reconcile contradictions across sources. By turn six, a routine loop has sent roughly 50x the tokens of a single linear call. The fix lives upstream of the loop: a shared, current knowledge layer the agent can read and write, so the first retrieval is right.

The people closest to the problem have already moved past prompting. Boris Cherny, head of Claude Code at Anthropic, [doesn't prompt agents anymore](https://addyosmani.com/blog/loop-engineering/). He designs loops that prompt them. Peter Steinberger, creator of OpenClaw, [puts it more bluntly](https://addyosmani.com/blog/loop-engineering/): *"You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."* Steinberger spent [$1.3 million on OpenAI tokens in 30 days](https://thenextweb.com/news/openclaw-peter-steinberger-1-3-million-openai-token-bill) running 100 Codex instances on a single project.

The frontier moved from the prompt to the loop. The bill moved with it.

## TLDR

- An agent loop is a small program that prompts a coding agent, reads what it produced, decides whether the work is done, and prompts it again if it isn't. It is the new unit of agentic coding.
- A 10-turn loop sends roughly 50x the tokens of a single linear call. The [Stanford Digital Economy Lab](https://digitaleconomy.stanford.edu/publication/how-do-ai-agents-spend-your-money-analyzing-and-predicting-token-consumption-in-agentic-coding-tasks/) found agentic coding tasks consume \~1000x more tokens than chat or code reasoning, with up to 30x variance on the same task.
- The largest hidden line on the token bill is retrieval thrash: when the first retrieval misses, the agent broadens its query, pulls more docs, and reconciles contradictions in-prompt. Every step looks healthy in your gateway.
- Prompt caching, conversation pruning, and tool scoping lower per-call cost but do not fix why the loop got long in the first place.
- The upstream fix is a shared knowledge layer the loop can read and write. One cited answer, grounded in current code, docs, tickets, and team history. The loop ends in one or two turns instead of six.
- [Falconer](https://falconer.com/?utm_source=blog&utm_medium=referral&utm_campaign=content) is a knowledge layer for engineering teams that writes and updates docs as code ships, returns cited answers from the agent's first call, and reads and writes through MCP. In customer deployments, median loop length on retrieval-heavy questions drops by more than half and input-token cost per question drops by an order of magnitude.

A coding agent that produces a 400-token answer often burns 80,000+ input tokens to get there. The [Stanford Digital Economy Lab's study of agentic coding tasks](https://digitaleconomy.stanford.edu/publication/how-do-ai-agents-spend-your-money-analyzing-and-predicting-token-consumption-in-agentic-coding-tasks/) found these workloads consume roughly 1000x more tokens than chat or code reasoning, with up to 30x variance on the same task. Almost all of that variance is path-dependent, driven by how good the retrieved context is. Our [AI FinOps for engineering teams guide](https://falconer.com/guides/ai-finops-engineering-teams) breaks down where the spend actually comes from.

![Chart showing per-task cost variance for unconstrained coding agents from Microsoft Research benchmark](https://falconer.com/api/file/s3/images/1781025202340-hwegq.png)

Most fixes for agent token spend attack the symptom. Caching lowers the unit cost of repeated tokens. Pruning trims older turns. Tool scoping removes unused schemas from each call. All useful at the margin, none of them fix why the loop got long.

The loop got long because the first retrieval missed.

| Approach | What it does | What it doesn't fix |
| --- | --- | --- |
| Prompt caching | Reduces unit cost of repeated context tokens | Why the agent keeps re-retrieving the same context |
| Conversation pruning | Trims older turns from the context window | Why the loop got long in the first place |
| Tool scoping | Removes unused tool schemas from each call | Why the first retrieval missed |
| Shared knowledge layer | Returns one cited answer from a current, structured source | Tool or framework bugs unrelated to retrieval |

## What is retrieval thrash?

Retrieval thrash is the input-token spend driven by an agent that misses on its first lookup, then broadens the query, pulls more documents, and asks the model to reconcile contradictions across stale and current sources. Every step is well-formed, billable, and invisible to your gateway. We covered the mechanics in detail in our guide on [why your AI agents burn tokens hunting for answers they should already have](https://falconer.com/guides/ai-agent-token-waste/).

### Why your dashboard can't see it

Retrieval thrash is the largest hidden line in an agent's input-token bill. Most observability tools can't see it because every individual request looks healthy.

Andrej Karpathy [described agents](https://karpathy.bearblog.dev/sequoia-ascent-2026/) as interns with great recall and weak judgment. The recall part is doing more work than people give it credit for. An intern with a stale handbook will still answer your question. Loudly, wrong. Your agent does the same thing, except 100 times an hour across every branch in your repo. There is no jagged-frontier fix for this. You can't RL your way out of bad source material.

```mermaid
flowchart TB
    Q1["User query"]
    A1["Agent"]
    M1["Retrieval miss"]
    M2["Broaden query"]
    M3["Pull more docs"]
    M4["Reconcile<br/>contradictions"]

    Q1 --> A1
    A1 --> M1
    M1 --> M2
    M2 --> M3
    M3 --> M4
    M4 -->|"loop 6+ turns<br/>tokens compound"| A1
```

## How does a shared knowledge layer fix retrieval thrash?

A shared knowledge layer, sometimes called a company brain, is a single, current source that knows what the company knows: code, docs, tickets, and team history, kept synchronized as work happens. The agent queries it once, gets a cited answer grounded in current state, and exits the loop. The first retrieval is right. The loop ends in one or two turns instead of six.

In customer deployments, median loop length on retrieval-heavy questions drops by more than half. Input-token cost per question drops by an order of magnitude.

![Chart showing reduction in agent loop length after introducing a shared knowledge layer in customer deployments](https://falconer.com/api/file/s3/images/1781025265310-tggqu.png)

The next two sections cover why this works.

```mermaid
flowchart TB
    subgraph Sources["Live sources"]
        direction LR
        PR["PR diffs"]
        Slack["Slack threads"]
        Linear["Linear tickets"]
        Docs["Docs"]
    end

    Sources ==>|"auto-update"| Brain["Company brain"]

    Q["User query"] --> A["Agent"]
    A <==>|"reads + writes back<br/>every loop"| Brain
    A --> Ans["One cited answer<br/>(loop ends in 1 turn)"]
```

## Why do agent loops need docs that update with the code?

Stale docs are the worst kind of context for an agent. They look identical to current docs in the retrieval response. The agent reads them, acts, fails, retries, and burns tokens on every cycle.

Karpathy's framing for why this keeps happening is that [docs are still written for humans](https://karpathy.bearblog.dev/sequoia-ascent-2026/). His complaint at AI Native: *"Why are people still telling me what to do? I don't want to do anything. What is the thing I should copy paste to my agent?"*

### How Falconer keeps docs current

Falconer reads every PR diff, identifies which docs are affected, and proposes the exact edits. Architecture docs, API references, runbooks, and implementation plans update as code ships.

The point is that context rot gets prevented before the session starts, not managed inside the session.

Two modes are available: **Review** (Falconer proposes edits and an owner accepts them in Slack) or **self-driving** (Falconer applies edits directly). Multiple PRs touching the same doc get reconciled into one consolidated diff.

## Why should every agent read from the same knowledge source?

Steinberger's $1.3M bill is the obvious version of the same problem at fleet scale. When every agent re-derives context independently, you pay for context production once per agent, per call. Multiply that by however many agents your organization runs. We made the full case for shared-layer economics in [tokenmaxxing vs. token efficiency](https://falconer.com/guides/tokenmaxxing-token-efficiency/).

A shared knowledge layer flips the math. The first agent pays once to consolidate the answer. Every subsequent agent reads the consolidated answer. PR reviewer, on-call assistant, code search agent, Linear triage bot: same source, same freshness, same cited answer.

### How the loop writes back through MCP

Model Context Protocol (MCP) is an [open protocol developed by Anthropic](https://www.anthropic.com/news/model-context-protocol) for connecting AI agents to data sources. Falconer's MCP server reads and writes. When an agent drafts a postmortem mid-incident or updates a runbook to match a config change, that work lands back in Falconer. Every loop run leaves the knowledge base better than it found it.

Token savings compound across the fleet instead of getting re-spent inside each agent.

## How do you diagnose your own agent loop?

Run a real trace on a routine agent task. Count input tokens per turn and match the pattern to the table below.

| Trace shape | Tokens-per-turn pattern | Likely cause | Where to fix it |
| --- | --- | --- | --- |
| Short trace | Flat across turns | Healthy agent | Nothing to fix |
| Long trace | Flat across turns | History or tool-schema bloat | Framework layer (caching, pruning, scoping) |
| Long trace | Growing through middle turns | Retrieval thrash | Knowledge layer (single, current source) |

![Diagram of the three agent loop trace shapes used to diagnose token spend in coding agents](https://falconer.com/api/file/s3/images/1781025288579-obj2y.png)

If your traces look like the third row, caching and pruning won't fix it. The agent is reading bad source material, and the cost of the fix lives upstream of the loop. [Falconer](https://falconer.com/?utm_source=blog&utm_medium=referral&utm_campaign=content) is built for that case: try it on a single agent first, then expand to the fleet once you see the loop length drop.

## FAQ

### What are agent loops?

An agent loop is a small program that prompts a coding agent, reads what it produced, decides whether the work is done, and if not, prompts the agent again. Steve Kinney's framing is the cleanest: a minimum viable agent is a `while` loop, a model call, and a tool registry. Everything else is engineering around the edges. The loop replaces one-shot prompting with a self-running cycle of plan, act, verify, and iterate.

### How do you build an agent loop?

Start with five components: a goal, a model call, a tool registry the agent can use to act on the world, an evaluation gate that decides if the work is done, and a stopping condition. The eval gate is what separates a loop from a slop machine. Without one, the loop just iterates faster on the wrong answer. Closed loops (bounded scope, defined steps, eval at each step) are the version that pays off in production today. Open loops (exploratory, wide solution space, agent picks its own path) are expensive and prone to slop without a tight eval standard.

### Why do agent loops burn tokens?

Every turn re-reads the full conversation history and re-sends the tool definitions, so a 10-turn loop sends roughly 50x the tokens of a single linear call. On top of that, when retrieval misses, the agent broadens its query, pulls more documents, and asks the model to reconcile contradictions across stale and current sources. Each step is well-formed, billable, and invisible to your gateway. The [Stanford Digital Economy Lab](https://digitaleconomy.stanford.edu/publication/how-do-ai-agents-spend-your-money-analyzing-and-predicting-token-consumption-in-agentic-coding-tasks/) found agentic coding tasks consume \~1000x more tokens than chat or code reasoning, with up to 30x variance on the same task. Most of the variance is context-dependent.

### What is retrieval thrash?

Retrieval thrash is the input-token spend driven by an agent that misses on its first retrieval, broadens its query, pulls more documents, and asks the model to reconcile contradictions across sources. Every step is well-formed and billable but invisible to your gateway. It is usually the largest hidden line on an agent's input-token bill, and the line that observability tools can't see directly.

### Does prompt caching solve retrieval thrash?

No. Caching reduces the unit cost of the tokens you send. It does not change how many tokens you need to send. If your retrieval is wrong, caching just makes wrong retrievals cheaper. A clean knowledge layer reduces the number of tokens by fixing the retrieval itself.

### Who should be using agent loops?

Engineering teams running coding agents on routine work (PR reviews, bug triage, runbook execution, codebase questions, postmortem drafting) are the obvious early adopters. The threshold isn't team size, it's repeatability. If the same kind of task runs more than a few times a week, a closed loop with a clear eval gate will outperform one-shot prompting. Teams without a current knowledge layer should fix that first. A loop reading from a stale wiki at 100 calls an hour produces wrong answers faster, not better ones.

### How do you connect an agent loop to a single source of truth?

The loop has to read from and write to the same knowledge layer. Reading is the obvious half: the agent queries a current, structured source (code, docs, tickets, team history) instead of crawling tools independently every turn. Writing is the half most teams miss. When the agent drafts a postmortem mid-incident, updates a runbook to match a config change, or resolves an ambiguity, that work has to land back in the source. Otherwise the loop's output disappears into a chat window and the next loop starts from the same degraded baseline. MCP is the protocol most coding agents use to query enterprise knowledge layers, and the implementations worth using read and write.

### Where does Falconer fit in an agent loop?

Falconer is a knowledge layer for engineering teams that connects to coding agents via MCP. The agent reads from Falconer instead of thrashing across docs, tickets, and Slack. The first retrieval lands a cited answer grounded in current state. The loop ends in one or two turns instead of six. As the agent produces new artifacts (postmortems, runbook updates, decisions), it writes back through the MCP server so the next loop starts from a smarter baseline. Token savings compound across the fleet instead of getting re-spent inside each agent.