Agent loops are fine, retrieval is what you're paying for

Agent loops are a hot topic right now. But without a strong context layer, most teams end up with massive token spend running loops they didn’t really need in the first place.

If you’ve been anywhere near dev Twitter or your team Slack this week, you’ve watched someone wire up an agent loop and let it plan, act, observe, repeat, until the task is done. It looks like magic in a screen recording, until you check the bill.

We’ve been running these loops in our own product for weeks to have a clear view of where the cost comes from: the loop is almost never the problem. What you feed it is.

Why the demos look so easy

A loop gets cheap and reliable under two conditions: a generous token budget, and clean, well-organized context for the agent to search. The most impressive demos usually have both, which is much of why they look the way they do. Frontier labs are a good example: high computation availability and carefully maintained internal context, and those two do more of the heavy lifting than the loop itself does.

The other variable is how isolated the task is. A loop that occasionally wanders is fine on a self-contained job, where nothing downstream depends on it being efficient. On the core of a system, that slack disappears fast. So when a demo runs smoothly, it’s usually some mix of plenty of budget, good context, and a contained task. Change any of those and the same loop starts to strain.

The token bill

Here’s our own agent as an example, because it makes the cost concrete.

A 20-turn conversation runs about $9.50 in input tokens alone. With basic context management, meaning we clear tool results we no longer need and summarize earlier turns, that same conversation drops to about $3.20, roughly two-thirds cheaper, purely from being more careful about what we keep.

The bigger version of this is easy to hit: a single workflow can burn through 4 million tokens in ten minutes. It’s doing exactly what it was told, on a context window that keeps growing because nothing stops it.

So why does the context grow?

Retrieval quality is most of the answer.

The agent runs its first search, and the results come back incomplete, close but missing the one point that mattered. So it does the reasonable thing and widens the query, then searches again. Now it’s pulling more documents, more noise, and carrying all of it forward into every turn that follows, so the loop gets drastically longer simply because the first pass didn’t find what it needed.

The root cause is the quality of the context going in, making it a context engineering problem rather than a model problem, which means you can’t fix it by swapping in a smarter model or capping the number of turns; you fix it upstream, at retrieval.

Hyper-productivity without the infinite budget

When retrieval is good and the task is small and self-contained, such as renaming something across a repo or writing a migration, agent loops are excellent: they’ll run, finish quickly, and end up being fairly cheap.

They have a harder time on the core of a system, the work where the back-and-forth with a human is the point. There, the hard part is the judgment happening inside the loop. So how do you get lab-grade async productivity on a normal budget? You don’t make the loop smarter, you make the first retrieval better.

That’s the bet we’re making with Falconer. Keep a company’s context, the decisions and the docs and the reasoning behind them, in one place that stays current, and let agents pull from it directly. We’ve been running Claude Code against our own Falconer instance over MCP, and the difference is clear: when the first pull is accurate, the loop is short, and short loops are cheap.

The takeaway from this week sits between the two positions: loops aren’t the whole future, and they aren’t a gimmick. Get the first retrieval right, and most of the cost goes with it.

Why the demos look so easy

The token bill

So why does the context grow?

Hyper-productivity without the infinite budget

Related notes

The best engineering orgs all guard this secret. Cerebras just published theirs. Now it has over 2.4M views.

Falconer June 2026 changelog

Docs-as-code as simple as Google Docs

Search, rebuilt: how Falconer finds the right answer

Ready to get started?