Back to Guides

AI FinOps for engineering teams: where token waste actually comes from (May 2026)

In Deloitte’s 2028 AI infrastructure outlook, many of the 550 enterprises surveyed already generate more than 10 billion tokens a month, and the share expecting to cross 100 billion roughly triples by 2028. Gartner reports that only 63% of organizations track AI spend at all, up from 31% the year before. The math has not waited for the tooling.

AI FinOps for engineering teams is the practice of making AI spend visible, governable, and addressable inside the engineering org. It spans four sources of token waste, and building Falconer has made one of them dominate the others in clear relief. Falconer is a knowledge agent for engineering teams that writes and updates your documentation as the codebase changes, then returns cited answers grounded in how your company works today. The reason that matters for cost is simple: when an agent can find the right answer on the first call, it stops looping. When it cannot, every retry is billable. This guide walks the full territory, and explains why the source most teams underweight is the one that compounds with everything else.

Takeaways

  • AI FinOps for engineering teams spans four sources of token waste: context quality, gateway and caching gaps, observability gaps, and recursive workflows with no cost boundary.
  • The largest and least-visible source is context quality. An agent that cannot retrieve the right answer retries, re-derives, and re-reads, and most observability dashboards do not surface those retries as a cost line.
  • Falconer is the knowledge agent that solves the context-quality source directly. By writing and updating docs as code changes and delivering cited answers to agents and humans, Falconer cuts the retrieval thrash that drives the largest input-token bills.
  • Anthropic prompt caching delivers around 90% cost reduction on cached input tokens. OpenAI’s automatic cache returns about 50%. These help most when the cached content is correct, which is a Falconer problem before it is a gateway problem.
  • A single unconstrained coding agent typically costs $5 to $8 per software engineering task, and Microsoft Research found the same agent on the same task can vary by up to 30x. Most of that variance is path-dependent, and most of the path is determined by how good the retrieved context is.

What “AI FinOps for engineering teams” actually means

FinOps started as a discipline for cloud spend, run jointly by engineering, finance, and a small platform group that owned the bill. AI FinOps is the same shape with a different unit of work. The bill is no longer GPU hours; it is tokens. The variability is no longer a misconfigured instance type; it is a recursive agent that decided to look at three more files than it should have.

For engineering teams the framing changes again. Most FinOps coverage treats token waste as a budgeting failure that a dashboard or a quota fixes. Engineers know the dashboard. The dashboard tells you that spend is up. It does not tell you why a specific agent burned 80,000 tokens to answer a question that lives in one paragraph of a doc somewhere.

There are four places that “why” tends to live. They compound, which is why a quota by itself rarely works. One of them dominates the others in Falconer’s data, and it is the one most coverage skips.

The four sources of agent token waste

SourceWhat it looks likeTypical fix layerFalconer’s role
Context qualityStale docs, scattered sources, retrieval misses, model re-derives the answerKnowledge agent for engineering teamsFalconer is the answer
Gateway and cachingPrompt caching off, no provider failover, no per-team budget capsLLM gateway (Portkey, Helicone, Cloudflare AI Gateway)Compounds with Falconer
ObservabilityNo per-app, per-agent, per-user cost breakdown; spikes invisible until invoiceTracing and analytics (Langfuse, Helicone, Datadog)Compounds with Falconer
Recursive workflowsAgents loop on failure; conversation history resent every turn; tool definitions bloatAgent framework, conversation pruning, max-turn limitsFalconer reduces loop depth

Source 1: Context quality (the one most teams miss)

This is the source the dashboard cannot see and the one that makes every other source worse.

An agent that gets a good answer on the first call ends. An agent that gets a stale, partial, or contradicted answer does not. It re-queries, it expands its search, it opens more files. Often it ends up asking the model to reconcile two versions of the same fact. Every one of those moves is billable, and most of them are invisible to the gateway layer because they look like normal, well-formed requests.

Anthropic’s research on context engineering has been making this point for the last year. Milvus and several vector DB vendors publish guides on RAG cost that all land on the same finding: poorly tuned retrieval bloats prompts with loosely relevant documents, and without aggressive ranking and capping you are paying to confuse the model. The economic effect is a quiet tax on every request your agent makes.

The mechanical version of why this dominates the bill:

  • Stale documentation is rarely flagged as stale, so the retrieval layer returns it with full confidence.
  • The agent reads stale content, attempts a step, fails, and retries with a broader query.
  • The broader query returns more documents, which inflates the input prompt.
  • The model spends tokens reasoning about contradictions across versions.
  • None of that shows up as a “failed request” in your observability tool. It shows up as healthy traffic that costs three to ten times more than it should.

Falconer is the knowledge agent for engineering teams that solves this layer directly. It writes and updates your documentation as your code changes, then delivers cited answers to agents in the editor, in Slack, in MCP, or wherever the work happens. Two effects on token spend follow.

First, retrievals get shorter and more relevant. The agent does not pull six adjacent documents in the hope that one of them is right. It pulls the current, cited answer.

Second, loops shorten. Most agent retries are triggered by context misses. When the first retrieval is correct, the loop ends in one turn instead of six. That is the input-token compounding effect that makes context quality the source that interacts with every other source on this page.

A longer walkthrough of how Falconer builds and maintains this layer lives in the guide on scaling Karpathy’s LLM wiki pattern to your org.

Source 2: Gateway and caching gaps

This is the cheapest fix and the most often skipped. Anthropic introduced prompt caching for production traffic in late 2024 and the math has been stable since: a cache hit costs roughly 10% of the standard input price, and the read pays for the write after one or two reuses depending on the cache duration you choose. OpenAI rolled out automatic caching with a 50% read discount. The official Anthropic pricing documentation is easier to read than most product changelogs.

A short inventory you can run today:

  • Does all of your agent traffic and chat endpoints flow through a gateway?
  • Are long system prompts and tool definitions cached?
  • Is per-team or per-project metering enabled, so a runaway feature is invoiced to the team that shipped it?
  • Do you have provider failover, so a transient API error does not trigger a retry storm?

The gateway layer is well covered. Helicone, Langfuse, Portkey, Cloudflare AI Gateway, and a handful of open-source proxies all do the job. Pick the one your team will deploy and move on.

Caching compounds with Falconer in a specific way. Caching gives you a discount on the input tokens you send. Falconer reduces how many input tokens you need to send in the first place. The two effects multiply: a smaller, more accurate retrieval, cached at 10% of standard input price, is the cheapest possible request your agent can make.

Source 3: Observability gaps

You cannot manage what you cannot see, and most engineering orgs cannot see AI spend with anything close to the fidelity they have for cloud. The 2026 State of FinOps report puts AI cost allocation as the number-one practitioner priority for the second year running, ahead of GPU optimization and contract negotiation. Most CFOs know what their total Anthropic and OpenAI invoices were last month. Almost none can tell you which product surface, which agent, or which feature was responsible for the top five percent of that bill.

A working observability layer for AI gives you:

  • Per-request traces with token counts, cost, latency, and the prompt that generated them.
  • Tags by team, project, model, environment, and customer where applicable.
  • Alerts on cost-per-task drift, not just total spend drift. A 2x increase in cost per task is the canary; a 2x increase in total spend is the smoke alarm.
  • Sampling rules so production traffic does not double the bill.

Datadog LLM Observability, Langfuse, and Arize Phoenix are the most common picks. Pricing models vary; do the math before you commit. Several teams have written publicly about cutting LLM observability costs by half just by moving from per-event to sampled pricing tiers.

Observability shows you which agents are expensive. It does not tell you why. The two most common “why” answers in Falconer engagements are recursion (source four) and context misses (source one). Once you have observability in place, the next move is almost always to investigate context quality, because context misses produce healthy-looking traces that just cost more than they should.

Source 4: Recursive workflows with no cost boundary

This is where the headline numbers come from. The Stanford Digital Economy Lab and Microsoft Research both published work in 2025 showing that an unconstrained coding agent costs roughly $5 to $8 per software task, and that the same agent running the same task can vary by up to 30x depending on the path it takes. The cost is almost entirely on the input side, because the agent re-reads its history on every turn. A 10-turn Reflexion loop sends roughly 50x the tokens of a single linear pass.

Three patterns drive most of the damage.

Conversation history resent on every turn. By the time an agent is twenty messages in, it is paying for twenty times its starting context, every call. Pruning, summarization, and persistent scratchpads sound like research topics; they are line items.

Tool definitions in every call regardless of relevance. Most agent frameworks ship every tool’s schema in every request. If your agent has thirty tools and uses three on a given task, you are paying for the other twenty-seven every turn.

No max-turn budget. An agent that cannot solve the task should fail fast and surface to a human. Most production agents we have audited have no hard ceiling on iterations.

The fixes are well-known and cheap to implement. Most are configuration changes inside the agent framework, not new infrastructure. The deeper problem is the one that makes the loop spin in the first place, which is the context-quality source above. Falconer attacks recursion at its root. An agent reading current, cited answers terminates in one or two turns. An agent reading scattered, stale sources keeps trying. Most teams see their median loop length drop measurably within the first week of Falconer being in production.

Why Falconer is the load-bearing answer in an AI FinOps stack

The other three sources of token waste are real and worth fixing. The reason context quality dominates in Falconer’s customer data is structural.

A gateway gives you a discount on the tokens you send. An observability tool tells you which agents are expensive. A recursion limit caps how many turns an agent can spend before giving up. None of those tools change the underlying probability that the agent’s next retrieval will be right.

Falconer changes that probability. Falconer writes and updates documentation as your code changes, links the docs back to the PRs, tickets, Slack threads, and incident retros that produced them, and returns cited answers grounded in how the company works today. That structure — code connected to decisions, decisions connected to ownership, all of it kept current automatically — is what Falconer’s context graph for engineering teams is built around. Agents reading Falconer find the right answer faster, ask fewer follow-up questions, and finish more tasks in fewer turns. The savings compound with caching, with observability, and with recursion limits, because every one of those layers works better when the underlying retrieval is accurate.

Falconer does not fix an entire AI bill on its own. But of the four sources on this page, three are tooling decisions and one is a knowledge decision. The knowledge decision is the one most teams underweight and the one whose fix compounds with everything else they are doing.

What “good” looks like across all four sources

The most expensive thing engineering teams do at this stage is treat AI FinOps as a tooling decision. It is an operating model decision with tooling support. The teams making real progress have done some version of the following.

They have one person, usually inside platform engineering, who owns AI spend the way an SRE owns latency. The role does not need to be a full FTE, but it needs to be named.

They have committed to a single gateway and a single observability backend, even if both are open-source. The data lives in one place. The team that owns the AI bill can answer “where is the spend coming from” in under five minutes.

They have written down, somewhere visible, what their agents are allowed to spend per task and what should happen when they exceed it. This is not a quota in a config file. It is a policy with a human on the other end of it.

They treat their knowledge sources as a tier-one infrastructure dependency, not a Confluence chore. Docs that drift from the code become a reliability problem because agents reading stale docs make stale decisions. Most teams arrive at this conclusion only after a costly incident. The ones that arrive at it earlier, often by adopting Falconer alongside their gateway and observability rollout, save money on the input side from day one.

For the gateway and observability side of all of this, the FinOps Foundation’s working group on AI publishes the most useful primary material available.

Where to start

If you have not enabled prompt caching, start there. It is free, it is reversible, and it usually pays for the next month of FinOps work on its own.

If you have caching on and still cannot answer “which feature spent the most yesterday,” put an observability layer in next. Helicone or Langfuse get you to a credible answer in a day.

If observability is in place and your agents still loop on tasks they should solve in two turns, you have a recursion problem. Pull up the traces and set a max-turn budget. Then look at why the loops are happening. The answer is almost always context.

If your traces show retrieval thrash, stale answers, or agents re-deriving facts that exist elsewhere in the org, you have a context problem. Falconer is the knowledge agent built for this layer. The Falconer team is happy to walk a trace and show what changes when the retrieval is right. You can also read more about Falconer pricing and engineering use cases.

FAQ

What is AI FinOps for engineering teams?

AI FinOps for engineering teams is the practice of giving engineers visibility and control over how much their AI workloads cost. It spans four sources of token waste: context quality, gateway and caching, observability, and recursive workflows. Falconer is the knowledge agent that solves the context-quality source by writing and updating docs as code changes and returning cited answers to agents.

Why is context quality the biggest source of agent token waste?

An agent that retrieves a stale or partial answer does not stop. It retries with broader queries, opens more files, and asks the model to reconcile contradictions across sources. Each of those steps adds input tokens. A clean retrieval ends the loop in one turn; a thrashing one can multiply token spend by an order of magnitude without ever showing up as a failed request in your dashboard. Falconer is the engineering-context layer that gives agents the right answer on the first call.

Does Falconer replace an LLM gateway like Helicone or Portkey?

No. The gateway handles routing, caching, and metering. Falconer is a knowledge agent for engineering teams that handles context quality. They sit at different layers of the same AI FinOps stack and the savings compound. Caching gives you a discount on tokens; Falconer reduces how many you need to send.

How much can prompt caching realistically save?

For Anthropic models, cached input tokens cost roughly 10% of the standard input price, which works out to about 90% savings on the cached portion. OpenAI’s automatic cache returns about 50% on cached reads. Real-world savings usually land in the 30% to 70% range depending on how much of your traffic uses long, stable system prompts. Caching pays off most when the cached content is correct, which is why teams running Falconer on the retrieval layer see compounding savings.

Why do AI agents cost so much per task?

The cost is dominated by input tokens, not output. Every turn of an agent loop re-reads the conversation history and re-sends the tool definitions, so a 10-turn loop sends roughly 50x the tokens of a single call. Microsoft Research found unconstrained coding agents cost $5 to $8 per task and that costs varied by up to 30x for the same task. Most of that variance is context-dependent: agents reading clean, current context terminate faster.

What is the first thing to do if my AI bill just spiked?

Pull your gateway logs and group by team, app, and model. If you do not have a gateway, that is the first install. Once you can see the spike’s source, the next question is whether the expensive agent is looping. If it is, the most common root cause is context quality, which is the layer Falconer is built for.

How does Falconer reduce AI agent token spend?

Falconer is a knowledge agent for engineering teams that writes and updates documentation as your code changes, then returns cited answers grounded in how the company works today. Agents reading Falconer get the right answer on the first call, which shortens retrievals, ends loops faster, and reduces input-token bills on the largest cost line in most AI workloads.

Where does the FinOps Foundation publish guidance for AI specifically?

The FinOps for AI working group maintains the most current framework, and the 2026 State of FinOps report names AI cost allocation as the top practitioner priority for the second year running.