Engineering 9 min readApril 10, 2026

Why Your AI Agents Are Burning Money (And How to Stop It)

AI agents consume 10-50x more tokens than chatbots. Learn how agent loop detection, per-step model routing, and context compression cut agent costs by 65-75%.

NR

NeuralRouting Team

April 10, 2026

AI agents are incredible at racking up your API bill.

A chatbot sends a message, gets a response, done. An agent reads 15 files, calls 4 tools, retries twice when something fails, and accumulates context across dozens of turns. By the time it finishes a task, it might have consumed 500K+ tokens. On Opus 4.6 at $5/$25 per MTok, that single agent session costs $2-5 before you even count the output.

Multiply that by hundreds of users, and you''ve got a problem that no amount of fundraising solves.

I''ve been thinking about this a lot while building NeuralRouting, because agent workloads are where cost optimization matters most — and where most teams are doing it worst.

The agent cost problem is different from chatbot costs

With a chatbot, costs are roughly predictable. User sends message, model responds, you can estimate the average cost per conversation. Easy.

Agents are different in three ways that mess up your budget:

Token accumulation. Every turn appends to the conversation history. An agent that takes 20 steps to complete a task resends the entire context on each step. By step 20, you''re paying input token costs on a massive context window — not because the task is complex, but because the conversation grew.

Tool call overhead. Each tool call generates tokens for the function definition, the arguments, the result, and the model''s interpretation of the result. A coding agent that reads a file, edits it, runs tests, and reads the error output can burn through 50K tokens just on the tool call overhead.

Retry loops. When an agent hits an error, it retries. Sometimes it retries the same thing 3-4 times with slightly different approaches. Each retry resends the full context. These loops are where money goes to die.

Cognition (the team behind Devin) measured that 60% of agent compute goes to search and context retrieval — reading files, looking things up — not to actual code generation. You''re paying input prices for the agent to think about what to do, not for the output you actually want.

Agent loop detection: the first thing to fix

The most expensive agent failure mode is the infinite loop. Agent tries something, fails, tries a variation, fails, tries another variation, and keeps going. I''ve seen sessions where an agent burned through $10-15 in tokens looping on a problem it was never going to solve.

You need a kill switch. Here''s what to look for:

  • Step limits. Set a hard cap on how many turns an agent can take per task. 25-30 is reasonable for most workflows. If it hasn''t solved it in 30 steps, it''s not going to.
  • Repetition detection. If the agent is sending substantially similar tool calls on consecutive turns, something is stuck. Flag it and escalate to a human or a different model.
  • Cost caps per session. Set a dollar limit. When a session crosses $2 (or whatever your threshold is), kill it. No task is worth infinite money.

NeuralRouting has agent loop detection built into the routing layer, which catches these before they spiral. But even a simple counter on your side works. The point is that you need something.

Route agent steps individually

Most teams get this wrong. They pick a model for the agent — say Opus 4.6 — and every step runs on Opus. But agent tasks aren''t uniformly complex.

A typical coding agent session might look like:

  1. Read the file structure → Simple, Haiku can do this
  2. Analyze the error message → Medium, Sonnet handles it
  3. Generate a fix → Complex reasoning, Opus makes sense here
  4. Write the code → Medium, Sonnet
  5. Run tests and interpret results → Simple, Haiku
  6. Summarize what was done → Simple, Haiku

Out of 6 steps, only 1 actually needs Opus. But most teams run all 6 on Opus because it''s easier to configure.

If you route each step to the right model:

  • Steps 1, 5, 6 on Haiku 4.5: $1/$5 per MTok
  • Steps 2, 4 on Sonnet 4.6: $3/$15 per MTok
  • Step 3 on Opus 4.6: $5/$25 per MTok

That''s roughly 50-60% cheaper than running everything on Opus. Same output quality, because the easy steps don''t need a frontier model.

The challenge is that most agent frameworks don''t support per-step model selection natively. You either have to hack it yourself or use a routing layer (like NeuralRouting) that scores each step''s complexity and picks the model automatically.

Cache aggressively between agent steps

Agents are repetitive by nature. The system prompt, the project context, the conversation history up to the current point — all of this gets resent on every turn.

Anthropic''s prompt caching helps here. If your agent''s system prompt is 50K tokens and you cache it, every subsequent step costs 90% less on those 50K tokens. Over a 20-step session, that saves you a lot.

But there''s a more interesting opportunity: semantic caching across users. If 10 different users ask their agents to do roughly the same thing ("fix the type error in the login component"), the first agent does the work and the next 9 get a cached response. This only works for common enough requests, but in production apps with many users doing similar tasks, it adds up.

Compress context between turns

This is an aggressive optimization but it works: instead of keeping the full conversation history, summarize older turns into a compact representation.

The agent doesn''t need the exact text of step 3 when it''s on step 18. It needs to know what happened. A 200-token summary of "read the user model, found a missing validation on the email field" carries the same information as the 5,000 tokens of the original tool call and response.

Use a cheap model (Haiku) to summarize older context, then feed the compressed version to the expensive model for the current step. You''re trading a few cents on summarization for dollars saved on context that would otherwise keep growing.

The risk: you lose detail. If the agent needs to reference something specific from step 3, the summary might not have it. A good middle ground is keeping the last 3-5 steps in full detail and summarizing everything before that.

Extended thinking: powerful but expensive

Claude''s extended thinking feature lets the model reason step-by-step before producing a response. The thinking tokens are billed as output tokens at the same rate.

For agents, this gets expensive fast. A complex reasoning step might generate 3,000-5,000 thinking tokens at $25/MTok (on Opus). That''s $0.075-$0.125 just for the model to think, before it produces any visible output.

Extended thinking is worth it on hard problems where accuracy matters. It''s waste on simple steps. If your agent is using extended thinking on "read this file and tell me what''s in it," you''re paying for thinking the model doesn''t need to do.

Route the thinking budget like you route the model: simple steps get no extended thinking, complex steps get it when needed.

What real savings look like

A team running a coding agent on Claude Opus 4.6, no optimization:

  • Average session: 25 steps, 400K total input tokens, 80K output tokens
  • Cost per session: $2.00 input + $2.00 output = $4.00
  • 500 sessions/day = $2,000/day = $60,000/month

After per-step routing + caching + context compression + loop detection:

  • 60% of steps routed to cheaper models
  • Prompt caching cuts input costs 40% on cached portions
  • Context compression reduces average input tokens by 30%
  • Loop detection kills 5% of sessions early that would''ve been 3x more expensive

Effective cost: roughly $15,000-20,000/month. Still not cheap, but 65-75% cheaper than the starting point.

This only gets worse from here

The agent cost problem is only going to get worse. As agents take on more complex, longer-running tasks, token consumption per session will grow. Models will get cheaper per token, but sessions will get longer. The net effect on your bill depends on which trend wins.

The teams that''ll be fine are the ones treating token cost as an engineering problem right now — routing, caching, compressing, and monitoring — instead of hoping that the next model price drop saves them.

If you''re building with agents and your cost monitoring consists of checking your Anthropic dashboard once a week, you''re going to be surprised at some point. Probably soon.

More in Engineering

Ready to cut your AI costs?

Start saving up to 80% on token costs today. Free tier available.

Get Started Free →