Engineering 10 min readApril 10, 2026

How to Reduce Claude API Costs by 60-80% Without Sacrificing Quality

Claude Opus 4.6 costs $5/$25 per MTok. Most teams overpay by 3-5x. Learn how model routing, prompt caching, batch API, and semantic caching cut your Anthropic bill by 60-80%.

NR

NeuralRouting Team

April 10, 2026

Everyone writes about reducing OpenAI costs. Almost nobody talks about Anthropic.

That''s weird, because Claude Opus 4.6 costs $5/$25 per million tokens (input/output), and Sonnet 4.6 sits at $3/$15. These aren''t cheap. If you''re running a production app on Claude and you''re not thinking about cost optimization, you''re probably spending 3-5x more than you need to.

I went through this myself while building NeuralRouting. Here''s what actually moves the needle, ranked by impact.

The biggest mistake: using Opus for everything

Opus 4.6 is a beast. It scores highest on reasoning benchmarks, handles complex multi-step problems, and gives you a 1M token context window at standard pricing. It also costs 5x more than Haiku 4.5.

The problem is that most requests don''t need Opus. A customer support response, a text summary, a simple classification — Haiku handles these fine. Sonnet covers the middle ground. Opus should only touch the hard stuff.

A rough breakdown:

  • Haiku 4.5 ($1/$5 per MTok): Classification, extraction, simple Q&A, content moderation, translation
  • Sonnet 4.6 ($3/$15 per MTok): Code generation, analysis, longer-form writing, multi-step reasoning
  • Opus 4.6 ($5/$25 per MTok): Complex research, agentic workflows, tasks where accuracy is critical

Most production workloads are 60-70% simple tasks. That means 60-70% of your requests could run on a model that costs 80% less. The math here isn''t complicated — it''s just that nobody bothers to do it.

Prompt caching: the single biggest cost lever

Anthropic''s prompt caching is genuinely impressive, and I don''t think enough teams use it. The idea is simple: if you''re sending the same system prompt or document context with every request, you''re paying full input token price every time.

With caching, the first request pays a small premium (1.25x for 5-minute TTL, 2x for 1-hour TTL). Every subsequent request that hits cache pays 10% of the standard input price.

Quick example with Sonnet 4.6:

Your system prompt is 50,000 tokens. Without caching, 20 requests cost you $3.00 in system prompt tokens alone. With 5-minute caching, the first request costs $0.19 (write), and the next 19 cost $0.015 each. Total: $0.47.

That''s an 84% reduction on input costs, just from caching.

The catch: your cache only lives for 5 minutes (or 1 hour at the higher write cost). If your requests are spaced further apart, you won''t get hits. For chatbots and agents that process multiple requests in quick succession, this is free money.

Batch API: 50% off if you can wait

Anthropic''s Batch API gives you a flat 50% discount on everything — input and output tokens — in exchange for async processing within 24 hours.

This works for:

  • Content generation pipelines
  • Document analysis batches
  • Data classification at scale
  • Anything that doesn''t need a real-time response

Stack batch processing with prompt caching and you''re looking at up to 95% total savings on eligible workloads. That number sounds aggressive but it''s straight from Anthropic''s documentation.

The limitation is obvious: if your user is waiting for a response, batch doesn''t help. But a surprising amount of production AI work isn''t user-facing. Log analysis, content pipelines, scheduled reports — all of this can run async.

Model routing: automate the model selection

Here''s where I''m biased, because this is what NeuralRouting does. But the concept matters regardless of what tool you use.

The idea: instead of hardcoding model: "claude-opus-4.6" in every API call, route each request to the cheapest model that can handle it. Score the prompt complexity, check if it''s a simple task or a hard one, and pick accordingly.

You can do this manually with if/else logic. It works until you have 50 different prompt types and the rules get messy. Or you can use a router that does it automatically.

Either way, the principle is the same: match the model to the task. A $5 model answering a $1 question is waste. Not dramatic waste — just steady, compounding, unnecessary spending that adds up to thousands per month at scale.

Semantic caching: stop paying for repeat questions

This is different from Anthropic''s prompt caching. Semantic caching stores the actual LLM responses and returns them when a similar enough question comes in.

"What''s your return policy?" and "How do I return a product?" are different strings but the same question. Exact-match caching misses this. Semantic caching catches it.

In production, we see 30-40% cache hit rates on typical customer-facing applications. That''s 30-40% of your requests answered instantly at zero token cost.

The tradeoff: you need a vector database (pgvector works fine) and you need to decide on a similarity threshold. Too loose and you''ll serve wrong answers. Too tight and you won''t get many hits. 0.92-0.95 cosine similarity is a decent starting point.

Trim your context window

This one''s boring but it matters. Every token you send costs money. Long conversation histories, bloated system prompts, documents included "just in case" — all of it adds up.

Some practical cuts:

  • Summarize conversation history instead of sending the full transcript after 10+ turns
  • Only include document sections relevant to the current query, not the entire document
  • Keep system prompts tight — every word costs tokens and most system prompts are 3x longer than they need to be
  • Use Haiku to pre-process and extract relevant chunks before sending to a more expensive model

A 200K token input on Opus 4.6 costs $1.00. The same information compressed to 20K tokens costs $0.10. Same answer, 90% cheaper input.

What this looks like combined

Let''s say you''re running a production app doing 100K Claude requests per month on Opus 4.6.

Before optimization:

  • Average 2K tokens in, 500 tokens out per request
  • Cost: (100K × 2K × $5/1M) + (100K × 500 × $25/1M) = $1,000 + $1,250 = $2,250/month

After routing + caching + context trimming:

  • 70% of requests routed to Haiku ($1/$5)
  • 20% to Sonnet ($3/$15)
  • 10% stays on Opus ($5/$25)
  • 35% cache hit rate eliminates those requests entirely
  • Average context trimmed 40%

Effective cost: roughly $350-500/month. That''s a 75-85% reduction.

These aren''t theoretical numbers. They''re based on the routing patterns we see in the NeuralRouting pipeline. Your mileage varies depending on your workload mix, but the direction is consistent.

But won''t prices just keep dropping?

Anthropic keeps making their models cheaper. Opus 4.6 is 67% cheaper than Opus 4.1 was. Haiku 4.5 is dirt cheap. The pricing trend is clearly downward.

So you could just wait and let the prices drop. But "wait for it to get cheaper" isn''t a cost strategy. You''re overpaying right now, today, on every request. And the techniques here — routing, caching, context management — work regardless of what the per-token price is. When Anthropic drops prices again, your optimized setup gets even cheaper.

Start with the easy wins: prompt caching and model tiering. Those two alone will cut your bill in half. Then add semantic caching and routing when you''re ready for the next jump.

More in Engineering

Ready to cut your AI costs?

Start saving up to 80% on token costs today. Free tier available.

Get Started Free →