Everyone writes about reducing OpenAI costs. Almost nobody talks about Anthropic.
That''s weird, because Claude Opus 4.6 costs $5/$25 per million tokens (input/output), and Sonnet 4.6 sits at $3/$15. These aren''t cheap. If you''re running a production app on Claude and you''re not thinking about cost optimization, you''re probably spending 3-5x more than you need to.
I went through this myself while building NeuralRouting. Here''s what actually moves the needle, ranked by impact.
The biggest mistake: using Opus for everything
Opus 4.6 is a beast. It scores highest on reasoning benchmarks, handles complex multi-step problems, and gives you a 1M token context window at standard pricing. It also costs 5x more than Haiku 4.5.
The problem is that most requests don''t need Opus. A customer support response, a text summary, a simple classification — Haiku handles these fine. Sonnet covers the middle ground. Opus should only touch the hard stuff.
A rough breakdown:
- Haiku 4.5 ($1/$5 per MTok): Classification, extraction, simple Q&A, content moderation, translation
- Sonnet 4.6 ($3/$15 per MTok): Code generation, analysis, longer-form writing, multi-step reasoning
- Opus 4.6 ($5/$25 per MTok): Complex research, agentic workflows, tasks where accuracy is critical
Most production workloads are 60-70% simple tasks. That means 60-70% of your requests could run on a model that costs 80% less. The math here isn''t complicated — it''s just that nobody bothers to do it.
Prompt caching: the single biggest cost lever
Anthropic''s prompt caching is genuinely impressive, and I don''t think enough teams use it. The idea is simple: if you''re sending the same system prompt or document context with every request, you''re paying full input token price every time.
With caching, the first request pays a small premium (1.25x for 5-minute TTL, 2x for 1-hour TTL). Every subsequent request that hits cache pays 10% of the standard input price.
Quick example with Sonnet 4.6:
Your system prompt is 50,000 tokens. Without caching, 20 requests cost you $3.00 in system prompt tokens alone. With 5-minute caching, the first request costs $0.19 (write), and the next 19 cost $0.015 each. Total: $0.47.
That''s an 84% reduction on input costs, just from caching.
The catch: your cache only lives for 5 minutes (or 1 hour at the higher write cost). If your requests are spaced further apart, you won''t get hits. For chatbots and agents that process multiple requests in quick succession, this is free money.
Batch API: 50% off if you can wait
Anthropic''s Batch API gives you a flat 50% discount on everything — input and output tokens — in exchange for async processing within 24 hours.
This works for:
- Content generation pipelines
- Document analysis batches
- Data classification at scale
- Anything that doesn''t need a real-time response
Stack batch processing with prompt caching and you''re looking at up to 95% total savings on eligible workloads. That number sounds aggressive but it''s straight from Anthropic''s documentation.
The limitation is obvious: if your user is waiting for a response, batch doesn''t help. But a surprising amount of production AI work isn''t user-facing. Log analysis, content pipelines, scheduled reports — all of this can run async.
Model routing: automate the model selection
Here''s where I''m biased, because this is what NeuralRouting does. But the concept matters regardless of what tool you use.
The idea: instead of hardcoding model: "claude-opus-4.6" in every API call, route each request to the cheapest model that can handle it. Score the prompt complexity, check if it''s a simple task or a hard one, and pick accordingly.
You can do this manually with if/else logic. It works until you have 50 different prompt types and the rules get messy. Or you can use a router that does it automatically.
Either way, the principle is the same: match the model to the task. A $5 model answering a $1 question is waste. Not dramatic waste — just steady, compounding, unnecessary spending that adds up to thousands per month at scale.
Semantic caching: stop paying for repeat questions
This is different from Anthropic''s prompt caching. Semantic caching stores the actual LLM responses and returns them when a similar enough question comes in.
"What''s your return policy?" and "How do I return a product?" are different strings but the same question. Exact-match caching misses this. Semantic caching catches it.
In production, we see 30-40% cache hit rates on typical customer-facing applications. That''s 30-40% of your requests answered instantly at zero token cost.
The tradeoff: you need a vector database (pgvector works fine) and you need to decide on a similarity threshold. Too loose and you''ll serve wrong answers. Too tight and you won''t get many hits. 0.92-0.95 cosine similarity is a decent starting point.
Trim your context window
This one''s boring but it matters. Every token you send costs money. Long conversation histories, bloated system prompts, documents included "just in case" — all of it adds up.
Some practical cuts:
- Summarize conversation history instead of sending the full transcript after 10+ turns
- Only include document sections relevant to the current query, not the entire document
- Keep system prompts tight — every word costs tokens and most system prompts are 3x longer than they need to be
- Use Haiku to pre-process and extract relevant chunks before sending to a more expensive model
A 200K token input on Opus 4.6 costs $1.00. The same information compressed to 20K tokens costs $0.10. Same answer, 90% cheaper input.
What this looks like combined
Let''s say you''re running a production app doing 100K Claude requests per month on Opus 4.6.
Before optimization:
- Average 2K tokens in, 500 tokens out per request
- Cost: (100K × 2K × $5/1M) + (100K × 500 × $25/1M) = $1,000 + $1,250 = $2,250/month
After routing + caching + context trimming:
- 70% of requests routed to Haiku ($1/$5)
- 20% to Sonnet ($3/$15)
- 10% stays on Opus ($5/$25)
- 35% cache hit rate eliminates those requests entirely
- Average context trimmed 40%
Effective cost: roughly $350-500/month. That''s a 75-85% reduction.
These aren''t theoretical numbers. They''re based on the routing patterns we see in the NeuralRouting pipeline. Your mileage varies depending on your workload mix, but the direction is consistent.
But won''t prices just keep dropping?
Anthropic keeps making their models cheaper. Opus 4.6 is 67% cheaper than Opus 4.1 was. Haiku 4.5 is dirt cheap. The pricing trend is clearly downward.
So you could just wait and let the prices drop. But "wait for it to get cheaper" isn''t a cost strategy. You''re overpaying right now, today, on every request. And the techniques here — routing, caching, context management — work regardless of what the per-token price is. When Anthropic drops prices again, your optimized setup gets even cheaper.
Start with the easy wins: prompt caching and model tiering. Those two alone will cut your bill in half. Then add semantic caching and routing when you''re ready for the next jump.