Every CTO running a meaningful LLM workload eventually asks the question: should we self-host?
On the surface, the math looks compelling. Running Llama 3.3 70B on your own GPUs costs fractions of a cent per 1,000 tokens. OpenAI charges dollars. The gap seems obvious.
But the total cost of self-hosting is rarely what it appears in a napkin calculation. This guide walks through the honest break-even analysis — including the costs most teams forget to count.
The Three Cost Buckets Most Teams Ignore
Before running the numbers, you need to account for costs that don't show up in your GPU invoice:
1. Engineering time Somebody has to set up, tune, monitor, and maintain your inference stack. On a capable ML infra engineer, that's $150–250k/year fully loaded. Even at 20% allocation, that's $30–50k/year of hidden cost.
2. Reliability overhead Self-hosted LLMs have downtime. You need redundancy (at least 2 GPU instances), auto-scaling, health checks, and failover. Doubling your GPU cost for reliability is a safe assumption.
3. Model quality gap The open-source frontier narrows every year, but it still exists. Llama 3.3 70B is excellent for many tasks — but not all. If 20% of your use cases require frontier model quality, you're running a hybrid stack anyway, which means maintaining two infrastructure layers.
The Self-Hosting Cost Model
Let's build a realistic monthly cost for self-hosting a 70B parameter model in 2026.
GPU options
| Option | Specs | Monthly Cost |
|---|---|---|
| Lambda Labs A100 80GB | 8× A100 | $7,920/mo |
| AWS p4d.24xlarge | 8× A100 | ~$9,800/mo |
| Vast.ai (spot) | 8× A100 | $3,200–5,500/mo |
| RunPod (secure) | 8× A100 | $5,600/mo |
A Llama 3.3 70B model (in FP16) requires ~140GB VRAM for inference, so 2× A100 80GB is the minimum for comfortable inference. For production throughput, 8× A100 is more realistic.
Assumption: 2 redundant 8×A100 nodes on Lambda Labs = $15,840/month
Engineering and operational overhead
| Cost Item | Monthly Estimate |
|---|---|
| ML infra engineer (20% allocation) | $3,500 |
| Monitoring tools (Grafana, alerting) | $200 |
| Storage, networking, egress | $400 |
| Total overhead | $4,100/mo |
Total self-hosting cost
GPU infrastructure: $15,840/mo
Operational overhead: $4,100/mo
──────────────────────────────────
Total: $19,940/mo
The API Cost Model (with Intelligent Routing)
Now let's model the same workload going through a managed gateway with intelligent routing.
Scenario: 500M tokens/month (realistic for a mid-size product)
Without routing — all GPT-4o:
500M tokens × $5/M = $2,500/month
Wait — at 500M tokens, the API is dramatically cheaper than self-hosting.
The self-hosting math only starts to work at very high volumes. Let's find the break-even.
The break-even volume
At $19,940/month of self-hosting infrastructure, and a blended API rate (with routing) of ~$0.80/M tokens:
Break-even = $19,940 / $0.80 per M tokens
= 24.9 billion tokens/month
For context: 24.9 billion tokens/month is roughly the volume of a mid-size enterprise AI product — not a startup.
With unoptimized routing (all GPT-4o at $5/M):
Break-even = $19,940 / $5 per M tokens
= 3.99 billion tokens/month
Still a significant volume. But here's the critical insight: with intelligent routing, the break-even point is 6× higher than naive API usage, because routing dramatically lowers your per-token cost.
The Hybrid Stack Reality
Most teams that self-host end up in a hybrid model anyway:
- Self-hosted: bulk, simple requests (classification, extraction, summaries)
- API: frontier reasoning, code generation, complex tasks
This is exactly what an intelligent routing gateway does — but without the self-hosting overhead. NeuralRouting routes simple requests to cheap hosted models ($0.06–0.12/M tokens) and complex requests to GPT-4o or Claude, automatically.
The result: the cost profile of self-hosting, with the reliability and simplicity of managed APIs.
When Self-Hosting Does Make Sense
Self-hosting isn't wrong — it's just right for a narrower set of situations than most teams assume.
You should self-host if:
- Your token volume exceeds 5 billion/month AND you have dedicated ML infra resources
- You have strict data residency requirements that no managed provider can satisfy (defense, healthcare in specific jurisdictions)
- Your use case requires fine-tuned models that can't be served via API
- You have existing GPU capacity from other workloads (amortized cost)
You should stay on managed APIs (with routing) if:
- Your monthly token volume is under 5 billion
- You don't have dedicated ML infrastructure engineers
- You need model flexibility across providers
- You want to ship product instead of maintain servers
A Practical Decision Framework
Monthly API spend (with routing) < $20k?
→ Stay managed. Not worth the infra overhead.
Monthly API spend $20k–$80k?
→ Optimize routing first. Self-hosting is premature.
Monthly API spend > $80k?
→ Run the hybrid model analysis with real numbers.
Self-hosting select workloads may be justified.
Data residency / compliance requirement?
→ Self-host the specific workloads that require it.
Keep everything else on managed APIs.
The Fast Path to Lower Costs
For teams not yet at self-hosting scale, the highest-leverage move is intelligent routing — getting the cost profile of cheap models for the majority of requests, without maintaining any infrastructure.
NeuralRouting routes each prompt to the cheapest capable model in real time, with semantic caching on top. Most teams see 70–97% cost reduction versus routing everything through a frontier model.
The infrastructure is zero to maintain. The integration is a one-line URL change. And the savings show up in your dashboard the same day.