Semantic Caching for LLM APIs: Complete Implementation Guide
Every LLM API call costs money. But research shows that over 30% of queries to LLMs are semantically similar (MeanCache, 2024). That means nearly a third of your AI spend is going to questions you've already answered — just worded slightly differently.
Semantic caching solves this by matching queries by meaning, not exact text. This guide covers the architecture, implementation, and real-world performance data.
Why Exact-Match Caching Fails
Traditional caching uses exact string matching. The problem:
- "What is the capital of France?" → cache hit
- "what is the capital of france?" → cache miss (different case)
- "Capital of France?" → cache miss (different phrasing)
- "Tell me France's capital city" → cache miss (completely different wording)
All four queries have the same answer. Exact-match catches 1 out of 4. Semantic caching catches all 4.
How Semantic Caching Works
The architecture has two layers:
Layer 1: Exact Hash Match (sub-microsecond)
Query → SHA-256 hash → Redis lookup → hit/miss
This catches exact duplicates instantly. No embedding computation needed.
Layer 2: Vector Similarity Search (1-5ms)
Query → Embedding model → Vector DB search → similarity threshold → hit/miss
When exact match fails, the query is embedded and compared against stored query embeddings using cosine similarity.
Similarity threshold is critical:
- > 0.95: Very conservative, few false positives, ~20% hit rate
- > 0.90: Balanced, occasional edge cases, ~30% hit rate
- > 0.85: Aggressive, more false positives, ~40% hit rate
NeuralRouting uses 0.92 as the default threshold — balancing hit rate with accuracy.
Implementation Architecture
┌─────────────┐
│ Request │
└──────┬──────┘
│
┌───▼───┐
│ Hash │──── hit ──→ Return cached response ($0)
│ Check │
└───┬───┘
│ miss
┌───▼────────┐
│ Embed │──── hit ──→ Return cached response ($0)
│ + Vector │ (similarity > 0.92)
│ Search │
└───┬────────┘
│ miss
┌───▼───┐
│ LLM │──→ Store response + embedding in cache
│ Call │
└───────┘
Embedding Model Selection
| Model | Dimensions | Speed | Cost | Quality |
|---|---|---|---|---|
| text-embedding-3-small | 1536 | Fast | $0.02/1M | Good |
| text-embedding-3-large | 3072 | Medium | $0.13/1M | Best |
| all-MiniLM-L6-v2 (local) | 384 | Fastest | $0 | Decent |
For caching, text-embedding-3-small offers the best balance. The embedding cost ($0.02/1M tokens) is negligible compared to the LLM call it saves ($2-$15/1M tokens).
Production Hit Rates
Real-world cache hit rates depend heavily on your use case:
| Application Type | Expected Hit Rate | Annual Savings (at $10K/mo spend) |
|---|---|---|
| Customer support bots | 35-50% | $42K-$60K |
| Internal knowledge Q&A | 25-40% | $30K-$48K |
| Code assistance | 15-25% | $18K-$30K |
| Creative writing | 5-10% | $6K-$12K |
Customer support has the highest hit rates because users ask similar questions repeatedly. Creative tasks have the lowest because each query is unique.
Multi-Turn Conversation Caching
Single-turn caching is straightforward. Multi-turn is where it gets tricky.
Naive approach (cache only the last message): High false positive rate. "Tell me more" could match any previous "Tell me more" regardless of context.
Context-aware approach (cache a sliding window): Encode the last 3-5 messages as the cache key. This preserves conversational context and reduces false positives dramatically.
NeuralRouting's semantic cache uses context-aware encoding for multi-turn conversations, preventing cross-conversation contamination.
Cache Invalidation
LLM responses don't have the same staleness problems as database caches, but there are edge cases:
- Time-sensitive queries: "What's the weather today?" should have short TTL (minutes)
- Factual queries: "What is photosynthesis?" can be cached indefinitely
- Personalized queries: Queries referencing user-specific data should include user context in the cache key
Default TTL recommendation: 24 hours for most applications, with category-based overrides.
Getting Started with NeuralRouting's Semantic Cache
NeuralRouting includes semantic caching at every pricing tier. No setup required — it activates automatically:
- First request: LLM processes the query, response is cached
- Similar request: Cache serves the response in < 5ms, $0 cost
- Dashboard shows cache hit rate, estimated savings, and entry count
The cache gets smarter over time as it accumulates your specific query patterns.