Semantic Caching for LLM APIs: Complete Implementation Guide (Save 40-70%)
Exact-match caching misses 95% of duplicate queries. Semantic caching catches them. Here's how to implement it and what hit rates to expect in production.
NR
NeuralRouting Team
April 10, 2026
Semantic Caching for LLM APIs: Complete Implementation Guide
Every LLM API call costs money. But research shows that over 30% of queries to LLMs are semantically similar (MeanCache, 2024). That means nearly a third of your AI spend is going to questions you've already answered — just worded slightly differently.
Semantic caching solves this by matching queries by meaning, not exact text. This guide covers the architecture, implementation, and real-world performance data.
Why Exact-Match Caching Fails
Traditional caching uses exact string matching. The problem:
"What is the capital of France?" → cache hit
"what is the capital of france?" → cache miss (different case)
"Capital of France?" → cache miss (different phrasing)
"Tell me France's capital city" → cache miss (completely different wording)
All four queries have the same answer. Exact-match catches 1 out of 4. Semantic caching catches all 4.
How Semantic Caching Works
The architecture has two layers:
Layer 1: Exact Hash Match (sub-microsecond)
Query → SHA-256 hash → Redis lookup → hit/miss
This catches exact duplicates instantly. No embedding computation needed.
Layer 2: Vector Similarity Search (1-5ms)
Query → Embedding model → Vector DB search → similarity threshold → hit/miss
When exact match fails, the query is embedded and compared against stored query embeddings using cosine similarity.
Similarity threshold is critical:
> 0.95: Very conservative, few false positives, ~20% hit rate
> 0.90: Balanced, occasional edge cases, ~30% hit rate
> 0.85: Aggressive, more false positives, ~40% hit rate
NeuralRouting uses 0.92 as the default threshold — balancing hit rate with accuracy.
Implementation Architecture
┌─────────────┐
│ Request │
└──────┬──────┘
│
┌───▼───┐
│ Hash │──── hit ──→ Return cached response ($0)
│ Check │
└───┬───┘
│ miss
┌───▼────────┐
│ Embed │──── hit ──→ Return cached response ($0)
│ + Vector │ (similarity > 0.92)
│ Search │
└───┬────────┘
│ miss
┌───▼───┐
│ LLM │──→ Store response + embedding in cache
│ Call │
└───────┘
Embedding Model Selection
Model
Dimensions
Speed
Cost
Quality
text-embedding-3-small
1536
Fast
$0.02/1M
Good
text-embedding-3-large
3072
Medium
$0.13/1M
Best
all-MiniLM-L6-v2 (local)
384
Fastest
$0
Decent
For caching, text-embedding-3-small offers the best balance. The embedding cost ($0.02/1M tokens) is negligible compared to the LLM call it saves ($2-$15/1M tokens).
Production Hit Rates
Real-world cache hit rates depend heavily on your use case:
Application Type
Expected Hit Rate
Annual Savings (at $10K/mo spend)
Customer support bots
35-50%
$42K-$60K
Internal knowledge Q&A
25-40%
$30K-$48K
Code assistance
15-25%
$18K-$30K
Creative writing
5-10%
$6K-$12K
Customer support has the highest hit rates because users ask similar questions repeatedly. Creative tasks have the lowest because each query is unique.
Multi-Turn Conversation Caching
Single-turn caching is straightforward. Multi-turn is where it gets tricky.
Naive approach (cache only the last message): High false positive rate. "Tell me more" could match any previous "Tell me more" regardless of context.
Context-aware approach (cache a sliding window): Encode the last 3-5 messages as the cache key. This preserves conversational context and reduces false positives dramatically.