The Hidden Cost of Repeated Prompts
In most production AI applications, users ask similar questions repeatedly. Customer support bots see the same 200 questions 80% of the time. Search assistants get near-identical queries. FAQ systems process thousands of variations of a dozen core questions.
Without caching, every one of these pays full inference cost. With semantic caching, the second — and hundredth — similar request costs essentially nothing.
How Semantic Caching Works
Step 1: Embed the Prompt
When a new request arrives, it's converted to a vector embedding using a lightweight embedding model (text-embedding-3-small at $0.02/M tokens). This takes ~20ms and costs a fraction of a cent.
Step 2: Search the Cache
The embedding is compared against previously stored embeddings using cosine similarity. If similarity exceeds the threshold (typically 0.92), the cached response is returned immediately.
Step 3: Store New Responses
When no cache hit is found, the request proceeds to the LLM. The response is stored asynchronously with its embedding for future lookups.
Two-Level Cache Architecture
For maximum efficiency, implement a two-level lookup:
- Exact hash match — MD5 of the normalized prompt string. Instant O(1) lookup, zero embedding cost.
- Semantic similarity search — pgvector cosine similarity across stored embeddings. Catches paraphrases and near-duplicates.
This architecture handles both identical repeated requests (exact hash) and semantically equivalent variations (vector search).
The Economics
| Metric | Without Cache | With Cache (30% hit rate) |
|---|---|---|
| 100k requests/mo | $500 | $350 |
| 500k requests/mo | $2,500 | $1,050 |
| 1M requests/mo | $5,000 | $1,600 |
At scale, a 30% cache hit rate saves $3,400/month on a 1M request workload.
Setting the Similarity Threshold
The threshold controls the trade-off between cache hit rate and response accuracy:
- 0.98+ — Near-identical only. Very high accuracy, low hit rate (~5–10%)
- 0.92–0.97 — Strong semantic similarity. Balanced accuracy and hit rate (~20–35%)
- 0.85–0.91 — Broader matching. Higher hit rate but occasional mismatches
For most production workloads, 0.92 is the optimal threshold.
NeuralRouting's Semantic Cache
NeuralRouting includes a production-grade semantic cache powered by pgvector:
- Two-level lookup (exact hash → cosine similarity)
- Configurable similarity threshold
- Fire-and-forget async storage (zero latency impact)
- Cache hit stats in the FinOps dashboard
- Per-user and global cache scoping
No configuration needed. Every request through NeuralRouting is automatically cached.