Engineering 10 min readApril 9, 2026

Semantic Caching for LLM APIs: Complete Implementation Guide (Save 40-70%)

Exact-match caching misses 95% of duplicate queries. Semantic caching catches them. Here's how to implement it and what hit rates to expect in production.

NR

NeuralRouting Team

April 9, 2026

Semantic Caching for LLM APIs: Complete Implementation Guide

Every LLM API call costs money. But research shows that over 30% of queries to LLMs are semantically similar (MeanCache, 2024). That means nearly a third of your AI spend is going to questions you've already answered — just worded slightly differently.

Semantic caching solves this by matching queries by meaning, not exact text. This guide covers the architecture, implementation, and real-world performance data.


Why Exact-Match Caching Fails

Traditional caching uses exact string matching. The problem:

  • "What is the capital of France?" → cache hit
  • "what is the capital of france?" → cache miss (different case)
  • "Capital of France?" → cache miss (different phrasing)
  • "Tell me France's capital city" → cache miss (completely different wording)

All four queries have the same answer. Exact-match catches 1 out of 4. Semantic caching catches all 4.


How Semantic Caching Works

The architecture has two layers:

Layer 1: Exact Hash Match (sub-microsecond)

Query → SHA-256 hash → Redis lookup → hit/miss

This catches exact duplicates instantly. No embedding computation needed.

Layer 2: Vector Similarity Search (1-5ms)

Query → Embedding model → Vector DB search → similarity threshold → hit/miss

When exact match fails, the query is embedded and compared against stored query embeddings using cosine similarity.

Similarity threshold is critical:

  • > 0.95: Very conservative, few false positives, ~20% hit rate
  • > 0.90: Balanced, occasional edge cases, ~30% hit rate
  • > 0.85: Aggressive, more false positives, ~40% hit rate

NeuralRouting uses 0.92 as the default threshold — balancing hit rate with accuracy.


Implementation Architecture

┌─────────────┐
│   Request    │
└──────┬──────┘
       │
   ┌───▼───┐
   │ Hash   │──── hit ──→ Return cached response ($0)
   │ Check  │
   └───┬───┘
       │ miss
   ┌───▼────────┐
   │  Embed      │──── hit ──→ Return cached response ($0)
   │  + Vector   │             (similarity > 0.92)
   │  Search     │
   └───┬────────┘
       │ miss
   ┌───▼───┐
   │  LLM  │──→ Store response + embedding in cache
   │  Call  │
   └───────┘

Embedding Model Selection

ModelDimensionsSpeedCostQuality
text-embedding-3-small1536Fast$0.02/1MGood
text-embedding-3-large3072Medium$0.13/1MBest
all-MiniLM-L6-v2 (local)384Fastest$0Decent

For caching, text-embedding-3-small offers the best balance. The embedding cost ($0.02/1M tokens) is negligible compared to the LLM call it saves ($2-$15/1M tokens).


Production Hit Rates

Real-world cache hit rates depend heavily on your use case:

Application TypeExpected Hit RateAnnual Savings (at $10K/mo spend)
Customer support bots35-50%$42K-$60K
Internal knowledge Q&A25-40%$30K-$48K
Code assistance15-25%$18K-$30K
Creative writing5-10%$6K-$12K

Customer support has the highest hit rates because users ask similar questions repeatedly. Creative tasks have the lowest because each query is unique.


Multi-Turn Conversation Caching

Single-turn caching is straightforward. Multi-turn is where it gets tricky.

Naive approach (cache only the last message): High false positive rate. "Tell me more" could match any previous "Tell me more" regardless of context.

Context-aware approach (cache a sliding window): Encode the last 3-5 messages as the cache key. This preserves conversational context and reduces false positives dramatically.

NeuralRouting's semantic cache uses context-aware encoding for multi-turn conversations, preventing cross-conversation contamination.


Cache Invalidation

LLM responses don't have the same staleness problems as database caches, but there are edge cases:

  • Time-sensitive queries: "What's the weather today?" should have short TTL (minutes)
  • Factual queries: "What is photosynthesis?" can be cached indefinitely
  • Personalized queries: Queries referencing user-specific data should include user context in the cache key

Default TTL recommendation: 24 hours for most applications, with category-based overrides.


Getting Started with NeuralRouting's Semantic Cache

NeuralRouting includes semantic caching at every pricing tier. No setup required — it activates automatically:

  1. First request: LLM processes the query, response is cached
  2. Similar request: Cache serves the response in < 5ms, $0 cost
  3. Dashboard shows cache hit rate, estimated savings, and entry count

The cache gets smarter over time as it accumulates your specific query patterns.

More in Engineering

Ready to cut your AI costs?

Start saving up to 80% on token costs today. Free tier available.

Get Started Free →