Neural Research 5 min readApril 5, 2026

Semantic Caching for LLMs: Make Repeat Requests Cost Zero

Semantic caching stores vector embeddings of LLM responses and returns them instantly when similar prompts arrive. Here's how it works and what savings to expect.

NR

NeuralRouting Team

April 5, 2026

The Hidden Cost of Repeated Prompts

In most production AI applications, users ask similar questions repeatedly. Customer support bots see the same 200 questions 80% of the time. Search assistants get near-identical queries. FAQ systems process thousands of variations of a dozen core questions.

Without caching, every one of these pays full inference cost. With semantic caching, the second — and hundredth — similar request costs essentially nothing.

How Semantic Caching Works

Step 1: Embed the Prompt

When a new request arrives, it's converted to a vector embedding using a lightweight embedding model (text-embedding-3-small at $0.02/M tokens). This takes ~20ms and costs a fraction of a cent.

Step 2: Search the Cache

The embedding is compared against previously stored embeddings using cosine similarity. If similarity exceeds the threshold (typically 0.92), the cached response is returned immediately.

Step 3: Store New Responses

When no cache hit is found, the request proceeds to the LLM. The response is stored asynchronously with its embedding for future lookups.

Two-Level Cache Architecture

For maximum efficiency, implement a two-level lookup:

  1. Exact hash match — MD5 of the normalized prompt string. Instant O(1) lookup, zero embedding cost.
  2. Semantic similarity search — pgvector cosine similarity across stored embeddings. Catches paraphrases and near-duplicates.

This architecture handles both identical repeated requests (exact hash) and semantically equivalent variations (vector search).

The Economics

MetricWithout CacheWith Cache (30% hit rate)
100k requests/mo$500$350
500k requests/mo$2,500$1,050
1M requests/mo$5,000$1,600

At scale, a 30% cache hit rate saves $3,400/month on a 1M request workload.

Setting the Similarity Threshold

The threshold controls the trade-off between cache hit rate and response accuracy:

  • 0.98+ — Near-identical only. Very high accuracy, low hit rate (~5–10%)
  • 0.92–0.97 — Strong semantic similarity. Balanced accuracy and hit rate (~20–35%)
  • 0.85–0.91 — Broader matching. Higher hit rate but occasional mismatches

For most production workloads, 0.92 is the optimal threshold.

NeuralRouting's Semantic Cache

NeuralRouting includes a production-grade semantic cache powered by pgvector:

  • Two-level lookup (exact hash → cosine similarity)
  • Configurable similarity threshold
  • Fire-and-forget async storage (zero latency impact)
  • Cache hit stats in the FinOps dashboard
  • Per-user and global cache scoping

No configuration needed. Every request through NeuralRouting is automatically cached.

More in Neural Research

Ready to cut your AI costs?

Start saving up to 80% on token costs today. Free tier available.

Get Started Free →