The Hidden Cost of Repeated Prompts
In most production AI applications, users ask similar questions repeatedly. Customer support bots see the same 200 questions 80% of the time. Search assistants get near-identical queries. FAQ systems process thousands of variations of a dozen core questions.
Without caching, every one of these pays full inference cost. With semantic caching, the second — and hundredth — similar request costs essentially nothing.
How Semantic Caching Works
Step 1: Embed the Prompt
When a new request arrives, it's converted to a vector embedding using a lightweight embedding model (text-embedding-3-small at $0.02/M tokens). This takes ~20ms and costs a fraction of a cent.
Step 2: Search the Cache
The embedding is compared against previously stored embeddings using cosine similarity. If similarity exceeds the threshold (typically 0.92), the cached response is returned immediately.