Back to Insights
2026-05-21 4 min read Tanuj Garg

Semantic Caching at Scale: How We Cut LLM API Costs by 73%

AI & Automation#Semantic Caching#LLM#Cost Optimization#AI Engineering#FinOps

Introduction

LLM API costs do not scale linearly with users—they scale with tokens. A support bot that answers 10,000 queries per day at 2,000 tokens per response burns $400–$800/day on inference alone, before embeddings, reranking, or agent orchestration.

Exact-match caching helps for identical prompts. But real workloads are messy: users rephrase the same question, add filler words, or submit near-duplicates that produce the same answer. Exact-match caches hit rates of 5–15%. Semantic caching pushes that to 40–70%.

In a recent production deployment, combining semantic caching with token-aware rate limiting reduced total LLM API spend by 73%—without measurable quality regression on evaluated queries.


Section 1: How Semantic Caching Works

Traditional cache: hash(prompt) → response.

Semantic cache: embed(prompt) → nearest neighbor in vector space → response.

When a new query arrives:

  1. embed the user message,
  2. search the cache index for vectors above a similarity threshold (typically 0.92–0.97 cosine similarity),
  3. if a match exists, return the cached response,
  4. if not, call the LLM, store the embedding + response pair.

The threshold is the tuning knob. Too low: wrong answers from dissimilar queries. Too high: cache misses on valid paraphrases.


Section 2: Architecture Components

A production semantic cache needs four layers:

Embedding service

Use the same embedding model for cache keys and retrieval consistency. Mixing models breaks similarity scores. For most workloads, a small, fast model (text-embedding-3-small or equivalent) is sufficient.

Vector index

Store (embedding, response, metadata) tuples. Options:

  • Redis with vector search for low-latency, moderate scale,
  • pgvector if you already run Postgres and want operational simplicity,
  • Dedicated vector DB (Pinecone, Weaviate) at higher scale.

Invalidation strategy

Cached responses go stale when:

  • underlying data changes (product catalog, policy docs),
  • the system prompt changes,
  • the model version changes.

Tag cache entries with source_version, model_id, and system_prompt_hash. Invalidate on any of these changing.

Bypass rules

Never cache:

  • queries with PII that should not be stored,
  • responses that include time-sensitive data,
  • agent tool-call results that depend on live system state.

Section 3: Token-Aware Rate Limiting

Caching reduces cost; rate limiting prevents runaway spend. Token-aware limits are stricter than request-count limits because a single agent loop can consume 50,000 tokens in one "request."

Implement limits at three levels:

  • Per user/session: prevent individual abuse,
  • Per tool/agent: cap agent loop iterations and total tokens per task,
  • Per tenant/org: budget enforcement for multi-tenant platforms.

When a limit is hit, degrade gracefully: return cached partial results, switch to a cheaper model, or queue for async processing.


Section 4: Measuring Cache Effectiveness

Track these metrics:

MetricTarget
Cache hit rate40–70% for support/FAQ workloads
False positive rate< 1% (wrong answer from cache)
Cost per resolved query50–80% reduction vs uncached
p95 latency (cache hit)< 50ms

Run weekly evals on a golden dataset: compare cached vs fresh responses. If quality drifts, lower the similarity threshold or tighten invalidation.


Section 5: When Semantic Caching Is Not Enough

Semantic caching works best for:

  • FAQ and support bots,
  • RAG systems with stable knowledge bases,
  • classification and routing tasks.

It works poorly for:

  • creative generation (every response should be unique),
  • real-time data queries (stock prices, inventory),
  • multi-step agent workflows where each step depends on prior tool results.

For those workloads, focus on prompt compression, model routing (cheap model for simple tasks), and context window management instead.


Conclusion

Semantic caching has moved from "nice optimization" to architectural requirement. At LLM scale, you cannot afford to re-infer identical intent on every request.

Start with your highest-volume, lowest-variance endpoint—usually support or search—and measure hit rate and false positive rate for two weeks before expanding.

Related reading:

For cost optimization help: