Caching Strategies for AI Applications: Managing High Read Loads and Latency-Sensitive Inference
Introduction
AI applications are read-heavy by nature. Every user query triggers multiple reads: embedding lookups, similarity searches, context retrieval, model inference, and result formatting. Unlike traditional applications where reads hit a database, AI application reads hit databases, vector stores, model endpoints, and caching layers.
The challenge compounds: LLM inference is slow (hundreds of milliseconds to seconds), embedding generation is expensive (API costs or GPU time), and vector searches across large datasets add latency. Without aggressive caching, your AI application's read latency becomes unusable and your costs explode.
Effective caching for AI applications requires understanding what to cache (responses, embeddings, intermediate results), where to cache (memory, distributed cache, CDN), and how to invalidate (time-based, semantic, event-driven).
Section 1: What to Cache in AI Applications
AI applications have multiple cacheable layers. Cache at each layer for maximum impact.
LLM response caching
The most impactful cache—LLM API calls are slow and expensive:
class CachedLLM {
private cache: Cache;
async complete(prompt: string, options: LLMOptions = {}): Promise<string> {
// Create a cache key from prompt + options
const cacheKey = this.getCacheKey(prompt, options);
// Check cache first
const cached = await this.cache.get(cacheKey);
if (cached) {
return cached;
}
// Call LLM
const response = await llm.complete(prompt, options);
// Cache the response
await this.cache.set(cacheKey, response, {
ttl: options.cacheTTL || 3600 // 1 hour default
});
return response;
}
private getCacheKey(prompt: string, options: LLMOptions): string {
// Include options that affect output in the cache key
return `llm:${hash(prompt)}:${hash(JSON.stringify(options))}`;
}
}
Cache hit scenarios: identical prompts, repeated user questions, system prompts that don't change.
Embedding caching
Embeddings are expensive to compute. Cache them aggressively:
class CachedEmbedder {
private cache: Cache;
private embedder: Embedder;
async embed(text: string): Promise<number[]> {
const cacheKey = `emb:${hash(text)}`;
const cached = await this.cache.get(cacheKey);
if (cached) return cached;
const embedding = await this.embedder.embed(text);
// Embeddings don't change—cache for a long time
await this.cache.set(cacheKey, embedding, { ttl: 86400 * 365 }); // 1 year
return embedding;
}
async embedBatch(texts: string[]): Promise<number[][]> {
// Check cache for all texts, only embed cache misses
const results: number[][] = [];
const toEmbed: { index: number; text: string }[] = [];
for (let i = 0; i < texts.length; i++) {
const cached = await this.cache.get(`emb:${hash(texts[i])}`);
if (cached) {
results[i] = cached;
} else {
toEmbed.push({ index: i, text: texts[i] });
}
}
if (toEmbed.length > 0) {
const embeddings = await this.embedder.embedBatch(toEmbed.map(t => t.text));
for (let j = 0; j < toEmbed.length; j++) {
results[toEmbed[j].index] = embeddings[j];
await this.cache.set(`emb:${hash(toEmbed[j].text)}`, embeddings[j], { ttl: 86400 * 365 });
}
}
return results;
}
}
RAG context caching
In RAG systems, the retrieved context is often reusable:
class CachedRAGRetriever {
async retrieve(query: string): Promise<Document[]> {
const cacheKey = `rag:${hash(query)}`;
const cached = await this.cache.get(cacheKey);
if (cached) return cached;
// Retrieve relevant documents
const queryEmbedding = await embed(query);
const documents = await vectorDb.search(queryEmbedding, 10);
// Cache for moderate duration (content may update)
await this.cache.set(cacheKey, documents, { ttl: 3600 });
return documents;
}
}
Section 2: Multi-Level Caching Architecture
A single cache layer isn't enough for high-read AI applications. Use multi-level caching:
Request → L1 Cache (In-Memory) → L2 Cache (Redis) → L3 Cache (Persistent) → Source
L1: In-memory cache (nanosecond latency)
Fastest cache, but limited by process memory:
class L1Cache {
private cache: Map<string, CacheEntry> = new Map();
private maxSize: number = 10000;
get(key: string): any | null {
const entry = this.cache.get(key);
if (!entry) return null;
// Check TTL
if (Date.now() - entry.timestamp > entry.ttl) {
this.cache.delete(key);
return null;
}
return entry.value;
}
set(key: string, value: any, ttlMs: number = 3600000): void {
// Evict oldest entries if at capacity
if (this.cache.size >= this.maxSize) {
const oldestKey = this.cache.keys().next().value;
this.cache.delete(oldestKey);
}
this.cache.set(key, { value, timestamp: Date.now(), ttl: ttlMs });
}
}
L2: Distributed cache (Redis, millisecond latency)
Shared across processes and servers:
class L2Cache {
private redis: Redis;
async get(key: string): Promise<any | null> {
const value = await this.redis.get(key);
return value ? JSON.parse(value) : null;
}
async set(key: string, value: any, ttlSeconds: number = 3600): Promise<void> {
await this.redis.setex(key, ttlSeconds, JSON.stringify(value));
}
}
L3: Persistent cache (database, tens of milliseconds latency)
For cacheable data that survives process restarts:
class L3Cache {
async get(key: string): Promise<any | null> {
const row = await db.query(
'SELECT value, expires_at FROM cache WHERE key = $1 AND expires_at > NOW()',
[key]
);
return row.length > 0 ? JSON.parse(row[0].value) : null;
}
async set(key: string, value: any, ttlSeconds: number = 3600): Promise<void> {
await db.query(`
INSERT INTO cache (key, value, expires_at)
VALUES ($1, $2, NOW() + INTERVAL '${ttlSeconds} seconds')
ON CONFLICT (key) DO UPDATE SET value = $2, expires_at = NOW() + INTERVAL '${ttlSeconds} seconds'
`, [key, JSON.stringify(value)]);
}
}
Tiered cache implementation
class TieredCache {
constructor(
private l1: L1Cache,
private l2: L2Cache,
private l3: L3Cache
) {}
async get(key: string): Promise<any | null> {
// Try L1
let value = await this.l1.get(key);
if (value !== null) return value;
// Try L2
value = await this.l2.get(key);
if (value !== null) {
await this.l1.set(key, value); // Populate L1
return value;
}
// Try L3
value = await this.l3.get(key);
if (value !== null) {
await this.l1.set(key, value);
await this.l2.set(key, value);
return value;
}
return null;
}
}
Section 3: Semantic Caching for LLM Responses
Traditional caching uses exact key matching. Semantic caching uses similarity—if a new query is semantically similar to a cached query, return the cached response.
How semantic caching works
class SemanticCache {
private vectorDb: VectorDatabase;
private responseStore: Map<string, string> = new Map();
async get(query: string, similarityThreshold: number = 0.95): Promise<string | null> {
// Embed the query
const queryEmbedding = await embed(query);
// Search for similar cached queries
const similar = await this.vectorDb.search(queryEmbedding, 1);
if (similar.length > 0 && similar[0].score >= similarityThreshold) {
return this.responseStore.get(similar[0].id) || null;
}
return null;
}
async set(query: string, response: string): Promise<void> {
const queryEmbedding = await embed(query);
const id = `cache:${Date.now()}:${hash(query)}`;
await this.vectorDb.upsert([{
id,
values: queryEmbedding,
metadata: { query, timestamp: Date.now() }
}]);
this.responseStore.set(id, response);
}
}
Benefit: "What's the weather in NYC?" and "What's the weather like in New York?" return the same cached response.
Tradeoff: embedding + vector search adds latency to cache lookups, but still faster than LLM inference.
Section 4: Cache Invalidation Strategies
AI application caches need smart invalidation—data changes, embeddings get updated, and models improve.
Time-based invalidation
Simple but effective for data that ages out:
// Cache LLM responses for 1 hour
await cache.set(key, response, { ttl: 3600 });
// Cache embeddings essentially forever (they don't change)
await cache.set(key, embedding, { ttl: 86400 * 365 });
// Cache RAG context for a moderate time (content may update)
await cache.set(key, context, { ttl: 1800 });
Event-driven invalidation
Invalidate cache when underlying data changes:
class EventDrivenCache {
private cache: Cache;
constructor() {
// Listen for data change events
eventBus.subscribe('document_updated', (event) => {
this.invalidateRelatedCache(event.documentId);
});
eventBus.subscribe('model_updated', (event) => {
this.invalidateAllEmbeddings(); // New model = new embeddings
});
}
private async invalidateRelatedCache(documentId: string): Promise<void> {
// Invalidate RAG cache entries that might include this document
const keys = await this.cache.keys(`rag:*`);
for (const key of keys) {
const cached = await this.cache.get(key);
if (cached?.some((doc: Document) => doc.id === documentId)) {
await this.cache.del(key);
}
}
}
}
Semantic invalidation
When embeddings or models update, invalidate based on semantic similarity:
async function invalidateOldEmbeddings(newModel: string): Promise<void> {
// Find all cached embeddings from the old model
const oldEmbeddings = await cache.keys(`emb:*:model:${oldModel}`);
// Invalidate them
await cache.del(oldEmbeddings);
}
Section 5: Handling Cache Stampedes
When a cached item expires and many requests try to recompute it simultaneously, you get a cache stampede. This is especially problematic for AI applications where recomputation is expensive (LLM calls, embedding generation).
Lock-based prevention
class StampedeProtectedCache {
private cache: Cache;
private locks: Map<string, Promise<any>> = new Map();
async get(key: string, compute: () => Promise<any>, ttl: number): Promise<any> {
// Check cache
let value = await this.cache.get(key);
if (value !== null) return value;
// Check if someone else is computing this
if (this.locks.has(key)) {
return this.locks.get(key);
}
// Compute with lock
const computePromise = compute().then(value => {
this.cache.set(key, value, ttl);
this.locks.delete(key);
return value;
}).catch(error => {
this.locks.delete(key);
throw error;
});
this.locks.set(key, computePromise);
return computePromise;
}
}
Early expiration
Refresh cache before it expires:
class EarlyRefreshCache {
async get(key: string, compute: () => Promise<any>, ttl: number): Promise<any> {
const cached = await this.cache.getWithMeta(key);
if (!cached) {
// Cache miss—compute
const value = await compute();
await this.cache.set(key, value, ttl);
return value;
}
// Cache hit—check if we should refresh
const age = Date.now() - cached.timestamp;
if (age > ttl * 0.8) {
// Refresh in background (don't block the request)
this.refreshInBackground(key, compute, ttl);
}
return cached.value;
}
private async refreshInBackground(key: string, compute: () => Promise<any>, ttl: number): Promise<void> {
try {
const value = await compute();
await this.cache.set(key, value, ttl);
} catch (error) {
// Log but don't throw—this is a background refresh
logger.error('Background cache refresh failed', error);
}
}
}
Section 6: CDN Caching for AI-Generated Content
For AI applications that generate content (articles, images, summaries), use CDN caching:
// Set appropriate cache headers for AI-generated content
app.get('/api/ai-content/:topic', async (req, res) => {
const { topic } = req.params;
// Check if content was previously generated
const cached = await contentStore.get(topic);
if (cached) {
res.set('Cache-Control', 'public, max-age=86400'); // 1 day
return res.json(cached);
}
// Generate content
const content = await llm.generate({ prompt: `Write about ${topic}` });
// Store and cache at CDN
await contentStore.set(topic, content);
res.set('Cache-Control', 'public, max-age=86400');
res.json(content);
});
Benefit: repeated requests for the same content are served from the CDN edge, not your application.
Section 7: Monitoring Cache Performance
Track these metrics to optimize your caching strategy:
class CacheMetrics {
private hits: number = 0;
private misses: number = 0;
private computeTime: number = 0;
private computeCount: number = 0;
recordHit(): void {
this.hits++;
}
recordMiss(computeMs: number): void {
this.misses++;
this.computeTime += computeMs;
this.computeCount++;
}
getMetrics() {
const total = this.hits + this.misses;
return {
hitRate: total > 0 ? this.hits / total : 0,
avgComputeTime: this.computeCount > 0 ? this.computeTime / this.computeCount : 0,
totalRequests: total
};
}
}
Key metrics to track:
- Hit rate: what percentage of requests are served from cache,
- Compute time: how long cache misses take to compute,
- Cache size: how much memory/storage your cache uses,
- Eviction rate: how often items are evicted from cache.
Conclusion
Caching is the difference between an AI application that feels responsive and one that users abandon. LLM responses, embeddings, and retrieval results should all be cached aggressively at multiple levels.
Start with LLM response caching—it has the highest impact on both cost and latency. Add embedding caching next. Then build multi-level caching with L1 (in-memory), L2 (Redis), and L3 (persistent) tiers. Consider semantic caching for queries that are similar but not identical.
The best AI applications are often the best-cached ones. Every cache hit is a faster response and lower cost. Design your caching strategy before you have scale problems, because by then, your users have already felt the latency.
Related Service: AI Systems & Automation
Need help building high-performance caching layers for your AI application?