Caching Strategies for LLM Systems
— LLM API calls are slow and expensive. Caching can dramatically reduce both, but naive caching strategies fail because prompts rarely repeat exactly. Here's what works.
Caching is a fundamental optimization in software engineering. Store the result of an expensive operation, reuse it when the same input appears again. Simple, effective, well-understood.
But LLM systems break the standard caching model. Users rarely ask the exact same question twice. Prompts include dynamic context that changes on every request. Even if the question is similar, the full prompt (system message + context + user query) is different.
Naive caching (cache key = exact prompt text) gets hit rates near 0%. You need smarter strategies.
Why Exact-Match Caching Fails
A user asks: “What’s the weather in San Francisco?” You cache the response.
Five minutes later, another user asks the same question. But your prompt includes:
- A timestamp
- A session ID
- Retrieved context that might differ
- User-specific metadata
The full prompts aren’t identical, so the cache misses. You call the LLM again, wasting money and time.
This happens constantly in production. Caching seems like an obvious win, but without careful design, it doesn’t help.
Semantic Caching
Instead of exact string matching, cache based on semantic similarity.
How it works: Embed the user query (or the full prompt) into a vector. When a new query comes in, embed it and search for similar cached embeddings. If you find one above a similarity threshold (e.g., cosine similarity > 0.95), return the cached result.
What this catches: “What’s the weather in SF?” and “What’s the weather in San Francisco?” are semantically identical. A user asking “How do I reset my password?” and another asking “How can I change my password?” might get the same cached answer.
The tradeoff: Embedding queries adds latency (though much less than an LLM call). And you need to tune the similarity threshold—too high and you miss valid cache hits, too low and you return wrong answers.
When it works best: FAQ-style queries, customer support, search-and-summarize workflows where many users ask similar questions.
Prompt Prefix Caching
Many LLM providers now support caching at the token level. If two prompts share a long prefix, the provider caches the prefix and only processes the unique suffix.
How it works: Your system prompt might be 2,000 tokens and rarely changes. The user query is 50 tokens and changes every request. The provider caches the 2,000-token prefix and only charges you for processing the 50-token suffix.
What this catches: Requests that differ only in user input but share system prompts and context.
The tradeoff: You pay a small cache fee but save on input token processing. Caching only helps if the prefix is stable and large relative to the unique suffix.
When it works best: Applications with long, stable system prompts and short user queries. Especially effective for multi-turn conversations where context grows but the system prompt stays fixed.
Time-Based Expiration
Some queries have answers that don’t change often.
How it works: Cache responses with a TTL (time-to-live). “What’s the capital of France?” can be cached for days or weeks. “What’s the current stock price?” should expire in minutes.
What this catches: Factual queries with stable answers, documentation lookups, translation tasks.
The tradeoff: You need to set appropriate TTLs. Too short and you don’t save much; too long and you serve stale data.
When it works best: Knowledge base queries, documentation search, educational content, FAQs.
User-Specific vs. Global Caching
Should you cache per-user or across all users?
Global caching: If two users ask the same question, they get the same cached answer. Maximizes cache hit rate but risks leaking information if not careful.
Per-user caching: Cache results per user. Lower hit rate, but no risk of cross-user leakage. Useful when responses are personalized.
Hybrid approach: Use global caching for non-sensitive queries, per-user caching for personalized ones.
Partial Caching for RAG Systems
In RAG, you retrieve documents, then generate a response. You can cache at multiple levels.
Cache retrieval results: If the same query appears, reuse the retrieved documents without re-searching. This saves on vector DB or search API costs.
Cache generated responses: If the same query + same retrieved docs appear, reuse the LLM response. This saves on LLM costs.
Cache embeddings: If you’re embedding user queries repeatedly, cache embeddings to avoid reprocessing.
The tradeoff: More cache layers mean more complexity. Invalidation becomes harder—if your knowledge base updates, cached retrieval results might be stale.
Cache Invalidation Strategies
The hard part of caching is knowing when to invalidate.
Time-based expiration: Simple but imprecise. Cached data might become stale before expiration or remain valid long after.
Event-driven invalidation: When the underlying data changes (knowledge base update, model update, prompt change), invalidate related caches.
Version-based invalidation: Tag cached entries with model version and prompt version. When either changes, old cache entries become invalid.
Manual invalidation: Provide tools to clear caches when necessary (after deploying fixes, after discovering bad cached responses).
Negative Caching
Cache failures too, not just successes.
How it works: If a query fails (user asks something your system can’t answer), cache that failure. If the same query comes again within a short window, return the cached failure without retrying.
What this catches: Repeated requests for unavailable information, malformed queries, out-of-scope questions.
The tradeoff: You need shorter TTLs for negative caches (maybe minutes instead of hours) because the system’s capabilities might improve.
Cache Hit Rate Monitoring
Track cache performance to understand what’s working.
Hit rate by query type: Which types of queries get cache hits? Which never do? This tells you where caching helps.
Cost savings: Measure how much you’re saving on LLM calls due to caching. This justifies the complexity.
Latency improvements: Compare latency for cached vs. uncached requests. Caching should be significantly faster.
Staleness incidents: Track how often cached responses are wrong or outdated. If this is frequent, your TTLs are too long or invalidation is insufficient.
When Not to Cache
Caching isn’t always beneficial.
Highly dynamic queries: If every query is unique and context-dependent, caching won’t help. Don’t add complexity for negligible hit rates.
Real-time requirements: If you need up-to-the-second data (stock prices, live events), caching introduces unacceptable staleness.
Privacy-sensitive applications: If responses contain PII or sensitive information, caching creates security risks. Even per-user caching might violate compliance requirements.
Low-traffic applications: If you only get a few requests per hour, cache hit rates will be low. The overhead of caching infrastructure might not be worth it.
Storage Considerations
Caching requires storage, which has costs and limits.
Cache size limits: Set maximum cache sizes. Evict old or least-used entries when limits are reached (LRU eviction).
Cost of storage: Storing millions of cached responses (especially with embeddings for semantic search) gets expensive. Balance cache size against cost.
Cache warming: For high-traffic applications, pre-populate caches with common queries during low-traffic periods.
Implementation Patterns
In-memory caching: Use Redis or Memcached for fast access. Good for high-traffic, low-latency requirements.
Database caching: Store cached responses in a database with indexing on query embeddings or text. Slower than in-memory but more durable.
Provider-level caching: Use caching features from LLM providers (like OpenAI’s prompt caching). Easiest to implement but less control.
Multi-tier caching: Combine in-memory (for hottest data), database (for warm data), and provider-level caching. Maximizes hit rates but adds complexity.
Practical Workflow
Start with simple caching: Exact-match cache with time-based expiration. Measure hit rates.
Add semantic caching if hit rates are low: Embed queries, use vector similarity search. Measure improvement.
Tune TTLs based on data: Monitor how often cached responses become stale. Adjust TTLs accordingly.
Implement invalidation: When data or models change, invalidate affected caches.
Monitor continuously: Track hit rates, cost savings, staleness incidents. Optimize based on real usage patterns.
What Good Looks Like
A well-designed caching strategy:
- Achieves meaningful cache hit rates (20%+ in most applications)
- Reduces costs and latency significantly
- Has appropriate invalidation to prevent stale data
- Monitors cache performance and tunes based on real usage
- Balances complexity against benefits
Caching LLM outputs is harder than caching traditional API responses, but the payoff is substantial. With the right strategies, you can cut costs by 30-50% and improve latency dramatically.
But don’t over-engineer. Start simple, measure impact, and add complexity only when it’s justified by real gains.