Prompt Caching — The Production Cost-Optimization Layer for LLM Applications
Prompt Caching
TL;DR: Prompt caching reuses LLM input tokens across requests so the model doesn’t have to re-process the same prefix on every call. Anthropic cache reads cost 10% of base input price (90% discount); OpenAI cached inputs run 75-90% cheaper. Combined caching strategies achieve 70-80% total cost reduction in production workloads. The 2026 production stack has three distinct mechanisms: prompt cache (vendor-level prefix reuse — Anthropic
cache_controlmarkers, OpenAI automatic), semantic cache (vector-similarity-based response reuse via Redis / GPTCache — typically 50%+ cost reduction with repetitive query patterns), and KV cache (model-internal attention key-value cache within a single inference; not a practitioner-facing control). Distinct from glossary/agentic-memory (cross-session memory persistence) — prompt caching is per-request cost optimization; agentic memory is persistent context.
Simple explanation
When you send a prompt to an LLM, the model processes every token in the input — system instructions, tool definitions, conversation history, the user’s question. For long prompts (especially in agent applications with extensive system instructions + tool definitions + RAG context), most of the input is the same across calls. The model re-processes those tokens every time, paying the input-token cost on each request.
Prompt caching solves this. The LLM provider caches the processed state of repeated prefixes. The next time you send a request with the same prefix, the cached state is reused — you pay a fraction of the normal input cost. The model still processes any new tokens fresh; only the repeated prefix is cached.
The economics are striking. For an agent with a 50,000-token system prompt that runs 1,000 times a day, prompt caching can cut the monthly input-token bill by 80-90% without changing any model behavior.
Why it matters for business
The cost structure of agentic applications is dominated by repeated context. Every agent invocation typically reloads:
- System instructions defining the agent’s role (often 1,000-10,000 tokens)
- Tool definitions describing available functions (often 5,000-20,000 tokens in production agents)
- RAG context with retrieved documents (often 10,000-50,000 tokens)
- Conversation history (variable, often >5,000 tokens for ongoing sessions)
- Few-shot examples for brand voice / format / style (often 2,000-10,000 tokens)
This repeated context is what prompt caching exploits. The cost asymmetry is real:
- Without caching: every call pays full input price for the full prefix
- With caching: the cached prefix costs 10% (Anthropic) or up to 90% less (OpenAI) on subsequent reads
Real-world results from 2026 production deployments: ProjectDiscovery reported 59% cost reduction post-implementation, growing to 70% after optimization. Multiple sources report 50-60% reductions as typical and 80-90% achievable with combined strategies.
The three distinct caching mechanisms
Practitioner conversations often conflate three different caching layers that operate independently:
1. Prompt cache (vendor-level prefix caching)
The mechanism most “prompt caching” discussions refer to. The LLM vendor caches the processed state of a prefix; subsequent requests with the same prefix get the cached state.
Anthropic Claude (cache_control markers):
- Explicit opt-in: mark cache breakpoints in the request with
cache_controlblocks - Two TTL options: 5-minute default (1.25× base input price to write; 0.1× to read) and 1-hour extended (2× to write; 0.1× to read)
- Cache pays off after 1 read at 5-min TTL or 2 reads at 1-hour TTL
- Anthropic quietly changed the default from 60 minutes to 5 minutes in early 2026, increasing many production workloads’ costs by 30-60% overnight — a real failure mode worth tracking when costs spike
- Workspace-level isolation as of Feb 5, 2026
- Cache references the full prefix: tools + system + messages, in that order, up to and including the block designated
cache_control
OpenAI (automatic prompt caching):
- Automatic on GPT-4o, GPT-4o mini, o1 family, GPT-5.x family, and fine-tuned variants
- No code changes needed — applied to any request with a matching common prefix
- GPT-5.5 cached input runs $0.50/M tokens (90% discount from regular)
- No additional fees beyond the cheaper cached-input rate
Google Gemini: explicit prompt caching API with similar mechanics; pricing varies by model.
2. Semantic cache (vector-similarity response reuse)
A different mechanism. Instead of caching the processed state of a prefix, semantic caching stores the full response keyed by the semantic meaning of the query. Subsequent queries with similar intent get the cached response.
Mechanics:
- Convert incoming queries into vector embeddings (typically 768 or 1,536 dimensions — see glossary/embeddings)
- Measure cosine similarity to cached query embeddings
- If similarity exceeds threshold (commonly 0.85-0.95), return the cached response
- If no match, call the LLM and store the new response
Production results: typical 50%+ cost reduction with repetitive query patterns. A 60% hit rate on a 1M-requests-per-day workload translates to roughly $846/month saved at H100 on-demand pricing.
2026 production landscape:
- Redis + RedisSemanticCache (LangChain integration) — most common production substrate
- GPTCache (open-source, integrates with LangChain and llama_index)
- Weaviate, Qdrant, Bifrost as vector-store substrates
- Azure Managed Redis for AI agents at production scale
The trade-off: semantic cache hits return responses in milliseconds (vs. seconds for fresh LLM generation) at substantial cost savings. The risk is false-positive cache hits — a query semantically similar but factually different to a cached one returns a stale or wrong answer. Threshold tuning is the practitioner’s craft.
3. KV cache (model-internal attention key-value cache)
Not a practitioner-facing control, but worth naming for clarity. Inside a single LLM inference, the model maintains an internal cache of attention key-value tensors so it doesn’t recompute attention for tokens it’s already processed. KV cache lives within a single request; prompt cache and semantic cache persist across requests.
The distinction matters because “caching” in LLM-engineering conversations often slides between layers without naming which one. KV cache is the model’s internal optimization; prompt cache is the vendor’s cross-request optimization; semantic cache is the application-level cross-request optimization.
When caching helps (and when it doesn’t)
Caching helps most when:
- Long, stable prompt prefixes (system + tools + few-shot examples) repeat across many requests
- High request volume per session (the cache amortizes)
- Tool definitions are extensive (caching them is high-leverage)
- RAG context with stable retrieval sets per session
- Agents that handle many small variations of similar tasks
Caching helps less when:
- Prompts are short or highly variable
- Request volume is low (5-minute TTL expires between requests, paying write cost twice)
- Every request has substantially different context
- The cache write cost (1.25-2× base) isn’t recovered by enough reads
Operational rule: at Anthropic’s 5-minute TTL, you need at least 1 cache hit per 5 minutes for the cache to pay for itself; at 1-hour TTL you need at least 2 cache hits per hour. Below those thresholds, caching costs more than it saves.
Proactive cache warming (the under-applied practice)
A critical 2026 practitioner finding: proactive cache warming is essential — never rely on parallel LLM calls to create their own caches. When multiple parallel calls hit a cold cache simultaneously, each pays the write cost; the cache exists multiple times redundantly.
The fix: before launching parallel processing, make a single dedicated call to warm the cache. The first call pays the write cost once; all subsequent parallel calls read at 10% of base price. This is operationally trivial and frequently overlooked.
Connection to wiki frameworks
- glossary/agent-engineering — Prompt caching is a core production-discipline element of Karpathy’s agent-engineering framing. Agents with extensive tool definitions and system instructions get the most leverage from caching.
- glossary/agentic-memory — Distinct but adjacent. Prompt caching is per-request cost optimization; agentic memory is cross-session context persistence. Both reduce work, but on different timescales and for different reasons.
- glossary/tool-use — Tool definitions are usually the second-largest cacheable component (after system prompt). Production agents with 20+ tools see disproportionate benefit from caching the tool layer.
- glossary/llm — The underlying technology that makes both prompt cache and semantic cache work.
- glossary/rag — Retrieved documents are cacheable when retrieval sets are stable within a session. Semantic caching also pairs well with RAG — same query intent → same retrieved set → same response cached.
- glossary/embeddings — Semantic caching is built on the same embedding mechanism RAG uses for retrieval.
- tools/claude-managed-agents — Managed-platform deployments inherit Anthropic’s
cache_controlmechanics; the 5-min default TTL change affected Managed-Agent workloads disproportionately. - glossary/advisor-strategy — The cheap-executor + expensive-advisor pattern composes with prompt caching: cache the advisor’s expensive context, call the cheap executor without cache for variable inputs.
- glossary/automation-eats-execution — Prompt caching is execution-layer cost optimization. The decision which prefixes to cache and at what TTL is strategy work that stays human-leveraged.
Honest limits
- The 5-min vs. 1-hour TTL choice is non-trivial. At Anthropic, 1-hour TTL costs 60% more to write than 5-min; the break-even depends on your read pattern. Misjudging the read pattern means paying more for caching than for non-caching.
- Cache TTL changes can silently inflate costs. The Anthropic February 2026 default-TTL change cost many production workloads 30-60% in cost increases overnight without code changes. Monitor cache-hit rates over time, not just at deployment.
- Semantic cache false positives are real. A 0.95 similarity threshold can still return wrong answers for factually distinct but semantically similar queries. Threshold tuning is application-specific and requires evaluation.
- Caching doesn’t fix bad prompts. Caching saves on input-token cost; it doesn’t help when the prompt is producing wrong outputs. Run quality measurement separately from cost measurement.
- Vendor lock-in compounds. Cache mechanics differ across Anthropic, OpenAI, Google. An application optimized for Anthropic’s
cache_controlmechanics requires re-engineering for OpenAI’s automatic caching, and vice versa. - The “70-80% reduction” headline numbers are upper-bound. Real workloads see 30-70% reductions depending on prompt structure, cache hit rate, and TTL choice. Treat vendor-published numbers as marketing-optimistic.
Related
- glossary/agent-engineering — Production-discipline context
- glossary/agentic-memory — Adjacent but distinct mechanism (cross-session persistence vs. per-request optimization)
- glossary/tool-use — Tool definitions are high-leverage cache targets
- glossary/llm — Underlying technology
- glossary/rag — Retrieval context is often cacheable; semantic caching pairs with RAG
- glossary/embeddings — Semantic caching builds on embedding similarity
- tools/claude-managed-agents — Managed-platform caching mechanics
- glossary/advisor-strategy — Cost-optimization pattern that composes with prompt caching
- glossary/automation-eats-execution — Caching is execution-layer optimization
Key Takeaways
- Prompt caching cuts input-token costs by 75-90% on repeated prefixes. Combined with other strategies, 70-80% total cost reduction is achievable in production.
- Three distinct mechanisms: prompt cache (vendor-level prefix reuse), semantic cache (vector-similarity response reuse), KV cache (model-internal). Conversations often conflate them.
- Anthropic uses explicit
cache_controlmarkers with 5-min default TTL (changed from 60-min in early 2026, increasing many production costs 30-60% overnight) and 1-hour extended option. Cache reads cost 10% of base price. - OpenAI applies prompt caching automatically on GPT-4o family, GPT-5.x family, and fine-tuned variants. GPT-5.5 cached input runs $0.50/M (90% discount).
- Semantic caching uses vector embedding similarity (0.85-0.95 threshold typical) to reuse responses for semantically similar queries. Typical 50%+ cost reduction with repetitive patterns. Redis + RedisSemanticCache is the dominant production substrate.
- Proactive cache warming is essential — never rely on parallel calls to create their own caches. Make a single dedicated warming call first; parallel calls then read the warmed cache at 10% of base price.
- The cache pays off after 1 read at 5-min TTL or 2 reads at 1-hour TTL for Anthropic. Below those thresholds, caching costs more than it saves.
- Prompt caching is per-request cost optimization; agentic memory is cross-session persistence. Both reduce work; they’re not interchangeable.
Sources
Anthropic:
- Prompt caching — Claude API Docs — official documentation
- Anthropic API Pricing in 2026: Complete Guide (Finout)
- Claude API Cache Pricing 2026: 90% Input Savings Explained (TokenMix)
- Claude Prompt Caching in 2026: The 5-Minute TTL Change That’s Costing You Money (DEV) — TTL change documentation
- Anthropic: Claude quota drain not caused by cache tweaks (The Register, April 2026) — context on the TTL controversy
OpenAI:
- Prompt Caching in the API (OpenAI) — original announcement
- Prompt caching — OpenAI API guide
- OpenAI API Pricing 2026 (DevTk.AI)
Semantic caching:
- What is semantic caching? Guide to faster, smarter LLM apps (Redis)
- Semantic Caching for LLM Inference (Spheron) — production setup guide
- Top LLM Gateways That Support Semantic Caching in 2026 (DEV)
- GPTCache (GitHub) — open-source implementation
- Azure Managed Redis for AI Agents (ITNEXT)
Combined-strategy cost optimization: