Skip to content

Prompt Caching — The Production Cost-Optimization Layer for LLM Applications

Prompt Caching

TL;DR: Prompt caching reuses LLM input tokens across requests so the model doesn’t have to re-process the same prefix on every call. Anthropic cache reads cost 10% of base input price (90% discount); OpenAI cached inputs run 75-90% cheaper. Combined caching strategies achieve 70-80% total cost reduction in production workloads. The 2026 production stack has three distinct mechanisms: prompt cache (vendor-level prefix reuse — Anthropic cache_control markers, OpenAI automatic), semantic cache (vector-similarity-based response reuse via Redis / GPTCache — typically 50%+ cost reduction with repetitive query patterns), and KV cache (model-internal attention key-value cache within a single inference; not a practitioner-facing control). Distinct from glossary/agentic-memory (cross-session memory persistence) — prompt caching is per-request cost optimization; agentic memory is persistent context.

Simple explanation

When you send a prompt to an LLM, the model processes every token in the input — system instructions, tool definitions, conversation history, the user’s question. For long prompts (especially in agent applications with extensive system instructions + tool definitions + RAG context), most of the input is the same across calls. The model re-processes those tokens every time, paying the input-token cost on each request.

Prompt caching solves this. The LLM provider caches the processed state of repeated prefixes. The next time you send a request with the same prefix, the cached state is reused — you pay a fraction of the normal input cost. The model still processes any new tokens fresh; only the repeated prefix is cached.

The economics are striking. For an agent with a 50,000-token system prompt that runs 1,000 times a day, prompt caching can cut the monthly input-token bill by 80-90% without changing any model behavior.

Why it matters for business

The cost structure of agentic applications is dominated by repeated context. Every agent invocation typically reloads:

  • System instructions defining the agent’s role (often 1,000-10,000 tokens)
  • Tool definitions describing available functions (often 5,000-20,000 tokens in production agents)
  • RAG context with retrieved documents (often 10,000-50,000 tokens)
  • Conversation history (variable, often >5,000 tokens for ongoing sessions)
  • Few-shot examples for brand voice / format / style (often 2,000-10,000 tokens)

This repeated context is what prompt caching exploits. The cost asymmetry is real:

  • Without caching: every call pays full input price for the full prefix
  • With caching: the cached prefix costs 10% (Anthropic) or up to 90% less (OpenAI) on subsequent reads

Real-world results from 2026 production deployments: ProjectDiscovery reported 59% cost reduction post-implementation, growing to 70% after optimization. Multiple sources report 50-60% reductions as typical and 80-90% achievable with combined strategies.

The three distinct caching mechanisms

Practitioner conversations often conflate three different caching layers that operate independently:

1. Prompt cache (vendor-level prefix caching)

The mechanism most “prompt caching” discussions refer to. The LLM vendor caches the processed state of a prefix; subsequent requests with the same prefix get the cached state.

Anthropic Claude (cache_control markers):

  • Explicit opt-in: mark cache breakpoints in the request with cache_control blocks
  • Two TTL options: 5-minute default (1.25× base input price to write; 0.1× to read) and 1-hour extended (2× to write; 0.1× to read)
  • Cache pays off after 1 read at 5-min TTL or 2 reads at 1-hour TTL
  • Anthropic quietly changed the default from 60 minutes to 5 minutes in early 2026, increasing many production workloads’ costs by 30-60% overnight — a real failure mode worth tracking when costs spike
  • Workspace-level isolation as of Feb 5, 2026
  • Cache references the full prefix: tools + system + messages, in that order, up to and including the block designated cache_control

OpenAI (automatic prompt caching):

  • Automatic on GPT-4o, GPT-4o mini, o1 family, GPT-5.x family, and fine-tuned variants
  • No code changes needed — applied to any request with a matching common prefix
  • GPT-5.5 cached input runs $0.50/M tokens (90% discount from regular)
  • No additional fees beyond the cheaper cached-input rate

Google Gemini: explicit prompt caching API with similar mechanics; pricing varies by model.

2. Semantic cache (vector-similarity response reuse)

A different mechanism. Instead of caching the processed state of a prefix, semantic caching stores the full response keyed by the semantic meaning of the query. Subsequent queries with similar intent get the cached response.

Mechanics:

  • Convert incoming queries into vector embeddings (typically 768 or 1,536 dimensions — see glossary/embeddings)
  • Measure cosine similarity to cached query embeddings
  • If similarity exceeds threshold (commonly 0.85-0.95), return the cached response
  • If no match, call the LLM and store the new response

Production results: typical 50%+ cost reduction with repetitive query patterns. A 60% hit rate on a 1M-requests-per-day workload translates to roughly $846/month saved at H100 on-demand pricing.

2026 production landscape:

  • Redis + RedisSemanticCache (LangChain integration) — most common production substrate
  • GPTCache (open-source, integrates with LangChain and llama_index)
  • Weaviate, Qdrant, Bifrost as vector-store substrates
  • Azure Managed Redis for AI agents at production scale

The trade-off: semantic cache hits return responses in milliseconds (vs. seconds for fresh LLM generation) at substantial cost savings. The risk is false-positive cache hits — a query semantically similar but factually different to a cached one returns a stale or wrong answer. Threshold tuning is the practitioner’s craft.

3. KV cache (model-internal attention key-value cache)

Not a practitioner-facing control, but worth naming for clarity. Inside a single LLM inference, the model maintains an internal cache of attention key-value tensors so it doesn’t recompute attention for tokens it’s already processed. KV cache lives within a single request; prompt cache and semantic cache persist across requests.

The distinction matters because “caching” in LLM-engineering conversations often slides between layers without naming which one. KV cache is the model’s internal optimization; prompt cache is the vendor’s cross-request optimization; semantic cache is the application-level cross-request optimization.

When caching helps (and when it doesn’t)

Caching helps most when:

  • Long, stable prompt prefixes (system + tools + few-shot examples) repeat across many requests
  • High request volume per session (the cache amortizes)
  • Tool definitions are extensive (caching them is high-leverage)
  • RAG context with stable retrieval sets per session
  • Agents that handle many small variations of similar tasks

Caching helps less when:

  • Prompts are short or highly variable
  • Request volume is low (5-minute TTL expires between requests, paying write cost twice)
  • Every request has substantially different context
  • The cache write cost (1.25-2× base) isn’t recovered by enough reads

Operational rule: at Anthropic’s 5-minute TTL, you need at least 1 cache hit per 5 minutes for the cache to pay for itself; at 1-hour TTL you need at least 2 cache hits per hour. Below those thresholds, caching costs more than it saves.

Proactive cache warming (the under-applied practice)

A critical 2026 practitioner finding: proactive cache warming is essential — never rely on parallel LLM calls to create their own caches. When multiple parallel calls hit a cold cache simultaneously, each pays the write cost; the cache exists multiple times redundantly.

The fix: before launching parallel processing, make a single dedicated call to warm the cache. The first call pays the write cost once; all subsequent parallel calls read at 10% of base price. This is operationally trivial and frequently overlooked.

Connection to wiki frameworks

  • glossary/agent-engineering — Prompt caching is a core production-discipline element of Karpathy’s agent-engineering framing. Agents with extensive tool definitions and system instructions get the most leverage from caching.
  • glossary/agentic-memory — Distinct but adjacent. Prompt caching is per-request cost optimization; agentic memory is cross-session context persistence. Both reduce work, but on different timescales and for different reasons.
  • glossary/tool-use — Tool definitions are usually the second-largest cacheable component (after system prompt). Production agents with 20+ tools see disproportionate benefit from caching the tool layer.
  • glossary/llm — The underlying technology that makes both prompt cache and semantic cache work.
  • glossary/rag — Retrieved documents are cacheable when retrieval sets are stable within a session. Semantic caching also pairs well with RAG — same query intent → same retrieved set → same response cached.
  • glossary/embeddings — Semantic caching is built on the same embedding mechanism RAG uses for retrieval.
  • tools/claude-managed-agents — Managed-platform deployments inherit Anthropic’s cache_control mechanics; the 5-min default TTL change affected Managed-Agent workloads disproportionately.
  • glossary/advisor-strategy — The cheap-executor + expensive-advisor pattern composes with prompt caching: cache the advisor’s expensive context, call the cheap executor without cache for variable inputs.
  • glossary/automation-eats-execution — Prompt caching is execution-layer cost optimization. The decision which prefixes to cache and at what TTL is strategy work that stays human-leveraged.

Honest limits

  • The 5-min vs. 1-hour TTL choice is non-trivial. At Anthropic, 1-hour TTL costs 60% more to write than 5-min; the break-even depends on your read pattern. Misjudging the read pattern means paying more for caching than for non-caching.
  • Cache TTL changes can silently inflate costs. The Anthropic February 2026 default-TTL change cost many production workloads 30-60% in cost increases overnight without code changes. Monitor cache-hit rates over time, not just at deployment.
  • Semantic cache false positives are real. A 0.95 similarity threshold can still return wrong answers for factually distinct but semantically similar queries. Threshold tuning is application-specific and requires evaluation.
  • Caching doesn’t fix bad prompts. Caching saves on input-token cost; it doesn’t help when the prompt is producing wrong outputs. Run quality measurement separately from cost measurement.
  • Vendor lock-in compounds. Cache mechanics differ across Anthropic, OpenAI, Google. An application optimized for Anthropic’s cache_control mechanics requires re-engineering for OpenAI’s automatic caching, and vice versa.
  • The “70-80% reduction” headline numbers are upper-bound. Real workloads see 30-70% reductions depending on prompt structure, cache hit rate, and TTL choice. Treat vendor-published numbers as marketing-optimistic.

Key Takeaways

  • Prompt caching cuts input-token costs by 75-90% on repeated prefixes. Combined with other strategies, 70-80% total cost reduction is achievable in production.
  • Three distinct mechanisms: prompt cache (vendor-level prefix reuse), semantic cache (vector-similarity response reuse), KV cache (model-internal). Conversations often conflate them.
  • Anthropic uses explicit cache_control markers with 5-min default TTL (changed from 60-min in early 2026, increasing many production costs 30-60% overnight) and 1-hour extended option. Cache reads cost 10% of base price.
  • OpenAI applies prompt caching automatically on GPT-4o family, GPT-5.x family, and fine-tuned variants. GPT-5.5 cached input runs $0.50/M (90% discount).
  • Semantic caching uses vector embedding similarity (0.85-0.95 threshold typical) to reuse responses for semantically similar queries. Typical 50%+ cost reduction with repetitive patterns. Redis + RedisSemanticCache is the dominant production substrate.
  • Proactive cache warming is essential — never rely on parallel calls to create their own caches. Make a single dedicated warming call first; parallel calls then read the warmed cache at 10% of base price.
  • The cache pays off after 1 read at 5-min TTL or 2 reads at 1-hour TTL for Anthropic. Below those thresholds, caching costs more than it saves.
  • Prompt caching is per-request cost optimization; agentic memory is cross-session persistence. Both reduce work; they’re not interchangeable.

Sources

Anthropic:

OpenAI:

Semantic caching:

Combined-strategy cost optimization: