Your LLM API bill doesn’t have to scale linearly with usage. Prompt caching — the mechanism that lets providers skip re-processing tokens they’ve already seen — can cut your input token costs by 50–90%. But most implementations either miss the break-even threshold entirely or quietly destroy their own cache hits through subtle prompt mutations nobody notices until the bill arrives.
This guide covers the actual mechanics behind prompt caching LLM cost savings, how each provider’s approach differs architecturally, working implementation code, and the anti-patterns that drain your budget even when you think caching is enabled.
How Prompt Caching Works at the Infrastructure Level
When a language model processes a prompt, it computes key-value attention matrices for every token in the context window. These matrices are expensive. On a 100,000-token prompt, that computation alone can take over 11 seconds.
Prompt caching stores those KV matrices server-side. Subsequent requests with an identical prefix skip the computation entirely — the model picks up from the cached state and only processes new tokens.
The result: Anthropic reports up to 85% latency reduction for long prompts. A 100,000-token prompt drops time-to-first-token from 11.5 seconds to 2.4 seconds. Cost savings are separate and potentially even more significant.
The critical word is prefix. Caching only works on content that appears at the beginning of the prompt and stays byte-for-byte identical. Change a single character, add a timestamp, or reorder a JSON key, and the cached state is unusable. This single constraint explains most caching failures in production.
Provider Comparison at a Glance — Anthropic vs. OpenAI vs. Google
| Feature | Anthropic | OpenAI | Google Gemini |
|---|---|---|---|
| Cache trigger | Explicit cache_control |
Automatic | Implicit (auto) or Explicit (manual) |
| Min tokens | 1,024–4,096 (by model) | 1,024 | 32,768 (explicit only) |
| Cache read discount | 90% off | 50% off | 90% off (Gemini 2.5) |
| Write cost premium | 1.25x–2x | None | Storage: $4.50/MTok/hr |
| TTL | 5 min or 1 hour | Auto / up to 24 hr (extended) | User-defined (explicit) |
| Storage fees | No | No | Yes (explicit only) |
Each provider made a fundamentally different design decision about who controls cache placement. Anthropic puts it in your hands. OpenAI handles it automatically. Google offers both modes, each with distinct tradeoffs.
Implementing Prompt Caching with Anthropic Claude
Anthropic’s caching is explicit — you tell the API exactly where to draw the cache boundary using cache_control markers.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert code reviewer. " + large_codebase_context,
"cache_control": {"type": "ephemeral"} # Cache this prefix
}
],
messages=[
{"role": "user", "content": user_question}
]
)
# Verify it worked
print(response.usage.cache_creation_input_tokens) # Non-zero on first call
print(response.usage.cache_read_input_tokens) # Non-zero on subsequent calls
You can place up to 4 cache breakpoints per request. Each breakpoint tells Anthropic: cache everything up to this point.
TTL tiers matter more than most developers realize:
– "type": "ephemeral" = 5-minute TTL at 1.25x the base write cost, reads at 0.1x (90% off)
– For the 1-hour TTL variant, the write cost jumps to 2x the base price
Cache reads at 0.1x base price means once you’ve paid the write premium, subsequent reads are extremely cheap. The break-even section below covers when each tier pays off.
The February 2026 workspace isolation change
As of February 5, 2026, Anthropic moved caching from organization-level to workspace-level isolation. If your architecture spans multiple workspaces — common in enterprise setups or multi-tenant SaaS products — cached prefixes no longer cross workspace boundaries. Amazon Bedrock and Google Vertex AI deployments retain organization-level isolation, so the behavior differs depending on your deployment environment.
Minimum token thresholds also vary by model: Claude Sonnet 4/4.5 and Opus 4/4.1 require 1,024 tokens; Opus 4.5/4.6 and Haiku 4.5 require 4,096. Prompts under the threshold silently miss the cache with no error — the request runs at full price.
Implementing Prompt Caching with OpenAI
OpenAI took the opposite design philosophy: you configure nothing. Any prompt with a prefix of 1,024+ tokens is automatically eligible. No cache_control markers, no TTL settings, no storage fees.
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": large_system_prompt},
{"role": "user", "content": user_query}
]
)
# Check cache status
usage = response.usage
print(usage.prompt_tokens_details.cached_tokens) # Non-zero if cache hit
OpenAI caches in 128-token increments and extends the window on each hit. The default TTL is short (roughly 5–60 minutes), but you can opt into extended caching up to 24 hours — useful for batch jobs or applications with predictable traffic patterns.
The 50% discount on cached input tokens is automatic. Tool definitions and images in the prompt prefix are also cacheable, which matters for function-calling-heavy applications.
The tradeoff: because you have no explicit control, you can’t force a cache write at a specific point or inspect cache health proactively. You’re relying on OpenAI’s heuristics.
Implementing Context Caching with Google Gemini
Google offers two distinct modes, and choosing the wrong one can cost you more than not caching at all.
Implicit caching (Gemini 2.5+ default) works like OpenAI — automatic, no configuration, no storage cost. The model identifies reused prefixes and applies a discount. For most developers, this is the right starting point.
Explicit caching gives you API-level control with a user-defined TTL. But it comes with ongoing storage costs.
import google.generativeai as genai
import datetime
# Create explicit cache
cache = genai.caching.CachedContent.create(
model="gemini-2.5-pro",
contents=[large_document_content],
ttl=datetime.timedelta(hours=1),
display_name="product-docs-cache"
)
# Use the cache
model = genai.GenerativeModel.from_cached_content(cached_content=cache)
response = model.generate_content("Summarize the key findings")
The minimum for explicit caching is 32,768 tokens — roughly 25,000 words. The storage math is unforgiving: Google charges $4.50 per million tokens per hour. A 10M-token cache runs $45/hour, or $1,080/day continuously. Gemini 2.5 Pro brings cached reads down from $1.25 to $0.13 per million tokens (90% discount), but you need enough read volume to offset storage overhead.
Explicit caching is only viable at high, sustained request volumes against large, stable document sets — think legal research platforms or enterprise RAG systems with thousands of queries per hour against the same corpus. Everyone else should use implicit caching.
The Three-Layer Prompt Architecture
Most teams structure their prompts wrong for caching. Dynamic content near the top means the cacheable prefix is tiny and cache hit rates are terrible.
The pattern that works:
Layer 1 — Static System Layer (always cached)
System instructions, persona definition, tool/function schemas. This never changes between requests. Mark it with a cache breakpoint. Aim for several thousand tokens.
Layer 2 — Session-Stable Reference Layer (cached with TTL)
RAG chunks, retrieved documents, conversation history summaries, reference codebases. Changes per session but not per turn. Cache with a shorter TTL that matches your session length.
Layer 3 — Dynamic Query Layer (never cached)
The current user message, real-time data, timestamps, per-request personalization. Keep this minimal. It always goes at the end.
Most teams get this backward — injecting dynamic context early in the system prompt and wondering why cache hit rates hover near zero.
For agentic workflows, this layered prompt architecture is especially critical. Every agent turn reprocesses the system prompt unless it’s explicitly cached. On a 1,225-step task with 67.5 million input tokens, the difference between an uncached and cached system prompt isn’t marginal — it’s the entire cost basis of the operation.
The Break-Even Calculator — When Caching Saves Money vs. When It Costs More
Caching always costs something upfront. Here’s the math:
Anthropic (5-minute TTL):
– Write cost: 1.25x base
– Read cost: 0.1x base
– Break-even: after 1.25 cache reads
– If a prompt is called 3+ times within 5 minutes, caching is clearly profitable
Anthropic (1-hour TTL):
– Write cost: 2x base
– Read cost: 0.1x base
– Break-even: after 2 cache reads
– Best for: prompts called repeatedly over longer windows — batch processing, agentic loops
Google Explicit Caching:
Break-even reads/hour = Storage cost per hour / (base_read_price - cached_read_price)
At 10M tokens with Gemini 2.5 Pro: storage is $45/hr, savings per read is $1.12/MTok — you need roughly 40 full-context reads per hour just to cover storage. Below that volume, implicit caching wins.
When caching backfires:
– Traffic spikes that don’t sustain (cache expires before break-even)
– Highly personalized prompts where the “stable” prefix is only a few hundred tokens
– Prompts under the minimum token threshold (silent miss, full-price charge)
– Long Google explicit TTLs on low-traffic routes
A 2025 analysis of 86,000 developers found that 40–60% of LLM API budgets go to operational inefficiencies rather than necessary model usage — caching misconfiguration is a significant contributor.
Cache Invalidation Anti-Patterns That Silently Destroy Your Savings
These issues generate no errors. The cache misses, and you pay full price.
Non-deterministic JSON serialization. If you’re serializing tool schemas into your system prompt, Python’s json.dumps() doesn’t guarantee key ordering by default. Two calls with the same logical content but different key order produce different byte sequences — and different cache keys.
# Wrong — key order not guaranteed across calls
schema_str = json.dumps(tool_schema)
# Right — deterministic output
schema_str = json.dumps(tool_schema, sort_keys=True)
Timestamps inside cached blocks. Adding "generated_at": datetime.now() anywhere in a cached prefix — even buried in a comment — invalidates the cache on every call. Move all dynamic fields below your last cache breakpoint.
Tool definition changes. Updating a function’s description or adding a parameter changes the prefix. During development, this is expected, but in production it means your cache resets on every deploy. Version your tool definitions and warm the cache after deployments.
Model switching mid-session. Cached KV states are model-specific. Switching between claude-sonnet-4-5 and claude-opus-4 mid-session doesn’t reuse the cache — each model has its own representation.
Modifying tool choice settings. Changing tool_choice between calls (e.g., auto vs. a forced tool name) can affect what’s considered part of the cacheable prefix.
Real-World Benchmark — ProjectDiscovery’s 7% to 84% Cache Hit Rate
ProjectDiscovery’s engineering team documented a caching overhaul worth studying because the problem they solved is one most agentic application developers face.
Their AI-powered security scanning tool runs long agentic workflows — hundreds to thousands of steps — with the same system prompt repeated on every agent turn. At a 7% cache hit rate, they were paying near-full price on almost every token.
The fix wasn’t implementing caching — they already had it. The fix was where dynamic content lived in the prompt. They were injecting working memory, runtime context, and intermediate results near the top of the system prompt. Moving this content to the message tail, after all cache breakpoints, pushed hit rates to 84%.
On their most demanding task (1,225 steps, 67.5 million input tokens), they achieved a 91.8% cache rate and reduced total LLM costs by 59–70%.
Architecture, not configuration, determines cache efficiency.
How to Verify Caching Is Actually Working
This step gets skipped constantly. Developers add cache markers, deploy, assume it’s working. It often isn’t.
Anthropic — check usage fields on every response:
response = client.messages.create(...)
print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Regular input: {response.usage.input_tokens}")
On the first call, cache_creation_input_tokens should be non-zero. On subsequent calls with the same prefix, cache_read_input_tokens should be non-zero and cache_creation_input_tokens should be zero. If both fields are zero on repeated calls, your prompt is mutating between requests.
OpenAI:
cached = response.usage.prompt_tokens_details.cached_tokens
print(f"Cached tokens: {cached}")
Google Gemini explicit: Look for cached_content_token_count in the response metadata. If it’s absent, the cache wasn’t hit.
Debugging checklist when cache hits are zero:
1. Serialize your prompt to a string and compare SHA hashes across two consecutive calls — any diff reveals what’s changing
2. Confirm you’re above the minimum token threshold
3. Test with rapid back-to-back requests to rule out TTL expiry
4. For Anthropic: verify you’re in the correct workspace (post-February 2026 isolation change matters here)
Once your cache hit rate is verified and stable, combining prompt caching with Anthropic’s Batch API can reduce effective API spend by up to 95% on eligible workloads.
Getting Prompt Caching Right
Prompt caching LLM cost savings are real — but they require understanding each provider’s architecture, not merely copying a code snippet. The developers seeing 90% cost reductions aren’t necessarily calling the API differently; they’re structuring prompts so that the stable, expensive-to-process content is always at the front, always identical, and always within TTL.
Start with one provider, verify hits are occurring using the usage response fields, and run the break-even math before committing to any caching architecture. If you’re building agentic workflows, the three-layer prompt structure isn’t a nice-to-have — it’s the difference between caching being marginally useful and being the biggest LLM cost lever in your stack.
Pick the provider you’re already using, implement caching this week, and check your usage dashboard after 48 hours. The token counts don’t lie.