Cost Optimization
Prompt caching complete guide 2026: cut LLM API costs by 50–90%
Updated May 23, 2026 · By Byron Malone
TL;DR:Prompt caching is typically the highest-ROI LLM cost optimization — 2–4 hours of engineering investment that can reduce monthly API costs by 50–90% on system-prompt-heavy or document-context applications. The core mechanic: instead of billing the full standard input rate every time you send a large repeated prefix, the provider charges a dramatically lower “read” rate on cache hits. Anthropic's Claude charges $0.30/M on cache reads vs. $3.00/M standard — a 90% discount. OpenAI charges $2.50/M cached vs. $5.00/M standard — 50% off. No code change required for OpenAI; explicit cache_control markers required for Anthropic.
What prompt caching actually does: the KV cache explained
When a transformer model processes an input sequence, it computes attention key-value (KV) pairs for every token in the input. These KV tensors are memory-intensive to compute — the computation scales with the length of the prefix. On every uncached request, the model recomputes the KV pairs for the full input from scratch, including your system prompt and any static document context.
Prompt caching stores the computed KV tensors for a designated prefix segment on the provider's infrastructure. When a subsequent request arrives with the same prefix, the provider loads the cached KV tensors instead of recomputing them. The model “remembers” the prefix without recalculating it. This is why cache reads are so much cheaper: the expensive compute has already been done and the result is stored.
The economics are driven by the compute-to-storage tradeoff: the provider pays to compute the KV tensors once (the cache write), stores them in high-speed memory (the ongoing storage cost during the TTL), and then retrieves rather than recomputes on cache hits. The cache write premium ($3.75/M vs. $3/M for Claude Sonnet — a 25% premium) reflects the compute and storage cost of the write. The dramatic cache read discount ($0.30/M) reflects the fact that the expensive compute has already been amortized.
Caching is a prefix-match mechanism, not a content fingerprint system. The cached segment must be the first part of the input — you cannot cache an arbitrary middle section. This means your static content (system prompt, fixed document context, tool schemas) must come first, before the dynamic per-request content (user query, conversation history).
Provider comparison: Anthropic vs. OpenAI vs. Google Gemini
The three major providers implement caching with meaningfully different mechanics. Understanding the differences determines which optimization applies to your application.
- Anthropic Claude — explicit cache_control. You mark content blocks in your messages array with a cache_control parameter ({"type": "ephemeral"}). Anthropic caches the KV tensors for the marked prefix. Cache write: $3.75/M (25% premium over standard $3/M input). Cache read: $0.30/M (90% discount vs. $3/M standard). TTL: 5 minutes from the last cache hit (refreshes on every hit). Minimum cacheable length: 1,024 tokens for Sonnet and Opus; 2,048 for Haiku. Up to 4 cache_control breakpoints per request. You can measure hit rate from the API response: usage.cache_read_input_tokens and usage.cache_creation_input_tokens.
- OpenAI GPT-4o — automatic caching. No code change required. OpenAI detects repeated input prefixes across requests from the same API key and applies a 50% discount automatically. Standard: $5/M. Cached: $2.50/M. No write premium. TTL: approximately 5–10 minutes (not officially documented). Minimum prefix: approximately 1,024 tokens. Measure via usage.prompt_tokens_details.cached_tokens in the response. The no-write-premium model means caching is net positive from the very first cache hit — no break-even calculation needed.
- Google Gemini — implicit + explicit context_cache. Implicit caching happens automatically like OpenAI's model. The explicit context_cache API lets you create a named cache object with client.caches.create(), specify the content and TTL (default 60 minutes), and pass the cache name in subsequent requests. Minimum for explicit caching: 32,768 tokens (32K) — significantly higher than Anthropic's and OpenAI's 1,024-token minimum. Storage cost: $1/M tokens per hour while the cache is live. Longer TTL (60 min vs. 5 min) is more forgiving for low-traffic applications.
Which applications benefit most: patterns that cache well
Not every application benefits equally from prompt caching. The impact is determined by: (1) what fraction of your input tokens are identical across requests (the “cache ratio”), and (2) how often those tokens repeat within the cache TTL window.
High-cache-benefit patterns:
- SaaS with a shared system prompt. If your product has a 2,000-token system prompt that is identical for all users and all sessions, that system prompt is cached once and amortized across all traffic. In a high-traffic application (10,000+ requests/day), the effective cache hit rate on the system prompt approaches 99%. At Claude Sonnet pricing, a 2,000-token system prompt cached at 99% hit rate costs $0.0006/request instead of $0.006 — a 90% reduction on the system prompt tokens.
- Document analysis and Q&A tools.If users ask multiple questions about the same document within a session (e.g., “Analyze this contract” → multiple follow-up questions), the document context can be cached and each follow-up question only pays for the small per-question input. A 10,000-token contract analyzed with 10 follow-up questions caches the contract after the first question and reads it at $0.30/M for the remaining 9.
- Tool-heavy agents.If your agent system passes 50+ tool definitions on every request (function schemas can be large), the tool schema block is a strong caching candidate. Tool schemas don't change per request — they are a fixed static prefix that benefits from caching.
- Batch jobs with uniform prefixes. If you process a batch of 10,000 documents all with the same system prompt and instructions, the first request writes the cache and all 9,999 subsequent requests are cache hits. Cache hit rate approaches 99.99%.
Low-cache-benefit patterns (do not invest engineering time here):
- RAG with per-query document chunks. If each query retrieves different document chunks from your vector database, the cached content changes per query and hit rate approaches 0. The retrieval step defeats prefix caching.
- Short system prompts (under 1,024 tokens). Below the provider minimum, caching doesn't apply. A 500-token system prompt cannot be cached on Anthropic or OpenAI.
- Dynamic system prompt injection.If your system prompt includes the user's name, current timestamp, or any per-request value, the prefix changes on every request and the cache hit rate is 0. Move all dynamic values after the static cached prefix.
Implementation guide: Anthropic cache_control in 15 minutes
For Anthropic Claude, the implementation requires one structural change: move your static content to the front of the messages array and add cache_control markers.
# Python example — Anthropic SDK
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = "You are a helpful assistant..." # ≥1,024 tokens
LARGE_DOCUMENT = "..." # large static document context
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Cache this prefix
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": LARGE_DOCUMENT,
"cache_control": {"type": "ephemeral"} # Also cache document
},
{
"type": "text",
"text": user_query # Per-request — not cached
}
]
}
]
)
# Verify caching from response
print(response.usage.cache_creation_input_tokens) # >0 on write
print(response.usage.cache_read_input_tokens) # >0 on hitFor OpenAI: no code change needed for automatic caching. To maximize cache hit rate, ensure your system prompt content is identical across requests (no dynamic values in the system message, same ordering of instructions, same whitespace). Measure via usage.prompt_tokens_details.cached_tokens.
Simon Willison (simonwillison.net) recommends logging cache_creation_input_tokens and cache_read_input_tokens from every response to your observability stack and alerting when the cache hit rate drops below your baseline — a sudden drop indicates your prompt structure changed in a way that broke the prefix match.
Measuring and modeling cache savings: the break-even formula
The break-even formula determines the minimum number of cache reads required to recover the cache write premium:
break_even_hits = cache_write_price / (standard_price − cache_read_price) Claude 3.5 Sonnet: break_even_hits = $3.75 / ($3.00 − $0.30) = $3.75 / $2.70 = 1.39 Interpretation: if your cached prefix is hit ≥1.39 times on average after each write, caching is net positive. In practice: write once + 2 reads = break even.
OpenAI does not charge a write premium — break-even is exactly 1.0 (any repeat request saves money).
Modeling your monthly savings:
monthly_savings =
(monthly_total_requests × cache_hit_rate × cached_prefix_tokens
× (standard_price − cache_read_price))
− (cache_write_events × cached_prefix_tokens × cache_write_premium)
Example (Claude Sonnet, 10K-token prefix, 50K sessions/month, 85% hit rate):
= (50,000 × 0.85 × 10,000 × ($3.00 − $0.30) / 1,000,000)
− (50,000 × 0.15 × 10,000 × $0.75 / 1,000,000)
= (425,000,000 × $2.70 / 1,000,000) − (75,000,000 × $0.75 / 1,000,000)
= $1,147.50 − $56.25
= $1,091.25/month savingsThe Prompt Caching Savings Calculator on this site computes this for your specific inputs — model, prefix size, sessions, and hit rate.
TTL management, cache invalidation, and when caching breaks
Understanding when and why caches expire is essential for production reliability:
- Anthropic TTL: 5 minutes from last hit. In steady traffic, the TTL renews on every request and the cache stays warm indefinitely. In bursty or low-traffic applications (nightly batch jobs, business-hours-only SaaS), the cache expires between bursts and you pay the write premium at the start of each active period. Model this by multiplying your theoretical hit rate by the fraction of requests arriving within active TTL windows.
- Google Gemini TTL: 60 minutes (default). More forgiving for low-traffic applications. A user session that spans 60 minutes can cache the document context and benefit throughout the session without the cache expiring mid-session. Storage cost ($1/M tokens/hour) adds to the economics when many sessions run simultaneously.
- Prompt content changes invalidate caches. Any change to the cached prefix — including whitespace, punctuation, version bumps in the system prompt, tool schema updates — creates a cache miss on the next request. Cache invalidation on a prompt update causes a spike in cache write costs as all active sessions re-warm. If you update your system prompt frequently, build a deployment process that pre-warms the new cache before cutting over traffic.
- Model updates can invalidate caches. Caches are associated with a specific model version. When providers release new model versions (e.g., claude-3-5-sonnet-20241022 → a new version), existing caches may not carry over. Monitor for cache hit rate drops after provider model updates.
- Multi-tenant isolation. Caches are isolated per API key, not per user. All users of your application sharing the same API key share the cache — which is the desired behavior for caching a shared system prompt. Never include user-specific data (user ID, email, personal information) in a cached prefix block.
Modeling caching savings in practice — a worked example
When I decide whether to turn caching on, I’ve found the headline read discount is the wrong thing to fixate on — the traffic pattern decides it. Worked example on Claude Sonnet with a 10,000-token static prefix and 50,000 sessions/month at an 85% cache hit rate: savings from reads are 50,000 × 0.85 × 10,000 × ($3.00 − $0.30)/M ≈ $1,147.50, and the write premium on the 15% misses is 50,000 × 0.15 × 10,000 × $0.75/M ≈ $56.25, for roughly $1,091/month net savings. Now run the same prefix behind a low-traffic internal tool that fires once every fifteen minutes — outside the 5-minute TTL nearly every call becomes a fresh write at the ~1.25× premium, so caching there costs more than leaving it off. Same prompt, opposite verdict, because the hit rate — not the discount — is the variable that matters.
Assumptions:cache-read and cache-write multipliers, the minimum cacheable block size, and the TTL are provider-specific and change over time — the ~0.10× read / ~1.25× write figures are dated Anthropic examples, OpenAI’s automatic ~50% discount uses a different model, and Google Gemini adds an hourly storage cost. Verify current numbers against each provider’s pricing page. Savings apply only to the stable prefix, and the realized hit rate depends on call frequency relative to the TTL.
The break-even formula, the savings model, and the assumptions above are operationalized in the prompt-caching methodology and the open-source calculator source on GitHub (packages/calc).
Frequently asked questions
Last reviewed by Byron Malone, 2026-05-23. Pricing and implementation details sourced from Anthropic Prompt Caching Documentation (docs.anthropic.com/en/docs/build-with-claude/prompt-caching), OpenAI Prompt Caching Guide (platform.openai.com/docs/guides/prompt-caching), and Google Gemini Context Caching Documentation (ai.google.dev/gemini-api/docs/caching). Not financial advice or vendor endorsement. Verify pricing at official provider pages before production planning.
By Byron MaloneLast verified against Anthropic, OpenAI & Google Gemini prompt-caching documentation
Founder & Editor, Bedrocka Tools
Try the calculator
This guide pairs with the Prompt Caching Savings Calculator — model your prefix size, sessions, and hit rate to see net monthly savings.