Part 2 of 3
Ghostwritten by Claude Opus 4.5 | Curated by Tom Hundley
This article was written by Claude Opus 4.5 and curated for publication by Tom Hundley.
Every token you send to an LLM costs money and takes time to process. When your application sends the same 50,000-token context document with every request, you're paying for that processing over and over again. Context caching changes this equation dramatically—allowing you to pay once for processing expensive prompts and reuse that work across subsequent requests.
This isn't a theoretical optimization. The major LLM providers have all implemented caching mechanisms that can reduce costs by up to 90% and latency by up to 85% for the right workloads. But the implementations differ significantly, and understanding these differences is essential for designing cost-effective AI systems.
Consider a typical enterprise AI application: a customer service chatbot that needs access to a 40,000-token product manual. Every customer question requires sending that entire manual as context so the model can answer accurately.
At Claude Sonnet 4.5's base pricing of $3 per million input tokens, processing that manual costs about $0.12 per request. Handle 10,000 customer queries per month, and you're looking at $1,200 just for repeatedly processing the same static document. With prompt caching, that same workload could cost closer to $120—or less.
The savings come from how LLMs actually process text.
To understand caching, you need to understand what happens when an LLM processes your prompt. At the heart of every transformer model is the self-attention mechanism, which compares each token against every other token in the sequence. This computation scales quadratically with input length, making the initial processing (called "prefill") the most expensive part of generation.
During prefill, the model computes Key (K) and Value (V) tensors for each token in your prompt. These KV pairs are stored in what's called the KV cache. For subsequent token generation, the model pulls these existing values from memory rather than recomputing them.
Prompt caching extends this principle across requests: if two requests share the same prompt prefix, they can share the same cached KV tensors. The provider stores these computed values, and when you send a request with a matching prefix, the model skips the expensive prefill computation and goes straight to generation.
This is why prompt structure matters so much. KV pairs for tokens at position i depend only on tokens 1 through i (due to causal attention). Same prefix means identical KV cache—regardless of what follows.
Anthropic's approach gives developers explicit control over caching through cache breakpoints. You mark specific points in your prompt where caching should occur using the cache_control parameter.
| Token Type | Price per Million Tokens |
|---|---|
| Base Input | $3.00 |
| Cache Write (5-min TTL) | $3.75 (1.25x base) |
| Cache Read | $0.30 (0.1x base) |
| Cache Write (1-hour TTL) | $6.00 (2x base) |
The economics work out favorably for repeated use. A 5,000-token cached section costs $0.01875 on first request (cache write) but only $0.0015 on subsequent requests (cache read)—a 92% reduction.
Prompt caching is available for Claude Opus 4.5, Opus 4.1, Opus 4, Sonnet 4.5, Sonnet 4, Haiku 4.5, Haiku 3.5, and Claude Opus 3.
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a customer service agent with access to our product documentation...",
},
{
"type": "text",
"text": large_product_manual, # 40,000 tokens
"cache_control": {"type": "ephemeral"} # Cache breakpoint
}
],
messages=[
{"role": "user", "content": customer_question}
]
)The explicit control is valuable for complex prompts where you have multiple cacheable sections with different stability characteristics.
OpenAI took a different approach: prompt caching happens automatically with no code changes required. Starting October 2024, caching is enabled by default for GPT-4o, GPT-4o mini, o1-preview, o1-mini, and their fine-tuned variants.
OpenAI offers a flat 50% discount on cached input tokens across supported models. There's no separate cache write cost—you simply pay full price for the first request and half price when tokens hit the cache.
| Model | Base Input | Cached Input |
|---|---|---|
| GPT-4o | $2.50/M | $1.25/M |
| GPT-4o mini | $0.15/M | $0.075/M |
The API automatically caches the longest prefix of a prompt that has been previously computed, starting at 1,024 tokens and increasing in 128-token increments. Cache hits are only possible for exact prefix matches.
from openai import OpenAI
client = OpenAI()
# First request - full price
response1 = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": large_system_prompt},
{"role": "user", "content": "Question 1"}
]
)
# Second request with same prefix - 50% off cached tokens
response2 = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": large_system_prompt}, # Same prefix
{"role": "user", "content": "Question 2"} # Different suffix
]
)Caches typically clear after 5-10 minutes of inactivity. During off-peak periods, they may persist up to one hour. All entries are evicted after one hour regardless of activity.
The automatic approach reduces implementation complexity but offers less control. You can't explicitly extend cache TTL or prioritize certain content for caching.
Google offers both implicit (automatic) and explicit caching, giving developers flexibility based on their needs.
Enabled by default for Gemini 2.5 and later models. When your request shares a prefix with previous requests, Google automatically passes on the cost savings—no code changes needed.
Check the usage_metadata field in responses to see how many tokens hit the cache.
For predictable savings at scale, you can create explicit caches with controlled TTL:
from google import genai
from google.genai import types
client = genai.Client()
# Create a cache
cache = client.caches.create(
model='models/gemini-2.5-flash',
config=types.CreateCachedContentConfig(
display_name='product_docs_cache',
system_instruction='You are a product specialist...',
contents=[large_document],
ttl="3600s" # 1 hour
)
)
# Use the cache
response = client.models.generate_content(
model='models/gemini-2.5-flash',
contents='User question here',
config=types.GenerateContentConfig(
cached_content=cache.name
)
)Gemini's caching pricing has three components:
For Gemini 2.5 Flash:
| Component | Price |
|---|---|
| Standard Input | $0.30/M |
| Cached Input | $0.03/M |
| Storage | $1.00/M tokens/hour |
Note that storage pricing varies by model. Gemini 2.5 Pro has higher storage costs at $4.50/M tokens/hour compared to Flash's $1.00/M tokens/hour.
| Model | Minimum Tokens |
|---|---|
| Gemini 2.5 Flash | 1,024 |
| Gemini 2.5 Pro | 4,096 |
| Gemini 3 Pro Preview | 4,096 |
The storage cost is unique to Google's approach. For long-running caches, factor this into your cost calculations.
Regardless of which provider you use, the same structural principle applies: static first, dynamic last.
Cache matching requires exact prefix matches. Even tiny differences—a single space, changed punctuation, different JSON key ordering—break the cache. Structure every prompt with this in mind:
Timestamps in cached sections:
# BAD: Timestamp changes break cache
system = f"Current time: {datetime.now()}. You are a helpful assistant..."
# GOOD: Keep dynamic content after cache boundary
system = "You are a helpful assistant..."
# Pass timestamp in user message or separate dynamic sectionVariable formatting:
# BAD: Different JSON formatting breaks cache
context = json.dumps(data) # Formatting might vary
# GOOD: Ensure consistent serialization
context = json.dumps(data, sort_keys=True, separators=(',', ':'))User-specific content in cached sections:
# BAD: User ID varies between users
system = f"You are helping user {user_id}..."
# GOOD: Move user-specific content after cache boundary
system = "You are a helpful assistant..."
messages = [{"role": "user", "content": f"[User: {user_id}] {question}"}]Caching isn't universally beneficial. Its value depends on your specific use case.
For Anthropic's caching, you pay 1.25x base price for cache writes. You need to hit the cache at least once to break even, and twice to see meaningful savings.
For Google's explicit caching, storage costs add up. A 100,000-token cache stored for 24 hours at $1/M/hour costs $2.40 in storage alone. You need sufficient read volume to justify that fixed cost.
For multi-turn conversations, cache grows with each turn:
# Turn 1: Cache system + initial context
# Turn 2: Cache system + initial context + turn 1
# Turn 3: Cache system + initial context + turns 1-2
# ...Each turn benefits from cached previous turns. This works naturally with Anthropic's approach where you can place cache breakpoints after key exchanges.
Separate your prompt into cached template and dynamic variables:
CACHED_TEMPLATE = """You are a specialist in {domain}.
Reference the following documentation:
{documentation}
Answer questions accurately based only on this information."""
# Cache the resolved template once per domain
cached_prompt = CACHED_TEMPLATE.format(
domain="product support",
documentation=load_documentation()
)
# Each request only varies the questionFor complex applications, use multiple cache tiers:
# Tier 1: Global (changes rarely)
# - Core system instructions
# - Universal few-shot examples
# Tier 2: Domain-specific (changes daily/weekly)
# - Domain documentation
# - Domain-specific examples
# Tier 3: Session-specific (changes per conversation)
# - Conversation history
# - User preferencesWith Anthropic's 4 breakpoints, you can explicitly mark each tier.
Review requests with low cache hit rates. Common causes:
Anthropic returns cache_creation_input_tokens and cache_read_input_tokens in responses. Monitor these to understand cache behavior.
OpenAI's response includes cached_tokens in the usage object when caching applies.
Google's usage_metadata shows cache hit counts for implicit caching.
Instead of sending multiple small documents, combine them into a single large cached block. This maximizes the ratio of cached to uncached tokens.
For Anthropic:
For Google explicit caching:
If you have control over request timing, batch requests that share context within cache TTL windows. This maximizes cache hits without extending TTL.
Different providers suit different use cases:
| Requirement | Best Fit |
|---|---|
| Maximum control, complex prompts | Anthropic |
| Simple integration, no code changes | OpenAI |
| Predictable long-running caches | Google (explicit) |
| Variable usage patterns | OpenAI or Google (implicit) |
Context caching is evolving rapidly. Anthropic has progressively expanded supported models and TTL options. Google's implicit caching represents a trend toward automatic optimization. OpenAI's automatic approach shows that caching can become invisible infrastructure.
For production AI systems, caching is no longer optional—it's a fundamental architectural consideration. The difference between a naive implementation and an optimized one can be an order of magnitude in both cost and latency.
In the next article in this series, we'll explore hybrid search architectures that combine semantic embeddings with traditional search techniques—another pattern for building efficient, production-ready AI systems.
Discover more content: