Context Caching: Reduce Costs, Improve Speed

Every token you send to an LLM costs money and takes time to process. When your application sends the same 50,000-token context document with every request, you're paying for that processing over and over again. Context caching changes this equation dramatically—allowing you to pay once for processing expensive prompts and reuse that work across subsequent requests.

This isn't a theoretical optimization. The major LLM providers have all implemented caching mechanisms that can reduce costs by up to 90% and latency by up to 85% for the right workloads. But the implementations differ significantly, and understanding these differences is essential for designing cost-effective AI systems.

The Economics Problem

Consider a typical enterprise AI application: a customer service chatbot that needs access to a 40,000-token product manual. Every customer question requires sending that entire manual as context so the model can answer accurately.

At Claude Sonnet 4.5's base pricing of $3 per million input tokens, processing that manual costs about $0.12 per request. Handle 10,000 customer queries per month, and you're looking at $1,200 just for repeatedly processing the same static document. With prompt caching, that same workload could cost closer to $120—or less.

The savings come from how LLMs actually process text.

How Context Caching Works

To understand caching, you need to understand what happens when an LLM processes your prompt. At the heart of every transformer model is the self-attention mechanism, which compares each token against every other token in the sequence. This computation scales quadratically with input length, making the initial processing (called "prefill") the most expensive part of generation.

During prefill, the model computes Key (K) and Value (V) tensors for each token in your prompt. These KV pairs are stored in what's called the KV cache. For subsequent token generation, the model pulls these existing values from memory rather than recomputing them.

Prompt caching extends this principle across requests: if two requests share the same prompt prefix, they can share the same cached KV tensors. The provider stores these computed values, and when you send a request with a matching prefix, the model skips the expensive prefill computation and goes straight to generation.

This is why prompt structure matters so much. KV pairs for tokens at position i depend only on tokens 1 through i (due to causal attention). Same prefix means identical KV cache—regardless of what follows.

Anthropic's Prompt Caching

Anthropic's approach gives developers explicit control over caching through cache breakpoints. You mark specific points in your prompt where caching should occur using the cache_control parameter.

Current Pricing (Claude Sonnet 4.5)

Token Type	Price per Million Tokens
Base Input	$3.00
Cache Write (5-min TTL)	$3.75 (1.25x base)
Cache Read	$0.30 (0.1x base)
Cache Write (1-hour TTL)	$6.00 (2x base)

The economics work out favorably for repeated use. A 5,000-token cached section costs $0.01875 on first request (cache write) but only $0.0015 on subsequent requests (cache read)—a 92% reduction.

Supported Models

Prompt caching is available for Claude Opus 4.5, Opus 4.1, Opus 4, Sonnet 4.5, Sonnet 4, Haiku 4.5, Haiku 3.5, and Claude Opus 3.

Implementation Pattern

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a customer service agent with access to our product documentation...",
        },
        {
            "type": "text",
            "text": large_product_manual,  # 40,000 tokens
            "cache_control": {"type": "ephemeral"}  # Cache breakpoint
        }
    ],
    messages=[
        {"role": "user", "content": customer_question}
    ]
)

Key Constraints

Minimum token requirement varies by model: 1,024 tokens for Sonnet models, but 4,096 for Opus 4.5 and Haiku 4.5, and 2,048 for Haiku 3.5/3
Up to 4 cache breakpoints per request
Cache references the entire prompt (tools, system, messages) in order up to each breakpoint
Default 5-minute TTL, refreshed on each cache hit
Optional 1-hour TTL at higher write cost

The explicit control is valuable for complex prompts where you have multiple cacheable sections with different stability characteristics.

OpenAI's Automatic Caching

OpenAI took a different approach: prompt caching happens automatically with no code changes required. Starting October 2024, caching is enabled by default for GPT-4o, GPT-4o mini, o1-preview, o1-mini, and their fine-tuned variants.

Pricing

OpenAI offers a flat 50% discount on cached input tokens across supported models. There's no separate cache write cost—you simply pay full price for the first request and half price when tokens hit the cache.

Model	Base Input	Cached Input
GPT-4o	$2.50/M	$1.25/M
GPT-4o mini	$0.15/M	$0.075/M

How It Works

The API automatically caches the longest prefix of a prompt that has been previously computed, starting at 1,024 tokens and increasing in 128-token increments. Cache hits are only possible for exact prefix matches.

from openai import OpenAI

client = OpenAI()

# First request - full price
response1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": large_system_prompt},
        {"role": "user", "content": "Question 1"}
    ]
)

# Second request with same prefix - 50% off cached tokens
response2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": large_system_prompt},  # Same prefix
        {"role": "user", "content": "Question 2"}  # Different suffix
    ]
)

Cache Duration

Caches typically clear after 5-10 minutes of inactivity. During off-peak periods, they may persist up to one hour. All entries are evicted after one hour regardless of activity.

The automatic approach reduces implementation complexity but offers less control. You can't explicitly extend cache TTL or prioritize certain content for caching.

Google Gemini's Dual Approach

Google offers both implicit (automatic) and explicit caching, giving developers flexibility based on their needs.

Implicit Caching (Default)

Enabled by default for Gemini 2.5 and later models. When your request shares a prefix with previous requests, Google automatically passes on the cost savings—no code changes needed.

Check the usage_metadata field in responses to see how many tokens hit the cache.

Explicit Caching

For predictable savings at scale, you can create explicit caches with controlled TTL:

from google import genai
from google.genai import types

client = genai.Client()

# Create a cache
cache = client.caches.create(
    model='models/gemini-2.5-flash',
    config=types.CreateCachedContentConfig(
        display_name='product_docs_cache',
        system_instruction='You are a product specialist...',
        contents=[large_document],
        ttl="3600s"  # 1 hour
    )
)

# Use the cache
response = client.models.generate_content(
    model='models/gemini-2.5-flash',
    contents='User question here',
    config=types.GenerateContentConfig(
        cached_content=cache.name
    )
)

Pricing Structure

Gemini's caching pricing has three components:

Cached token read: 10% of standard input price (90% savings on 2.5+ models)
Storage: Hourly rate per million tokens stored
Standard tokens: Full price for non-cached input and all output

For Gemini 2.5 Flash:

Component	Price
Standard Input	$0.30/M
Cached Input	$0.03/M
Storage	$1.00/M tokens/hour

Note that storage pricing varies by model. Gemini 2.5 Pro has higher storage costs at $4.50/M tokens/hour compared to Flash's $1.00/M tokens/hour.

Minimum Token Requirements

Model	Minimum Tokens
Gemini 2.5 Flash	1,024
Gemini 2.5 Pro	4,096
Gemini 3 Pro Preview	4,096

The storage cost is unique to Google's approach. For long-running caches, factor this into your cost calculations.

Designing Cache-Friendly Prompts

Regardless of which provider you use, the same structural principle applies: static first, dynamic last.

The Golden Rule

Cache matching requires exact prefix matches. Even tiny differences—a single space, changed punctuation, different JSON key ordering—break the cache. Structure every prompt with this in mind:

Tools and function definitions (most stable)
System instructions (stable across sessions)
Reference documents and context (stable per use case)
Few-shot examples (stable per task type)
Conversation history (semi-dynamic)
Current user query (dynamic)

Anti-Patterns to Avoid

Timestamps in cached sections:

# BAD: Timestamp changes break cache
system = f"Current time: {datetime.now()}. You are a helpful assistant..."

# GOOD: Keep dynamic content after cache boundary
system = "You are a helpful assistant..."
# Pass timestamp in user message or separate dynamic section

Variable formatting:

# BAD: Different JSON formatting breaks cache
context = json.dumps(data)  # Formatting might vary

# GOOD: Ensure consistent serialization
context = json.dumps(data, sort_keys=True, separators=(',', ':'))

User-specific content in cached sections:

# BAD: User ID varies between users
system = f"You are helping user {user_id}..."

# GOOD: Move user-specific content after cache boundary
system = "You are a helpful assistant..."
messages = [{"role": "user", "content": f"[User: {user_id}] {question}"}]

When Caching Helps—And When It Doesn't

Caching isn't universally beneficial. Its value depends on your specific use case.

Ideal Use Cases

Large static context with many queries: Product documentation chatbots, code repository analysis, legal document Q&A
Consistent system prompts at scale: High-volume APIs with standard instructions
Multi-turn conversations: Each turn can build on cached previous context
Few-shot learning with fixed examples: Same examples across many different inputs

Poor Use Cases

Highly variable prompts: If every request is unique, nothing benefits from caching
Very short contexts: Minimum token requirements mean short prompts don't qualify
Infrequent requests: Cache TTLs of 5-60 minutes mean sporadic usage won't hit the cache
Rapidly changing context: If your "static" content changes frequently, cache writes will dominate

Break-Even Analysis

For Anthropic's caching, you pay 1.25x base price for cache writes. You need to hit the cache at least once to break even, and twice to see meaningful savings.

For Google's explicit caching, storage costs add up. A 100,000-token cache stored for 24 hours at $1/M/hour costs $2.40 in storage alone. You need sufficient read volume to justify that fixed cost.

Implementation Patterns for Production

Pattern 1: Session-Based Caching

For multi-turn conversations, cache grows with each turn:

# Turn 1: Cache system + initial context
# Turn 2: Cache system + initial context + turn 1
# Turn 3: Cache system + initial context + turns 1-2
# ...

Each turn benefits from cached previous turns. This works naturally with Anthropic's approach where you can place cache breakpoints after key exchanges.

Pattern 2: Template + Variable Pattern

Separate your prompt into cached template and dynamic variables:

CACHED_TEMPLATE = """You are a specialist in {domain}.
Reference the following documentation:
{documentation}

Answer questions accurately based only on this information."""

# Cache the resolved template once per domain
cached_prompt = CACHED_TEMPLATE.format(
    domain="product support",
    documentation=load_documentation()
)

# Each request only varies the question

Pattern 3: Tiered Caching

For complex applications, use multiple cache tiers:

# Tier 1: Global (changes rarely)
# - Core system instructions
# - Universal few-shot examples

# Tier 2: Domain-specific (changes daily/weekly)
# - Domain documentation
# - Domain-specific examples

# Tier 3: Session-specific (changes per conversation)
# - Conversation history
# - User preferences

With Anthropic's 4 breakpoints, you can explicitly mark each tier.

Monitoring and Optimization

Metrics to Track

Cache hit rate: Percentage of requests that hit the cache
Cached token ratio: Portion of input tokens served from cache
Time to first token (TTFT): Should decrease significantly on cache hits
Cost per request: Before and after caching implementation

Identifying Optimization Opportunities

Review requests with low cache hit rates. Common causes:

Unstable prefix: Something in your "static" content is actually changing
Ordering issues: Content isn't consistently ordered
Format variations: Whitespace, serialization differences
TTL mismatches: Requests too spread out to maintain cache

Provider-Specific Telemetry

Anthropic returns cache_creation_input_tokens and cache_read_input_tokens in responses. Monitor these to understand cache behavior.

OpenAI's response includes cached_tokens in the usage object when caching applies.

Google's usage_metadata shows cache hit counts for implicit caching.

Cost Optimization Strategies

Strategy 1: Consolidate Static Content

Instead of sending multiple small documents, combine them into a single large cached block. This maximizes the ratio of cached to uncached tokens.

Strategy 2: Right-Size Your TTL

For Anthropic:

Use 5-minute TTL for bursty workloads (many requests in short windows)
Use 1-hour TTL for steady, continuous usage
Remember: 1-hour writes cost 2x, so you need sustained usage to justify it

For Google explicit caching:

Calculate storage costs against expected read volume
Shorter TTLs reduce storage costs but risk cache misses
Consider time-of-day patterns in your usage

Strategy 3: Batch Strategically

If you have control over request timing, batch requests that share context within cache TTL windows. This maximizes cache hits without extending TTL.

Strategy 4: Evaluate Provider Fit

Different providers suit different use cases:

Requirement	Best Fit
Maximum control, complex prompts	Anthropic
Simple integration, no code changes	OpenAI
Predictable long-running caches	Google (explicit)
Variable usage patterns	OpenAI or Google (implicit)

Looking Ahead

Context caching is evolving rapidly. Anthropic has progressively expanded supported models and TTL options. Google's implicit caching represents a trend toward automatic optimization. OpenAI's automatic approach shows that caching can become invisible infrastructure.

For production AI systems, caching is no longer optional—it's a fundamental architectural consideration. The difference between a naive implementation and an optimized one can be an order of magnitude in both cost and latency.

In the next article in this series, we'll explore hybrid search architectures that combine semantic embeddings with traditional search techniques—another pattern for building efficient, production-ready AI systems.

Sources Consulted

Prompt caching with Claude | Anthropic - Anthropic's official announcement and overview
Prompt caching - Claude Docs - Technical documentation for Claude prompt caching
Claude Pricing | Anthropic - Current pricing for all Claude models
Prompt Caching in the API | OpenAI - OpenAI's prompt caching announcement
Prompt caching | OpenAI API - OpenAI technical documentation
Context caching | Gemini API - Google's context caching documentation
Gemini Developer API pricing - Current Gemini pricing
Gemini 2.5 Models now support implicit caching - Google's implicit caching announcement
Prompt caching: 10x cheaper LLM tokens, but how? | ngrok blog - Technical deep-dive on KV caching mechanics
How prompt caching works | sankalp's blog - Paged Attention and prefix caching explanation
Effectively use prompt caching on Amazon Bedrock | AWS - Implementation best practices
Prompt Caching with OpenAI, Anthropic, and Google Models | PromptHub - Cross-provider comparison